ElevenLabsSpeechToText - ComfyUI Built-in Node Documentation

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

The ElevenLabs Speech to Text node transcribes audio files into text. It uses ElevenLabs’ API to convert spoken words into a written transcript, supporting features like automatic language detection, identifying different speakers, and tagging non-speech sounds like music or laughter.

Inputs

Parameter	Data Type	Required	Range	Description
`audio`	AUDIO	Yes	-	Audio to transcribe.
`model`	COMBO	Yes	`"scribe_v2"`	Model to use for transcription. Selecting this model reveals additional parameters.
`tag_audio_events`	BOOLEAN	No	-	Annotate sounds like (laughter), (music), etc. in transcript. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False)
`diarize`	BOOLEAN	No	-	Annotate which speaker is talking. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False)
`diarization_threshold`	FLOAT	No	0.1 - 0.4	Speaker separation sensitivity. Lower values are more sensitive to speaker changes. This parameter is revealed when the `"scribe_v2"` model is selected and `diarize` is enabled. (default: 0.22)
`temperature`	FLOAT	No	0.0 - 2.0	Randomness control. 0.0 uses model default. Higher values increase randomness. This parameter is revealed when the `"scribe_v2"` model is selected. (default: 0.0)
`timestamps_granularity`	COMBO	No	`"word"` `"character"` `"none"`	Timing precision for transcript words. This parameter is revealed when the `"scribe_v2"` model is selected. (default: “word”)
`language_code`	STRING	No	-	ISO-639-1 or ISO-639-3 language code (e.g., ‘en’, ‘es’, ‘fra’). Leave empty for automatic detection. (default: "")
`num_speakers`	INT	No	0 - 32	Maximum number of speakers to predict. Set to 0 for automatic detection. (default: 0)
`seed`	INT	No	0 - 2147483647	Seed for reproducibility (determinism not guaranteed). (default: 1)

Note: The num_speakers parameter cannot be set to a value greater than 0 when the diarize option is enabled. You must either disable diarize or set num_speakers to 0.

Outputs

Output Name	Data Type	Description
`text`	STRING	The transcribed text from the audio.
`language_code`	STRING	The detected language code of the audio.
`words_json`	STRING	A JSON-formatted string containing detailed word-level information, including timestamps and speaker labels if enabled.

Nodes

​Inputs

​Outputs

Inputs

Outputs