Skip to main content
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
The ElevenLabs Speech to Text node transcribes audio files into text. It uses ElevenLabs’ API to convert spoken words into a written transcript, supporting features like automatic language detection, identifying different speakers, and tagging non-speech sounds like music or laughter.

Inputs

ParameterData TypeRequiredRangeDescription
audioAUDIOYes-Audio to transcribe.
modelCOMBOYes"scribe_v2"Model to use for transcription. Selecting this model reveals additional parameters.
tag_audio_eventsBOOLEANNo-Annotate sounds like (laughter), (music), etc. in transcript. This parameter is revealed when the "scribe_v2" model is selected. (default: False)
diarizeBOOLEANNo-Annotate which speaker is talking. This parameter is revealed when the "scribe_v2" model is selected. (default: False)
diarization_thresholdFLOATNo0.1 - 0.4Speaker separation sensitivity. Lower values are more sensitive to speaker changes. This parameter is revealed when the "scribe_v2" model is selected and diarize is enabled. (default: 0.22)
temperatureFLOATNo0.0 - 2.0Randomness control. 0.0 uses model default. Higher values increase randomness. This parameter is revealed when the "scribe_v2" model is selected. (default: 0.0)
timestamps_granularityCOMBONo"word"
"character"
"none"
Timing precision for transcript words. This parameter is revealed when the "scribe_v2" model is selected. (default: “word”)
language_codeSTRINGNo-ISO-639-1 or ISO-639-3 language code (e.g., ‘en’, ‘es’, ‘fra’). Leave empty for automatic detection. (default: "")
num_speakersINTNo0 - 32Maximum number of speakers to predict. Set to 0 for automatic detection. (default: 0)
seedINTNo0 - 2147483647Seed for reproducibility (determinism not guaranteed). (default: 1)
Note: The num_speakers parameter cannot be set to a value greater than 0 when the diarize option is enabled. You must either disable diarize or set num_speakers to 0.

Outputs

Output NameData TypeDescription
textSTRINGThe transcribed text from the audio.
language_codeSTRINGThe detected language code of the audio.
words_jsonSTRINGA JSON-formatted string containing detailed word-level information, including timestamps and speaker labels if enabled.