Speech To text
The speech to text area is composed by the following ROS nodes:
-
AudioCapturer
devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
-
GetUsefulAudio
devices/UsefulAudio [python]: A node that takes the chunks of audio and, using webrtcvad, checks for a voice, cut the audio and publishes it to the topic UsefulAudio. Webrtcvad approach was made as an alternative that don´t remove silence but obtains the pieces of audio when someone speaks perfectly, it has a very good performance.
-
Engine Selector
action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter. * Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node. * Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
-
Speech to text Engines Both are nodes that receive the audio and convert it to text. The difference is that one is online and the other is offline.
action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.