RichASR uses the cutting-edge End2End ASR architecture to transcribe the audio signals as well as automatically tagging the sensitive entities. Sensitive entities can be anything from credit card information to profanities and other types of entities.
For 1-step Speech Processing Unit (1SPU) we leverage a commercially available End-to-End Automatic Speech Recognition (E2E ASR) system, initializing it with a tokenizer that includes userdefined dummy tokens. These labels are akin to unique tokens, much like a ‘<pad>‘ token, and are not utilized during the ASR’s initial training phase. Thus, the incorporation of these inactive tokens does not impact the duration of training or inference as they can be masked during the E2E ASR optimization process. Subsequently, we re-purpose the pre-trained speech encoder to generate transcriptions that embed semantic tokens indicative of event tags, optimizing with the CTC loss function. This strategy diverges from previous methods that relied on special characters for tagging within transcriptions using CTC loss. Our approach, which introduces dummy tags during the ASR’s pre-training, obviates the need for alterations to the tokenizer or output layer before fine-tuning.