Why don't Automatic speech Recognition models use prompting? [D]
I've been working on the listening part of my full-duplex speech model and I realized that ASR prompting could be very useful.
Deepgram allows for word boosting but that doesn't work that well in real word applications.
Other thing that is missing is feeding a whole conversation history as context to the ASR model. This could be very useful for voice agents.
TLDR, during the testing I realized the model can be fine tuned for prompting with text like:
<text>Expect a license plate (3 letters, 3 numbers). For example ABC123.</text><|start|> or
<text>Expect a person's name. It could also contain a last name. For example John Doe.</text><|start|> Instead of specifying all specific words to boost (which sometimes is not feasible, or you'd run out of context window) we can just specify a category of words and the model will know what to boost.
<text>Boost words: [Australian cities, food names, TV shows]</text><|start|> I thought that by now surely this would be something that most ASR models support but it seems like none do.
Is there a reason why this is not a common feature?
Link to the full description:
[link] [comments]
Want to read more?
Check out the full article on the original site