Omni SenseVoice: High-Speed Speech Recognition with Words Timestamps

  • Looks cool! Combine this with this new TTS that released today that looks really good and an LLM and you'd have a pretty good all-local voice assistant! https://github.com/SWivid/F5-TTS

  • I’ve been building a production app on top of ASR and find the range of models kind of bewildering compared to LLMs and video. The commercial offerings seem to be custom or built on top of Whisper or maybe nvidia canary/parakeet and then you have stuff like speechbrain that seems to run on top of lots of different open models for different tasks. Sometimes it’s genuinely hard to tell what’s a foundation model and what isn’t.

    Separately, I wonder if this is the model Speechmatics uses.

  • How does the accuracy compare to Whisper?

  • This looks really nice. What I find interesting is that it seems to advertise itself for the transcription use case but if it is "lightning fast" I wonder if there are better uses cases for it.

    I use AWS Transcribe[1] primarily. It costs me $0.024 per minute of video and also provides timestamps. It's unclear to me without running the numbers if using this model I could do any better than that seeing as it needs a GPU to run.

    With that said, I always love to see these things in the Open Source domain. Competition drives innovation.

    Edit: Doing some math, with spot instances on EC2 or serverless GPU on some other platforms it could be relatively price competitive with AWS Transcribe if the performance is even slightly fast (2 hours of transcription per hour to break even). Of course the devops work for running your own model is higher.

    [1] https://aws.amazon.com/transcribe/

  • Can it diarize?

  • OOMs even in quantized mode on a 3090. What's a better option for personal use?

    > torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 43.71 GiB. GPU 0 has a total capacity of 24.00 GiB of which 20.74 GiB is free.

  • Can't wait for a bundle of something like this with screen capture. I'd love to pipe my convos/habits/apps/etc to a local index for search. Seems we're getting close

  • Does it do diarization?

  • With timestamps?! I gotta try this.

  • Which languages does it support?

  • Does it work with chorus?

  • [flagged]