Hacker News

Omni SenseVoice: High-Speed Speech Recognition with Words Timestamps

by ringer007on 10/13/2024, 12:48:25 AM with 12 comments

by modelesson 10/13/2024, 2:43:44 AM
Looks cool! Combine this with this new TTS that released today that looks really good and an LLM and you'd have a pretty good all-local voice assistant! https://github.com/SWivid/F5-TTS
by staticautomaticon 10/13/2024, 3:34:47 AM
I’ve been building a production app on top of ASR and find the range of models kind of bewildering compared to LLMs and video. The commercial offerings seem to be custom or built on top of Whisper or maybe nvidia canary/parakeet and then you have stuff like speechbrain that seems to run on top of lots of different open models for different tasks. Sometimes it’s genuinely hard to tell what’s a foundation model and what isn’t.
Separately, I wonder if this is the model Speechmatics uses.
by steinvakton 10/13/2024, 6:56:09 AM
How does the accuracy compare to Whisper?
by throwaway2016aon 10/13/2024, 3:54:50 PM
This looks really nice. What I find interesting is that it seems to advertise itself for the transcription use case but if it is "lightning fast" I wonder if there are better uses cases for it.
I use AWS Transcribe[1] primarily. It costs me $0.024 per minute of video and also provides timestamps. It's unclear to me without running the numbers if using this model I could do any better than that seeing as it needs a GPU to run.
With that said, I always love to see these things in the Open Source domain. Competition drives innovation.
Edit: Doing some math, with spot instances on EC2 or serverless GPU on some other platforms it could be relatively price competitive with AWS Transcribe if the performance is even slightly fast (2 hours of transcription per hour to break even). Of course the devops work for running your own model is higher.
[1] https://aws.amazon.com/transcribe/
by satvikpendemon 10/13/2024, 4:24:26 AM
Can it diarize?
by jbellison 10/13/2024, 3:12:51 PM
OOMs even in quantized mode on a 3090. What's a better option for personal use?
> torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 43.71 GiB. GPU 0 has a total capacity of 24.00 GiB of which 20.74 GiB is free.
by unshavedyakon 10/13/2024, 3:33:11 PM
Can't wait for a bundle of something like this with screen capture. I'd love to pipe my convos/habits/apps/etc to a local index for search. Seems we're getting close
by deegleson 10/13/2024, 5:29:46 AM
Does it do diarization?
by mrkrameron 10/13/2024, 11:31:26 AM
With timestamps?! I gotta try this.
by riiiion 10/13/2024, 9:27:29 PM
Which languages does it support?
by frozencellon 10/13/2024, 9:28:49 AM
Does it work with chorus?
by BLACK_hHOLE2729on 10/13/2024, 6:30:29 AM
[flagged]