MacWhisper: Transcribe audio files on your Mac

  • I've been using MacWhisper for a few months, it's fantastic.

    Sometimes I'll send a mp3 or mp4 video through it and use the resulting transcript directly.

    Other times I'll run a second step through https://claude.ai/ (because of its 100,000 token context) to clean it up. My prompt for that at the moment is:

    > Reformat this transcript into paragraphs and sentences, fix the capitalization and make very light edits such as removing ums

    That's often not necessary with Whisper output. It's great for if you extract captions directly from YouTube though - I wrote more about that here: https://simonwillison.net/2023/Aug/6/annotated-presentations...

  • I have a Python script on my mac that detects when I press-and-hold the right option key, and records audio while it's pressed. On release, it transcribes it with whispercpp and pastes it. Makes it very easy to record quick voice notes. Here it is: https://github.com/corlinp/whisperer/tree/whisper.cpp

    I was working on a native version in the form of a taskbar app with customizable prompt and all. However I quickly realized that the behaviors I want the app to do require a bunch of accessibility permissions that would block it from the app store and require more setup steps.

    Would anybody still find that useful?

  • Whisper is cool. Back in college I wanted to do some projects with speech-to-text and text-to-speech as an interface like 10-12 years ago, but at that point the only option was google APIs that charged by the word or second.

    On top of that, constantly sending data to google would have chewed a ton of battery compared to the "activation word" style solutions ("ok google/siri") that can be done on-device. The power for on-device processing was obviously going to come down over time, while wireless is much more governed by the laws of physics, and connectivity power budgets haven't gone down nearly as much over time. I am pretty sure there is a fundamental asymptotic limit for this, governed by Shannon entropy limit/channel width and power output. In the presence of a noise floor of X, for a bandwidth of Y, you simply cannot use less than Z total power for moving a given amount of data.

    BTLE is really the first game-changer (especially if you are hooking into a broad network of receivers like apple does with airtags) but even then you are not really breaking this rule - you are just transmitting less often, and sending less data. It's just a different spot on the curve that happens to be useful for IOT. If you are, say, doing a keyboard over BTLE where the duty cycle is higher, the power will be too. Applications that need "100% duty cycle"/"interactive" (reachable at any time with minimal latency") still have not improved very much.

    In hindsight I guess the answer would have been writing a mobile app that ties into google/siri keywords and actions, and letting the phone be the UI and only transmit BT/BTLE to the device. But BTLE hadn't hit the scene back then (or at least not nearly to the extent it has now) and I was less experienced/less aware of that solution sapce.

  • If you're looking for an alternative that runs on Linux, I just recently discovered Speech Note. It does speech to text, text to speech, and machine translation, all offline, with a GUI:

    https://flathub.org/apps/net.mkiol.SpeechNote

    https://github.com/mkiol/dsnote

  • While whisper.cpp is faster than faster-whisper on macOS due to Apple's Neural Engine [0], if you have a GPU on Windows or Linux, faster-whisper [1] is a lot faster than OpenAI's reference Whisper implementation as well as whisper.cpp, with the CLI being wscribe or whisper-ctranslate2 as faster-whisper is only a Python library. It's pretty good.

    [0] https://github.com/guillaumekln/faster-whisper/discussions/3...

    [1] https://github.com/guillaumekln/faster-whisper

  • This basically does the same thing but free:

    https://apps.apple.com/us/app/aiko/id1672085276

  • Here's a multi-platform open source app that does the same thing but uses vosk instead of whisper.

    https://github.com/bugbakery/audapolis

  • Been using it for a couple months, and Jordi keeps improving on it at a steady clip. It's great!!

  • I've used this for a few months to transcribe interviews and it works pretty well. The UI for dealing with multiple speakers is a bit cumbersome, and there are occasional crashes, but overall definitely a great app and worth the money

  • The main problem I have faced with the whisper model (large) is when there is silence or a sizable gap without audio, it hallucinates and just puts out some random gibberish repeatedly until the transcription ends. How does this app handle this?

  • https://github.com/MahmoudAshraf97/whisper-diarization

    This project has been alright for transcribing audio with speaker diarization. A big finicky. The OpenAI model is better than other paid products(Descript, Riverside) so I’m looking forward to trying MacWhisper.

  • I really like this app, I wish there was a way to play a video while editing the subtitles though!

  • There is a great library that has support not only with OpenAIs whisper but many others that also work offline. https://github.com/Uberi/speech_recognition

  • Out of curiosity, does anyone know what the state of the art for transcription is? Is there a possibility it will soon be "better than a person carefully listening and manually transcribing"?

    I ask because I asked a friend to record a (for fun) lecture I couldn't attend, and unfortunately the speech audio levels are quite low, and I'm trying to figure out how to extract as much info as possible so I can hear it. If I could add context to the transcriber like "This is about the Bronze Age collapse and uses terminology commonly used in discussions on that topic", it might be even more useful.

  • A few weeks ago I found myself wanting a speech to text transcriber that directly captures my computer's audio output (I.e. not mic input, not am audio file), but I could not find one. The best alternative I found was to have my computer direct audio output to a virtual audio input device, but I could not do this on my desktop because I do not have a sound card. I found software that did this, but it did not allow me to listen to the audio output while it was redirected to a virtual audio input.

    Has anyone else tried to do something similar? How did you achieve it?

  • Love the idea behind this. High quality transcription + the data not leaving your device is excellent.

    Any chance there's an iOS version of this coming down the pike? It would be great to have a voice-based note taking app that you can use when you are driving or walking and you don't want to type into your phone, but you just want to save that thought you just had somewhere by quickly dictating it, and having it accessible as text later.

  • I didn’t know whisper could differentiate voices for the per speaker transcription. Is that new? Is it also available in the command line whisper builds?

  • https://github.com/chidiwilliams/buzz

    Brew install buzz

    Its great

  • If you want a quick and free web transcription and editor tool, We've built https://revoldiv.com/ with speaker detection and timestamps. Takes less than a minute to transcribe 1 hour long video/audio

  • Is gumroad a good platform for selling software like this? How is licensing handled?

  • Would be nice if it allowed importing mkv files, in the end its just a container..

  • If you'd rather use a web app with minimal cost upfront check out PlainScribe :) https://www.plainscribe.com/

  • Does anyone know of an easy to use whisper fork with speaker attestation?

  • Shameless plug: recently launched LLMStack (https://github.com/trypromptly/LLMStack) and I have some custom pipelines built as apps on LLMStack that I use to transcribe and translate.

    Granted my use cases are not high volume or frequent but being able to take output from Whisper and pipe it to other models has been very powerful for me. It is also amazing how good the quality of Whisper is when handling non English audio.

    We added LocalAI (https://localai.io) support to LLMStack in the last release. Will try to use whisper.cpp and see how that compares for my use cases.

  • Seriously great program. Licensing model just fine. I use this all the time, so do my collegues at other companies.

    The developer Jordi has a great speech online about product development.

  • Is this just a front end to OpanAI's whisper?

    https://github.com/openai/whisper

  • Seems shady to me to charge for running larger free models you don't provide on hardware your users provide. You are charging for openAi features not yours.

  • Many such apps exist. I use Hello Transcribe from the App Store, $7 across all iDevices, with CoreML optimization.

  • I’ve gotten confused between the different whispers. How is this different from the openai api endpoint?

  • Great tool, but I can't wait until it can do real-time live transcribing.

  • superwhisper.com is also cool

  • So this is not Whisper Transcription 4 from the appstore?

  • Any insight on how Whisper works on older Intel Macs? I have a 2012 Mac mini with 16GB of RAM doing nothing; if I could use it to (slowly) transcribe media in the background, this becomes a must-buy.

  • [dead]

  • Anyone have a cached page? Seems to hugged to death.

  • Why? Just use whisper directly. The model and code is available and I think there’s even a homebrew formula...