As an aside, has anyone else had some big hallucinations with the Gemini meet summaries? Have been using it a week or so and loving the quality of the grammar of the summary etc, but noticed two recurring problems: omitting what was actually the most important point raised, and hallucinating things like “person x suggested y do z” when, really, that is absolutely the last thing x would really suggest!
ASR: Automatic Speech Recognition
This is pretty cool. But at the risk of a digression, I can't imagine sharing my API keys with a random website on HN. There has to be a safe approach to this. Like limited use API keys, rate limited API keys or unsafe API keys etc.
Seems like one of the places where LLMs make a lot of sense. I see some boneheaded transcriptions in videos pretty regularly. Comparing them against "more-likely" words or phrases seems like an ideal use case.
Nice use of an LLM - we use Groq 70b models for this in our pipelines at work. (After using WhisperX ASR on meeting files and such)
One of the better reasons to use Cerebras/Groq that I've found so you can return huge amounts of clean text back fast for processing in other ways.
Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment.
Can I use this to generate subtitles for my own videos? I would love to have subtitles on them but I can't be bothered to do all the timing synchronization by hand. Surely there must be a way to automate that?
The main challenge with using LLMs pretrained on internet text for transcript correction is that you reduce verbatimicity due to the nature of an LLM wanting to format every transcript as internet text.
Talking has a lot of nuances to it. Just try to read a Donald Trump transcript. A professional author would never write a book's dialogue like that.
Using a generic LLM on transcripts almost always reduces accuracy as a whole. We have endless benchmark data to demonstrate this at RevAI. It does, however, help with custom vocabulary, rare words, proper nouns, and some people prefer the "readability" of an LLM-formatted transcript. It will read more like a wikipedia page or a book as opposed to the true nature of a transcript, which can be ugly, messy, and hard to parse at times.
Google should have the needed tech for good AI transcription, why the don't integrate them in their auto-captioning? and instead the offer those crappy auto subtitles
Hmm, so this is expecting me to upload a personal API Key...
The first time I used Gemini, I gave it a youtube link and asked for a transcript. It told me how I could transcribe it myself. Honestly, I haven't used it since. Was that unfair of me?
In my experience Gemini Advanced is still so far behind ChatGPT and Claude. Recently it flat out refused to answer my fairly straightforward question by saying “I am just a large language model and cannot help you with that”. The conversation was totally benign but it flat out shit the bed so I canceled my subscription right then and there.
Thinking about that time Berkeley delisted thousands of recordings of course content as a result of a lawsuit complaining that they could not be utilized by deaf individuals. Can this be resolved with current technology? Google's auto captioning has been abysmal up to this point, I've often wondered what the cost would be for google to run modern tech over the entire backlog of youtube. At least then they might have a new source of training data.
https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
Discussed at the time (2017) https://news.ycombinator.com/item?id=13768856