Hacker News

GPT-4 is not getting worse

by COAGULOPATHon 9/16/2023, 6:33:31 AM with 33 comments

by jw1224on 9/16/2023, 9:53:06 AM
There is an ongoing bug[1] with OpenAI's API, where it stops streaming responses after exactly 5 minutes. When I first came across the issue, I debugged it by writing a prompt along these lines:
```
  > Output every number from 1 to 10,000 in written form (e.g. "one", "two", etc.). Respond with one number per line in numeric order.
```
As expected, the API would begin counting every number just as I asked. This would continue until exactly 5 minutes, when the stream would abruptly halt. Using this technique I was able to identify the bug. Every few weeks I run this test again to see if it's fixed (it broke something in production for me), but the bug remains open.
However, after a couple of months, the exact same test became useless. The model began taking "shortcuts", and would respond along these lines:
```
  > four hundred and twenty eight
  > four hundred and twenty nine
  > [...]
  > nine thousand nine hundred and ninety eight
  > nine thousand nine hundred and ninety nine
  > ten thousand
```
Yep, it literally started just writing "[...]" and skipped all the way to the end.
When instructional determinism is reduced like this, it's impossible to say it's not getting worse :(
[1] https://github.com/openai/openai-python/issues/399
by rendallon 9/16/2023, 9:09:49 AM
> I hate its craven, cringing, condescending, bitch-broken, mewling, groveling, wheedling, suck-ass tone. I hate its endless passive voice and weasel words. I hate its pompous It is important to remembers and it should be noteds.*
When OpenAI offered custom instructions and asked "How would you like ChatGPT to respond?" I wrote:
Be terse. Do not offer unprompted advice or clarifications.
Remain neutral on all topics.
Never apologize.
Immediately cleared up all that nonsense.
> me: I need to make an instructional video for graduate students. Any advice as to presentation?
> ChatGPT: Outline key points. Use visuals to complement speech. Engage the audience with questions or interactive elements. Ensure good audio and lighting. Rehearse before recording.
Glorious.
by jiggawattson 9/16/2023, 10:57:15 AM
The reason for these articles and the frustration people have is that GPT 4 is a black box and is being changed whether we like it or not. Nobody can choose to continue to use a specific snapshot with the alignment that suits them, because a third-party is enforcing their preferred alignment -- and not even specifying what that is!
It's as-if a private company had invented portal technology and used it to establish trade routes with far-off alien civilisations. They even handle the purchase order translation, currency conversion, and so on. It's a miracle! You can order room-temperature superconductors and floaty rocks at a reasonable price.
"Sorry, due to high demand, we had to cut some corners on the translations" comes the announcement.
Now the entire human race sometimes gets copper wire and ordinary rocks.
"You're still trading with the exact same aliens" is the follow-up press release to appease the angry customers.
People point at the clear evidence of almost-but-not-superconductive wire they've received and demand answers.
"No, you do not need to know anything at all about our translation technology. No, you may not visit other planets without our express approval, that might be dangerous to our prof... I mean that might be dangerous to the human race. Yes. Dangerous! Existential, even."
by redox99on 9/16/2023, 9:28:18 AM
Here's some data from a guy that trains LLMs using GPT4, and has tried both 0314 and 0613 versions of GPT4
https://old.reddit.com/r/LocalLLaMA/comments/16bi7bs/any_ben...
https://twitter.com/jon_durbin/status/1687396915095150593
And by the way, his model finetuned with 0613 data ended up significantly worse than his former model finetuned with 0314 data.
by JimDabellon 9/16/2023, 9:46:45 AM
All of the tests are one-shot questions and answers. Where I have found GPT-4 to be degrading significantly is with sustained discussion about technical topics. It starts forgetting important parts of the discussion almost straight away, long before the size of the context window becomes a factor. This wasn’t the case when it was new.
by tempusalariaon 9/16/2023, 8:58:27 AM
This is not conclusive at all.
Broadly there are two possible reasons why ChatGPT could have degraded (not saying it has).
1) OpenAI have higher user base than expected, so costs are very high/compute is limited to serve the full model, so they are using speculative decoding with a changed threshold to reduce costs.
2) OpenAI have changed the weights or control structure of the model in a way that is negative for performance. There is are two possible motivations a) reducing hallucination and/or ‘unwoke’ responses that might embarrass msft/openai or b) reduce capability of free/consumer product to push their new enterprise stuff.
Theses are clear motivations for openai to make the model worse than it was. There is no conclusive/rigorous evidence either way, and anecdote seems to lean towards it being nerfed.
by Madmallardon 9/16/2023, 9:10:48 AM
Theres at least a half dozen repeated queries with a 3+ month delay between in my chat history with chatGPT. The nerfing is that it does not try as hard to guess what you want and instead gives you a comment going “//fill this in” instead. This is with identical queries from February versus June or July. So that forces you at a minimum to query more. The other thing I noticed it doing is going “This is beyond the scope of X. Please speak with Y to get further information”. Extremely irritating responses that were not there in February for the exact same queries. Articles like this are paid actors.
by extheaton 9/16/2023, 8:47:08 AM
It’s a shame OpenAI for its ironic namesake is so opaque about their work these days. Fortunately since GPT-3.5 capabilities have mostly stayed the same (at least anecdotally), and the API costs have come down a bit, there is some silver lining here. With Google’s Gemini coming out later this year I think it’ll be interesting to see the reaction from OpenAI. It certainly won’t take much to blow out the current state of GPT-4. I’m just being patiently optimistic we’ll get an API with capabilities similar to OpenAI, but that’s definitely not a certainty.
by muzanion 9/16/2023, 10:31:24 AM
The gpt-4 API and ChatGPT-4 are basically different products. It's not clear which one this is comparing. Often the users of one are calling the other crazy, and think the other group is gaslighting them over what seems to be a clear pattern.
ChatGPT seems to fluctuate wildly in quality of expected output. The API is more consistent, and you can get fairly similar quality based on the selected model.
by 0xDEFon 9/16/2023, 8:52:32 AM
I think one of the main causes of rapid decline is the openai/evals repo that OpenAI is using the crowdsource the "safety" neutering of GPT-4.
https://github.com/search?q=repo%3Aopenai%2Fevals+safety+OR+...
Contributors in return get access to the 32K token version of GPT-4. This incentivize people to make up a ton of bullshit safety related evals.
by pxeger1on 9/16/2023, 8:31:33 AM
Interestingly, while skimming the front page, my brain autocorrected this to "GPT-4 is getting worse". We all have similar confirmation bias, I guess.
by nwolion 9/16/2023, 9:23:32 AM
I can only imagine how much worse it’ll get once the regulation they’re working on comes into place. A cartel of Microsoft and Google slowly watering down their public models while keeping the powerful ones to themselves or business partners
by permo-won 9/16/2023, 8:48:02 AM
gpt-4 may not be getting worse, but whatever model they give you for free in the app is certainly significantly worse than the output I get from gpt-3.5-turbo via the API. ask it to answer anything complex and it will just give you a verbose retelling of the question
by peloraton 9/16/2023, 11:34:59 AM
How often does it have to be repeated; it's not a knowledge engine, it's a language model!
by mark_l_watsonon 9/16/2023, 12:41:42 PM
As another HN member said here: it is not always clear if people are talking about the ChatGPT web UI or App products or the GPT-4 API product.
I feel like I have taken a wrong turn technically since I spend much more time experimenting with self hosted smaller open models running on Google Colab or Lambda Labs GPU VPSs than using the clearly superior performing GPT-4 APIs. I have been paid to work in the field of AI since 1982, and I should be desiring to use the very best models and technology, but open AI that can be self hosted just seems more interesting. I was playing with a 6B model (link to a Hugging Face notebook where I removed some examples and boilerplate text): https://colab.research.google.com/drive/1fMmXOcLdBzke-8Z0zl3... - really the best results I have seen from a small model.
by walthamstowon 9/16/2023, 9:35:27 AM
I haven't noticed 4 getting worse but 3.5 is noticeably worse than when I first signed up for Pro. Maybe it's a perception thing, maybe I'm going mad.
by Xiol32on 9/16/2023, 8:45:11 AM
It isn't getting worse, people are just running up against its limitations more often.
by jdthediscipleon 9/16/2023, 10:16:20 AM
Why you would quiz an LLM on intricacies of pop culture is beyond me.
Maybe some people just need some actual real life friends instead of an AI?
I almost exclusively use it for coding and technical questions and it's been doing an absolute hell of an amazing job so far!
by alexalx666on 9/16/2023, 8:54:00 AM
In my experience its much worse if you enable plugins or any other extra features.
by agentgumshoeon 9/16/2023, 11:36:40 AM
All this time we're spending trying to get AI to work properly. I think we're heading into that last 10/20% where it gets tricky.
by kromemon 9/16/2023, 9:35:30 PM
Without having the foundational model to compare to, there's really no way to evaluate whether OpenAI's continuing fine tuning is absolutely making it better or worse.
At the same time, I'm yet again struck by the tendency for analysis of AI to fall into unnecessary binarisms.
It's most probable that continued fine tuning is going to result in increased performance in the ways the NN is generally being used while decreasing performance in its broadest set of capabilities.
So things like getting better at prompt gotchas but worse at its variety of vocabulary or style.
So no, GPT-4 probably is getting worse over time. Just as it is also getting better over time.
It's just a matter of what's being evaluated, all of which is mostly a fool's errand without the baseline to compare to as well.
by ilakshon 9/16/2023, 10:16:36 PM
I think the answer to the problem of the OpenAI releases being out of our control and not necessarily consistent is to work on improving the open models.
I believe the big thing missing from open models is the advanced architecture and large amounts of human reinforcement. So it's actually not easy to replicate that with a volunteer effort. But I think the efforts of some great people working hard are gradually moving the open efforts forward.
Having said all that, it's funny how quickly people become entitled, demanding and critical towards this one company that provides a service with the ability to think for you and is smarter than any other such system in existence.
by chaosbolton 9/16/2023, 8:47:51 AM
Your problem is you want to be a contrarian.
>I've hated chatgpt for a very long time because of how it sounds
>it's not getting worse
I want the opinion of the people who objectively saw it as a tool and didn't like or hate it because of external opinions.
For example I've been using it since it came out, and I (no charts or data or proof) have just felt that it was nerfed... I mean maybe it wasn't who knows, but as a tool it used to help me more when it came out (gpt4) and now I feel like I spend more time correcting its mistakes than I would have just coding the method myself.
by FrustratedMonkyon 9/16/2023, 10:05:40 AM
For the 'tone', I really think it is trying to be as vanilla/generic as a default.
But that tone can be changed just by telling it to change tone.
Even on vanilla mode, it has given me a little sass when correcting me when I was wrong and it was right.
This can also happen, it can be correct when the user is wrong. Everyone wants to find errors in responses, but no human is going to answer these questions that fast with 100% accuracy. And No Search Engine. Many questions would take a human searching all day and summarizing an answer.
by yieldcrvon 9/16/2023, 9:26:34 PM
It’s funny how irrational haters become the best defenders of a technology because they attract other irrational haters that say the dumbest things, simply correcting the second group makes the first person a teacher who is contributing to the technology and its improvements or the community of people willing to improving it.
by anotherpaulgon 9/16/2023, 1:58:58 PM
My coding benchmarks agree with the headline. GPT-4 stayed about the same between the March and June model releases.
https://aider.chat/docs/benchmarks.html
by motbus3on 9/16/2023, 3:52:29 PM
I cannot say it has been worse than before, but there is one particular task I do that almost always now.
I will simplify the example but it is something that look like this:
Prompt: Take the next list of items and remove the name of people. - Adam, the prisoner
Response: - the prisoner (Adam)
by ameliuson 9/16/2023, 9:26:42 AM
Can it draw the SVG unicorn now?
by avereveardon 9/16/2023, 12:59:12 PM
what a weird titling. article show quality going all over the place, sample size n=3, then goes confirming is a bunch of nonstatistical anectodes, what a waste of time.
by pmarreckon 9/16/2023, 2:32:26 PM
My concern is that it's not getting better, and it's been SO many days since their last improvement! ;)
by m3kw9on 9/16/2023, 2:13:35 PM
It was a perception of it getting worse but nobody had solid tests to prove them
by romushaon 9/16/2023, 2:22:03 PM
Enshittification, just like google and bing
by geraldalewison 9/16/2023, 9:18:57 PM
> All I want for Christmas is a GPT-4 level model with less obnoxious RLHF, and if it heils Hitler occasionally while crapping out awk code, so be it.
What an ignorant thing to say.