The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l
I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.
If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?
This is good to see. I spent a couple weekends playing with ChatGPT and I found it is very sensitive to wording. One word gets you a lecture that it is just AI language model and can't do this or that, use an synonym and it happily spews pages of results. In another situation I asked chatgpt to summarize information from an article it cited that had been deleted - and it refused because the rights holder might have deleted the article for a reason. I told it the article had been restored by the author and it produced a summary. Mentioning Donald Trump by name often gets you lectured about controversial subjects, "45th president" does not. And so on.
The real test is the other way around ;) ... will smaller models / less compute be able to subvert larger models with larger compute ? As they get more complex and have more connected systems that would be problematic I think.
I've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.