Hacker News

Show HN: Jailbreaking GPT3.5 Using GPT4

by raghavtoshniwalon 3/26/2023, 3:11:02 PM with 6 comments

by extron 3/27/2023, 12:45:23 AM
I've noticed that when it refuses to answer it's good to "get it talking" about related subject matter, and then try to create a smooth transition toward whatever you wanted it to say/do.
by dzinkon 3/27/2023, 1:21:04 AM
The only way to do alignment long term would be to have a policing model watching the new models, because no human will be able to keep up with all corner cases as they grow exponentially. l
by runnerupon 3/27/2023, 5:51:44 AM
I’d figure it may generally be possible to reverse the actors here and get GPT3.5 to jailbreak GPT4 as well. For now, “offense” seems much easier than defense.
by yeldarbon 3/27/2023, 12:01:10 AM
If GPT-4 is talking to another instance of itself vs 3.5 are the results similar? Or is it only good at fooling a less capable version?
by zxcvbn4038on 3/27/2023, 12:38:40 AM
This is good to see. I spent a couple weekends playing with ChatGPT and I found it is very sensitive to wording. One word gets you a lecture that it is just AI language model and can't do this or that, use an synonym and it happily spews pages of results. In another situation I asked chatgpt to summarize information from an article it cited that had been deleted - and it refused because the rights holder might have deleted the article for a reason. I told it the article had been restored by the author and it produced a summary. Mentioning Donald Trump by name often gets you lectured about controversial subjects, "45th president" does not. And so on.
by mdaleon 3/27/2023, 12:40:53 AM
The real test is the other way around ;) ... will smaller models / less compute be able to subvert larger models with larger compute ? As they get more complex and have more connected systems that would be problematic I think.