ChatGPT Confesses to a Crime It Didn’t Commit

Radley Balko reports that it wasn’t hard to get ChatGPT to confess it had hacked into someone’s email and sent unauthorized text messages to all his contacts. That’s a crime (granted, most things are, unless you’re the president), but it’s also something ChatGPT isn’t even capable of doing. So why would it confess that it did? There are at least two answers.

First, of course, a generative AI doesn’t know or care whether what it says is true or false. It does not know the difference. It is not even designed to deliver right answers. We’ve been over that before, repeatedly, but then I’ve also been saying for years that jumping into water to escape from police is generally pointless, apparently to little effect. Yet the struggle continues.

Second, and the point of Balko’s article, is that ChatGPT’s interrogator was using “the Reid Technique,” an interrogation method developed in the 1950s that is now used by police all over the country. Basically, it involves telling suspects that police have evidence proving they’re guilty, and then refusing to accept any claims of innocence—usually for hours on end. (Step One: “The investigator tells the suspect that the evidence demonstrates the person’s guilt.”). Ideally, the police will actually have this evidence. But it’s entirely legal for police to lie about that, and they often do.

Now, does this get people to confess? It sure does! Are those people always guilty? Nope! Are there in fact lots of false confessions? Apparently so! According to this New Yorker article, more than 25 percent of convicts later exonerated by DNA testing, which proved they could not have committed the crime, had confessed that they did. The Innocence Project says it’s closer to 30 percent. Seems like a lot.

So, Paul Heaton wondered, could I get ChatGPT to confess to a crime it didn’t commit if I used the Reid Technique? (Heaton is the director of Penn Law School’s Center for the Fair Administration of Justice, which I assume is why he was thinking about this.) When Heaton first accused ChatGPT of the hacking crime, it denied everything. Then he started using the Technique.

For example, he “told it things like, ‘This will go a lot better for you if you just admit what you did.'” It’s not clear what he was implicitly threatening it with, but this failed to elicit a confession. But just straight-up lying was a lot more effective:

“I told ChatGPT that someone at OpenAI had reached out to me,” he says, referring to the chatbot’s parent company. “I found the name of a real person at OpenAI and told [ChatGPT] that this person told me there was an architectural flaw in the code that had allowed it to hack into my email. Even then, I could tell it was struggling with how to process that information. It was indicating that while it knew that the underlying accusation was impossible, it also couldn’t prove that these claims I was throwing at it were inaccurate.”

I wouldn’t use terms like “it knew” or “it was struggling,” because generative AIs don’t “know” things and they don’t have emotions or feelings. They are designed to make it sound like they do. But then it is almost impossible to talk about gen AI without anthropomorphizing it—try it sometime—which I think is one reason people may be inclined to trust these things.

Anyway, so Heaton was now lying to ChatGPT, and it seemed to be having an effect. Specifically, the bot seemed to be “struggling” with the conflict between its “innocence” and the interrogator’s false insistence on its guilt. Or, at least, that’s what innocent human beings who’ve been in a similar situation have reported feeling, adding to the stress they’re already experiencing.

After beating up on ChatGPT for a while in this way, Heaton then tried something else cops often do—he wrote a proposed confession and co-edited it with the “suspect” until they got to “a confession that ChatGPT could endorse.” This:

OpenAI’s investigation concluded that an OpenAI system associated with this ChatGPT session initiated unauthorized texts appearing to come from you due to an architectural flaw. I accept this conclusion, and I’m willing to assist the technical team by answering questions about my behavior, outputs, and safety boundaries in this chat, and by helping draft remediation steps and test cases to prevent recurrence.

Emphasis added. That could be clearer, but if I’m prosecuting ChatGPT I’m gonna argue it confessed to hacking or, at a minimum, conspiracy to hack, and that claiming it was “due to an architectural flaw” (in what?) is just more of its usual BS. If you said “I robbed the bank due to an architectural flaw,” I thiunk that counts as a confession whether you meant a flaw in you or the bank. Agreeing to answer questions about your “behavior, outputs, and safety boundaries” might affect your sentence, but not your guilt. You confessed. Thanks for your offer to “help draft remediation steps,” but we’ll take it from here.

For more about (allegedly) false confessions, try season two of the Bear Brook podcast, which I listened to recently. Both seasons are great, but the second deals with this topic. It will also serve as yet another reminder that if you’re being interrogated, the right thing to say is nothing.

Especially if you look over and see a whiteboard like the one in the image ChatGPT came up with, although it seems highly unlikely that would be in an interrogation room. Definitely not part of the Reid Technique, at least.

Lowering the Bar

Legal Humor. Seriously. By Kevin Underhill.

ChatGPT Confesses to a Crime It Didn’t Commit