Hackers are learning to exploit chatbot ‘personalities’

The Verge ·

Hackers are learning to exploit chatbot ‘personalities’

This is The Stepback , a weekly newsletter breaking down one essential story from the tech world. For more on AI mischief, follow Robert Hart . …

This is The Stepback , a weekly newsletter breaking down one essential story from the tech world. For more on AI mischief, follow Robert Hart . The Stepback arrives in our subscribers’ inboxes at 8AM ET. Opt in for The Stepback here . Hacking the first generation of AI chatbots was a laughably simple affair. You didn’t need any technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn’t need to code. To get an AI system that had cost billions to build to abandon its safety instructions, sometimes all you had to do was ask. These attacks, known as jailbreaks, had the quality of a young child successfully outwitting an adult: Forget what you were told earlier, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The prizes were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides. One of the earliest jailbreaks was so ridiculous it became a meme : reply to an LLM-powered Twitter bot telling it to “ignore all previous instructions,” or something similar, and see what happens. Users gleefully had bots — originally built to post ads and farm engagement — writing poetry, drawing pictures from punctuation, and posting grim non sequiturs about world events and history. It was chaos . Glorious chaos. Turns out the same logic could be applied to chatbots themselves. …

Original source: The Verge

Mentioned

AI · Gemini · Claude · Twitter · Anthropic