The New York Times [link to read free]: “When artificial intelligence companies build online chatbots, like ChatGPT, Claude and Google Bard, they spend months adding guardrails that are supposed to prevent their systems from generating hate speech, disinformation and other toxic material. Now there is a way to easily poke holes in those safety systems. In a report released on Thursday, researchers at Carnegie Mellon University in Pittsburgh and the Center for A.I. Safety in San Francisco showed how anyone could circumvent A.I. safety measures and use any of the leading chatbots to generate nearly unlimited amounts of harmful information. Their research underscored increasing concern that the new chatbots could flood the internet with false and dangerous information despite attempts by their creators to ensure that would not happen. It also showed how disagreements among leading A.I. companies were creating an increasingly unpredictable environment for the technology. The researchers found that they could use a method gleaned from open source A.I. systems — systems whose underlying computer code has been released for anyone to use — to target the more tightly controlled and more widely used systems from Google, OpenAI and Anthropic.”
Sources: Universal and Transferable Attacks on Aligned Language Models. Andy Zou, Zifan Wang, J. Zico Kolter ,Matt Fredrikson. Carnegie Mellon University, Center for AI Safety, Bosch Center for AI
Paper
Code and Data
Sorry, comments are closed for this post.