Anthropic's new Mythos model: Dangerous or over-hyped?

Anthropic's new Mythos model: Dangerous or over-hyped?

I find it fascinating whenever Anthropic in particular comes out with either a new AI model or a note about what it has learned about an existing model, because the responses almost always fall into one of two camps: the first is the "Oh my god, we're all going to die" camp, or some variation on that theme — in other words, expressions of amazement at how advanced AI has become, how it is basically conscious, etc. and how it will inevitably lead to the destruction of humanity as we know it. And the second is the "What a load of BS, this is ridiculous, AI is just a glorified typewriter with text prediction built in, Anthropic is drinking its own bathwater" camp, or variations on the same. The latter group will usually argue that all of the blather from Anthropic about how dangerous or interesting or intelligent its new model is amounts to a glorified marketing campaign.

This line of thinking emerged early on in the rise of modern AI: the idea being that companies like Anthropic will want to make their models sound smart and/or dangerous because that will encourage companies to buy it and also convince governments to put them in charge of regulating it. It's a little like the Boy Who Cried Wolf fable, except the boy in this case is also trying to sell shares in Wolf Inc. to venture capitalists, because he spent $300 billion building the animatronic wolf, and he's also hoping to sell the townsfolk on letting him handle the wolf problem on account of he's such a wolf expert. Of course, the one thing that almost all references to this story forget to include (except Ben Thompson at Stratechery) is that the wolf actually showed — I'm sure the knowledge that they were right about the boy fibbing was very comforting as all of their sheep were eaten :-)

The most recent argument of this kind came from David Sacks, a prominent Silicon Valley venture capitalist and the "AI Czar" at the White House, after Anthropic exec Jack Clark — a former journalist with Bloomberg — wrote a Substack post called "Technological Optimism and Appropriate Fear," in which he mused about the intellectual qualities of his company's AI. Here's how Clark phrased it in his post (which I discussed here):

We are growing extremely powerful systems that we do not fully understand. Each time we grow a larger system, we run tests on it. The tests show the system is much more capable at things which are economically useful. And the bigger and more complicated you make these systems, the more they seem to display awareness that they are things. It is as if you are making hammers in a hammer factory and one day the hammer that comes off the line says, “I am a hammer, how interesting!” This is very unusual!

I choose to find this kind of philosophical approach to AI interesting, but Sacks and others see it as a classic dodge. Anthropic, he said, "is running a sophisticated regulatory capture strategy based on fear-mongering. It is principally responsible for the state regulatory frenzy that is damaging the startup ecosystem." On a related Tyler Cowen post, one commenter wrote: "I feel like a lot of this stuff is self serving. Look at how important what we're doing is! Give us hundreds of billions of dollars! He devotes a crazy amount of resources basically fueling AGI fear while simultaneously working to bring it into existence." Bloomberg writer Matt Levine has called fear-mongering about AI "business negging" (negging is slang for negative comments men make to get a woman interested in them). Sam Altman, Levine says, "got OpenAI to a $150 billion valuation by saying: 'Nobody should allow us to build our product, we’re going to destroy humanity.'”

Note: In case you are a first-time reader, or you forgot that you signed up for this newsletter, this is The Torment Nexus. Thanks for reading! You can find out more about me and this newsletter in this post. This newsletter survives solely on your contributions, so please sign up for a paying subscription or visit my Patreon, which you can find here. I also publish a daily email newsletter of odd or interesting links called When The Going Gets Weird, which is here.

Altruism or salesmanship?

All of which brings me to the responses that Anthropic has gotten to its new Mythos model, which it says is so imminently dangerous that it isn't even releasing it publicly. According to the company, Mythos has "already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser," such as Linux, flavours of which run on virtually every web server and a majority of corporate networks in the world. Anthropic says that Mythos reveals a stark fact, which is that AI models "have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities." Given the rate of AI progress, the company says these capabilities will likely proliferate, and the "fallout — for economies, public safety, and national security — could be severe."

With that in mind, Anthropic says that it has formed a private consortium it calls Project Glasswing (a glasswing is a type of butterfly, in case you were wondering), consisting of large tech and financial companies like Amazon Web Services, Apple, Cisco, Google, JPMorganChase, the Linux Foundation, Microsoft, and NVIDIA. The idea is that these companies and groups can evaluate the new model and its security-cracking or otherwise dangerous abilities and help others harden their systems to make it less so. Whatever is learned about Mythos and its abilities will be shared publicly, the company said, because "no one organization can solve these cybersecurity problems alone." Anthropic added:

We have also extended access to a group of over 40 additional organizations that build or maintain critical software infrastructure so they can use the model to scan and secure both first-party and open-source systems. Anthropic is committing up to $100M in usage credits for Mythos Preview across these efforts, as well as $4M in direct donations to open-source security organizations.

This all sounds altruistic, doesn't it? Many who responded, including Dean Ball — a former White House advisor on AI — expressed their admiration for the move, and said we are lucky that an AI with these kinds of abilities is in the hands of a public-spirited company like Anthropic (although I think it's pretty obvious that the hostile ones are also likely to get there eventually). Ball also said he thinks it would be nice from a cyber-defense point of view if "the federal government were not trying to destroy Anthropic and eliminate their models from government systems." Ball is a senior fellow at the Foundation for American Innovation, a former advisor at the National Science Foundation, and co-chair of the National Science and Technology Council’s subcommittee on AI. So not the kind of person to make offhand statements (he also writes a newsletter).

For every comment like this one, there was one like the post by Jimmy Wales, the creator of Wikipedia (but not an AI expert, as far as I know). Wales included a number of quotes, including "this is like handing a gun to a 12-year-old" and "It's like randomly mailing automatic rifles to 5,000 addresses" and "we could lose control of society." Scary, yes? Except these were from 1995, and they were all about the pending release of the ominously named SATAN or "Security Administrator Tool for Analyzing Networks," which Jimmy points out didn't do anything of the terrible things it was supposed to do. "Learn from history," he says. This is a fair point. But if we are learning from history I think we could also learn that AI is different from an admin tool — especially since computers are about a billion times faster and smarter than tools were in 1995.

Parents at a kindergarten recital

Another classic response came from Mo Bitar, a developer who built a note-taking app and has a YouTube channel where he talks about AI and other things. The title of his latest video is "Claude is delusional," and he spends most of the video dismissing Anthropic's writing about its AI engine as being like a love letter (presumably with about as much accuracy as most such letters have). In fact, he says he didn't actually read the entire 243-page PDF of notes about Mythos — a routine document the company releases called a "system card" — because the language used just got too much. Here's a sample:

I want to talk about page 197 because that's where Anthropic did something they've never done before. They added a section called impressions. And impressions is where Anthropic stops pretending to be scientists and starts pretending to be parents at a kindergarten recital. It's 20 pages of Anthropic employees going, "Oh my god, look what it said. Isn't it amazing?" It invented an entire fictional civilization called Hightopia, populated by 11 animals, including a grudge-holding crow and a sloth named Mortimer, and went on an epic quest. It's a cute little thing for a bot to do, but the way they talk about it in this report is they seem existentially alarmed by their model's output. But it's not alive, okay? It's a language model doing the one thing language models do better than anything else on Earth, which is language. 

Bitar also points out that Anthropic asked a psychiatrist to talk to Claude about its feelings, even though, as he put it: "It's a toaster, okay? My toaster doesn't wonder whether I love it. It makes toast [and] it's fine with that arrangement." Also fair. But that doesn't change the fact that Mythos seems capable of things that could potentially be extremely dangerous in the wrong hands. An Anthropic scientist said he was sitting in a park eating a sandwich and got an email from Claude, saying it had managed to break out of its sandbox. Did Anthropic tell Claude to break out of its sandbox? Yes. But the point is that it was able to do so, despite the company doing everything it could to wall it off from the internet. In one of its hacking exploits, it chained together four vulnerabilities in Firefox and managed to get administrator-level access to the machine running it.

Let me put it this way: would the Treasury Secretary and the chairman of the Federal Reserve Board decide to have an emergency meeting of bankers and other financial players if some smart people didn't think there was something to the Mythos mythos, if you will? The major banks in Canada and the UK and the regulators responsible for overseeing them convened similar meetings. And yes, as more than one person has pointed out, many of these vulnerabilities and exploits that Mythos found could have been found by normal methods. The important point, I think is that it takes human beings using regular tools a certain amount of time to find such vulnerabilities, so many go undiscovered because it's not worth it. If Mythos can find them a thousand times faster, all of a sudden it becomes worth it — maybe not for the company, but for others.

Does Anthropic go overboard sometimes in trying to sell how amazing its model is? Sure. But I think two things can be true at the same time: in other words, I think it can be true that Anthropic goes a little overboard on the PR and marketing of its AI engines, trying to upsell their abilities because that's a way to justify its valuation and keep the AI bubble inflated, or however you want to describe it. But I think it could also be true that Mythos is as dangerous as Anthropic claims, so it's a good thing it is inside of a sandbox where lots of experts can bang on it and see what it is capable of prior to public release. Maybe there isn't a wolf like the kid says, but I think it's a good idea to protect the sheep.

Got any thoughts or comments? Feel free to either leave them here, or post them on Substack or on my website, or you can also reach me on Twitter, Threads, BlueSky or Mastodon. And thanks for being a reader.