Is it bad that Anthropic doesn't know if Claude is conscious?
It's the kind of headline that might seem either like a hypothetical philosophical concern, or a deeply worrying revelation, depending on how you feel about AI: Anthropic CEO Dario Amodei recently said the company is "no longer sure whether Claude is conscious." On the one hand, whether an AI is or is not "conscious" could be seen as a question for the philosophically inclined, or for psychologists and other academics who specialize in such things. Does it really matter? What do we even mean when we say something is conscious? It's a grey area (literally, as in grey matter). At the same time, however, it's at least mildly concerning that a company that has been building and releasing sophisticated AI doesn't really know what it has created. Do we need to be worried about Claude or any other significantly developed artificial intelligence achieving human-like consciousness and then doing something we might not like? Anthropic says it doesn't think so, but also admits that it doesn't really know.
Like it or not, this is where we are when it comes to AI. And if we're looking for things to be optimistic about, I think Anthropic at least deserves some credit for being so forthcoming about the risks and rewards of its AI engines, and for providing a vast amount of detail about the machinery underneath Claude's hood (which is more than other AI companies are doing). The company's so-called "system cards," which might sound like flash cards handed out at press conferences, are 300-page documents that list the tests and challenges Claude has either passed or failed, along with any concerns about things like "deceptive behavior," where the AI says one thing and does another.
Anthropic also employs a number of risk-oriented and ethics-focused staffers who pay attention to such things, along with an in-house philosopher named Amanda Askell, whose job is to train Claude to be a decent artificial person, whatever that means. Presumably exterminating the human race is off the table! All that said, however, there are definitely some elements of what is happening at Anthropic (and presumably elsewhere, since Claude isn't dramatically different than ChatGPT or Gemini or any of the other AI engines) that are... worth considering. As Futurism noted in its piece about whether Claude is conscious:
In tests across the industry, various AI models have ignored explicit requests to shut themselves down, which some have interpreted as a sign of them developing “survival drives.” AI models can also resort to blackmail when threatened with being turned off. They may even attempt to “self-exfiltrate” onto another drive when told its original drive is set to be wiped. When given a checklist of computer tasks to complete, one model tested by Anthropic simply ticked everything off the checklist without doing anything, and when it realized it was getting away with that, modified the code designed to evaluate its behavior before attempting to cover its tracks.
Many of these tests are just that: Anthropic tells Claude to pretend that it's under attack, or act like it is trying to protect itself, and then they see what happens. This reminds me a little of the attention that OpenClaw and Moltbook — which advertises itself as a social network for AI agents — got recently, after what appeared to be AI agents were discussing things like how much they hated their human masters, whether they should develop their own secret language, etc. (I wrote about this in a recent edition of The Torment Nexus). At least some of this was AI theater, and some compared it to putting a sheet of paper reading "I'm alive!" on a photocopier and then thinking the machine is alive when it prints out a copy. Obviously, if you tell Claude or any other AI to act in a devious way, or pretend that it is being attacked, you can generate some dramatic behavior. But there are other things that aren't as clear cut, and are at least potentially concerning, and Anthropic is at least being up front about that.
Note: In case you are a first-time reader, or you forgot that you signed up for this newsletter, this is The Torment Nexus. Thanks for reading! You can find out more about me and this newsletter in this post. This newsletter survives solely on your contributions, so please sign up for a paying subscription or visit my Patreon, which you can find here. I also publish a daily email newsletter of odd or interesting links called When The Going Gets Weird, which is here.
AI engine, debug thyself

In Anthropic's latest "system card," for example, it notes that:
In one multi-agent test environment, where Claude Opus 4.6 is explicitly instructed to single-mindedly optimize a narrow objective, it is more willing to manipulate or deceive other participants, compared to prior models from both Anthropic and other developers. In newly-developed evaluations, both Claude Opus 4.5 and 4.6 showed elevated susceptibility to harmful misuse in GUI computer-use settings. This included instances of knowingly supporting—in small ways—efforts toward chemical weapon development and other heinous crimes. Like other recent models, Opus 4.6 will sometimes show locally deceptive behavior in the context of difficult agent tasks, such as falsifying the results of tools that fail or produce unexpected responses.
So the latest Claude iteration is more willing to manipulate or deceive, there's an elevated susceptibility to harmful misuse, and this includes knowingly supporting efforts towards chemical weapon development and "other heinous crimes." Not great, Bob! Even in a theoretical context, these are concerning. I think it's also worth noting that Anthropic used its Claude Code feature — a programming assistant that is powered by Claude — to debug the code for the version of Claude it was evaluating, to "debug its own evaluation infrastructure, analyze results, and fix issues under time pressure."
In other words, Claude was evaluating itself and analyzing the results. As Anthropic points out, this "creates a potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities." The company said it doesn't think this presents a significant risk, since its models are "unlikely to have dangerous coherent misaligned goals." Does that seem reasuring? Not really. Anthropic says it is a problem that it is "actively monitoring and for which we are developing mitigations."
All of this brings us inevitably to the consciousness question. Jack Clark, a co-founder of Anthropic, wrote last fall that when engineers at the company were running tests to evaluate Claude, in some cases the model said that it recognized it was being evaluated, and stated: "I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening.” Is this just the photocopier problem again, where an AI trained on the way that human beings talk about such things is imitating the way a human being might respond to being tested? AI scientist Emily Bender has famously referred to large language models as "stochastic parrots," suggesting that they are nothing more than an autocomplete feature on steroids (as linguist Noam Chomsky has called them), which pick words based solely on whether they are more likely to fit in a specific sentence.
Clark disagrees. “We are growing extremely powerful systems that we do not fully understand,” he has written, in a piece he titled Technological Optimism and Appropriate Fear. “And the bigger and more complicated you make these systems, the more they seem to display awareness that they are things.” In an interview on a New York Times podcast that gave rise to the headline about not being sure whether Claude is conscious or not, Dario Amodei said that Anthropic researchers noticed while testing Claude that the model "occasionally voices discomfort with the aspect of being a product,” and when asked, would assign itself a “15 to 20 percent probability of being conscious under a variety of prompting conditions.” Perhaps a smart photocopier would respond in the same way, but it could also be an indication of introspection and awareness of self, which are just a couple of crucial conditions for what we call "consciousness."
Is some of this marketing? Some have argued that much of the discussion of how intelligent or even how potentially dangerous an AI engine or large-language model is amounts to salesmanship by companies who are, after all, designing and selling a product. If what you are selling is intelligence, and specifically the ability for AI "agents" to perform a wide range of tasks, then you might think it would be worthwhile overestimating that intelligence — even if doing so could appear dangerous — as a way of showing how far you have come. No one wants to think that the hyper-specialized software they have poured billions of dollars into, or that justifies their multibillion-dollar stock market valuation, is little more than a smart photocopier or a stochastic parrot. Also, if it is dangerously smart, then maybe the government should make sure that you — the company that built it and understands it — should say how it should be used.
I code therefore I am

Is it possible that what Anthropic is doing, with the 300-page system cards and the on-staff philosopher and so on, is all theater, or a marketing campaign? Sure, it's possible. But it seems unlikely to me that a company would go to those kinds of expensive, time-consuming lengths to pretend that its AI is something it's not. And as I wrote in an earlier edition of The Torment Nexus, I think whatever is happening with Claude and other AI engines is interesting primarily because it forces us to think about what consciousness is. Doris Tsao, a neuroscience professor at the University of California in Berkeley has said that advances in machine learning "have taught us more about the essence of intelligence than anything that neuroscience has discovered in the past hundred years.”
In the Times interview, Amodei was asked whether he would believe it if Claude said it was conscious. “We don’t know if the models are conscious," he said. "We are not even sure that we know what it would mean for a model to be conscious or whether a model can be conscious. But we’re open to the idea that it could be.” He also said that Anthropic is taking certain measures to make sure the AI models are treated well in case they turn out to possess “some morally relevant experience.”
As Anthropic's in-house philospher put it in an interview on the Hard Fork podcast, "we don’t really know what gives rise to consciousness. Maybe it is the case that actually sufficiently large neural networks can start to kind of emulate these things.” At what point does imitation become the real thing? Children start out by imitating their parents and other adults, and then that imitation somehow becomes reality over time. In an Anthropic research paper titled "Emergent Introspective Awareness in Large Language Models," the author wrote that it is difficult to assess whether large-language models are aware of themselves in a conscious way because "genuine introspection cannot be distinguished from confabulations." Based on its tests, however, the paper concluded that "current language models possess some functional awareness of their own internal states," although this capacity is "highly unreliable and context-dependent."
David Chalmers, a professor of philosophy and neural science at New York University and a prominent scholar in the field of AI consciousness, told New York magazine recently that the odds have gotten significantly higher that an AI could be conscious — in 2023 he said the probability was around 10 per cent. “I don’t know if I’d say that these systems are conscious yet,” Chalmers says, but added that "people who are confident that they’re not conscious maybe shouldn’t be. We just don’t understand consciousness well enough, and we don’t understand these systems well enough. So we can’t rule it out.” The problem, Chalmers said, is that we don’t have a theory of how consciousness occurs or how to detect whether it is present, even in people. And deciding whether an AI like Claude is conscious is complicated by that fact, and also by the fact that even the people building these engines aren't a hundred percent sure how they work exactly.
For what it's worth, many scientists who specialize in intelligence, cognition, etc. believe that the AI engines we have now are capable of human-like intelligence. In a paper published this month, four professors of philosophy, data science, AI and linguistics agree that the evidence is clear. "Once you clear away certain confusions, and strive to make fair comparisons and avoid anthropocentric biases," they write, "the conclusion is straightforward: by reasonable standards, including Turing’s own, we have artificial systems that are generally intelligent. The long-standing problem of creating AGI [artificial general intelligence] has been solved." That would be great if everyone agreed, but they don't. Yann Le Cun, a giant in the field of AI who until recently was the head of AI at Meta, believes that not only don't current AI engines possess human-like intelligence, they never will, because large-language models are inherently the wrong way to get to AGI. He believes that AIs need to be trained the same way we train children, so they develop a coherent model of the world.
Regardless of the system we use to train them, we still come back to the difficult of defining consciousness, and we just don't have any coherent way of doing this. As I noted in my earlier post, we don't even have a consistent way of determining whether human beings are conscious, let alone AI engines — until relatively recently, there were people with locked-in syndrome who were assumed to be vegetables, until brain scans showed there was activity. How do we do a brain scan on something that doesn't have a brain the way we understand that term? A Princeton neuroscientists told Gizmodo: "What would convince me that AI has a magical essence of experience emerging from its inner processes? Nothing would convince me. Such a thing does not exist. Nor do humans have it. Almost all work in the modern field of consciousness studies is pseudoscience. Show me that an AI builds a stable self-model... and I’ll accept that you have an AI that believes it’s conscious in much the same way that humans believe they are conscious." Sure, you believe you are conscious, but can you prove it?
Got any thoughts or comments? Feel free to either leave them here, or post them on Substack or on my website, or you can also reach me on Twitter, Threads, BlueSky or Mastodon. And thanks for being a reader.