Judge says AI engines can index books but can't pirate them

Judge says AI engines can index books but can't pirate them

Ever since ChatGPT first emerged on the scene in 2022, there has been a vociferous debate about whether the indexing (or "scraping") of public content that AI companies do when they are training a large-language model should be considered an infringement of the copyright held by publishers and/or the authors of those books, or whether it should be covered by the "fair use" exemption in US copyright law. As some of you may know, I have consistently been on the latter side of the debate — in a piece for the Columbia Journalism Review and then an edition of The Torment Nexus, I argued that the scraping or indexing of public content by LLMs should be legally no different than the indexing of books that Google did in the early 2000s as part of its Google Books project. After a court case that lasted for a number of years, judge Denny Chin ruled in 2013 that Google's indexing of content was covered by the fair-use exemption because he believed it to be a "transformative" use, which is one of the four factors that judges have to take into account when they are making a decision. As I wrote last year:

Judges have to balance the competing elements of the "four factor" test, namely: 1) What is the purpose of the use? In other words, is it intended as parody or satire, is it for scholarly research or journalism, etc. 2) What is the nature of the original work? Is it artistic in nature? Is it fiction or nonfiction? 3) How much of the original does the infringing use involve — is it an excerpt or the entire work? and 4) What impact does the infringing use have on the market for the original? In the Google Books case, the scanning of millions of books was not done for research or journalism, in many cases the books in question were creative works of fiction, the entire book was copied, and the Authors Guild argued that it would have a negative impact on the market. One element in Google's favour, however, was that while its indexing process made copies of the whole book, its search engine never showed users the entire thing."

As you can see from the four factors, a fair-use decision is effectively a balancing act between different and competing interests: the interests of the author and/or publisher, in protecting and making money from their works, and the interest of the public in having "transformative" uses of art available to them. This kind of balancing is necessary because copyright itself was designed as a balancing act, between the commercial interests of creators and the public benefit of freely available artistic work — to “promote the progress of science and useful arts," as the US Constitution describes it. Some authors and publishers (but not all) believe that copyright's sole purpose is to enrich creators, but that's not accurate; revenue for creators is important, but so is society's interest in having publicly available and usable art. Judge Chin decided that the scanning of books in order to make them searchable and provide excerpts was transformative enough that it outweighed the infringement of copyright and potential market impact.

In a recent case involving the AI company Anthropic, the judge in question did his own balancing act, between the fair use exemption on one hand and the questionable legality of the content that the company was indexing on the other. Last August, authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson filed a class-action lawsuit against Anthropic alleging that its scanning of books constituted copyright infringement on a massive scale. Anthropic asked the judge for a summary judgement — in other words, an advance ruling on a legal question without requiring a trial — on the question of whether its indexing qualified as fair use. This week, Judge William Alsup of the Northern District of California delivered a summary judgement, and his ruling broke down into two distinct parts: on the subject of scanning and fair use generally, the judge said that Anthropic's indexing qualified as fair use. However, he said that using an index of pirated books as source material was not considered to be fair use.

Note: In case you are a first-time reader, or you forgot that you signed up for this newsletter, this is The Torment Nexus. You can find out more about me and this newsletter in this post. This newsletter survives solely on your contributions, so please sign up for a paying subscription or visit my Patreon, which you can find here. I also publish a daily email newsletter of odd or interesting links called When The Going Gets Weird, which is here.

Partly fair and partly unfair

In his decision, Judge Alsup said that scanning and indexing books was clearly covered by the fair use exemption, and in fact he said that Anthropic's use of the works was "among the most transformative many of us will see in our lifetimes." However, building a central library of pirated works was clearly not fair use:

The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes. The copies used to convert purchased print library copies into digital library copies were justified, too, though for a different fair use. The first factor strongly favors this result, and the third favors it, too. The fourth is neutral. Only the second slightly disfavors it. On balance, as the purchased print copy was destroyed and its digital replacement not redistributed, this was a fair use. The downloaded pirated copies used to build a central library were not justified by a fair
use. Every factor points against fair use. Anthropic employees said copies of works (pirated ones, too) would be retained “forever” for “general purpose” even after Anthropic determined they would never be used for training LLMs. A separate justification was required for each use. None is even offered here except for Anthropic’s pocketbook and convenience.

In effect, Judge Alsup said that Anthropic's use of several popular databases of pirated books — including one called Books3, one known as LibGen, and one called Private Library Mirror or PiLiMi — meant that the scanning of these titles did not qualify as fair use. The company now faces a trial to determine what penalties it will have to pay for that infringement, and the cost could be considerable: there are more than 7 million books in the "central library" that Anthropic created from various pirated databases, and US copyright law says that the statutory penalty for infringement is typically between $750 and $30,000 per work, but that courts can impose a penalty as high as $150,000 for each work infringed. That could translate into a penalty of hundreds of billions of dollars! The judge noted in his decision that Anthropic had many places from which it could have purchased copyrighted books in order to index them for training purposes, but instead it decided to "steal them" to avoid what Anthropic co-founder and chief executive officer Dario Amodei described as "legal/practice/business slog."

I should note that I have some issues with the judge's use of the term "steal" — infringing on copyright isn't typically considered to be theft in a legal sense, because it only makes a digital copy, it doesn't deprive the original owner of the original physical item in the same way that theft does. However, while I may not agree with his reasoning entirely, I think his decision is fair: Anthropic (and other AI companies that have used such material) had the option to purchase the books in question through legal means, or even to borrow the books from libraries and other collectors — as the Internet Archive does, for example — but for pragmatic reasons it decided not to do that. That may have made business sense for the AI companies, but as Judge Alsup noted, it's not the court's duty to think of Anthropic's bottom line when it comes to making decisions.

According to the decision, Anthropic at some point decided that the legal risks of pirating books was not worth taking, and so it purchased large numbers of books that it had already gained access to through pirate databases. Whether this will affect the ultimate decision on penalties remains to be seen: the judge said in his ruling that the fact that Anthropic "later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft, but it may affect the extent of statutory damages." So it might not be the maximum penalty available, but there's no question that the "legal/practice/business slog" that Dario Amodei was concerned about when he authorized the use of pirated works pales by comparison to what he and his company face now.

Inputs versus outputs

One important aspect of this case that is not mentioned in much of the coverage of the decision (apart from Ben Thompson's overview in his Stratechery newsletter) is that Judge Alsup's decision specifically refers to whether the input to Anthropic's AI training qualifies as fair use — in other words, the scraping or indexing of copyrighted books. What he doesn't address (because the lawsuit doesn't address) is whether the output from Anthropic's Claude AI — that is, the responses that Claude provides to prompts from users, which in some cases might contain excerpts from copyrighted works — also qualifies as fair use. As the judge described it in his ruling:

Authors do not allege that any LLM output provided to users infringed upon Authors’ works. Our record shows the opposite. Users interacted only with the Claude service, which placed additional software between the user and the underlying LLM to ensure that no infringing output ever reached the users. This was akin to the limits Google imposed on how many snippets of text from any one book could be seen by any one user through its Google Books service, preventing its search tool from devolving into a reading tool. Here, if the outputs seen by users had been infringing, Authors would have a different case. And, if the outputs were ever to become infringing, Authors could bring such a case. But that is not this case. Instead, Authors challenge only the inputs, not the outputs, of these LLMs. They point to the fully trained LLMs and the Claude service only to shed light on how training itself uses copies of their works.

In my previous Torment Nexus post on this topic, I also drew this kind of distinction, because I think there is an important point there. If an AI startup (or a company like Google or the Internet Archive) scrapes or indexes hundreds or thousands or even millions of books, for the purposes of creating a publicly available archive that only ever provides short excerpts to an end user, that process seems to me to qualify as fair use because it is transformative (although a court might differ with me on one of the four factors). However, that doesn't mean that the output — the response to the prompt — necessarily has to be fair use as well. If an LLM that has been trained on all of a given artist's work produces work (visual or written) that is identical to that artist's output, it could be argued that this would be classic copyright infringement, and not susceptible to the fair use exemption because of the impact that the AI output might have on market demand. If the infringing content doesn't rework or present the material in some new way, then the court could find that the works are too similar to qualify as fair use. As my previous post put it:

So if AI engines don't produce word-for-word copies or artistic duplicates, should it still be copyright infringement to ingest billions of creative works and produce a work in the style of a specific author or artist? That is something the courts will have to consider, but I would argue that it should not. Does Edgar Allen Poe or Dashiell Hammett somehow own or have the right to control my ability to write a novel or story with a gumshoe detective? Should Stan Lee or Marvel get paid if I create the image of a man flying through the air wearing tights? Should the estate of Mary Shelley profit from every work that involves a mad scientist creating a terrible (but misunderstood) monster? Probably not. Are there cases where an AI engine might produce an obvious copy of a copyrighted work? Of course, in which case fair use likely wouldn't apply. But the ingestion or indexing of all that content should be considered fair use, because of the benefits that AI could generate in non-infringing ways.

It seems that Judge Alsup, who has a reputation as one of the most technologically savvy judges in the Northern District (if not the entire lower court system in the US), is convinced that scanning and indexing of copyrighted content is clearly transformative — significantly so — and therefore should be covered by fair use. In other words, he agrees with me :-) Whether other courts concur with this assessment, however, remains to be seen.

Got any thoughts or comments? Feel free to either leave them here, or post them on Substack or on my website, or you can also reach me on Twitter, Threads, BlueSky or Mastodon. And thanks for being a reader.