Why AI content scraping should qualify as fair use

Mathew Ingram

19 Dec 2024 — 8 min read

Back in October, I wrote about artificial intelligence, and specifically about one of the crucial questions experts still can't seem to agree on, which is whether it is going to destroy us or not. In that piece, I also mentioned the debate over whether the indexing or "ingesting" that AI large-language models do is — or at least should be — covered by the fair-use exception in copyright law. I didn't spend a lot of time on it because it wasn't directly relevant to the danger issue, but I wanted to expand on some of the points I made then, and also in a Columbia Journalism Review piece that I wrote last year.

I am not a cheerleader for giant technology companies by any means, but I think there is an important principle at stake. And at the heart of it are some key questions: What (or who) is copyright law for? What was it originally designed to do? And does AI scraping or indexing of copyrighted content fit into that, and if so, how?

The case against AI indexing of content is relatively straightforward: by hoovering up content online and then using it to create a massive database for training large-language models, AI engines copy that content without asking and without paying for it (unless the publisher or owner has signed a deal with the AI company, as some news outlets have). This pretty clearly qualifies as de facto copyright infringement, as the Authors Guild and the New York Times and a number of others have argued and continue to argue. In a similar way, one could imagine that if a company were to copy millions of books and use them to create a massive index of content, that would pretty clearly qualify as infringement as well — copying without permission or payment.

Note: In case you are a first-time reader, or you forgot that you signed up for this newsletter, this is The Torment Nexus. You can find out more about me and this newsletter in this post. This newsletter survives solely on your contributions, so please sign up for a paying subscription or visit my Patreon, which you can find here. I also publish a daily email newsletter of odd or interesting links called When The Going Gets Weird, which is here.

The major difference between these two cases is that the second hypothetical one actually happened, when Google scanned millions of books as part of its Google Books project between 2002 and 2005, and created an index that allowed users to search for content from those books. After years of back-and-forth negotiations over payment for the infringement, this led to a lawsuit in which the Authors Guild and others argued that Google was guilty of copyright infringement on a massive scale. In the early days of that case, Judge Denny Chin of the Southern District of New York seemed to agree, but then at some point he changed his mind, and ruled that Google's book-scanning activity was covered by the fair-use exception under US copyright law.

How could such massive and obvious unauthorized copying of content owned by someone else be permitted to occur without permission? Because Judge Chin ruled that Google Books was a "transformative" use of the content, seen by many as the crucial factor in deciding whether something qualifies as fair use. As I described in my CJR piece, the question of whether AI indexing should be seen as fair use is complicated in part because large-language models are complicated, but also because the concept of fair use is complicated. Anyone who says that specific activity clearly qualifies as fair use — like universities that instruct students not to use more than six sentences from a published work — doesn't know what they are talking about, or is being deliberately reductionist, because it doesn't work that way.

The Four Factor Test

The only way to determine whether something is fair use is to go to court, where a judge or judges have to balance the competing elements of the "four factor" test, namely: 1) What is the purpose of the use? In other words, is it intended as parody or satire, is it for scholarly research or journalism, etc. This is also where the "transformative" test comes in. Does the use create something new or qualitatively different? 2) What is the nature of the original work? Is it artistic in nature? Is it fiction or nonfiction? 3) How much of the original does the infringing use involve — is it an excerpt or the entire work? and 4) What impact does the infringing use have on the market for the original?

In the Google Books case, for example, the scanning of millions of books was not done for research or journalism, in many cases the books in question were creative works of fiction, the entire book was copied, and the Authors Guild argued that it would have a negative impact on the market for those works. One element in Google's favour, however, was that while its indexing process made copies of the whole book, its search engine never showed users the entire thing, only excerpts. But the most crucial factor was that Judge Chin ruled that the transformative nature of the use — creating a digital database where people could search for out-of-print or hard to find books — overcame and effectively negated the obvious copyright infringement. He wrote:

In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits.

Who is copyright for?

This goes directly to the questions I mentioned above: What (or who) is copyright law for? What is it supposed to accomplish? Some authors and artists (and most publishers) would like you to believe that it was designed solely for creators, to enable them to make money from their creations, full stop. If Disney wants to make billions of dollars from an animated mouse that is their business, and there's nothing any of us can do about it. The Statute of Anne from 1709, the first modern copyright law, gave authors the exclusive right to print and make money from their books and didn't say anything about other uses. But the US constitution notes that the ultimate goal of copyright law is to “promote the progress of science and useful arts” for the benefit of society.

Obviously, incentivizing creators by allowing them to make money from their work accomplishes this goal. But I think the framers of the fair-use principle understood that if copyright was solely about compensating creators — if no one could use even a small part of an existing work without asking for permission and paying money — artistic and intellectual creativity could be impeded. As the Supreme Court noted, fair use was designed to counterbalance cases that might “stifle the very creativity which [copyright law] is designed to foster.” The point is that copyright law is designed for the benefit of society as a whole, not just as a way for creators to make money. The point of the "transformative" test is to decide whether the other aspects of an infringing use compensate for or counterbalance the obvious infringement.

So does the activity of indexing creative work to train a large-language model qualify? Needless to say, there is a lot of debate on that question, and ultimately the courts will have to decide (one court has already thrown out most of a case involving a group of authors who claimed ChatGPT engaged in copyright infringement). From my perspective — and that of a number of copyright and intellectual property experts — LLM indexing should qualify as fair use, in the same way that Google's book-scanning did. In most cases, AI engines do not provide word-for-word reproductions. The New York Times lawsuit against OpenAI says that ChatGPT reproduced entire articles, but it took the newspaper a considerable amount of effort to get it to do this.

The landmark Sony v. Betamax case found that a technology is not illegal if it has "substantial non-infringing uses," and I would argue that large language models do. Obviously there are those who claim that all AI creates is "slop," and that anything we can do with artificial intelligence or large language models could be done more easily and cheaply by other means. By that definition, if you assume that nothing good or having social benefit could ever come from AI, then you would probably argue that it doesn't deserve fair use protection. But I think there is already plenty of evidence that AI can generate useful content in a variety of ways, and we are only a year or two into this process. Already, studies have found that people prefer AI-generated poetry to the real thing. That may shock or depress you, but do those poetry readers care? Unlikely.

Potential benefits outweigh the infringement

There are a number of groups that support the concept of fair use, and believe that AI indexing of content should qualify. To take just one example, the Library Copyright Alliance has argued that the ingestion of copyrighted works to create large language models or other AI training databases "generally is a fair use,” an opinion it also provided in a submission to the US Copyright Office's notice of inquiry on copyright and AI. Their reasoning is that if copyright maximalists were to lock down content from large language models, it might also make that content less available to others as well. As the Alliance put it:

As champions of fair use, free speech, and freedom of information, libraries have a stake in maintaining the balance of copyright law so that it is not used to block or restrict access to information. We drafted the principles on AI and copyright in response to efforts to amend copyright law to require licensing schemes for generative AI that could stunt the development of this technology, and undermine its utility to researchers, students, creators, and the public.

So if AI engines don't produce word-for-word copies or artistic duplicates, should it still be copyright infringement to ingest billions of creative content and produce a work in the style of a specific author or artist? That is something the courts will have to consider, but I would argue that it should not. Does Edgar Allen Poe or Dashiell Hammett somehow own or have the right to control my ability to write a novel or story with a gumshoe detective? Should Stan Lee or Marvel get paid if I create the image of a man flying through the air wearing tights? Should the estate of Mary Shelley profit from every work that involves a mad scientist creating a terrible (but misunderstood) monster? Probably not.

Are there cases where an AI engine might produce an obvious copy of a copyrighted work? Of course, in which case fair use likely wouldn't apply. But the ingestion or indexing of all that content should be considered fair use, because of the benefits that AI could generate in non-infringing ways.

Got any thoughts or comments? Feel free to either leave them here, or post them on Substack or on my website, or you can also reach me on Twitter, Threads, BlueSky or Mastodon. And thanks for being a reader.

Why AI content scraping should qualify as fair use

Mathew Ingram

The Four Factor Test

Who is copyright for?

Potential benefits outweigh the infringement

Read more

We should help teens with social media not ban them from it

The social web is dying. Is that a good thing?

Self-driving cars are an unambiguous social good

What did Mark Zuckerberg know and when did he know it?