The Internet Archive should be protected not attacked

The Internet Archive should be protected not attacked

In a recent edition of The Torment Nexus, I wrote about Wikipedia, which I argued was one of the best things the internet ever created (or that we all created with the help of the internet). In my opinion, there is another thing that ranks right up there with Wikipedia on the list of great things, and that is the Internet Archive. Just as Jimmy Wales created Wikipedia as a crowdsourced repository of information, Brewster Kahle created the Internet Archive as a repository for as much of the internet as he could save. Want to find the original Google.com page from 1998? Or a version of the Apple website from 1996? Or the original version of Wikipedia from 2001? The archive's Wayback Machine can find it. And much like Wikipedia – which has come under fire from Elon Musk's competing Grokipedia and others who dislike the truth and want to replace it with their preferred version – the Internet Archive has been and continues to be under attack on a variety of fronts, mostly from commercial interests who dislike free information.

There's a conventional wisdom that "the internet never forgets," and therefore anything that has been posted will survive forever, but the internet and the web forget things all the time. This was one of the reasons why Kahle and others decided to create the Internet Archive in 1996 – because of what became known as "link rot," where websites disappear for one reason or another, and then everyone who linked to them is left with a dead link where that information used to be. I've had to deal with this on a more personal level multiple times, when companies I worked for removed their archives and articles I worked on disappeared instantly – which is why I use a service called Authory, so I have a personal archive of everything I've published. Here's how Kahle described the rationale behind the Archive in a piece for Scientific American in 1997:

The early manuscripts at the Library of Alexandria were burned, much of early printing was not saved, and many early films were recycled for their silver content. While the Internet’s World Wide Web is unprecedented in spreading the popular voice of millions that would never have been published before, no one recorded these documents and images from 1 year ago. The history of early materials of each medium is one of loss and eventual partial reconstruction through fragments. Even though the documents on the Internet are the easy documents to collect and archive, the average lifetime of a document is 75 days and then it is gone. While the changing nature of the Internet brings a freshness and vitality, it also creates problems for historians and users alike.

Kahle, who built a web-traffic analysis service called Alexa and sold it to Amazon in 1999, added that the Archive was designed to be the digital equivalent of a library, and that if it were built and used properly, it would "offer insights into human endeavor and lead to the creation of new services," and that as the internet became a serious publishing system, archives like the IA would be available to "serve documents that are no longer in print." The service recently announced that it has stored more than one trillion webpages, an event that was celebrated by Sir Tim Berners-Lee, the creator of the web, and Vint Cerf, who invented Ethernet. Unfortunately, the most recent attacks on the Archive -- both of which have been successful to some extent – have come about not because the service has strayed from that original purpose or become corrrupted, but rather because the Archive and Kahle have continued to execute that original vision in a way that some powerful industries have come to see as a threat.

It's important to note that the Archive doesn't just store webpages, although that it the majority of what it has saved – about 100,000 terabytes worth of data. In addition to websites, the Archive has also backed up and stored hundreds of thousands of computer programs, including computer games, as well as movies, TV shows, news and public affairs content, animations, radio shows, audio books, 78 rpm records, podcasts and plenty of other material that might otherwise be lost to time or degraded to the point where it would become unusable. All of the publicly shareable images from Flickr are backed up in the Archive, and so are all of NASA's images of deep space. And the Archive also has the complete text of publicly available books from the open-source Project Gutenberg and other similar efforts, They are all searchable and downloadable. It also maintains something called the Open Library, which is the source of one of the more contentious (pun intended) legal battles the Archive has faced to date.

Note: In case you are a first-time reader, or you forgot that you signed up for this newsletter, this is The Torment Nexus. You can find out more about me and this newsletter in this post. This newsletter survives solely on your contributions, so please sign up for a paying subscription or visit my Patreon, which you can find here. I also publish a daily email newsletter of odd or interesting links called When The Going Gets Weird, which is here.

When is a library not a library?

I wrote about the Open Library project in 2023 when I was still the chief digital writer for the Columbia Journalism Review, and the headline I chose was "When is a library not a library? When it's online, apparently" (if you want to use the version that was captured by the Archive, there's one here). In a similar way to the Google Books project – which was also the subject of a multi-year copyright lawsuit, although in that case Google was victorious – the Archive digitally scanned millions of books, many of them either out of copyright or out of print or otherwise unavailable. If the service had provided digital copies of all of those books for free to anyone who asked, that would have been an obvious copyright violation, but the Archive didn't do that. Instead, it implemented a digital Controlled Lending Program similar to the ones that physical libraries offer: only one person could "take out" a digtial copy of a specific book, and once that lending period expired, only then could it be "loaned" again to someone else.

Not surprisingly, the publishing industry and some authors' rights groups didn't love this idea, but it remained under the radar until COVID-19 came along, and the Internet Archive made what in hindsight turns out to have been an unfortunate decision. Since COVID lockdowns made it difficult for people to get to physical bookstores and libraries, Kahle and the Archive decided to open up digital lending of their scanned books to anyone, creating something they called the National Emergency Library. In effect, they removed any of the controls on their lending program – no limit on the number of books that could be "loaned," no limit on the number of people who could simultaneously "borrow" them, etc. Publisher and author groups seemed to see this as a red flag – a bridge too far – and decided to press their copyright claims against the Open Library project. So they sued the Internet Archive in federal court in June of 2020.

I wrote about the result of that case in only the second edition of The Torment Nexus ever published, in September of last year (you can probably get a hint of what I thought of the court's decision from my headline: "The Second Circuit's decision in the Internet Archive case is bad"). In order to understand the context of the case, it's worth revisiting the Google Books decision in 2013, in which the company was found not guilty of infringement for the scanning and indexing of millions of books because Judge Denny Chin ruled that providing access to a search function and short snippets of text from those books was a "transformative" use, and that the benefits of this use outweighed the copyright interests of publishers and authors. Here's how he put it:

Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits.

So given that decision, why didn't the Open Library book-scanning project get the same kind of protection from copyright infringement? In a nutshell, Judge John G. Koeltl of the Southern District of New York ruled that while the Internet Archive's project might be of positive benefit to society, it wasn't transformative enough to make up for the harm done to publishers and authors by having free versions of their work being loaned out to anyone who wanted them for free. The Archive made the case that its lending program enabled "new and expanding interactions between library books and the web,” but Koeltl ruled that “making an invaluable contribution to the progress of science and cultivation of the arts” did not constitute a transformative use. In the case of Google Books, the judge wrote, the company increased the utility of the books it scanned by making them searchable, whereas the Open Library just made copies available.

We need to fight for fair use

You can read my response to the Open Library decision in more detail in that Torment Nexus post from last year, but the bottom line for me is that the whole reason there is a concept of fair use in US copyright law at all is to act as a counterweight to the commercial interests of authors and publishers, to ensure that financial concerns don't squash potential contributions to the progress of science and the cultivation of the arts. If a library – digital or physical – isn't one of the best ways of doing this, I don't know what is. The idea that the Open Library lending individual copies of books (many of them out of print or otherwise unavailable) to specific people for short periods of time is an unreasonable infringement of author and publishers' rights seems to give their commercial interests all of the weight, and the open circulation of ideas and information virtually none – especially when research shows library lending has little or no impact on the sales of commercially published books.

A somewhat smaller case against the Internet Archive was mounted in 2023 by a number of record companies, who argued that the Archive's collection of 78 rpm records was a clear copyright infringement. According to a release from the Archive, that case was settled, although the terms of the settlement remain confidential. How did providing access to recordings that are in most cases not even available any more, let alone capable of being played, inconvenience those record labels? Because they can charge radio stations and other services for access to them, and apparently their right to do this outweighs any broader social or public interest there might be in giving the public access to some of these recordings for free, despite the fact that these giant record labels have already been paid for them many times over – just as publishers have already been paid many times for the books the Internet Archive scanned.

In a recent story by Ars Technica about the Archive, Brewster Kahle argued that "the world became stupider" when the Open Library was disemboweled by the Hachette ruling, and it's hard to argue with him:

Meredith Rose, senior policy counsel for Public Knowledge, told Ars Technica that the Open Library could have served to surface information that’s often buried in books, giving researchers a streamlined path to source accurate information online. But Kahle said the lawsuits against the Archive showed that “massive multibillion-dollar media conglomerates” have their own interests in controlling the flow of information. “We don’t want libraries to become Hulu or Netflix,” said Courtney of the eBook Study Group. He, like Kahle, is concerned that libraries will become unable to fulfill their longtime role—preserving culture and providing equal access to knowledge. Remote access, Courtney noted, benefits people who can’t easily get to libraries, like the elderly, people with disabilities, rural communities, and foreign-deployed troops.

Cases like the Hachette lawsuit reinforce something I've often thought, which is that if public libraries didn't already exist, corporate interests like book publishers and movie companies and record labels would never allow them to be created. People get to read our books or watch our movies or listen to our records and we don't get paid anything? they would shout – that's communism! Incidentally, it's worth noting that plenty of authors' groups and individual creators are fans of the Internet Archive, and many didn't support the Hachette case. But cases like this are why Congress created fair use – because without it, commercial entities will press their interests relentlessly until every book and film and song and work of art is locked up in a giant vending machine where the price keeps going up, and your purchase is deleted if you don't agree to its terms.

That kind of thing should be resisted at all costs, in my opinion, and one of the best ways of doing so is to support open-source projects like the Internet Archive and Wikipedia (and the Interplanetary File System and archive.today – which was recently served with a subpoena by the FBI), and fight back against attempts to stop them from sharing information without asking for a credit card.

Got any thoughts or comments? Feel free to either leave them here, or post them on Substack or on my website, or you can also reach me on Twitter, Threads, BlueSky or Mastodon. And thanks for being a reader.