Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Category Archives: Copyright

The battle over copyright in the age of ChatGPT

Boston Review: “Questions of AI authorship and ownership can be divided into two broad types. One concerns the vast troves of human-authored material fed into AI models as part of their “training” (the process by which their algorithms “learn” from data). The other concerns ownership of what AIs produce. Call these, respectively, the input and output problems. So far, attention—and lawsuits—have clustered around the input problem. The basic business model for LLMs relies on the mass appropriation of human-written text, and there simply isn’t anywhere near enough in the public domain. OpenAI hasn’t been very forthcoming about its training data, but GPT-4 was reportedly trained on around thirteen trillion “tokens,” roughly the equivalent of ten trillion words. This text is drawn in large part from online repositories known as “crawls,” which scrape the internet for troves of text from news sites, forums, and other sources. Fully aware that vast data scraping is legally untested—to say the least—developers charged ahead anyway, resigning themselves to litigating the issue in retrospect. Lawyer Peter Schoppert has called the training of LLMs without permission the industry’s “original sin”—to be added, we might say, to the technology’s mind-boggling consumption of energy and water in an overheating planet. (In September, Bloomberg reported that plans for new gas-fired power plants have exploded as energy companies are “racing to meet a surge in demand from power-hungry AI data centers.”) The scale of the prize is vast: intellectual property accounts for some 90 percent of recent U.S. economic growth. Indeed, crawls contain enormous amounts of copyrighted information; the Common Crawl alone, a standard repository maintained by a nonprofit and used to train many LLMs, contains most of b-ok.org, a huge repository of pirated ebooks that was shut down by the FBI in 2022. The work of many living human authors was on another crawl, called Books3, which Meta used to train LLaMA. Novelist Richard Flanagan said that this training made him feel “as if my soul had been strip mined and I was powerless to stop it.” A number of authors, including Junot Díaz, Ta-Nehisi Coates, and Sarah Silverman, sued OpenAI in 2023 for the unauthorized use of their work for training, though the suit was partially dismissed early this year. Meanwhile, the New York Times is in ongoing litigation against OpenAI and Microsoft for using its content to train chatbots that, it claims, are now its competitors. As of this writing, AI companies have largely responded to lawsuits with defensiveness and evasion, refusing in most cases even to divulge what exact corpora of text their models are trained on. Some newspapers, less sure they can beat the AI companies, have opted to join them: the Financial Times, for one, minted a “strategic partnership” with OpenAI in April, while in July Perplexity launched a revenue-sharing “publisher’s program” that now counts Time, Fortune, Texas Tribune, and WordPress.com among its partners. At the heart of these disputes, the input problem asks: Is it fair to train the LLMs on all that copyrighted text without remunerating the humans who produced it? The answer you’re likely to give depends on how you think about LLMs…”

Every AI Copyright Lawsuit in the US, Visualized

Wired: “WIRED is following every copyright battle involving the AI industry—and we’ve created some handy visualizations that will be updated as the cases progress. In May 2020, the media and technology conglomerate Thomson Reuters sued a small legal AI startup called Ross Intelligence, alleging that it had violated US copyright law by reproducing materials from… Continue Reading

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Wired – “The project’s leader says that allowing everyone to access the collection of public-domain books will help “level the playing field” in the AI industry. Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI… Continue Reading

How ChatGPT Search (Mis)represents Publisher Content

Columbia Journalism Review – “ChatGPT search—which is positioned as a competitor to search engines like Google and Bing—launched with a press release from OpenAI touting claims that the company had “collaborated extensively with the news industry” and “carefully listened to feedback” from certain news organizations that have signed content licensing agreements with the company. In… Continue Reading

Canadian legal information database sues company behind AI chatbot

CBA – Lawsuit filed in B.C. Supreme Court alleges that Caseway AI violates CanLII’s terms of service and copyrights: “The Canadian Legal Information Institute (CanLII) has taken the makers of an AI chatbot to court over what it says is a violation of its terms of service, due to the chatbot scraping CanLII’s database in… Continue Reading

Ziff Davis study says AI firms rely on publisher data to train models

Axios: “Leading AI companies such as OpenAI, Google and Meta rely more on content from premium publishers to train their large language models (LLMs) than they publicly admit, according to new research from executives at Ziff Davis, one of the largest publicly-traded digital media companies. Why it matters: Publishers believe that the more they can… Continue Reading

Google Asked to Remove 10 Billion “Pirate” Search Results

TorrentFreak – “Rightsholders have asked Google to remove more than 10 billion ‘copyright infringing’ URLs from its search results. The search engine doesn’t celebrate the milestone in any way, but the takedown notices document intriguing shifts in volume over time, as well as shifting takedown interests. While search engines are extremely helpful for the average… Continue Reading

Metropolitan Museum of Art Puts 490,000 High-Res Images Online & Makes Them Free to Use

Open Culture: “The Metropolitan Museum of Art has put online 492,000 high-resolution images of artistic works. Even better, the museum has placed the vast majority of these images into the public domain, meaning they can be downloaded directly from the museum’s website for non-commercial use. When you browse the Met collection and find an image… Continue Reading

Vanishing Culture: A Report on Our Fragile Cultural Record

Internet Archives Blogs: “In today’s digital landscape, corporate interests, shifting distribution models, and malicious cyber attacks are threatening public access to our shared cultural history. The rise of streaming platforms and temporary licensing agreements means that sound recordings, books, films, and other cultural artifacts that used to be owned in physical form, are now at… Continue Reading

What are the current swing states, and how have they changed over time?

USA Facts: “Swing states, also known as battleground states, are states that could “swing” to either Democratic or Republican candidates depending on the election. Because of their potential to be won by either candidate, political parties often spend a disproportionate amount of time and campaign resources on winning these states. While there is no universal… Continue Reading