Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Daily Archives: September 25, 2023

These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech

The Atlantic – Use our new search tool to see which authors have been used to train the machines. This summer, I acquired a data set of more than 191,000 books that were used without permission to train generative-AI systems by Meta, Bloomberg, and others. I wrote in The Atlantic about how the data set, known as “Books3,” was based on a collection of pirated ebooks, most of them published in the past 20 years. Since then, I’ve done a deep analysis of what’s actually in the data set, which is now at the center of several lawsuits brought against Meta by writers such as Sarah Silverman, Michael Chabon, and Paul Tremblay, who claim that its use in training generative AI amounts to copyright infringement. Since my article appeared, I’ve heard from several authors wanting to know if their work is in Books3. In almost all cases, the answer has been yes. These authors spent years thinking, researching, imagining, and writing, and had no idea that their books were being used to train machines that could one day replace them. Meanwhile, the people building and training these machines stand to profit enormously. Reached for comment, a spokesperson for Meta did not directly answer questions about the use of pirated books to train LLaMA, the company’s generative-AI product. Instead, she pointed me to a court filing from last week related to the Silverman lawsuit, in which lawyers for Meta argue that the case should be dismissed in part because neither the LLaMA model nor its outputs are “substantially similar” to the authors’ books. It may be beyond the scope of copyright law to address the harms being done to authors by generative AI, and the point remains that AI-training practices are secretive and fundamentally nonconsensual. Very few people understand exactly how these programs are developed, even as such initiatives threaten to upend the world as we know it. Books are stored in Books3 as large, unlabeled blocks of text. To identify their authors and titles, I extracted ISBNs from these blocks of text and looked them up in a book database. Of the 191,000 titles I identified, 183,000 have associated author information. You can use the search tool below to look up authors in this subset and see which of their titles are included…”

You Can Now Get Your Free Credit Report Every Week, Forever

Lifehacker: “Monitoring your credit history regularly reduces the likelihood that reporting errors (best case) or identity theft (worst case) will derail your financial health—and you can now do this at no cost every single week, indefinitely, through Equifax, Experian, and TransUnion. Prior to the COVID-19 pandemic, each credit bureau offered one free credit report per… Continue Reading

I Was Wrong About the Death of the Book And Umberto Eco was right.

The Atlantic [read free]: “Fifteen years ago, in What Would Google Do?, I called for the book to be rethought and renovated, digital and connected, so that it could be updated and made searchable, conversational, collaborative, linkable, less expensive to produce, and cheaper to buy. The problem, I said, was that we so revered the… Continue Reading

New phone call etiquette: Text first and never leave a voice mail

Washington Post: “Phone calls have been around for 147 years, the iPhone 16 years and FaceTime video voice mails about a week. Not surprisingly, how we make calls has changed drastically alongside advances in technology. Now people can have conversations in public on their smartwatches, see voice mails transcribed in real time and dial internationally… Continue Reading

The Cambridge Law Corpus: A Corpus for Legal AI Research

The Cambridge Law Corpus: A Corpus for Legal AI Research Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek. arXiv:2309.12269 [cs.CL] [v1] Thu, 21 Sep 2023 17:24:40 UTC “We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court… Continue Reading

Project Gutenberg turned ebooks in its digital library into audiobooks without any need for human voices

Quartz: “The oldest digital library in the world, Project Gutenberg, has transformed thousands of ebooks into audiobooks using AI—bypassing the longer (and more expensive) process of hiring a human reader to do the job. It’s exactly the kind of AI application that actors, who are currently on strike in the US for the first time… Continue Reading

Wikipedia search-by-vibes through millions of pages offline

“What is This? This is a browser-based search engine for Wikipedia, where you can search for “the reddish tall trees on the san francisco coast” and find results like “Sequoia sempervirens” (a name of a redwood tree). The browser downloads the database, and search happens offline. To download two million Wikipedia pages with their titles… Continue Reading

SEC obtains Wall Street firms’ private chats in probe of WhatsApp, Signal use

Ars Technica: “The US Securities and Exchange Commission has “collected thousands of staff messages from more than a dozen major investment companies” as it expands a probe into how employees and executives at Wall Street firms use private messaging platforms such as WhatsApp and Signal, Reuters reported today, citing “four people with direct knowledge of… Continue Reading