The Unbelievable Scale of AI’s Pirated-Books Problem

The Atlantic – no paywall: “…Meta and OpenAI have both argued in court that it’s “fair use” to train their generative-AI models on copyrighted work without a license, because LLMs “transform” the original material into new work. The defense raises thorny questions and is likely a long way from resolution. But the use of LibGen raises another issue. Bulk downloading is often done with BitTorrent, the file-sharing protocol popular with pirates for its anonymity, and downloading with BitTorrent typically involves uploading to other users simultaneously. Internal communications show employees saying that Meta did indeed torrent LibGen, which means that Meta could have not only accessed pirated material but also distributed it to others—well established as illegal under copyright law, regardless of what the courts determine about the use of copyrighted material to train generative AI. (Meta has claimed that it “took precautions not to ‘seed’ any downloaded files” and that there are “no facts to show” that it distributed the books to others.) OpenAI’s download method is not yet known. Meta employees acknowledged in their internal communications that training Llama on LibGen presented a “medium-high legal risk,” and discussed a variety of “mitigations” to mask their activity. One employee recommended that developers “remove data clearly marked as pirated/stolen” and “do not externally cite the use of any training data including LibGen.” Another discussed removing any line containing ISBN, Copyright, ©, All rights reserved. A Llama-team senior manager suggested fine-tuning Llama to “refuse to answer queries like: ‘reproduce the first three pages of “Harry Potter and the Sorcerer’s Stone.”’” One employee remarked that “torrenting from a corporate laptop doesn’t feel right.”

See also The Atlantic – Search LibGen, the Pirated-Books Database That Meta Used to Train AI. Millions of books and scientific papers are captured in the collection’s current iteration. Editor’s note: This search tool is part of The Atlantic’s investigation into the Library Genesis data set. You can read an analysis about LibGen and its contents here. Find The Atlantic’s search tool for movie and television writing used to train AI here.

Facebook LinkedIn

M	T	W	T	F	S	S
« Feb
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31