Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Wired – “The project’s leader says that allowing everyone to access the collection of public-domain books will help “level the playing field” in the AI industry. Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. “It’s gone through rigorous review,” he says…However the IDI’s dataset is released, it will be joining a host of similar projects, startups, and initiatives that promise to give companies access to substantial and high-quality AI training materials without the risk of running into copyright issues. Firms like Calliope Networks and ProRata have emerged to issue licenses and manage compensation schemes designed to get creators and rights holders paid for providing AI training data…”

Sorry, comments are closed for this post.