Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Has your paper been used to train an AI model? Almost certainly

Nature – Artificial-intelligence developers are buying access to valuable data sets that contain research papers — raising uncomfortable questions about copyright. “Academic publishers are selling access to research papers to technology firms to train artificial-intelligence (AI) models. Some researchers have reacted with dismay at such deals happening without the consultation of authors. The trend is raising questions about the use of published and sometimes copyrighted work to train the exploding number of AI chatbots in development. Experts say that, if a research paper hasn’t yet been used to train a large language model (LLM), it probably will be soon. Researchers are exploring technical ways for authors to spot if their content being used. AI models fed AI-generated data quickly spew nonsense. Last month, it emerged that the UK academic publisher Taylor & Francis, had signed a US$10-million deal with Microsoft, allowing the US technology company to access the publisher’s data to improve its AI systems. And in June, an investor update showed that US publisher Wiley had earned $23 million from allowing an unnamed company to train generative-AI models on its content. Anything that is available to read online — whether in an open-access repository or not — is “pretty likely” to have been fed into an LLM already, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle. “And if a paper has already been used as training data in a model, there’s no way to remove that paper after the model has been trained,” she adds…”

  • See also Ars Technica – Artists Claim ‘Big’ Win In Copyright Suit Fighting AI Image Generators
  • See also Washington Post – A group of visual artists and illustrators is celebrating a federal judge’s decision this week to allow key parts of their class-action lawsuit against the makers of popular AI image generators to move forward. The artists allege that tech start-ups including Midjourney and Stability AI violated various laws by training their AI image tools on the artists’ work without consent. They say the tools encourage users to generate images that closely mimic a given artist’s style. In a 33-page ruling issued Monday, U.S. District Judge William Orrick dismissed some of the artists’ claims but left core parts of the suit unresolved. That means the case can proceed to the discovery phase, which could bring to light internal communications around how the companies developed their AI tools.

Sorry, comments are closed for this post.