Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Daily Archives: November 6, 2024

Ziff Davis study says AI firms rely on publisher data to train models

Axios: “Leading AI companies such as OpenAI, Google and Meta rely more on content from premium publishers to train their large language models (LLMs) than they publicly admit, according to new research from executives at Ziff Davis, one of the largest publicly-traded digital media companies. Why it matters: Publishers believe that the more they can show that their high-end content has contributed to training LLMs, the more leverage they will have in seeking copyright protection and compensation for their material in the AI era. Zoom in: While AI firms generally do not say exactly what data they use for training, executives from Ziff Davis say their analysis of publicly available datasets makes it clear that AI firms rely disproportionately on commercial publishers of news and media websites to train their LLMs.

  • The paper — authored by Ziff Davis’ lead AI attorney, George Wukoson, and its chief technology officer, Joey Fortuna — finds that for some large language models, content from a set of 15 premium publishers made up a significant amount of the data sets used for training.
  • For example, when analyzing an open-source replication of the OpenWebText dataset from OpenAI that was used to train GPT-2, executives found that nearly 10% of the URLs featured came from the set of 15 premium publishers it studied.

Of note: Ziff Davis is a member of the News/Media Alliance (NMA), a trade group that represents thousands of premium publishers. The new study’s findings resemble those of a research paper submitted by NMA to the U.S. Copyright Office last year…”