Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

New web crawler launched by Meta last month is quietly scraping the internet for AI training data

Fortune [no paywall]: “Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model. The crawler, named the Meta External Agent, was launched last month according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups. A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data. Meta, the parent company of Facebook, Instagram, and Whatsapp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler. A Meta spokesman said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit— “has been used for different purposes over time, like sharing link previews.” “Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.” Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July)…”

Sorry, comments are closed for this post.