Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

News homepages, archived

Data is Plural: “Since launching in March 2022, homepages.news has archived millions of screenshots, performance audits, robots.txt files, accessibility trees, and hyperlink lists from the homepages of 1,100+ news sites. The open-source project, run by journalist Ben Welsh, provides bulk data for each of those assets. The screenshots themselves are stored on the Internet Archive; you can also view the latest screenshots from all the sites on one page. To date, the publications span 32 countries and 17 languages. Related: Welsh and volunteer Alex Garcia are using the robots.txt data to track which sites block OpenAI, Google AI, and Common Crawl — findings that have been cited widely.”

Sorry, comments are closed for this post.