Jill Lepore, The New Yorker: “The Wayback Machine has archived more than four hundred and thirty billion Web pages. The Web is global, but, aside from the Internet Archive, a handful of fledgling commercial enterprises, and a growing number of university Web archives, most Web archives are run by national libraries. They collect chiefly what’s in their own domains (the Web Archive of the National Library of Sweden, for instance, includes every Web page that ends in “.se”). The Library of Congress has archived nine billion pages, the British Library six billion. Those collections, like the collections of most national libraries, are in one way or another dependent on the Wayback Machine; the majority also use Heritrix, the Internet Archive’s open-source code. The British Library and the Bibliothèque Nationale de France backfilled the early years of their collections by using the Internet Archive’s crawls of the .uk and .fr domains. The Library of Congress doesn’t actually do its own Web crawling; it contracts with the Internet Archive to do it instead.”
Sorry, comments are closed for this post.