“The HathiTrust Research Center is pleased to announce the release of its Extracted Features Dataset, a dataset derived from 4.8 million public domain volumes, totaling over 1.8 billion pages currently available in the HathiTrust Digital Library collection. The dataset includes over 734 billion words, dozens of languages, and spans multiple centuries. The release of this dataset enables analytical work by a single researcher at a scale that, before now, had been virtually impossible. Humanities scholars can analyze all or part of the data to make new discoveries and further understanding about history, culture, and language. “Large views of our published literature are extremely valuable for observing historic, cultural, and linguistic trends. This dataset addresses and solves common problems that researchers face including access to that literature, the technical obstacles to processing it, and the copyright issues involved when working with consumable—that is, individually readable—books,” said Peter Organisciak, doctoral candidate at the Graduate School of Library and Information Science (GSLIS) at Illinois and researcher on the project. Researchers from GSLIS, the Illinois Informatics Initiative, and the Department of English at Illinois contributed to the creation of the dataset. After writing the code that identifies the many facets of the text, the team processed the enormous amount of data using Blue Waters, one of the most powerful supercomputers in the world, located at the National Center for Supercomputing Applications (NCSA) on the Illinois campus. In addition to applications in digital humanities, the dataset is also a useful tool for computer modeling and machine learning. Computer scientists can use the dataset to build algorithms that can determine whether a piece of text is written in English or French, for example.”
Sorry, comments are closed for this post.