Jeremy Singer-Vine, Data is Plural: Melissa Dell et al.’s American Stories dataset contains the text of ~400 million newspaper articles, extracted from ~20 million public-domain scans in the Library of Congress’s Chronicling America project (DIP 2017.08.16). To construct the dataset, the authors built “a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes.” For each article, the dataset provides the newspaper name, edition number, date of publication (largely in the 1800s–1920s), page number, headline, byline, and article text. Previously: The LOC’s Newspaper Navigator dataset (DIP 2020.10.07), which extracts visual content from the Chronicling America scans. [h/t Derek M. Jones]”
Sorry, comments are closed for this post.