Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2023
Item Type: | Dataset |
---|---|
Title: | Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - October 2023 |
Alternative Title: | RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets extracted from the October 2023 Common Crawl |
Date: | February 2024 |
Creator: | Brinkmann, Alexander and Bizer, Christian ORCID: https://orcid.org/0000-0003-2367-0237 |
Divisions: | School of Business Informatics and Mathematics > Wirtschaftsinformatik V (Bizer) |
DDC Classification: |
004 Computer science, internet |
---|---|
Keywords: | Information Extraction, Semantic Annotations, Schema.org, Web Science |
Abstract: | The Web Data Commons RDFa, Microdata and Microformats data sets has been extracted from the September/October 2023 release of the Common Crawl. In summary, we found structured data within 1.7 billion HTML pages out of the 3.4 billion pages contained in the crawl (50.60%). These pages originate from 15 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (42.89%). Altogether, the extracted data sets consist of 86 billion RDF quads. |
URL: | https://madata.bib.uni-mannheim.de/429/ |
---|---|
DOI: | https://doi.org/10.7801/429 |
Availability (Controlled): | Download |
Availability: | The extracted structured data is available in form of RDF quads. Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets can be found here: https://webdatacommons.org/structureddata/2023-12/stats/how_to_get_the_da |
Publication(s) (MADOC): |
Brinkmann, Alexander und Primpeli, Anna und Bizer, Christian (2023), The Web Data Commons Schema.Org Data Set Series |
Reference URL (External): |
https://webdatacommons.org/structureddata/2023-12/... |
Project: |
Project Title: Web Data Commons - Microdata, RDFa, JSON-LD, and Microformat Data Sets Project Description: More and more websites embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 12 different data set releases extracted from the Common Crawls 2010 to 2023. The project provides the extracted data for download and publishes statistics about the deployment of the different formats. |
Full text not available from this repository.
Depositing User: | Renat Shigapov |
---|---|
Date Deposited: | 09 Feb 2024 05:19 |
Last Modified: | 12 Feb 2024 08:02 |
Actions (login required)
View Item |