Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1)
Item Type: | Dataset |
---|---|
Title: | Product Datasets from the MWPD2020 Challenge at the ISWC2020 Conference (Task 1) |
Alternative Title: | Product Data Matching Task derived from the WDC Product Data Corpus Large-Scale Product Matching - Version 2.0 used for the MWPD2020 Challenge at the ISWC2020 Conference |
Date: | November 2020 |
Creator: | Bizer, Christian ORCID: https://orcid.org/0000-0003-2367-0237, Peeters, Ralph and Primpeli, Anna |
Divisions: | School of Business Informatics and Mathematics > Wirtschaftsinformatik V (Bizer) |
DDC Classification: |
004 Computer science, internet |
---|---|
Keywords: | schema.org ; product matching ; entity matching ; identity resolution ; record linkage ; e-commerce |
Abstract: | The goal of Task 1 of the Mining the Web of Product Data Challenge (MWPD2020) was to compare the performance of methods for identifying offers for the same product from different e-shops. The datasets that are provided to the participants of the competition contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) from the product category computers. The data is available in the form of training, validation and test set for machine learning experiments. The Training set consists of ~70K product pairs which were automatically labeled using the weak supervision of marked up product identifiers on the web. The validation set contains 1.100 manually labeled pairs. The test set which was used for the evaluation of participating systems consists of 1500 manually labeled pairs. The test set is intentionally harder than the other sets due to containing more very hard matching cases as well as a variety of matching challenges for a subset of the pairs, e.g. products not having training data in the training set or products which have had typos introduced. These can be used to measure the performance of methods on these kinds of matching challenges. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites, marking up their offers with schema.org vocabulary. For more information and download links for the corpus itself, please follow the links below. |
URL: | https://madata.bib.uni-mannheim.de/352/ |
---|---|
DOI: | https://doi.org/10.7801/352 |
Availability (Controlled): | Download |
Availability: | The data is available in json format. |
Publication(s) (MADOC): |
Zhang, Ziqi und Bizer, Christian und Peeters, Ralph und Primpeli, Anna (2020), MWPD2020: Semantic Web challenge on Mining the Web of HTML-embedded product data |
Reference URL (External): |
https://ir-ischool-uos.github.io/mwpd/
http://webdatacommons.org/largescaleproductcorpus/... http://webdatacommons.org/largescaleproductcorpus/ |
Project: |
Project Title: Semantic Web Challenge @ ISWC2020: Mining the Web of HTML-embedded Product Data (Task 1) Project Description: Recent years have seen significant growth of semantic annotations on the Web, using markup languages such as Microdata together with the schema.org vocabulary. A particular domain that is witnessing the boom of semantic annotations is e-commerce, where online shops are increasingly embedding schema.org annotations into HTML-pages describing products in order to enable search engines to easily identify product offers and potentially drive traffic to the respective websites. The potentials as well as the challenges resulting from the wide-spread availability of semantically annotated product data on the Web motivated the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data (MWPD2020), as well as the specific tasks of the challenge: product matching and product classification. In the first task, participants need to identify offers for the same product originating from different websites. The goal of the second task is to categorize offers from different websites into the GS1 GPC product hierarchy. For both tasks, we have assembled training, validation, and test sets consisting of semantically annotated product data from a wide variety of different websites. Six teams from the USA, China, Japan, and Germany participated in the challenge. The winning system in Task 1, PMap, achieved an F1 score of 86.05 using an ensemble of transformer-based language models. More information about the challenge as well as the participating systems is found in the MWPD2020 Proceedings. |
File | Filename / Infos | Link |
---|---|---|
Archive
Filename: ISWC2020_SWC_MWPD_challenge.zip |
Download (32MB)
|
Depositing User: | Ralph Peeters |
---|---|
Date Deposited: | 26 Nov 2020 17:10 |
Last Modified: | 29 Feb 2024 20:30 |
Actions (login required)
View Item |