Website | Imprint | Privacy Policy | Print

Mannheim Research Data

Login

 

Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0

Item Type:	Dataset
Title:	Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0
Alternative Title:	Product Matching Task derived from the WDC Product Data Corpus - Version 2.0
Date:	October 2019
Creator:	Bizer, Christian ORCID: 0000-0003-2367-0237 ; Primpeli, Anna ; Peeters, Ralph
Divisions:	School of Business Informatics and Mathematics > Information Systems V: Web-based Systems (Bizer 2012-)

DDC Classification:	004 Computer science, internet
Keywords:	schema.org ; product matching ; entity matching ; identity resolution ; record linkage ; e-commerce
Abstract:	Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

URL:	https://madata.bib.uni-mannheim.de/351/
DOI:	https://doi.org/10.7801/351
Availability (Controlled):	Download
Availability:	The data is available in json format.
Publication(s) (MADOC):	Primpeli, Anna und Peeters, Ralph und Bizer, Christian (2019), The WDC training dataset and gold standard for large-scale product matching Peeters, Ralph und Primpeli, Anna und Wichtlhuber, Benedikt und Bizer, Christian (2020), Using schema.org annotations for training and maintaining product matchers
Reference URL (External):	http://webdatacommons.org/largescaleproductcorpus/... http://webdatacommons.org/largescaleproductcorpus/
Project:	Project Title: WDC Product Data Corpus and Gold Standard for Large-Scale Product Matching - Version 2.0 Project Description: The research focus in the field of identity resolution (also called duplicate detection, record linkage, or link discovery) is moving from traditional symbolic matching methods towards embeddings and deep neural network based matching methods. The problem with evaluating deep learning based matchers is that they require large amounts of training data for playing their strengths. The benchmark datasets that have been used so far for comparing matching methods are often too small to properly evaluate this new family of methods. Another problem with existing benchmark datasets is that they are mostly based on data from a small set of data sources and thus do not properly reflect the heterogeneity that is found in large-scale integration scenarios. The WDC gold standard and training sets for large-scale product matching tackle both challenges by being derived from a large product data corpus originating from many websites which annotate product descriptions using the schema.org vocabulary.

File	Filename / Infos	Link
	Archive Filename: wdc_lspc_v2_sets.zip	Download (413MB)

Depositing User:	Ralph Peeters
Date Deposited:	26 Nov 2020 16:50
Last Modified:	05 Mar 2024 13:55

Actions (login required)

View Item

View Item

Mannheim Research Data is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.