Labeled Reference Data from the Linked Open Citation Database (LOC-DB) Project ============================================================================== The data consists of 515 pages of lists of references from books and chapters together with the labeled boxes for each entry in the list of references. The XML files contain the coordinates of the 10.722 boxes and for each box a label (box or incomplete). The XML files are in PASCAL VOC format and the boxes in there look for example like this ``` box # "box" for boxes which contain the whole reference string, "incomplete" for boxes which contain only a part of the reference string Unspecified # this element is not used for the reference analysis, it always contains a default value 0 # this element is not used for the reference analysis, it always contains a default value 0 # this element is not used for the reference analysis, it always contains a default value 558 # x coordinate of the upper left corner of the bounding box in pixels counting from the upper left corner of the page 775 # y coordinate of the upper left corner of the bounding box in pixels counting from the upper left corner of the page 1944 # x coordinate of the lower right corner of the bounding box in pixels counting from the upper left corner of the page 864 # y coordinate of the lower right corner of the bounding box in pixels counting from the upper left corner of the page ``` ## Folders and file names They are organized in 5 folders: * single-column-pdfs: contains reference pages from e-book pdfs (born digital) with single column layout * single-column-scans: contains scanned reference pages from print books with single column layout * single-column-scans-cropped: contains scanned reference pages from print books with single column layout and cropped margins * two-columns-scans: contains scanned reference pages from print books with two columns layout * x-references-intervene-normal-text: contains non-standard reference lists (e.g. from annotated bibliographies or endnotes). Here, the "incomplete" label is also used for boxes containing information which is not part of the reference string (e.g. annotations). The file names contain the id (called PPN) from the SWB union catalog http://swb.bsz-bw.de/DB=2.1/SET=1/TTL=1/START_WELCOME where the bibliographic metadata of the book can easily be found. ## Details about the labeling process * The labeling took place from February to May 2017 and was done by student workers as well as librarians of the Mannheim University Library. * The data was produced during the LOC-DB project https://locdb.bib.uni-mannheim.de/ * The software labelImg v1.2.1 and v1.2.2 was used for labeling the images. * Each box should contain the whole reference, but can also contain a little more space e.g. to the right. Text before or after references is not labeled. ## Examining the data visually Possible steps for a visual impression about the data: 1. Download the software labelImg (e.g. version 1.6.0): https://tzutalin.github.io/labelImg/ 2. Unzip and run labelImg 3. Click 'Change default saved annotation folder' in Menu/File and choose one of the subfolders here 4. Click 'Open Dir' and choose the same subfolder ## Copyright and Data Citations All data here is CC0 and can be reused without further limitations. However, we encourage you to make a data citation in any publication using the data here: Akansha Bhardwaj, Laura Erhard, Annette Klein, Sylvia Zander, Philipp Zumstein (2018): Labeled Reference Data from the Linked Open Citation Database (LOC-DB) Project. Universitätsbibliothek Mannheim. https://doi.org/10.7801/268