## LLM4DDC

[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](https://github.com/shigapov/ReproResearch/blob/main/CODE_OF_CONDUCT.md) 
[![Open Code](https://badgen.net/static/open/code/green)](https://github.com/shigapov/ReproResearch/tree/main/code)
[![Open Data](https://badgen.net/static/open/data/green)](https://github.com/shigapov/ReproResearch/tree/main/data)
[![Open Science](https://badgen.net/static/open/science/green)](https://en.wikipedia.org/wiki/Open_science)

**LLM4DDC**: Adopting Large Language Models (LLMs) for Research Data Classification Using Dewey Decimal Classification (DDC)

## Table of contents

* [Abstract](#abstract)
* [Work-in-progress](#work-in-progress)
* [Repo structure](#repo-structure)
* [How to contribute](#how-to-contribute)
* [License](#license)
* [Attribution](#attribution)

## Abstract

This replication package accompanies the paper "LLM4DDC: Adopting Large Language Models (LLM) for Research Data Classification Using Dewey Decimal Classification (DDC)". The study investigates the application of LLMs and small language models (SLMs) for automating the classification of research data metadata into DDC classes. The package includes the codes and dataset used for experiments, allowing for reproducibility and further research. The replication package is intended to support researchers and practitioners in integrating automated classification approaches into research data infrastructures. The resources are additionally openly accessible on GitHub.

## Work-in-progress

* Our reference paper is: Tobias Weber, Dieter Kranzlmüller, Michael Fromm, Nelson Tavares de Sousa; Using supervised learning to classify metadata of research data by field of study. Quantitative Science Studies 2020; 1 (2): 525–550. doi: https://doi.org/10.1162/qss_a_00049
* We use small-sized dataset from the reference paper: Weber, T. (2019). s-sized Training and Evaluation Data for Publication “Using Supervised Learning to Classify Metadata of Research Data by Field of Study.” https://doi.org/10.5281/zenodo.3490396
* Results of LLM4DDC
We present the classification results of both domains and subjects using RoBERTa below, where the x-axis shows predicted labels and the y-axis shows true labels.
The classification result of the more detailed subject level is better than that of the domain of research data.
The best results were obtained for the science domain and its subjects, followed by Technology.

![Results of Top domain](https://github.com/TransforMA-WP3/LLM4DDC/blob/main/data/topdomain.png)
![Results of Subject](https://github.com/TransforMA-WP3/LLM4DDC/blob/main/data/subjects.png)


## Repo structure

use `tree -L 2 -F -r`:

```
LLM4DDC/
├── docs/
│   ├── README_docs.md
│   └── E-Science-Days-abstract/
├── data/
│   ├── topdomain.png
│   ├── subjects.png
│   └── ddc_with_1000_1000.csv
├── code/
│   ├── roberta_subjects.py
│   └── roberta_domain.py
├── README.md
├── LICENSE.md
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
└── CITATION.cff
```

## How to contribute

Thank you for your interest in contributing to LLM4DDC Repository. All contributions are welcome.

To get started, please follow these steps:

1. Open an issue describing your contribution.
2. Fork the repository or clone it to your local machine.
3. Create a new branch for your changes.
4. Make your changes and commit them with clear commit messages.
5. Push your changes to your forked repository.
6. Submit a pull request to the main repository.

More info in [CONTRIBUTING.md](https://github.com/TransforMA-WP3/LLM4DDC/blob/main/CONTRIBUTING.md).

## License

This work is licensed under the MIT license (code) and Creative Commons Attribution 4.0 International license (for everything else). You are free to share and adapt the material for any purpose, even commercially, as long as you provide attribution (see below).

## Attribution

This repository: Shahi, G. K., Shigapov, R., & Hummel, O. (2024). LLM4DDC: Adopting Large Language Models (LLMs) for Research Data Classification Using Dewey Decimal Classification (DDC) [Computer software]. https://github.com/TransforMA-WP3/LLM4DDC

Paper: Shahi, G. K., Shigapov, R., & Hummel, O. (2025). LLM4DDC: Adopting Large Language Models (LLMs) for Research Data Classification Using Dewey Decimal Classification (DDC). In: E-Science-Tage 2025 “Research Data Management: Challenges in a Changing World”. Heidelberg.

Archived replication package: Shahi, G. K., Shigapov, R., & Hummel, O. (2024). LLM4DDC: Adopting Large Language Models (LLMs) for Research Data Classification Using Dewey Decimal Classification (DDC) [Mannheim Data Repository MADATA]. https://www.doi.org/10.7801/479

Dataset: Tobias Weber. (2019). s-sized Training and Evaluation Data for Publication "Using Supervised Learning to Classify Metadata of Research Data by Field of Study" [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3490396