This folder contains the data (SentenceComplexityDataset.txt) used in the paper:

Štajner, S., Ponzetto, S. P., Stuckenschmidt, H. 2017. Automatic Assessment of Absolute Sentence Complexity. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, pp. 4096-4102.

The dataset consists of 1131 sentences annotated for their complexity on a 1-5 level scale, where:
1 - very complex (very difficult to understand)
2 - complex (difficult to understand)
3 - neutral (neither difficult or easy to understand)
4 - simple (easy to understand)
5 - very simple (very easy to understand)

The distribution of sentences per score:
112 sentences scored 1
210 sentences scored 2
283 sentences scored 3
309 sentences scored 4
217 sentences scored 5

This data is licensed under a Creative Commons Attribution 4.0 International License.
https://creativecommons.org/licenses/by/4.0/


### ANNOTATIONS ###

The sentences were annotated by three non-native but fluent speakers of English and their scores were averaged and rounded to the closest integer.

The annotators were not given any guidelines as to what kind of complexity they should rate (e.g. lexical or syntactical), but were rather asked just to mark how difficult it is for them to understand the whole sentence.

The average pairwise inter-annotator agreement (IAA), measured by quadratic Cohen’s κ (to account for different levels of annotator disagreement as the task uses an ordinal scale), was 0.62 (0.58, 0.60, and 0.69, for each pair of annotators).


### SOURCES ###

The dataset consists of 150 original sentences from news articles and 150 original sentences from English Wikipedia, and their automatic simplifications obtained by using several different automated text simplification systems (for more details please see the paper mentioned above).

The sentences which were obtained by automatic simplification can be ungrammatical or meaningless. This influences the obtained scores for complexity as well.