This archive contains the code and data that accompany the EACL 2017 short paper: "Unsupervised Cross-Lingual Scaling of Political Texts" 

1. CODE

The provided code (Python scripts) allows 

i. Training the translation matrices and construction of multilingual embedding spaces
ii. Computation of semantic similarity scores for input texts (of the same or different languages)
iii. Graph-based text scaling from pairwise document similarities, using the Harmonic Function Label Propagation (HFLP) and PageRank algorithms

The archive contains the following Python scripts: 

* corpus.py -- functionality for loading and preprocessing a corpus of textual documents
* embeddings.py -- loads word embeddings (possibly for multiple languages in parallel) and computes the semantic similarities between words
* evaluation.py -- implementation of metrics for evaluating automatically generated positions of political texts
* graph.py -- implementation of a generic graph structure and the HFLP and PageRank algorithms
* io_helper.py -- a collection of helper I/O functions
* main.py -- contains code that serves as an example on how to run the functionality available in other scripts in order to perform text scaling
* scaler.py -- implementation of the Wordfish text scaling model
* textsim.py -- implementation of semantic similarity metrics, alignment and aggregation based.
* transmat.py -- implementation of the translation matrix model used for construction of the multilingual embedding space.  

2. DATA 

We provide as the dataset the corpus of transcripts of speeches from the European Parliament (more details are provided in the paper). The dataset contains speeches of the representatives from the European Parliament. We concatenated all speeches of representatives from the same party. The dataset contains 25 parties from 5 countries: 

1. Germany
301: Christian-Democratic Union (CDU)
302: Social Democratic Party of Germany (SPD)
306: Party of Democratic Socialism (PDS)
308: Christian Social Union in Bavaria (CSU)

2. Spain
502: People's Party (PP)
505: Convergence and Union (CiU)
506: Basque Nationalist Party (PNV)
513: Galician Nationalist Block (BNG) 
516: Andalusian Party (PA)

3. France
601: French Communist Party (PCF)
605: Socialist Party (PS)
609: Rally for the Republic (RPR)
610: National Front (NF)

4. Italy
802: Democrats of the Left (DS)
803: Communist Refoundation (RC)
805: National Alliance (AN)
815: League North/ Northern League (LN)
818: United Christian Democrats (CDU)
819: Democrats (Dem)
823: Italian People's Party (PPI)

5. United Kingdom
1101: Conservative Party (Cons)
1102: Labour Party (Lab)
1104: Liberal Democrats (LibDems)
1105: Scottish National Party (SNP)
1106: Party of Wales (Plaid)

We provide both the monolingual and cross-lingual versions of the dataset: 

i. Monolingual (English) variant: data/dataset/monolingual
ii. Multilingual (speeches in native languages of speakers/parties: English, Spanish, Italian, German, French): data/dataset/multilingual 

Besides the texts, the "data/dataset" folder additionally contains two files with gold political positions of the 25 parties (gold positions were obtained from the 2002 Chapel Hill Expert Survey (http://chesdata.eu/)): 

i. gs-l2r-ideology.txt containts scores assigned to parties by experts (political scientists) corresponding to parties'  left-to-right ideology positions. 
ii. gs-eu-intergation.txt containts scores assigned to parties by experts (political scientists) corresponding to parties'  positions on European integration (pro-European vs. anti-European)


Additionally, in the "data" folder, we provide: 

* Lists of stopwords for each of the five languages involved in the evaluation ("data/stopwords")
* Lists of word translation pairs used to learn (and test) the translation matrices: ES-EN, IT-EN, DE-EN, FR-EN (data/trans-pairs)