Introducing ‘Russian state institutions full-text datasets’ (2024 edition)

datasets
russia
text-mining
Stable version now officially published: a collection of corpora based on textual contents extracted from the websites of Russian state institutions
Author

Giorgio Comai

Published

October 30, 2024

Quick links

I am happy to announce that I have finally publicly released with a permissive license a set of textual datasets extracted from the website of Russian institutions.

They are free to download, accompanied by detailed documentation at the following stable address (and, for redundancy, on Zenodo):

Giorgio Comai (2024): Russian state institutions full-text datasets – A collection of corpora based on contents extracted from the websites of Russian state institutions, v. 1.0, Discuss Data, https://doi.org/10.48320/0578D7FE-35F7-4E9E-A29D-926618A5C6BD

You can also find the same information on this website, by clicking on the corpus name of the summary table below:

institution website language corpus name start date end date n_items
Russia’s president kremlin.ru en kremlin.ru_en_2024 1999-12-31 2023-12-31 33 165
Russia’s president kremlin.ru ru kremlin.ru_ru_2024 1999-12-31 2023-12-31 45 538
Russia’s MFA mid.ru ru mid.ru_ru_2024 2003-01-02 2023-12-31 56 203
Russia’s MFA mid.ru en mid.ru_en_2024 2003-01-04 2023-12-31 25 943
Russia’s government government.ru ru government.ru_ru_2024 2012-04-24 2023-12-30 17 135
Russia’s government (archived version) archive.government.ru ru archive.government.ru_ru_2024 2008-05-07 2013-05-21 7 103
Russia’s prime minister (archived version) archive.premier.gov.ru ru archive.premier.gov.ru_ru_2024 2008-05-07 2012-05-07 3 323
Russia’s Duma duma.gov.ru ru duma.gov.ru_ru_2024 2006-04-05 2023-12-30 29 094
Russia’s Duma (transcripts) transcript.duma.gov.ru ru transcript.duma.gov.ru_ru_2024 1994-01-11 2023-12-15 6 032

Find more context about the file formats and the data included on the page dedicated to this release. The official release on Discuss Data includes a pdf file for each of these datasets, outlining data processing and data quality issues.

These are the same datasets I have used in some of my previous posts, including:

This is a stable release, and is expected to be updated once a year. The current version includes all posts published until the end of 2023; you may expect an updated version in the early months of 2024.

All datasets include for each item a title, a date, and the text, as well as further metadata according to availability: for example, if a website also showed a “tag” section along with each post, or a field dedicated to named individuals, these are also made available.

For the dataset based on the Kremlin’s website, an additional attempt has been made to geolocate the location from which each of the press releases was issued. This may not be 100% accurate, but should tentatively enable some more approaches for analysis. Also, it allows for exploring speeches and statements through an interactive map (posts issued from Moscow are excluded).

These datasets are free to download in full and should be easy to import in any content or text analysis software you may be using. For quick searches or basic word frequency exploration, I have made them available also through a basic web interface that enables basic operations, as well as filtering and exporting of text. The interface is not a finalised product and will eventually be updated, but should be mostly functional (although slow when the “keyword-in-context” option is enabled).

Note

This textual dataset has been created in the context of the project “Framings of Russia’s invasion of Ukraine in Russia’s pro-Kremlin public discourse” carried out with the support of the Italian Ministry of Foreign Affairs and International Cooperation under art. 23 bis, D.P.R. 18/1967. Find more information and the full disclaimer on the project’s web page.