Russian state institutions 2024

This is a collection of full-text datasets based on contents extracted from the websites of Russian institutions.

This is a stable release.

all datasets do not include materials published after 31 December 2023
download urls and links to metadata pages are expected to remain available at their current location; updates with new contents may be announced here, but will be published separately; any quality adjustment introduced after final publication will be documented
a formal release in an established repository is forthcoming - relevant links will be added here
context and use cases are described in a dedicated book chapter:

Comai, Giorgio (2025, forthcoming), “Text-mining on-line sources from Russia openly”, in Autocracy, Influence, War: Russian Propaganda Today, edited by Paul Goode

The name of each corpus is composed of the bare domain name, a two letter code of the main language of the contents, and the year of release of the dataset, separated by an underscore, e.g. kremlin.ru_ru_2024.

Dataset format

Datasets are published as compressed csv files (.csv.gz), as well as in .ods format.

In line with the tif standard, each corpus has a few standard columns, as well as additional metadata depending on availability:

the first column is always doc_id, and is composed of the bare corpus name (based on base domain of the source and language) and a numeric id, separated by an underscore. For the Russian version of Kremlin’s website, such id would look as follows: kremlin.ru_ru_12345 (where 12345 is the numeric id associated with the given item). Numeric identifiers have no inherent meaning; their order may be substantially meaningless. If the original source website includes in the url a unique numeric id, this is maintained in the doc_id; otherwise an id is given at database creation (and the order numbering may depend on the way the extraction process was implemented). This format allows to combine datasets, ensuring doc_id is still unique.
the second column is always text: this is the main text included in the source page
the third colums is always title
the fourth colums is always date
other time-related fields, such as time and datetime, may follow if available (time and date refer to the original publication timezone; in this release, this is always Moscow’s time)
additional columns include fields and metadata, depending on availability of contents: this may include substantive text contents (e.g. a separate lede or description field), categories, tags, location, author, additional identifiers, etc.
finally, url is always the last column

doc_id and url are conceptually unique and always present. In all of these datasets, also date is always present. All other fields may be missing or empty for some of the items (e.g. there may be items with title, but no text, or vice-versa). See the documentation accompanying each dataset for more details.

License

Details about licensing are includeded along with the documentation of each corpus. The specifics vary slightly, but all of the source websites used to create this collection explicitly allowed for re-publication of contents a under a Creative Commons (CC-BY) license or similar. To the extent that it is possible, the datasets themselves are also distributed by its creator, Giorgio Comai, under the Open Data Commons Attribution license (ODC-BY).

Summary statistics

Click on the corpus name for more information and download links

institution	website	language	corpus name	start date	end date	n_items
Russia’s president	kremlin.ru	en	kremlin.ru_en_2024	1999-12-31	2023-12-31	33 165
Russia’s president	kremlin.ru	ru	kremlin.ru_ru_2024	1999-12-31	2023-12-31	45 538
Russia’s MFA	mid.ru	ru	mid.ru_ru_2024	2003-01-02	2023-12-31	56 203
Russia’s MFA	mid.ru	en	mid.ru_en_2024	2003-01-04	2023-12-31	25 943
Russia’s government	government.ru	ru	government.ru_ru_2024	2012-04-24	2023-12-30	17 135
Russia’s government (archived version)	archive.government.ru	ru	archive.government.ru_ru_2024	2008-05-07	2013-05-21	7 103
Russia’s prime minister (archived version)	archive.premier.gov.ru	ru	archive.premier.gov.ru_ru_2024	2008-05-07	2012-05-07	3 323
Russia’s Duma	duma.gov.ru	ru	duma.gov.ru_ru_2024	2006-04-05	2023-12-30	29 094
Russia’s Duma (transcripts)	transcript.duma.gov.ru	ru	transcript.duma.gov.ru_ru_2024	1994-01-11	2023-12-15	6 032

List of available datasets

Title	Description	Categories
archive.government.ru_ru_2024	Corpus based on the archived version of Russia’s government website (in Russian, 2008-2013)	dataset, Russian institutions, Russian government, Russian language
archive.premier.gov.ru_ru_2024	Corpus based on the archived version of the website of Russia’s prime minister (in Russian, 2008-2012)	dataset, Russian institutions, Russian government, Russian language
duma.gov.ru_ru_2024	Corpus based on the Russia’s Duma website (in Russian, 2006-2023)	dataset, Russian institutions, Russian government, Russian language
government.ru_ru_2024	Corpus based on the Russia’s government website (in Russian, 2013-2023)	dataset, Russian institutions, Russian government, Russian language
kremlin.ru_en_2024	Corpus based on Russia’s president website (in English, 1999-2023)	corpus, full corpus, Russian institutions, Russia’s president, English language
kremlin.ru_ru_2024	Corpus based on Russia’s president website (in Russian, 1999-2023)	dataset, Russian institutions, Russian language
mid.ru_en_2024	Corpus based on the website of Russia’s MFA (in English, 2003-2023)	corpus, full corpus, Russian institutions, Russia’s MFA, English language
mid.ru_ru_2024	Corpus based on the website of Russia’s MFA (in Russian, 2003-2023)	corpus, full corpus, Russian institutions, Russia’s MFA, Russian language
transcript.duma.gov.ru_ru_2024	Corpus based on the Russia’s Duma website (in Russian, 2006-2023)	dataset, Russian institutions, Russian parliament, Russian language

Categories

Dataset format

License

Summary statistics

List of available datasets