Russian institutions 2024
This is a collection of text datasets based on contents extracted from the websites of Russian institutions.
This is a stable release.
- all datasets do not include materials published after 31 December 2023
- download urls and links to metadata pages are expected to remain available at their current location; updates with new contents may be announced here, but will be published separately; any quality adjustment introduced after final publication will be documented
- a formal release in an established repository is forthcoming - relevant links will be added here
- context and use cases will be described in a dedicated book chapter [forthcoming]
The name of each corpus is composed of the bare domain name, a two letter code of the main language of the contents, and the year of release of the dataset, separated by an underscore, e.g. kremlin.ru_ru_2024
.
Dataset format
Datasets are published as compressed csv files (.csv.gz), as well as in .ods format.
In line with the tif
standard, each corpus has a few standard columns, as well as additional metadata depending on availability:
- the first column is always
doc_id
, and is composed of the bare corpus name (based on base domain of the source and language) and a numeric id, separated by an underscore. For the Russian version of Kremlin’s website, such id would look as follows:kremlin.ru_ru_12345
(where 12345 is the numeric id associated with the given item). Numeric identifiers have no inherent meaning; their order may be substantially meaningless. If the original source website includes in the url a unique numeric id, this is maintained in the doc_id; otherwise an id is given at database creation (and the order numbering may depend on the way the extraction process was implemented). This format allows to combine datasets, ensuringdoc_id
is still unique. - the second column is always
text
: this is the main text included in the source page - the third colums is always
title
- the fourth colums is always
date
- other time-related fields, such as
time
anddatetime
, may follow if available (time and date refer to the original publication timezone; in this release, this is always Moscow’s time) - additional columns include fields and metadata, depending on availability of contents: this may include substantive text contents (e.g. a separate
lede
ordescription
field), categories, tags, location, author, additional identifiers, etc. - finally,
url
is always the last column
doc_id
and url
are unique and always present. In all of these datasets, also date
is always present.
Summary statistics
Click on the corpus name for more information and download links
institution | website | language | corpus name | start date | end date | n_items |
---|---|---|---|---|---|---|
Russia’s president | kremlin.ru | en | kremlin.ru_en_2024 | 1999-12-31 | 2023-12-31 | 33 165 |
Russia’s president | kremlin.ru | ru | kremlin.ru_ru_2024 | 1999-12-31 | 2023-12-31 | 45 538 |
Russia’s MFA | mid.ru | ru | mid.ru_ru_2024 | 2003-01-02 | 2023-12-31 | 55 644 |
Russia’s MFA | mid.ru | en | mid.ru_en_2024 | 2003-01-04 | 2023-12-31 | 25 882 |
Russia’s government | government.ru | ru | government.ru_ru_2024 | 2006-06-22 | 2023-12-30 | 17 137 |
Russia’s government (archived version) | archive.government.gov.ru | ru | archive.government.gov.ru_ru_2024 | 2008-05-07 | 2013-05-21 | 7 103 |
Russia’s prime minister (archived version) | archive.premier.gov.ru | ru | archive.premier.gov.ru_ru_2024 | 2008-05-07 | 2012-05-07 | 3 569 |
Russia’s Duma | duma.gov.ru | ru | duma.gov.ru_ru_2024 | 2006-04-05 | 2023-12-30 | 29 094 |
Russia’s Duma (transcripts) | transcript.duma.gov.ru | ru | transcript.duma.gov.ru_ru_2024 | 1994-01-11 | 2023-12-15 | 5 960 |
List of available datasets
Title | Description | Categories |
---|---|---|
kremlin.ru_en_2024 | Corpus based on Russia's president website (in English, 1999-2023) | dataset, Russian institutions, Russia's president, English language |
kremlin.ru_ru_2024 | Corpus based on Russia’s president website (in Russian, 1999-2023) | dataset, Russian institutions, Russian language |
No matching items