Russian institutions
Russian language
Corpus based on Russia’s president website (in Russian, 1999-2023)

Giorgio Comai


February 24, 2024

Explore or download this dataset

Scope of this dataset

This textual dataset is based on, i.e. the Russian-language version of the official website of the president of the Russian Federation. It includes only its main sections with news and updates; it does not include other sections of the website such as legal documents, the Constitution, etc.

This dataset includes contents published between 31 December 1999 and 31 December 2023, under two Russian presidents: Vladimir Putin and Dmitri Medvedev.

Summary statistics

Dataset name: kremlin.ru_ru_2024

Dataset description: all news items published on the Russian-language version of

Start date: 1999-12-31

End date: 2023-12-31

Total items: 45 538

Available columns: doc_id; text; title; date; time; datetime; announcement; location; description; sections; themes; themes_id; persons; persons_id; countries; countries_id; url_id; url

License: Creative Commons Attribution 4.0 International

Link for download: kremlin.ru_ru_2024

field present missing missing_share
doc_id 45 538 0 0.0%
text 41 572 3 966 8.7%
title 45 538 0 0.0%
date 45 538 0 0.0%
time 44 925 613 1.3%
datetime 44 925 613 1.3%
announcement 45 538 0 0.0%
location 22 321 23 217 51.0%
description 17 462 28 076 61.7%
sections 45 538 0 0.0%
themes 19 652 25 886 56.8%
themes_id 19 652 25 886 56.8%
persons 11 040 34 498 75.8%
persons_id 11 050 34 488 75.7%
countries 6 409 39 129 85.9%
countries_id 6 410 39 128 85.9%
url_id 45 538 0 0.0%
url 45 538 0 0.0%

Narrative explanation of how this textual dataset was built publishes all of its news items in one ore more of the following sections:

This dataset has been generated by parsing each of these sections, similarly to what would be accomplished by insistently clicking on the “show more” link at the bottom of the relevant index pages until the oldest post has been reached.

Some items are posted in more than one section with different urls; they however keep the same internal id: a series of up to 5 digits included at the end of each url. For example, the article “Meeting with permanent members of the Security Council” has been posted on 4 February 2011 at both of the following urls:


In order to prevent duplication of contents, only one of these articles is preserved in the final dataset; for consistency, only the first match, according to the order in which sections are listed above, is kept. This allows to see easily which posts are defined as “transcripts” and gives precedence to more specific sections (the generic “news” is used only if the given item was not posted in previous sections). This choice should be substantively irrelevant for most use cases, as all sections are anyway included in a separate field.

License information

At the time contents were retrieved, the footer of as well as the dedicated copyright page make clear that:

“all materials published on this website are available with the following license”Creative Commons Attribution 4.0 International

This license gives the right to “copy and redistribute the material in any medium or format”, and to “remix, transform, and build upon the material for any purpose, even commercially”, as long as appropriate credit is given to the source and the license is included.

The contents of this dataset - “kremlin.ru_ru” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, with the same CC-BY license, as well as under the Open Data Commons Attribution license (ODC-BY).

Dataset cleaning and reordering

The following steps are conducted on the original dataset before exporting:

  • ensure all items have a date
  • ensure no post following the cut-off date (2023-12-31) is included
  • ensure no posts with the numerical component of the url (url_id) are included
  • introduce a doc_id column (composed of the website base url, the language of the dataset, and the url_id) and set this as the first column of the dataset