kremlin.ru_en_2024

corpus
full corpus
Russian institutions
Russia's president
English language
Corpus based on Russia’s president website (in English, 1999-2023)
Author

Giorgio Comai

Published

May 3, 2024

Explore or download this dataset

Scope of this corpus

This textual dataset is based on en.kremlin.ru, i.e. the English-language version of the official website of the president of the Russian Federation. It includes only its main sections with news and updates; it does not include other sections of the website such as legal documents, the Constitution, etc.

This dataset includes contents published between 31 December 1999 and 31 December 2023, under two Russian presidents: Vladimir Putin and Dmitri Medvedev.

Summary statistics

Dataset name: kremlin.ru_en_2024

Dataset description: all news items published on the English-language version of Kremlin.ru

Start date: 1999-12-31

End date: 2023-12-31

Total items: 33 165

Available columns: doc_id; text; title; date; time; datetime; location; description; keywords; tags; tags_links; persons_id; persons_name; url_id; url

License: Creative Commons Attribution 4.0 International

Link for download: kremlin.ru_en_2024

field present missing missing_share
doc_id 33 165 0 0.0%
text 33 165 0 0.0%
title 33 165 0 0.0%
date 33 165 0 0.0%
time 32 552 613 1.8%
datetime 32 552 613 1.8%
location 17 635 15 530 46.8%
description 16 734 16 431 49.5%
keywords 33 165 0 0.0%
tags 16 708 16 457 49.6%
tags_links 16 708 16 457 49.6%
persons_id 9 253 23 912 72.1%
persons_name 6 354 26 811 80.8%
url_id 33 165 0 0.0%
url 33 165 0 0.0%

Narrative explanation of how this textual dataset was built

Kremlin.ru publishes all of its news items in one ore more of the following sections:

This dataset has been generated by parsing each of these sections, similarly to what would be accomplished by insistently clicking on the “show more” link at the bottom of the relevant index pages until the oldest post has been reached.

Some items are posted in more than one section with different urls; they however keep the same internal id: a series of up to 5 digits included at the end of each url. For example, the article “Meeting with permanent members of the Security Council” has been posted on 4 February 2011 at both of the following urls:

  • [http://en.kremlin.ru/events/president/news/10235]
  • [http://en.kremlin.ru/events/security-council/10235]

In order to prevent duplication of contents, only one of these articles is preserved in the final dataset; for consistency, only the first match, according to the order in which sections are listed above, is kept. This allows to see easily which posts are defined as “transcripts” and gives precedence to more specific sections (the generic “news” is used only if the given item was not posted in previous sections). This choice should be substantively irrelevant for most use cases, as all sections are anyway included in a separate field.

License information

At the time contents were retrieved, the footer of kremlin.ru as well as the dedicated copyright page make clear that:

“all materials published on this website are available with the following license”Creative Commons Attribution 4.0 International

This license gives the right to “copy and redistribute the material in any medium or format”, and to “remix, transform, and build upon the material for any purpose, even commercially”, as long as appropriate credit is given to the source and the license is included.

The contents of this dataset - “kremlin.ru_en” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, with the same CC-BY license, as well as under the Open Data Commons Attribution license (ODC-BY).

Dataset cleaning and reordering

The following steps are conducted on the original dataset before exporting:

  • ensure all items have a date
  • ensure no post following the cut-off date (2023-12-31) is included
  • ensure no posts with the numerical component of the url (url_id) are included
  • introduce a doc_id column (composed of the website base url, the language of the dataset, and the url_id) and set this as the first column of the dataset