kremlin.ru_en

dataset
Russian institutions
English language
All items published on the English language version of the Kremlin’s website
Author

Giorgio Comai

Published

March 13, 2023

Work-in-progress

This is an early release of the dataset. Only limited quality checks have been conducted, so if you intend to use it, make sure it is fit for purpose.

A full release in a proper data repository with better documentation is forthcoming.

Check for duplicates

The dataset has originally been generated by parsing all index pages for the following categories available on the Kremlin’s website:

index_group
news
transcripts
administration
state-council
security-council
councils

In total, the corpus thus generated has 40 178 items. It appears, however, that starting with 2008 items that are included in more than one category are published with a separate url. These are items published on the same date, with the same title, and exactly (or almost exactly) the same text, yet would still be included each time they appear as they have a separate url. Here are a few examples:

date url title internal_id
2008-07-05 http://en.kremlin.ru/events/president/transcripts/683 Dmitry Medvedev met with King of Jordan Abdullah II 683
2008-07-05 http://en.kremlin.ru/events/president/news/683 Dmitry Medvedev met with King of Jordan Abdullah II 683
2008-08-24 http://en.kremlin.ru/events/president/transcripts/1185 Dmitry Medvedev met with King of Jordan Abdullah II 1185
2008-08-24 http://en.kremlin.ru/events/president/news/1185 Dmitry Medvedev met with King of Jordan Abdullah II 1185
2008-08-28 http://en.kremlin.ru/events/president/transcripts/1252 Dmitry Medvedev met with President of Afghanistan Hamid Karzai. 1252
2008-08-28 http://en.kremlin.ru/events/president/news/1252 Dmitry Medvedev met with President of Afghanistan Hamid Karzai. 1252

Luckily, it is easy to notice that such duplicated articles share the same internal id (the numeric part at the end of the url is the same).

For analytical purposes, it makes sense to drop such duplicated items.

Once these duplicated items are removed, the total number of items is effectively 32 775. These are the ones that are included in the published dataset and used in further analyses.

Summary statistics

Dataset name: kremlin.ru_en

Dataset description: all items published on the English language version of the Kremlin’s website

Start date: 1999-12-31

End date: 2023-09-12

Total items: 32 775

Available columns: id; url; title; date; time; datetime; location; description; keywords; text; tags; tags_links; internal_id

License: Creative Commons Attribution 4.0 International

Link for download: kremlin.ru_en