index_group |
---|
news |
transcripts |
administration |
state-council |
security-council |
councils |
kremlin.ru_en
This is an early release of the dataset. Only limited quality checks have been conducted, so if you intend to use it, make sure it is fit for purpose.
A full release in a proper data repository with better documentation is forthcoming.
Check for duplicates
The dataset has originally been generated by parsing all index pages for the following categories available on the Kremlin’s website:
In total, the corpus thus generated has 40 178 items. It appears, however, that starting with 2008 items that are included in more than one category are published with a separate url. These are items published on the same date, with the same title, and exactly (or almost exactly) the same text, yet would still be included each time they appear as they have a separate url. Here are a few examples:
date | url | title | internal_id |
---|---|---|---|
2008-07-05 | http://en.kremlin.ru/events/president/transcripts/683 | Dmitry Medvedev met with King of Jordan Abdullah II | 683 |
2008-07-05 | http://en.kremlin.ru/events/president/news/683 | Dmitry Medvedev met with King of Jordan Abdullah II | 683 |
2008-08-24 | http://en.kremlin.ru/events/president/transcripts/1185 | Dmitry Medvedev met with King of Jordan Abdullah II | 1185 |
2008-08-24 | http://en.kremlin.ru/events/president/news/1185 | Dmitry Medvedev met with King of Jordan Abdullah II | 1185 |
2008-08-28 | http://en.kremlin.ru/events/president/transcripts/1252 | Dmitry Medvedev met with President of Afghanistan Hamid Karzai. | 1252 |
2008-08-28 | http://en.kremlin.ru/events/president/news/1252 | Dmitry Medvedev met with President of Afghanistan Hamid Karzai. | 1252 |
Luckily, it is easy to notice that such duplicated articles share the same internal id (the numeric part at the end of the url is the same).
For analytical purposes, it makes sense to drop such duplicated items.
Once these duplicated items are removed, the total number of items is effectively 32 775. These are the ones that are included in the published dataset and used in further analyses.
Summary statistics
Dataset name: kremlin.ru_en
Dataset description: all items published on the English language version of the Kremlin’s website
Start date: 1999-12-31
End date: 2023-09-12
Total items: 32 775
Available columns: id; url; title; date; time; datetime; location; description; keywords; text; tags; tags_links; internal_id
License: Creative Commons Attribution 4.0 International
Link for download: kremlin.ru_en