archive.premier.gov.ru_ru_2024

dataset
Russian institutions
Russian government
Russian language
Corpus based on the archived version of the website of Russia’s prime minister (in Russian, 2008-2012)
Author

Giorgio Comai

Published

August 26, 2024

Explore or download this dataset

Scope of this corpus

This corpus is based on all contents published in the “news” section of the website archive.premier.gov.ru as it was available online in early 2024.

Users should be aware that broadly for the same period (specifically, the time during which Vladimir Putin was prime minister) a separate website for the government was maintained, and its archived version is still available online at archive.government.gov.ru.

Summary statistics

Dataset name: archive.premier.gov.ru_ru_2024

Dataset description: all news items published on archive.premier.gov.ru

Start date: 2008-05-07

End date: 2012-05-07

Total items: 3 323

Available columns: doc_id; text; title; date; datetime; section; internal_id; url

License: Creative Commons Attribution 3.0 International

Link for download: archive.premier.gov.ru_ru_2024

field present missing missing_share
doc_id 3 323 0 0.0%
text 3 272 51 1.5%
title 3 323 0 0.0%
date 3 323 0 0.0%
datetime 3 323 0 0.0%
section 3 323 0 0.0%
internal_id 3 323 0 0.0%
url 3 323 0 0.0%

Narrative explanation of how this corpus has been created

This corpus has been built based on index pages of the event “news” section, retrieving links starting with the earliest publication.

Links to photo, video, and audio pages have been removed, only textual contents have been kept.

Text and metadata have been extracted from the resulting pages.

Duplicates

Some items have been posted on the same date, with the same title, and with the same text under different urls (but the same numeric component in the url, here recorded as internal_id). In such cases, duplicates have been removed.

Items with title but no text

There are 0 items with title, but no text. These are kept in the dataset, as the title may still offer relevant contents.

License information

At the time contents were retrieved, the footer of the website makes clear that all contents available are published with a Creative Commons Attribution 3.0 license:

Creative Commons Attribution 3.0 Непортированная

The contents of this dataset - “archive.premier.gov.ru_ru” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, with the same CC-BY license, as well as under the Open Data Commons Attribution license (ODC-BY).