government.ru_ru_2024

dataset
Russian institutions
Russian government
Russian language
Corpus based on the Russia’s government website (in Russian, 2013-2023)
Author

Giorgio Comai

Published

August 22, 2024

Explore or download this dataset

Scope of this corpus

This corpus is based on all contents published in the “news” section of the website government.ru as it was available online in early 2024.

Summary statistics

Dataset name: government.ru_ru_2024

Dataset description: all news items published on government.ru

Start date: 2012-04-24

End date: 2023-12-30

Total items: 17 135

Available columns: doc_id; text; title; date; internal_id; time; place; tags; url

License: Creative Commons Attribution 3.0 International

Link for download: government.ru_ru_2024

field present missing missing_share
doc_id 17 135 0 0.0%
text 17 107 28 0.2%
title 17 135 0 0.0%
date 17 135 0 0.0%
internal_id 17 135 0 0.0%
time 17 091 44 0.3%
place 6 299 10 836 63.2%
tags 16 439 696 4.1%
url 17 135 0 0.0%

Narrative explanation of how this corpus has been created

This corpus has been built based on index pages of the “news” section of the website, parsing older posts as they would be auto-loaded when scrolling the news index page.

Text and metadata have been extracted from the resulting pages, relying on the well structured format of the news pages, presenting each element in a dedicated element:

  • the title is always included in a <h3> element of class reader_article_headline
  • the date is always included in a <span> element of class reader_article_dateline__date
  • the time is always included in a <span> element of class reader_article_dateline__time
  • the place is always included in a <span> element of class entry__meta__date__place
  • the place is always included in a <li> element of class reader_article_tags_item
  • the main text is always included in a <div> element of class reader_article_body

Data cleaning

Among all news items extracted, only two do not have a date (one and two), seemingly because they effectively link to other contents. They have been removed from the dataset.

Besides, two items date many years before all the rest appear on the website among other news (one and two). They have also been removed for clarity.

There are 28 items with date, title, and tags, but an empty text field. They mostly refer to meetings such as this one; the titles have a format similar to “Medvedev met gubernor of X”, and such, with not additional content shared. They are maintained in the dataset, as title and tags may still contain useful information.

License information

At the time contents were retrieved, the footer of the website makes clear that all contents available are published with a Creative Commons Attribution 3.0 license:

Все материалы сайта доступны по лицензии: Creative Common Attribution 4.0

The contents of this dataset - “government.ru_ru” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, with the same CC-BY license, as well as under the Open Data Commons Attribution license (ODC-BY).