government.ru_ru_2024

dataset

Russian institutions

Russian government

Russian language

Corpus based on the Russia’s government website (in Russian, 2013-2023)

Author

Giorgio Comai

Published

August 22, 2024

Explore or download this dataset

Explore in an interactive web interface

Links for download: compressed csv / ods

Scope of this corpus

This corpus is based on all contents published in the “news” section of the website government.ru as it was available online in early 2024.

Summary statistics

Dataset name: government.ru_ru_2024

Dataset description: all news items published on government.ru

Start date: 2012-04-24

End date: 2023-12-30

Total items: 17 135

Available columns: doc_id; text; title; date; internal_id; time; place; tags; url

License: Creative Commons Attribution 3.0 International

Link for download: government.ru_ru_2024

field	present	missing	missing_share
doc_id	17 135	0	0.0%
text	17 107	28	0.2%
title	17 135	0	0.0%
date	17 135	0	0.0%
internal_id	17 135	0	0.0%
time	17 091	44	0.3%
place	6 299	10 836	63.2%
tags	16 439	696	4.1%
url	17 135	0	0.0%

Narrative explanation of how this corpus has been created

This corpus has been built based on index pages of the “news” section of the website, parsing older posts as they would be auto-loaded when scrolling the news index page.

Text and metadata have been extracted from the resulting pages, relying on the well structured format of the news pages, presenting each element in a dedicated element:

the title is always included in a <h3> element of class reader_article_headline
the date is always included in a <span> element of class reader_article_dateline__date
the time is always included in a <span> element of class reader_article_dateline__time
the place is always included in a <span> element of class entry__meta__date__place
the place is always included in a <li> element of class reader_article_tags_item
the main text is always included in a <div> element of class reader_article_body

Data cleaning

Among all news items extracted, only two do not have a date (one and two), seemingly because they effectively link to other contents. They have been removed from the dataset.

Besides, two items date many years before all the rest appear on the website among other news (one and two). They have also been removed for clarity.

There are 28 items with date, title, and tags, but an empty text field. They mostly refer to meetings such as this one; the titles have a format similar to “Medvedev met gubernor of X”, and such, with not additional content shared. They are maintained in the dataset, as title and tags may still contain useful information.

License information

At the time contents were retrieved, the footer of the website makes clear that all contents available are published with a Creative Commons Attribution 3.0 license:

Все материалы сайта доступны по лицензии: Creative Common Attribution 4.0”

The contents of this dataset - “government.ru_ru” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, with the same CC-BY license, as well as under the Open Data Commons Attribution license (ODC-BY).