government.ru_ru_2024
Explore in an interactive web interface
Links for download: compressed csv / ods
Scope of this corpus
This corpus is based on all contents published in the “news” section of the website government.ru as it was available online in early 2024.
Summary statistics
Dataset name: government.ru_ru_2024
Dataset description: all news items published on government.ru
Start date: 2012-04-24
End date: 2023-12-30
Total items: 17 135
Available columns: doc_id; text; title; date; internal_id; time; place; tags; url
License: Creative Commons Attribution 3.0 International
Link for download: government.ru_ru_2024
field | present | missing | missing_share |
---|---|---|---|
doc_id | 17 135 | 0 | 0.0% |
text | 17 107 | 28 | 0.2% |
title | 17 135 | 0 | 0.0% |
date | 17 135 | 0 | 0.0% |
internal_id | 17 135 | 0 | 0.0% |
time | 17 091 | 44 | 0.3% |
place | 6 299 | 10 836 | 63.2% |
tags | 16 439 | 696 | 4.1% |
url | 17 135 | 0 | 0.0% |
Narrative explanation of how this corpus has been created
This corpus has been built based on index pages of the “news” section of the website, parsing older posts as they would be auto-loaded when scrolling the news index page.
Text and metadata have been extracted from the resulting pages, relying on the well structured format of the news pages, presenting each element in a dedicated element:
- the title is always included in a
<h3>
element of classreader_article_headline
- the date is always included in a
<span>
element of classreader_article_dateline__date
- the time is always included in a
<span>
element of classreader_article_dateline__time
- the place is always included in a
<span>
element of classentry__meta__date__place
- the place is always included in a
<li>
element of classreader_article_tags_item
- the main text is always included in a
<div>
element of classreader_article_body
Data cleaning
Among all news items extracted, only two do not have a date (one and two), seemingly because they effectively link to other contents. They have been removed from the dataset.
Besides, two items date many years before all the rest appear on the website among other news (one and two). They have also been removed for clarity.
There are 28 items with date, title, and tags, but an empty text field. They mostly refer to meetings such as this one; the titles have a format similar to “Medvedev met gubernor of X”, and such, with not additional content shared. They are maintained in the dataset, as title and tags may still contain useful information.
License information
At the time contents were retrieved, the footer of the website makes clear that all contents available are published with a Creative Commons Attribution 3.0 license:
Все материалы сайта доступны по лицензии: Creative Common Attribution 4.0”
The contents of this dataset - “government.ru_ru” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, with the same CC-BY license, as well as under the Open Data Commons Attribution license (ODC-BY).