full corpus
Russian institutions
Russia’s MFA
English language
Corpus based on the website of Russia’s MFA (in English, 2003-2023)

Giorgio Comai


May 4, 2024

Explore or download this dataset

Scope of this corpus

This corpus includes all news items published on the English language version of the website of Russia’s MFA.

Summary statistics

Dataset name: mid.ru_en_2024

Dataset description: all news items published on the English-language version of mid.ru

Start date: 2003-01-04

End date: 2023-12-31

Total items: 25 943

Available columns: doc_id; text; date; datetime; title; internal_id; url_id; translations; url

License: Permissive (see details)

Link for download: mid.ru_en_2024

field present missing missing_share
doc_id 25 943 0 0.0%
text 25 938 5 0.0%
date 25 943 0 0.0%
datetime 25 943 0 0.0%
title 25 943 0 0.0%
internal_id 25 856 87 0.3%
url_id 25 943 0 0.0%
translations 25 917 26 0.1%
url 25 943 0 0.0%

Narrative explanation of how this textual dataset was built

The website of Russia’s MFA makes it possible to search in its news section by date. All index pages for each date starting with earliest publications have been retrieved. In the few occasions when more than 20 items were published on the same day, a second page for the relevant day was also retrieved. Here is an example of such an index page:

Direct links to news items were extracted from these pages.

The corpus includes the limited metadata available through the website, namely: - title - date and time of publication - an internal id which is included in almost all posts (see note below) - a list of the languages in which a given post has been published


This section lists some issues that may be of interest to users of this corpus

  • Many items include the string: “Unofficial translation from Russian”
  • Along with news, the MFA publishes items that detail the timing and accreditation rules for press briefings, see for example: https://mid.ru/en/foreign_policy/news/1927386/. As these do no not include substantive contents, they are not included in the dataset.
  • Almost all news items are published with an identifier, e.g. “1383-22-09-2011” for this item. In a few dozens of instances the identifier is missing, and in a handful it is not unique. As a consequence, the numeric component of the url is likely preferrable as the main unique identifier.
  • The Russian-language version of this corpus has a significantly larger number of publications.
  • There are 5 items with empty text fields, they are listed below. Indeed, they simply include no text besides the title or include just a link to an external file (not included in this corpus).
date title url
2009-05-05 Statement by H.E. Ambassador Anatoly Antonov, Head of the Delegation of the Russian Federation at the Third Session of the Preparatory Committee for the 2010 Review Conference of the Parties to the Treaty on the Non–Proliferation of Nuclear Weapons, New York, 4 May 2009 https://mid.ru/en/foreign_policy/news/1711875/
2012-12-06 Report on the human rights situation in the European Union https://mid.ru/en/foreign_policy/news/1653435/
2013-05-21 Speech of and answers to questions of mass media by Russian Foreign Minister Sergey Lavrov during joint press conference summarizing the results of negotiations with Secretary General of the Council of Europe Thorbjørn Jagland, Sochi, 20 May 2013 https://mid.ru/en/foreign_policy/news/1587812/
2014-03-28 The Hague Nuclear Security Summit Communiqué https://mid.ru/en/foreign_policy/news/1709087/
2015-08-05 The Ministry of Foreign Affairs of the Russian Federation on certain legal issues highlighted by the action of the Arctic Sunrise against Prirazlomnaya platform https://mid.ru/en/foreign_policy/news/1512649/

License information

At the time contents were retrieved, the page on the conditions for the use of website contents makes clear that contents can be used for research purposes and can be re-published, as long as reference is always made to the website of the MFA.

Materials on the website of the Russian Ministry of Foreign Affairs are generally accessible and open for non-commercial use (personal, family, education, research, etc.).

Their reprinting, as well as any quoting in the mass media is allowed only with a reference to the website of the Russian Ministry of Foreign Affairs as a source of the information.

No specific license is however mentioned.

The contents of this dataset - “mid.ru_en” - are distributed within the remits of this license. To the extent that it is possible, the dataset itself is also distributed by its creator, Giorgio Comai, at the same conditions, as well as under the Open Data Commons Attribution license (ODC-BY).