Available textual datasets that are relevant for the analysis of Russia’s invasion of Ukraine and conflicts in the region

review of literature

content analysis

data sources

What datasets are publicly available?

Author

Giorgio Comai

Published

November 21, 2022

Work-in-progress

This page is still a work-in-progress. It is shared in the spirit of keeping the research process as open as possible, but it still a draft document, possibly an early draft: incomplete, unedited, and possibily inaccurate. Datasets included may likewise not be fully verified.

Context

This project is based on the assumption that the researcher has no access to proprietary software or proprietary datasets. For those who effectively needs them or can afford them, these are surely sensible points of reference; there may also be few good alternatives for some media formats or for large-scale analyses.

Such sources are however rarely affordable for anyone outside established academic institutions or businesses. Besides, in practice, contents generated in contested territories are rarely included in established datasets.

The textual datasets at the core of this project are generated via text mining using open source solutions in a way that can be replicated by anyone with a computer, an internet connection, and the necessary skills. However, online sources retrieved through text mining do not exist in a vacuum, and of course need not be the sole point of reference.

With this in mind, in this section I include references to potentially relevant sources.

Reviews of different datasets/corpora

Kopotev, Mustajoki, and Bonch-Osmolovskaya (2021) offers a review of “Corpora in Text-Based Russian Studies”. They point at a variety of sources, ranging from do-it-yourself datasets to established online resources, highlighting in particular two of them: the Russian National Corpus and the Integrum database

Open access datasets

Duma Speeches: A Term Frequency Analysis

Russian State Duma Transcripts 1994–2021

dekoder.org (2021)

Commercial sources of Russian media contents

Integrum World Wide

Integrum is a Russia-based service that offers services such as media and social media monitoring. It includes transcripts of TV broadcasts. It is used, for example, by the RuMOR project.

Website: http://www.integrumworld.com/

Used by:

Lankina and Watanabe (2017)
RuMOR project (see e.g. this post)

Non-commercial sources of media contents

GDELT - The Global Database of Events, Language, and Tone

Website: https://www.gdeltproject.org/

GDELT includes also TV recordings, including of Russian TV stations:

https://api.gdeltproject.org/api/v2/tvv/tvv

Media Cloud

An initiative by: > Media Cloud is a consortium research project across multiple institutions, including the University of Massachusetts Amherst, Northeastern University, and the Berkman Klein Center for Internet & Society at Harvard University.

Website: https://mediacloud.org/

N.B.: Due to a number of issues, Media Cloud has partially incomplete data for various periods of time. For details, see e.g. this November 2022 update or make sure you check the current state of affairs before using these data for structured analyses.

International providers of media and other textual contents

Lexis Nexis
Proquest
Factiva

A list of lists

Ultimate 🚀 Data Sources for [Textual] Content Analysis in Communication and Media Research (UDS4CA)

A curated list (apparently) by Cornelius Puschmann, with the credited support of the “Twitter Hive Mind”.

Meteor - Media texts Open Registry

Self-described as:

OPTED Meteor (Media Text Open Registry) is an inventory for European journalistic texts and is part of the EU-funded Project OPTED where researchers work towards the creation of a new European research infrastructure for the study of political communication in Europe. Please visit opted.eu for more information about the project.

Website: https://meteor.opted.eu/

References

dekoder.org. 2021. “Duma Speeches: A Term Frequency Analysis Russian State Duma Transcripts 19942021.” Https://Discuss-Data.net/Dataset/Fb52dac2-66e3-47a3-86c5-B2a3dadf41bf/, August. https://doi.org/10.48320/FB52DAC2-66E3-47A3-86C5-B2A3DADF41BF.

Kopotev, Mikhail, Arto Mustajoki, and Anastasia Bonch-Osmolovskaya. 2021. “Corpora in Text-Based Russian Studies.” In, edited by Daria Gritsenko, Mariëlle Wijermars, and Mikhail Kopotev, 299–317. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-42855-6_17.

Lankina, Tomila, and Kohei Watanabe. 2017. “‘Russian Spring’ or ‘Spring Betrayal’? The Media as a Mirror of Putin’s Evolving Strategy in Ukraine.” Europe-Asia Studies 69 (10): 1526–56. https://doi.org/10.1080/09668136.2017.1397603.