Available textual datasets that are relevant for the analysis of Russia’s invasion of Ukraine and conflicts in the region
This page is still a work-in-progress. It is shared in the spirit of keeping the research process as open as possible, but it still a draft document, possibly an early draft: incomplete, unedited, and possibily inaccurate. Datasets included may likewise not be fully verified.
Context
This project is based on the assumption that the researcher has no access to proprietary software or proprietary datasets. For those who effectively needs them or can afford them, these are surely sensible points of reference; there may also be few good alternatives for some media formats or for large-scale analyses.
Such sources are however rarely affordable for anyone outside established academic institutions or businesses. Besides, in practice, contents generated in contested territories are rarely included in established datasets.
The textual datasets at the core of this project are generated via text mining using open source solutions in a way that can be replicated by anyone with a computer, an internet connection, and the necessary skills. However, online sources retrieved through text mining do not exist in a vacuum, and of course need not be the sole point of reference.
With this in mind, in this section I include references to potentially relevant sources.
Reviews of different datasets/corpora
Kopotev, Mustajoki, and Bonch-Osmolovskaya (2021) offers a review of “Corpora in Text-Based Russian Studies”. They point at a variety of sources, ranging from do-it-yourself datasets to established online resources, highlighting in particular two of them: the Russian National Corpus and the Integrum database
Open access datasets
Duma Speeches: A Term Frequency Analysis
Russian State Duma Transcripts 1994–2021
dekoder.org (2021)
Commercial sources of Russian media contents
Integrum World Wide
Integrum is a Russia-based service that offers services such as media and social media monitoring. It includes transcripts of TV broadcasts. It is used, for example, by the RuMOR project.
Website: http://www.integrumworld.com/
Used by:
Non-commercial sources of media contents
GDELT - The Global Database of Events, Language, and Tone
Website: https://www.gdeltproject.org/
GDELT includes also TV recordings, including of Russian TV stations:
https://api.gdeltproject.org/api/v2/tvv/tvv
Media Cloud
An initiative by: > Media Cloud is a consortium research project across multiple institutions, including the University of Massachusetts Amherst, Northeastern University, and the Berkman Klein Center for Internet & Society at Harvard University.
Website: https://mediacloud.org/
N.B.: Due to a number of issues, Media Cloud has partially incomplete data for various periods of time. For details, see e.g. this November 2022 update or make sure you check the current state of affairs before using these data for structured analyses.
International providers of media and other textual contents
- Lexis Nexis
- Proquest
- Factiva
A list of lists
Ultimate 🚀 Data Sources for [Textual] Content Analysis in Communication and Media Research (UDS4CA)
A curated list (apparently) by Cornelius Puschmann, with the credited support of the “Twitter Hive Mind”.
Meteor - Media texts Open Registry
Self-described as:
OPTED Meteor (Media Text Open Registry) is an inventory for European journalistic texts and is part of the EU-funded Project OPTED where researchers work towards the creation of a new European research infrastructure for the study of political communication in Europe. Please visit opted.eu for more information about the project.
Website: https://meteor.opted.eu/