Text as data & data in the text

Studying conflicts in post-Soviet spaces through structured analysis of textual contents available on-line

A project led by Giorgio Comai, researcher and data analyst at OBCT/CCI, carried out with the support of the Italian MFA (see below for details and disclaimers).

About this project

Text as data & data in the text

What is this project about?

Datasets and tools

Find out more about the datasets and the tech stack at the core of this project.

Giorgio Comai

Giorgio Comai is lead researcher of the project.

Posts and updates

Russia and the “root causes” of the conflict

This is best summarised by prominent Russian commentator Mikhail Leontyev, as he was about to explain why there shouldn’t be expectations about any breakthrough in the…

May 30, 2025

Giorgio Comai

From ‘residents of the Donbas’ to ‘root causes’

How the concern with ‘residents of the Donbas’ and humanitarian issues disappeared from Russian media

May 5, 2025

Giorgio Comai

Why Kyiv cannot possibly fix the ‘root causes’ of the conflict

No amount of unilateral Ukrainian concessions would address what Russia considers the ‘root causes’ of the conflict.

May 5, 2025

Giorgio Comai

Analysing Russian news via Telegram, processing them with open LLMs

Testing alternatives for investigating public discourse.

Mar 18, 2025

Giorgio Comai

Introducing ‘Russian state institutions full-text datasets’ (2024 edition)

Stable version now officially published: a collection of corpora based on textual contents extracted from the websites of Russian state institutions

Oct 30, 2024

Giorgio Comai

Transnistria ‘under blockade’: an analysis of local media

A quantitative analysis of the main local news agency and TV station, including visual evidence of about 2 000 mentions of ‘blockade’ on Transnistria’s TV.

Jun 5, 2024

Giorgio Comai

How unhinged has Medvedev become? Testing inter-coder reliability of open Large Language Models (LLM)

Testing the feasibility of using locally-deployed LLMs as coders of Russian-language text

Apr 9, 2024

Giorgio Comai

Russophobia in Russian official statements and media

A word frequency analysis.

Sep 26, 2023

Giorgio Comai

From the ‘battle of Bakhmut’ to the ‘march of justice’: Prigozhin’s audio files, transcribed

Prigozhin’s press service actively responds via Telegram to questions asked by journalists. Questions are mostly posted as screenshots, responses are mostly posted as audio messages, other posts include video. How do we turn these into something that can be searched and analysed?

Aug 22, 2023

Giorgio Comai

Traditional, conservative, Christian, distinct… how supposedly old values emerged in the official and media discourse in Russia since 2012

A data-driven exploration of how references to different types of values evolved in official statements and the media in Russia in the last two decades.

May 16, 2023

Giorgio Comai

Who said it first? ‘The collective West’ in Russia’s nationalist media and official statements

Expressions such as ‘the collective West’ have entered official discourse only recently. They have been in common use in nationalist media for much longer.

Mar 14, 2023

Giorgio Comai

Introducing ‘tadadit.xyz’

A short introduction to the core principles and ideas at the basis of this initiative.

Mar 12, 2023

Giorgio Comai

What’s needed in a content analysis toolkit for starters?

In order to analyse text, one needs to have text in a structured format. How do we get there?

Nov 7, 2022

Giorgio Comai

Review of literature

Tools for text mining and text analysis

Software packages or other solutions that can be used to facilitate retrieving, analysing, and processing textual data.

Local media and institutional news sources in Russia-held territories in Ukraine and beyond

A dataset with basic details about relevant online sources

Words that count

A selection of words or expressions that have acquired a special significance in late Putin’s Russia

Content analysis on Russian media and conflict in Ukraine: from Putin’s third term to Russia’s invasion of Ukraine

A review of the scholarly literature

What is a corpus? And what shape should it take?

Definitions, and standards and formats for storing and sharing textual data

Available textual datasets that are relevant for the analysis of Russia’s invasion of Ukraine and conflicts in the region

What datasets are publicly available?

Initiatives conducting content analysis related to Russia’s invasion of Ukraine

A review of content-analysis-based ongoing initiatives and past studies related to Russia’s invasion of Ukraine, local dynamics in the Donbas or other Russia-controlled territories in Ukraine

Datasets

Title	Description	Categories
Russian state institutions 2024	This is a collection of full-text datasets based on contents extracted from the websites of Russian institutions.
Russian state institutions 2025 (draft)	Until you see this notice, please ignore this work-in-progress section and refer to the 2024 version of this dataset or other sections of this website.
Zavtra.ru - Summary of articles	A sample of all items published on the website of the Russian weekly magazine ‘Zavtra’, summarised with a locally deployed LLM (gemma:4b)	dataset, Russian media, Russian language, llm, summary
zavtra.ru_ru_2025	Corpus based on the website of Russian weekly newspaper ‘Zavtra’ (in Russian, 1996-2024)	corpus, full corpus, Russian media, Russian language
transcript.duma.gov.ru_ru_2024	Corpus based on the Russia’s Duma website (in Russian, 2006-2023)	dataset, Russian institutions, Russian parliament, Russian language
archive.premier.gov.ru_ru_2024	Corpus based on the archived version of the website of Russia’s prime minister (in Russian, 2008-2012)	dataset, Russian institutions, Russian government, Russian language
duma.gov.ru_ru_2024	Corpus based on the Russia’s Duma website (in Russian, 2006-2023)	dataset, Russian institutions, Russian government, Russian language
government.ru_ru_2024	Corpus based on the Russia’s government website (in Russian, 2013-2023)	dataset, Russian institutions, Russian government, Russian language
archive.government.ru_ru_2024	Corpus based on the archived version of Russia’s government website (in Russian, 2008-2013)	dataset, Russian institutions, Russian government, Russian language
mid.ru_en_2024	Corpus based on the website of Russia’s MFA (in English, 2003-2023)	corpus, full corpus, Russian institutions, Russia’s MFA, English language
mid.ru_ru_2024	Corpus based on the website of Russia’s MFA (in Russian, 2003-2023)	corpus, full corpus, Russian institutions, Russia’s MFA, Russian language
kremlin.ru_en_2024	Corpus based on Russia’s president website (in English, 1999-2023)	corpus, full corpus, Russian institutions, Russia’s president, English language
zavtra.ru_ru_2024	Corpus based on the website of Russian weekly newspaper ‘Zavtra’ (in Russian, 1996-2023)	corpus, full corpus, Russian media, Russian language
kremlin.ru_ru_2024	Corpus based on Russia’s president website (in Russian, 1999-2023)	dataset, Russian institutions, Russian language
rg.ru_ru	All items published on Rossiiskaya Gazeta	dataset, Russian media, Russian language
novostipmr.com_ru	All items published on the website of Transnistria’s news agency Novosti PMR	dataset, Russian language, Transnistria
patriarhia.ru_ru	All items published on the official website of the Moscow Patriarchate	dataset, Russian language
Prigozhin audio files, transcribed	An automatic transcription of all the audio messages posted on Prigozhin’s official Telegram channel	dataset, automatic transcription, Telegram, Russian language, Russian media
mid.ru_en	All English-languge news items published on the website of the Russian Ministry of Foreign Affairs	dataset, Russian institutions, English language
mid.ru_ru	All Russian-languge news items published on the website of the Russian Ministry of Foreign Affairs	dataset, Russian institutions, Russian language
duma.gov.ru_ru	All news items published on the website of the Russian Duma	dataset, Russian institutions, Russian language
tsargrad.tv_ru	All textual items published on the website of the Russian TV broadcaster ‘Tsargrad’	dataset, Russian media, Russian language
kp.ru_ru	All items published in the politics section of Komsomolskaya Pravda	dataset, Russian media, Russian language
ng.ru_ru	All items published on Nezavisimaya Gazeta	dataset, Russian media, Russian language
zavtra.ru_ru	All items published on the website of the Russian weekly magazine ‘Zavtra’	dataset, Russian media, Russian language
1tv.ru_ru	All items published on the Pervy Kanal (1tv.ru)	dataset, Russian media, Russian language
kremlin.ru_en	All items published on the English language version of the Kremlin’s website	dataset, Russian institutions, English language
kremlin.ru_ru	All items published on the Russian language version of the Kremlin’s website	dataset, Russian institutions, Russian language

Tutorials

The tutorials are mostly based on castarter - Content Analysis Starter Toolkit for the R programming language, and will target users with beginner or beginner-intermediate coding skills. As the package gains new features, the tutorials will become more accessible; eventually, some of them will be accessible to users with no coding experience at all.

A draft version of the documentation for the package castarter is already available online. Both documentation and functionalities of the package will mature in the coming months.

Extracting textual contents from the Kremlin’s website with `castarter`

A first introduction to extracting textual contents for further analysis

Feb 27, 2023

Giorgio Comai

Funding and disclaimers

This project is hosted by OBCT/CCI. It is carried out with the support of the Italian Ministry of Foreign Affairs and International Cooperation under art. 23 bis, D.P.R. 18/1967. All opinions expressed within the scope of this project represent the opinion of their author and not those of the Ministry.

“Le posizioni contenute nel presente report sono espressione esclusivamente degli autori e non rappresentano necessariamente le posizioni del Ministero degli Affari Esteri e della Cooperazione Internazionale”