Text as data & data in the text

Studying conflicts in the post-Soviet space through structured analysis of textual contents available online

What is this project about?

Published

October 20, 2022

A project by Giorgio Comai, researcher and data analyst at OBCT/CCI

Structured analysis of textual contents can take many forms and can be conducted through a variety of approaches of varying complexity.

This emerges most clearly in the burgeoning number of research efforts that include a text as data component, ranging from basic word frequency to complex methods that rely on machine learning. In all these cases, the data come from the text itself, from the words themselves.

Data, however, can also be found in the text itself, between words. For example, socio-economic data that are not readily available in nicely formatted tables may be scattered around press releases of local authorities, local news services, or other on-line sources. Researchers commonly quote such data, which they may have found serendipitously, through patient use of search engines, or some other unstructured approach. I argue that in particular for scholars and analysts working on a specific area or case, a structured approach for finding data in the text is often both feasible and preferable.

While reference to on-line sources has become ubiquitous, and content analysis of selected official documents quite common, structured analysis of on-line contents is still seldom to be found, especially outside of studies that are explicitly focused on content analysis.

Finally, a key component of this whole endeavour is working in the open. If you explore this website, you will see that many pages, including reviews of literature, are often in draft format, sometimes just early drafts (and are marked as such). This is a feature, not a bug. Significant efforts will be dedicated towards radical transparency throughout all phases of this endeavour, favouring open debates, earlier sharing of knowledge, positive feedback loops, and the advancement of open science practices in academia.

Project objectives and outputs

This project aims at facilitating structured analysis of on-line contents related to conflicts in the post-Soviet space by providing easier access to relevant datasets and tools.

Outputs include:

basic case studies in the form of brief blog posts that address a meaningful substantive question that can be addressed with these approaches (both “text as data” and “data in the text”)
easier access to basic content analysis tools and relevant textual corpora for this area of study through publicly available web interfaces
the publication of relevant textual corpora with full metadata in on-line repositories if compatible with copyright restrictions on the original contents
tutorials and tools facilitating the creation and the update of textual datasets for alternative case studies with a dedicated package for the R programming language
a review of literature, accompanied by posts that outline how to replicate relevant parts of these studies, or to explore the same research question with different and possibly complementary datasets

Sources

textual contents related to contested and occupied territories in Ukraine, as well as areas administered by Russian proxies since 2014
textual contents published in post-Soviet de facto states
textual contents published by Russian media and institutions

Previous work

This project is part of a research journey that has focused on contested spaces in the countries of the former Soviet Union, dealing in particular with socio-economic aspects and sources of external support for the entities that govern these territories.

It stems also from a parallel journey with computer programming focused on data analysis and data visualisation, and a conviction that technical hurdles unnecessarily restrict access to approaches that should be more readily available to scholars working in area studies, as well as on peace and conflict research.

I have outlined some of these considerations in multiple occasions, including,

most briefly, in section 4 of this blog post, “There’s a lot of evidence on-line, and often Google won’t help”, 8 April 2018

in a scholarly article:
- Comai, Giorgio. 2017. “Quantitative Analysis of Web Content in Support of Qualitative Research. Examples from the Study of Post-Soviet de Facto States.” Studies of Transition States and Societies 9 (1).
in public presentations, including:
- “Surfing the post-Soviet web with style. Text mining post-Soviet de facto states”, in Tbilisi in 2016
- “Roundtable: Research Data Quality Assessment for the Area-Studies on the post-Soviet region: New Approaches needed?”, in Tartu in 2016 - see presentation slides
- “Victims of double standards: double victimhood and changing narratives in Azerbaijan’s public rhetoric” - in Gorizia in 2018, see key points in the form of a Twitter thread
in blog posts or articles that to different extents rely on these methods, including:
- “Russophobia in Russian official statements and media. A word frequency analysis”, 1 August 2021
- “How much has the 2020 war in Nagorno Karabakh been in the news? A comparison with August 2008 war in South Ossetia”, 14 October 2020
- “After a new president came to power, what happened to Transnistria’s media?”, 18 June 2018
- “Abkhazia’s parliamentary elections: not for the famous?”, 31 March 2017
- “Word frequency of Ukraine, Crimea, DNR/LNR and Novorossiya on 1tv.ru”, 20 March 2017
- “Russia and pensions in post-Soviet de facto states”, 1 February 2016
Implicitly, by publishing this dataset:
- “kremlin_en - A textual dataset based on the contents published on the English-language version of the Kremlin’s website”, May 2021

Funding and disclaimers

This project is hosted by OBCT/CCI. It is carried out with the support of the Italian Ministry of Foreign Affairs and International Cooperation under art. 23 bis, D.P.R. 18/1967. All opinions expressed within the scope of this project represent the opinion of their author and not those of the Ministry.

“Le posizioni contenute nel presente report sono espressione esclusivamente degli autori e non rappresentano necessariamente le posizioni del Ministero degli Affari Esteri e della Cooperazione Internazionale”

More about this project

Datasets and tools

Find out more about the datasets and the tech stack at the core of this project.

Giorgio Comai

Giorgio Comai is lead researcher of the project.