Text as data & data in the text

Studying conflicts in the post-Soviet space through structured analysis of textual contents available online

What is this project about?

October 20, 2022

A project by Giorgio Comai, researcher and data analyst at OBCT/CCI

Structured analysis of textual contents can take many forms and can be conducted through a variety of approaches of varying complexity.

This emerges most clearly in the burgeoning number of research efforts that include a text as data component, ranging from basic word frequency to complex methods that rely on machine learning. In all these cases, the data come from the text itself, from the words themselves.

Data, however, can also be found in the text itself, between words. For example, socio-economic data that are not readily available in nicely formatted tables may be scattered around press releases of local authorities, local news services, or other on-line sources. Researchers commonly quote such data, which they may have found serendipitously, through patient use of search engines, or some other unstructured approach. I argue that in particular for scholars and analysts working on a specific area or case, a structured approach for finding data in the text is often both feasible and preferable.

While reference to on-line sources has become ubiquitous, and content analysis of selected official documents quite common, structured analysis of on-line contents is still seldom to be found, especially outside of studies that are explicitly focused on content analysis.

Finally, a key component of this whole endeavour is working in the open. If you explore this website, you will see that many pages, including reviews of literature, are often in draft format, sometimes just early drafts (and are marked as such). This is a feature, not a bug. Significant efforts will be dedicated towards radical transparency throughout all phases of this endeavour, favouring open debates, earlier sharing of knowledge, positive feedback loops, and the advancement of open science practices in academia.

Project objectives and outputs

This project aims at facilitating structured analysis of on-line contents related to conflicts in the post-Soviet space by providing easier access to relevant datasets and tools.

Outputs include:

  • basic case studies in the form of brief blog posts that address a meaningful substantive question that can be addressed with these approaches (both “text as data” and “data in the text”)

  • easier access to basic content analysis tools and relevant textual corpora for this area of study through publicly available web interfaces

  • the publication of relevant textual corpora with full metadata in on-line repositories if compatible with copyright restrictions on the original contents

  • tutorials and tools facilitating the creation and the update of textual datasets for alternative case studies with a dedicated package for the R programming language

  • a review of literature, accompanied by posts that outline how to replicate relevant parts of these studies, or to explore the same research question with different and possibly complementary datasets


  • textual contents related to contested and occupied territories in Ukraine, as well as areas administered by Russian proxies since 2014

  • textual contents published in post-Soviet de facto states

  • textual contents published by Russian media and institutions

Previous work

This project is part of a research journey that has focused on contested spaces in the countries of the former Soviet Union, dealing in particular with socio-economic aspects and sources of external support for the entities that govern these territories.

It stems also from a parallel journey with computer programming focused on data analysis and data visualisation, and a conviction that technical hurdles unnecessarily restrict access to approaches that should be more readily available to scholars working in area studies, as well as on peace and conflict research.

I have outlined some of these considerations in multiple occasions, including,

Funding and disclaimers

This project is hosted by OBCT/CCI. It is carried out with the support of the Italian Ministry of Foreign Affairs and International Cooperation under art. 23 bis, D.P.R. 18/1967. All opinions expressed within the scope of this project represent the opinion of their author and not those of the Ministry.

Le posizioni contenute nel presente report sono espressione esclusivamente degli autori e non rappresentano necessariamente le posizioni del Ministero degli Affari Esteri e della Cooperazione Internazionale”

More about this project