Tools for text mining and text analysis

Software packages or other solutions that can be used to facilitate retrieving, analysing, and processing textual data.

Author

Giorgio Comai

Context

Some of the most common software packages used by scholars who work on content analysis are proprietary (e.g. NVivo or MAXQDA). This project is however based on the assumption that the researcher has no access to proprietary software or proprietary datasets, which is often the case for anyone working outside of established academic institutions or businesses. Most advanced analyses are anyway already conducted using open source tools, which offer more solid foundations for long-term research commitments. Hence, for both practical and philosophical reasons, this section will focus on free and open source solutions available to anyone willing to learn how to use them, and, quite possibly, improve them.

With this in mind, I have been developing a package for the R programming language that makes it easier to create textual datasets from on-line resources and conduct basic analyses. This is based on the idea that many of the features needed in a content analysis starter toolkit are not readily available. The package, castarter is currently available on-line, with extensive documentation to be published on this website soon.

There is, however, a great number of solutions for processing and analysing textual datasets. In this section, I will list some of the most established. I will also include references to some tools that facilitate the retrieval of textual data from various on-line sources.

R packages for processing textual data

In the R ecosystem, there are many packages dedicated to processing textual data and working on text as data. In the last decade, there have been various iterations of approaches, but by 2023 there are some clear points of reference for anyone interested in working with textual data in R.

The most obvious starting point is the book Text Mining with R, written by the authors of the tidytext R package.

Taking it from there, here are some of the most established packages (or collection of packages) to further advance your textual analysis journey from within R.

tidytext

Self-described as: “Text mining using tidy tools”

Website: https://juliasilge.github.io/tidytext/

quanteda

Self-described as: “Quantitative Analysis of Textual Data”

Website: https://quanteda.io/

text

Self-described as: An R-package for analysing natural language with transformers from HuggingFace using Natural Language Processing and Machine Learning.

Website: https://r-text.org/

R packages for retrieving textual data

paperboy

Self-described as: “A comprehensive (eventually) collection of web scraping scripts for news media sites”

Website: https://github.com/JBGruber/paperboy

Newspapers

Self-described as: “R package to import articles from newspaper databases”

Website: https://github.com/koheiw/newspapers

Python libraries to retrieve textual data

news-please

Self-described as: “news-please - an integrated web crawler and information extractor for news that just works”

Website: https://github.com/fhamborg/news-please

Telethon

Self-described as: “Full-featured Telegram client library for Python 3”

Website: https://github.com/LonamiWebs/Telethon