A systematic evaluation of text mining methods for short texts: Mapping individuals’ internal states from online posts

Abstract

Short texts generated by individuals in online environments can provide social and behavioral scientists with rich insights into these individuals’ internal states. Trained manual coders can reliably interpret expressions of such internal states in text. However, manual coding imposes restrictions on the number of texts that can be analyzed, limiting our ability to extract insights from large-scale textual data. We evaluate the performance of several automatic text analysis methods in approximating trained human coders’ evaluations across four coding tasks encompassing expressions of motives, norms, emotions, and stances. Our findings suggest that commonly used dictionaries, although performing well in identifying infrequent categories, generate false positives too frequently compared to other methods. We show that large language models trained on manually coded data yield the highest performance across all case studies. However, there are also instances where simpler methods show almost equal performance. Additionally, we evaluate the effectiveness of cutting-edge generative language models like GPT-4 in coding texts for internal states with the help of short instructions (so-called zero-shot classification). While promising, these models fall short of the performance of models trained on manually analyzed data. We discuss the strengths and weaknesses of various models and explore the trade-offs between model complexity and performance in different applications. Our work informs social and behavioral scientists of the challenges associated with text mining of large textual datasets, while providing best-practice recommendations.

What is the best way to prepare textual data?

We find that stop word removal improved the performance of custom-made dictionaries, but lemmatization seemed to matter to a lesser extent. Text preparation when using “bag-of-words” representations with supervised machine learning models appeared to have a rather modest effect on model performance. Finally, we did not see any improvement upon using fine-tuned transformer models with pre-processing pipelines designed for short social media texts in particular (BERTweet). Our results, thus, echo the notion that there is no universally “best” approach to text pre-processing (Denny & Spirling, 2018), but also suggest that method choice matters more than specific pre-processing decisions taken when working with a particular method.

Before concluding, we emphasize that it is crucial to bear in mind the limitations of measuring concepts in text more generally. Although there is evidence that internal experiences are reflected in language use (Boyd & Pennebaker, 2017; Tausczik & Pennebaker, 2010) and can be recovered by external observers (Koutsoumpis et al., 2022), there are still important factors to consider. First, not all internal states appear to be equally observable in text (Kennedy et al., 2021). Second, it is still an open question to what extent manual coding for internal states accurately captures the intentions of the author, rather than manual coders’ subjective perceptions (Boyd & Schwartz, 2021; Kennedy et al., 2022; Koutsoumpis et al., 2022; Vazire, 2010). There is still a need for extensive validation of the concepts extracted from text by manual coders against alternative measures including self-reports, observer reports, physical states, and actual behaviors (Amador Diaz Lopez et al., 2017; Bleidorn & Hopwood, 2019; Boyd & Schwartz, 2021; Kennedy et al., 2021; Lykousas et al., 2019; Malko et al., 2021; Matsuo et al., 2019; Troiano et al., 2019; Vine et al., 2020). Finally, existing methods for measurement in text usually involve substantial simplification. For instance, nuanced internal states are often reduced to binary indicators (e.g., whether the emotion of joy is present in the text or not). While such reductions facilitate our analyses (including those relying on text mining), they can dispose of important details about the intensity with which individuals perceive and communicate their internal experiences.

Bearing these limitations in mind, we demonstrate how manual coding of internal states can be reliably extended across many texts using different text mining methods. While we examined method families separately for simplicity and clarity, researchers can also effectively combine them. For example, dictionaries can be enhanced with the help of distributed representations (Di Natale & Garcia, 2023; Garten et al., 2018; Mpouli et al., 2020), and dictionary outputs can serve as input for supervised machine learning methods (Farnadi et al., 2021). We consider text mining methods as valuable tools that can augment our analytical capabilities, but are prone to replicating the shortcomings of our research procedures (Grimmer et al., 2022). Therefore, it is important to rely on text-based measures that are rooted in theory and substantive knowledge about the concepts of interest (Grimmer et al., 2022; Kennedy et al., 2022), evolve through engagement with textual data (Lazer et al., 2021; Nelson, 2017), and are built with consideration of potential structural biases inherent in our data and computational methods (Bonikowski & Nelson, 2022).

Big textual data come with their own limitations for applications in social and behavioral sciences (Boyd & Schwartz, 2021; Kern et al., 2016; Lazer et al., 2021; Macanovic, 2022). Therefore, it could be beneficial to supplement them with questionnaire, census, or experimental data (Boyd & Schwartz, 2021; Salganik, 2017; van Loon et al., 2020). Finally, as we have done in this survey, it is crucial to conduct transparent and reproducible text mining analyses (Nelson, 2019) and validate model performance against a reliable (human) reference (Bonikowski & Nelson, 2022; Grimmer et al., 2022; Grimmer & Stewart, 2013; van Atteveldt et al., 2021). We conclude by reiterating the advice of Grimmer and colleagues (2021): despite the abundance of data and computational tools, we should not disregard the valuable lessons learned working with scarce data by prioritizing theory, sound research design, and good implementation.

(Ana Macanovic & Wojtek Przepiorka)

https://link.springer.com/article/10.3758/s13428-024-02381-9

https://www.semanticscholar.org/paper/A-systematic-evaluation-of-text-mining-methods-for-Macanovic-Przepiorka/f565c7a1216bd69ef8c4d682fb6c9df84002ce7f