Political scientists and political psychologists are turning to text to study politics, as evidenced by a proliferation of studies using automated text analysis methods to explore, for example, policy positions and topics, speaker sentiment or even personality (Benoit & Laver, 2007; Grimmer, 2010; Slapin & Proksch, 2008; Young & Soroka, 2012). These developments are to be applauded as they bring about novel insights about policies and politics using new sources of unstructured data. However, a divide exists between researchers using text as data in political science on the one hand and (political) psychology on the other, with cross-disciplinary work the exception not the rule. Generally, political psychologists are more likely to apply supervised methods like dictionaries (which assume that underlying categories are known) to learn from text about stable characteristics of the author or speaker (e.g., personality type or linguistic styles). Political scientists on the other hand more often use unsupervised methods like topic models or scaling models (which assume that underlying categories are unknown) to learn from text about topical content or policy positions.i Each approach is valuable, but as this paper demonstrates they could both benefit from better integration (for a similar argument for political science and political psychology more generally see Druckman, Kuklinski, & Sigelman, 2009). For example, political scientists could learn from political psychologists about how individual characteristics are reflected in stable language patterns among politicians, whereas political psychologists could learn from political scientists how the political context (e.g., the dynamics a political campaign or the intended audience of a speech) pressures these politicians into changing their language use.
To further advance the promise it holds, this paper provides a multidisciplinary assessment of crucial assumptions in a typical text as data project, highlighting differences between political scientists and political psychologists. Building on Grimmer and Stewart (2013) and Wilkerson and Casas (2017), this assessment is structured around four central steps in a typical text as data research design: (i) sampling text; (ii) authorship as meta data; (iii) preprocessing text; and (iv) analyzing text. Our discussion is intended to raise awareness of each of these issues as well as to provide practical suggestions on how to deal with them. Along the way we demonstrate that assessment of speaker characteristics may crucially depend on the text sources under study, and that the use of sentiment words correlates with estimates of policy positions, with implications for interpretation of the latter. Our discussion is by no means intended to disqualify published results. Rather we want to highlight the importance of considering each of these issues when starting a text as data project. In the next four sections, we discuss each issue in turn. We then summarize our discussion by offering a set of best practices and finish with some concluding thoughts.
Sampling Text
The first issue in every text as data project concerns sampling. What text sources should be used to build a corpus? And what text sources should not be used? And why? A key consideration is – of course – to identify the text source best able to capture the theoretical construct of interest. Among political psychologists, researchers interested in personality or leadership style generally consider interview responses to be the most valuable source of text. The argument goes that in interviews the language used by the interviewee is more natural than in most other settings, and therefore ideal for capturing personality and style (Hermann, 2005; Slatcher, Chung, Pennebaker, & Stone, 2007; Winter, Hermann, Weintraub, & Walker, 1991). Other sources of “spontaneous” text are used as well. For example, in their study of linguistic styles among four presidential candidates, Slatcher and colleagues (2007, p. 64) note: “Although the final drafts of verbal texts yield useful knowledge about a person, more accurate indicators of people’s individual differences are spontaneous speech samples across varied social contexts. Among politicians, examples of available speech samples include press conferences, public interviews, and debates.” Thus, it is argued that as long as the text is produced “spontaneously” it can be analyzed for speaker characteristics, regardless of whether it is a debate text, an interview, a press conference or something else.
Work in political science on the other hand has argued that it is preferable to use similar text sources, because otherwise model output may be biased (Gemenis, 2013). The reason for this is that specific words may be more common in one text source than another, which may depend on the intended audience or the (political) process of how the text came about (i.e., the “data-generating process”). For example, De Lange and Van Erkel (2013) compared election manifestos of parties and subsequent coalition agreements to study which parties were most influential during the coalition formation process. Using Wordscores to scale these texts on an underlying dimension, their results showed that coalition agreements were more extreme than all parties involved. As argued by these authors it is highly unlikely that coalition parties will decide on such a coalition agreement. A more plausible explanation is that language use varies between coalition agreements and election manifestos which leads the Wordscores procedure – rather than picking up on ideological differences – to simply distinguish between election manifestos and coalition agreements. This example is not unique. Biases may even emerge if texts serve similar purposes but the way they came about is different. For example, political parties use election manifestos for various purposes: sometimes they are the result of an extended process of negotiation between several party factions and in other cases the election manifesto is the expression of a powerful party leader. These variations in party organization can lead to differences in language use in manifestos as well.
Turning back to the example of spontaneity as a criterion for analyzing text for speaker characteristics, it is informative to note that in their well-cited analysis of linguistic style of U.S. presidential and vice presidential candidates, Slatcher and colleagues (2007, p. 69) report considerable aggregate differences between text sources – which they all consider to be spontaneous – among speakers in three out of six linguistic measures in the 271 texts under study: “Candidates used language more like that of a depressed person in interviews compared to press conferences (d = .92) and town hall meetings (d = 1.19); they used language more like that of an older person in press conferences compared to interviews (d = .84) and debates (d = 1.30) and in town hall meetings compared to interviews (d = .56) and debates (d = 1.12); their language was less presidential in interviews compared to press conferences (d = 1.01) and town hall meetings (d = .83).” That is, among the same set of speakers (George W. Bush, John Kerry, Dick Cheney and John Edwards) aggregate language use in press conferences, town hall meetings and interviews, varies on multiple dimensions, which – from the perspective of political scientists – makes sense since they are targeted to different audiences. Additional analysis of their data (see Figure 1) confirms that these patterns also exist within individual speakers.ii In a comparison of text sources for which we have at least 10 observations per speaker, it appears that George W. Bush scores significantly lower on honesty, aging and presidentiality during town hall meetings than during press conferences, and significantly higher on cognitive complexity. John Kerry, on the other hand, speaks more like an older, depressed, and less presidential person in his press conferences than in his network interviews. Substantive conclusions about their language use would thus depend on the type of text source used. This indicates that individual differences and political context together impact language use.iii
Figure 1
Linguistic style of George W. Bush and John Kerry on six linguistic style dimensions.
Note. This figure displays standardized LIWC scores for George W. Bush and John Kerry on six linguistic style dimensions (aging, complexity, depression, honesty, presidentiality, feminity) for various text sources for which we have at least 10 observations: network interviews (Kerry: n = 44), press conferences (Bush: n = 57: Kerry: n = 21), and town hall meetings (Bush: n = 38) (for more information, see Slatcher et al., 2007).
The preceding discussion is not intended to disqualify these published results but rather to highlight that both political scientists and political psychologists will need to put in careful work when constructing a corpus. We propose that if analysts have reason to believe that text sources are systematically different from each other that they account for these differences in their models. For example, political scientists have developed the structural topic model (Roberts et al., 2014) which allows for meta data (such as author, type of audience, occasion, etc.) to influence model results. In sum, when constructing a corpus, we propose analysts use similar text sources to the extent possible. When a corpus consists of multiple text sources, analysts should account for this in their models as meta data. We will turn to a particular application of using meta data next.
Authorship as Meta Data
Analysts interested in psychological constructs like personality may use politicians’ speeches to measure such constructs “at-a-distance”. This approach opens up many opportunities. For one, a direct approach of interviewing political elites is not always feasible since survey response rates among this group is generally low (Dietrich, Lasley, Mondak, Remmel, & Turner, 2012) and – important for those researchers interested in historical data – limited to those politicians who are still alive. The beauty of text analysis is that – once a text is archived – it can be studied, no matter what the time span. But of course, analysts face hurdles as well; there is no free lunch. Importantly, it is likely that not the politician but a speech writer wrote the text. Yet the impact this issue has on our ability to learn about the politicians’ characteristics from these texts is far from clear. For example, a comparison between private recording of John F. Kennedy and his public speeches revealed no differences in leadership assessment (Renshon, 2009). Dille (2000) on the other hand found small but important differences in leadership style assessment between spontaneous and prepared remarks for George H. Bush and Ronald Reagan. However, if either president was involved in drafting the speech these differences disappear.
To understand their involvement, we should understand the role conception and incentives of speech writers. To this end we consulted guidebooks on becoming a speech writer and worked our way through several speech writer biographies. In terms of advice, a clear lesson we learned is that the speech should be an authentic and recognizable reflection of the “best possible version” of the speaker (Collins, 2012, p. 11). What is this best possible version? According to Collins (2012, p. 5), a speech performance is an artificial moment and your essential character will need to be drawn in “primary colours, sometimes in lurid colours, to make sure it is visible from the distant point in the audience”. Peggy Noonan, speech writer of Ronald Reagan, notes that “you have to find their sound” (Noonan, 1998, p. 101). “The way people speak usually reflects how they think. And so you must listen closely, not only so that the work you do sounds like them, but so it sounds like them thinking” (Noonan, 1998, p. 101). Following this advice, speech writers should write a speech in such a way that the personality of the speaker is visible to the audience. This way personality in speech may be slightly exaggerated, but probably not much.
But do actual speech writers stick to this advice? It depends. Obama speech writers – who referred to this speech writer Jon Favreau as his “mind reader” – had a lot of material to work with, but other speech writers were less fortunate. Jimmy Carter, for example, delivered few speeches about national issues and never met with his speech writers. This makes speech writing more difficult (Noonan, 1998), and perhaps in such cases personality assessment from speeches may be off. Consider the example of Barton Swaim (Swaim, 2015), a speech writer of former South Carolina Governor Mark Sanford. In contrast to the Obama-Favreau tandem, the working relationship between Swaim and Sanford was awful. Swaim had multiple speeches and op-eds sent back to him. He recalls: “It was then that he [Nat, Sanford’s chief of staff] told me that everyone who worked for this governor had one goal. It wasn’t to please him with your superior work, because that would never happen. The goal was to take away any reason he might have to bitch at you. It was then too that Nat explained that my job wasn’t to write well; it was to write like the governor. I wasn’t hired to come up with brilliant phrases. I was hired to write what the governor would have written if he had had the time” (Swaim, 2015, p. 9). The Swaim anecdote confirms that career incentives mobilize speech writers to write like their clients, not to write what they like. For Swaim his job became so awful that he commented: “Sometimes I felt no more attachment to the words I was writing than a dog has to its vomit” (Swaim, 2015, p. 6). For analysts on the other hand, it may strengthen confidence that one can learn about leader characteristics even when the text is written by a speech writer.
Based on our reading of these speech writer guidebooks, we are optimistic about the possibility of learning from speech writer text about leader characteristics. That being said, we encourage analysts to look into how the speech was produced. Does the politician have a speech writer at all? How many? Are they a long-running tandem or does the politician change speech writers often? And how is the politician involved in drafting the text? The analyst could also turn the search around and use supervised methods to predict whether politicians or speech writers wrote a text (see Airoldi, Fienberg, & Skinner, 2007) for an analysis of co-authorship of Ronald Reagan’s radio addresses). If the results of that effort prove to be unreliable, this may serve as evidence that speech writer text can indeed be used for learning about psychological constructs of the politician.iv These are all pieces of information that are knowable but not often considered in political psychology nor in political science research. Just like with our previous discussion of sampling from different text sources, we consider authorship patterns a form of meta data which can tell us something about how a text came about. Rather than discarding such texts all together, their meta data should be included when the analyst builds a corpus.
Preprocessing Text
When analyzing political language, it is common to “preprocess” text in order to simplify the inputs to an analysis without altering its substantive conclusions. Common preprocessing steps encompass, for example, removal of numbers, punctuation and stop words, or word stemming. These preprocessing steps are typically presented as innocuous procedures, but in fact they may have non-trivial substantive consequences. For example, Denny and Spirling (2018) show how substantive conclusions from scaling methods and topic models may crucially depend on seemingly arbitrary decisions during preprocessing. To address this issue, Denny and Spirling (2018) propose that analysts first collect results from text analysis models with various different preprocessing steps. In a second step, the analyst evaluates if model results are robust to particular combinations of preprocessing steps. If model results are not sensitive to the applied preprocessing procedure, this increases confidence about their robustness. If model results vary with particular preprocessing steps, the analyst will need to report these dependencies.v This is an important step forward in establishing robust results from unsupervised text as data models. However, it does not provide an explanation for why and when the results of unsupervised models depend on specific preprocessing steps. In this section, we argue that correlations between ideology, personality differences and linguistic habits could explain such patterns and we show evidence for this.
Work in psychology and linguistics reports that linguistic habits and the use of function words correlate with personality characteristics and policy positions (Pennebaker, 2011).vi For example, introverted speakers prefer rich vocabulary (Oberlander & Gill, 2006), use more negations (e.g., Pennebaker & King, 1999) and fewer expressions and connectives (Oberlander & Gill, 2006). Neurotic speakers do not use rich vocabulary (Oberlander & Gill, 2006), are more likely to use first-person singular and to use words associated with negative emotions (Pennebaker & King, 1999). People high on openness to experience use more tentative words, such as ‘maybe’ or ‘perhaps’, and they use longer words (Pennebaker & King, 1999). People who score low on conscientiousness also use negations more often and are more likely to use negative emotion words (Pennebaker & King, 1999). At the same time there is an extensive literature that documents correlations between personality and ideology among citizens (e.g., Bakker, 2017) and political elites (e.g., Caprara, Francescato, Mebane, Sorace, & Vecchione, 2010; Dietrich et al., 2012).
Given these correlations between personality characteristics, ideology and linguistic habits, common preprocessing steps may not be “ideologically neutral”. That is, they may affect some speakers more than others, leading to unreliable estimates from subsequent unsupervised models. This may in part depend on the amount of text data under study with the impact of preprocessing steps likely to be larger in smaller corpora.vii These are two possible scenarios that we empirically evaluate using the EUSpeech dataset (Schumacher, Schoonvelde, Dahiya, & De Vries, 2016; Schumacher, Schoonvelde, Traber, Dahiya, & De Vries, 2016). EUSpeech consists of all publicly available speeches from the main European institutions plus the IMF and the speeches of prime ministers – or president in the case of France – of 10 EU countries for the period after 1 January 2007: Czech Republic, France, Germany, Greece, Netherlands, Italy, Spain, United Kingdom, Poland and Portugal. From this dataset we select the English speeches from all the group leaders in the European Parliament as well as heads of government (n = 3,301), which are larger than 200 words. For each speech we count all stop words and divide that number by the total number of words in that speech to get at a proportion (the mean proportion of stop words across all speeches = 0.54). For this we used a stop word list from Quanteda (Benoit & Nulty, 2016) containing 174 words.viii We collect similar statistics for 3 other common preprocessing steps in a typical text as data project: stemming, the use of numbers and punctuation (see Denny & Spirling, 2018; Grimmer & Stewart, 2013; Wilkerson & Casas, 2017). Stemming concerns the algorithmic conversion of inflected forms of words into their root forms (for example, stemming the words “fish”, “fishing” and “fishes” converts them to “fish”, “fish” and “fish”). We use the Porter stemmer to get the number of unique tokens in a stemmed speech which we divide by the number of unique tokens in the same, unstemmed speech. The lower this proportion, the higher the impact of stemming (mean proportion of stemmed tokens across all speeches = 0.92). We also calculate the proportion of numbers to the total number of tokens in each unstemmed speech (mean proportion of numbers = 0.006). Furthermore, we calculate the number of punctuation tokens as a proportion of the total number of tokens (mean proportion of punctuation characters = 0.10). Punctuation tokens are dots, colons, semicolons, and so on.
To assess the degree to which these preprocessing steps correlate with political ideology, we collect two ideology measures from the Comparative Manifesto Group Database, which has systematically coded election manifestos to specific topics or positions on topics (Volkens, Lehmann, Matthieß, Merz, Regel, & Werner, 2016). We use the cultural progressive-conservative itemsix and the economic left-right itemsx from this database to calculate a progressive-conservative position and a left-right position for each speaker. Because the Manifesto Group contains data for each election, we use the score from their party’s most recent election manifesto as the speaker’s position.
We aggregate the proportion of numbers, punctuation, stop words and stemming for all 31 speakers in the corpus. Figure 2 shows the bivariate relationship between each four preprocessing steps and left-right and progressive-conservative ideology respectively using a Loess curve. These scatter plots reveal interesting patterns. For example, it appears that moving from the left to the center of the left-right dimension is positively correlated with the use of numbers: speakers in the center use on average more numbers than speakers on the left. Furthermore, moving from the progressive end to the center of the progressive-conservative dimension relates to an increase in the use of numbers as well. When comparing progressive and conservative speakers, we observe first a decrease and then a slight increase in the use of punctuation. We also find some variation across left-right and progressive-conservative speakers when it comes to using unique words: politicians on the extremes of these ideological scales tend to use slightly more unique words stems than politicians located in the ideological center (as evidenced by lower proportion and thus the higher impact of stemming for the latter). Tables B1 (stopwords), B2 (numbers), B3 (stemming) and B4 (punctuation) contain OLS regression results modeling at the speech level preprocessing scores as a function of left-right ideology, progressive-conservative ideology and their interaction, with fixed effects for countries (thus controlling for language specific differences). For each of the four preprocessing steps we find evidence that they are related with left-right and progressive conservative scores of the speakers. Taking the proportion for stop words as an example, the model estimates that Gabriele Zimmer and Lothar Bisky, the left-most speakers in the corpus (left-right score of -3.2) use about 10.5 punctation characters per 100 words, and Nigel Farage (left-right score of +0.8) about 12 punctuation characters (who thus uses shorter sentences on average).
Figure 2
Average use of numbers, punctuation, stop words by politicians in the European Union, sorted by ideological left-right and progressive-conservative positions.
Note. This figure displays the average scores on each four preprocessing dimensions (numbers, punctuation, stemming and stop words) for EU politicians (heads of government and group leaders in the European Parliament), sorted by left-right and progressive-conservative ideology.
These results serve as evidence that these preprocessing steps are indeed not “ideologically neutral”, but the question remains what implications this has for unsupervised models like Wordfish in a large corpus like EUSpeech. In order to explore this question, we fitted 5 times a Wordfish model on our corpus: once with stop words removed; once with stemming; once with numbers removed; once with interpunction removed; and once without any of these preprocessing steps included. We excluded words that appeared among fewer than 10 speakers. The results are in Figure 3, which displays perfectly positive Spearman’s rank correlations between Wordfish positions estimated without any preprocessing steps and Wordfish positions estimated with each one of the four preprocessing steps respectively (p = 0.98 for removing stopwords and p = 0.99 for the other 3 preprocessing steps). The reason for this is likely the change in the number of features following each preprocessing step is modest compared to the size of the total corpus: for example, removing stop words removes only 139 features, a number that is swamped by the total number of features (5,564 and 5,425 respectively).xi The same goes for the other preprocessing steps: their impact on the corpus on which the Wordfish estimates are based is negligible when compared against the total number of features: when removing numbers, the total number of features is 5,463; when removing punctation, the total number of features is 5,545; when stemming the corpus, the total number of features decreases considerably more to 3,840 but this has no impact on Wordfish positions either.
Figure 3
Wordfish estimates based on speeches by politicians in the European Union, with and without four preprocessing steps applied to the corpus.
Note. This figure displays Wordfish estimates based on speeches by politicians in the European Union, with and without four common preprocessing steps applied to the corpus. All four preprocessing steps (removal of numbers, punctuation, stop words, and stemming) appear not to have an influence on the estimated Wordfish positions.
From this we conclude that the amount of data (3,301 speeches from 31 speakers) overrides the potentially detrimental impact of preprocessing on subsequent scaling results.
Political scientists have recently been alerted to the dangers of preprocessing for unsupervised models (Denny & Spirling, 2018). We showed some evidence that seemingly random preprocessing steps (such as taking out stop words, numbers and punctuation, as well as stemming) correlate with stable characteristics like left-right ideology and liberal-conservative ideology (see Figure 2). Denny and Spirling (2018) demonstrate that these preprocessing steps also produce substantively different results. Using a much larger corpus we do not find that preprocessing steps influence estimated Wordfish positions (see Figure 3). The likely reason for this is that these preprocessing steps have a very small impact on the total number of features in this large corpus. In terms of concrete advice, we propose that when applying preprocessing steps on a small corpus, researchers are well- advised to consider the lessons from Denny and Spirling (2018) by using preText and average results across different model specifications. We also note that in a larger corpus, preprocessing steps may be less influential. We think that moving forward, work in political psychology on stable language patterns can inform a theory of when and why preprocessing matters exactly.
Analyzing Text
Can multifaceted concepts such as policy positions, topics, sentiment, complexity and personality be extracted from text? We think they can. Yet the problem is that typically we extract all these concepts simultaneously, while intending to extract just one. In other words, catching one construct may come with bycatch of another construct. This has implications for the substantive interpretation for the construct under study. This section illustrates this issue further, through a conceptual and an empirical example.
Let us start with a conceptual illustration of construct by-catch, using topic models and scaling models as an example. Scaling models use word co-occurrences between texts to place them on a single policy dimension. The more words co-occur between two texts, the closer they are placed on this dimension. This approach assumes that this dimension is what drives the dissimilarities between texts (Grimmer & Stewart, 2013), which requires that politicians talk differently about similar topics. For example, the sentence “we will raise unemployment benefits”, is very similar to “we will not raise unemployment benefits” and it is more dissimilar to an ideologically similar sentence like “levels of unemployment assistance should be increased”. For scaling to work the assumption is that a right-wing politician does not say “we will not raise unemployment benefits”, but instead says “handouts to the poor should be slashed”. Topic models, on the other hand, use word co-occurrences to classify text into one or more issues or topics. This builds on the assumption that different actors use identical words when talking about a topic. Taking the example of unemployment benefits, a topic model would place these sentences in separate topics, one characterized by words like “unemployment” and “benefits”, and the other by “handouts” and “poor”. The drivers of politicians’ language use are thus of crucial importance. If politicians only emphasize issues on which they are perceived as strong (Budge & Farlie, 1983), then political texts vary in their language as a result of different parties talking about different topics. However, when parties engage with each other on issues (e.g., Green-Pedersen & Mortensen, 2015) then scaling only works if these parties use language on similar topics that is different enough. In practice, parties sometimes engage with and sometimes avoid issues (Green-Pedersen & Mortensen, 2015). If the scaling procedure finds dissimilarities between parties it is because they talk about different issues or because they take different positions on the same issues. Only in the latter scenario do scaling methods help distinguish policy positions. In the former scenario scaling methods would distinguish between different topics instead. Thus, meaningful interpretation of scaling models depends on potential by-catch of topics.
As a further illustration of the issue of construct by-catch, we provide an example involving scaling and sentiment. We take speeches longer than 200 words (again from the EUSpeech data) that were originally delivered in English by prime ministers from nine EU member states and party leaders in the European Parliament (n = 3,301) which we aggregate across 31 speakers. Using Quanteda (Benoit & Nulty, 2016) we fit on these speeches a Wordfish scaling model and collect estimated speech positions. We also collect for each speech the percentage of negative and positive words using the Lexicoder Sentiment Dictionary (Young & Soroka, 2012) by taking the number of sentiment word occurrences and dividing by the total number of words in that speech. We then aggregate both position and emotion words at the speaker level.
If scaling is about position alone then there should be no relationship between Wordfish positions and use of sentiment words. Yet this is not what we observe (see Figure 4). Instead, speakers on one end of the underlying dimension use more positive words than speakers on the other end (r = -0.34). For example, about 10 percent of words used by English Prime Minister David Cameron are positive, whereas on the other side of the Wordfish dimension, about 8 percent of words of Greek Prime Minister Papademos are positive.xii The relationship between the use of negative emotions with Wordfish positions is less pronounced (r = 0.14).xiii
Figure 4
Use of sentiment and estimated Wordfish positions of speeches by politicians in the European Union.
Note. These scatterplots denote the average use of negative sentiment (left) and average use of positive sentiment (right) over the range of the estimated average Wordfish position of heads of government and MEP group leaders. It shows that Wordfish scores and the use of positive sentiment are correlated with each other.
The examples in this section show that words on which co-occurrence models like Wordfish are based may not have anything to do with policy positions, other than that they are correlated with them. This makes it difficult to substantively interpret positions on the dimension that Wordfish estimates. For example, the conclusion that “speaker A is more left-wing than speaker B” may be based on the fact that speaker A uses more sentiment words than speaker B. This may reflect real policy differences - speaker A is more sentimental about the topic - but it may also reflect real personality differences - speaker A is more emotional than speaker B. In any case, sentiment and position become blurred and we do not know which conclusion is justified. The underlying issue is one of construct by-catch. To distinguish between different constructs the analyst could apply a predictive validity criterion: if estimated policy positions are known to correlate with the use of emotion words, the analyst will need to account for these emotion words in subsequent analyses. If this alters the results, this should also alter the substantive conclusions. We thus encourage researchers – both in political psychology and political science – to be aware of construct by-catch and check their results against other possible explanations. A popular way to cross-validate results is to use coders and train data. This, however, is not a feasible option for many researchers, especially for those working outside the English language context. Instead, we propose that researchers cross-validate their findings using other tools from the automated text analysis toolkit. In our example, we combined scaling and sentiment analyses. Other options are to use different dictionary methods or topic modeling.
Conclusion
Text is not a silver bullet for learning about politics and psychology. We have highlighted a number of issues to consider for each text as data project around four steps in the research process: (i) sampling text; (ii) authorship as meta data; (iii) preprocessing text; (iv) analyzing text.
Our discussion of these issues has reflected our optimism about the possibilities of text as a data source in both political psychology and political science, and we highlighted a set of guidelines which we summarize in Table 1.
Table 1
Guidelines for Text as Data Projects in Political Psychology and Political Science
Guidelines | |
---|---|
1. | When collecting a corpus, use similar text sources to the extent possible. When multiple text sources are the only option, account for this in the analysis. |
2. | Get to know your data. How did a text come about? Who was involved? Incorporate this information in the analysis. |
3. | Consider in what ways preprocessing steps can correlate with stable speaker characteristics. Average results across preprocessing steps, particularly when working with a small corpus. |
4. | Use multiple methods to evaluate the possibility of construct by-catch when analyzing text. |
We would like to conclude with a few observations. First, our disciplines (political psychology and political science) have come to expect much from the quality of, say, surveys or experimental data and it would be good that we apply that same rigorous standard to text. We encourage analysts in psychology and political science to be mindful of the quality of the texts they use, with an eye towards the construct under study, and we made some suggestions to that end. Second, our ability to extract those constructs will require us to think about data theory (Jacoby, 1991), research design and preprocessing steps. For example, preprocessing steps may have substantive implications when text is mined for constructs like personality (which research political psychology has shown to be reflected in stable language patterns). Third, we want to emphasize the importance of further theory-building and concept development. As it currently stands, the literature converges on the use of existing “gold standards” like the Linguistic Inquiry and Word Count (LIWC) dictionary (Pennebaker et al., 2015) or the ANEW (Affective Norms for English Words). Although these measurement instruments are highly valuable, analysts should keep questioning them – and build alternatives – to avoid running the risk of depending too much on them.
In their early review paper, Grimmer and Stewart (2013) urged researchers using text as data to “validate, validate, validate” the outputs of their models. We believe that an important way of doing so for researchers working on text as data projects is by integrating different perspectives in their work. For example, political scientists could learn from political psychologists about how individual characteristics are reflected in stable language patterns among politicians, whereas political psychologists could learn from political scientists how the political context (e.g., the dynamics a political campaign or the intended audience of a speech) pressures these politicians into changing their language use. The promise that text as data holds for political psychology and political science will be bolstered with more cross-fertilization – theoretically and empirically – between both disciplines.