Validating Automated Integrative Complexity: Natural Language Processing and the Donald Trump Test

Computer algorithms that analyze language (natural language processing systems) have seen a great increase in usage recently. While use of these systems to score key constructs in social and political psychology has many advantages, it is also dangerous if we do not fully evaluate the validity of these systems. In the present article, we evaluate a natural language processing system for one particular construct that has implications for solving key societal issues: Integrative complexity. We first review the growing body of evidence for the validity of the Automated Integrative Complexity (AutoIC) method for computer-scoring integrative complexity. We then provide five new validity tests: AutoIC successfully distinguished fourteen classic philosophic works from a large sample of both lay populations and political leaders (Test 1) and further distinguished classic philosophic works from the rhetoric of Donald Trump at higher rates than an alternative system (Test 2). Additionally, AutoIC successfully replicated key findings from the hand-scored IC literature on smoking cessation (Test 3), U.S. Presidents’ State of the Union Speeches (Test 4), and the ideology-complexity relationship (Test 5). Taken in total, this large body of evidence not only suggests that AutoIC is a valid system for scoring integrative complexity, but it also reveals important theory-building insights into key issues at the intersection of social and political psychology (health, leadership, and ideology). We close by discussing the broader contributions of the present validity tests to our understanding of issues vital to natural language processing.

-often referred to as natural language processing methods -allow researchers to gain new insights by scoring massive amounts of material at levels heretofore unheard of.
However, as the authors also note (Schoonvelde et al., 2019), there is a danger as well: It is possible that we will move ahead with seductively easy-to-score measurements without proper scientific discussion about what they mean -or if they are even measuring what they claim to measure. As a result, ongoing validation of any natural language processing system is vital to the health of the field (see Schoonvelde et al., 2019, for an excellent discussion).
To that end, we evaluate evidence pertaining to the validity of a natural language processing system designed to measure a construct with a long and storied history at the intersection of social and political psychology: Integrative complexity.

Human-Scored Integrative Complexity
Designed in its current instantiation by Peter Suedfeld's lab (e.g., Suedfeld et al., 1977), integrative complexity is the measurement of the degree that spoken or written materials have differentiation (the recognition of different distinct dimensions) and integration (the subsequent recognition of interrelations among differentiated dimensions).
At a conceptual level, lower integrative complexity scores indicate rigid, black-and-white communication; higher integrative complexity scores reflect more multi-dimensional language. i Human-scored integrative complexity has proven vitally important in understanding behavior at the intersection of social and political psychology. For example, integrative complexity has been directly tied to the reduction of social problems such as war (Suedfeld & Jhangiani, 2009;Suedfeld, Tetlock, & Ramirez, 1977;Tetlock, 1985; see Conway et al., 2001;Conway et al., 2018, for reviews), terrorism (Andrews Fearon & Boyd-MacMillan, 2016;Houck et al., 2018), and poor health . It has further been tied to constructs that are directly related to solving societal problems, such as political ideology (Conway et al., 2016a;Conway et al., 2016b;Houck & Conway, 2019;Suedfeld, 2010;Tetlock, 1983Tetlock, , 1984) and world leaders' success in gaining and keeping power (Conway et al., 2012;Suedfeld & Rank, 1976).

Automated Integrative Complexity
Due to the massive advantages of automated text scoring (see Boyd & Pennebaker, 2017), over the past decade, there has been an increasing push to automate integrative complexity Houck et al., 2014;Robertson et al., 2019;Tetlock et al., 2014;Young & Hermann, 2014). However, given integrative complexity's strong ties to key societal issues, it is vital that we continuously evaluate such systems.
Indeed, the need for further discussion on this topic is highlighted by the fact that, in the last five years, research articles have used the Linguistic Inquiry and Word Count's complexity/analytic thinking score  as a direct measurement of integrative complexity (e.g., Brundidge et al., 2014;Vergani & Bliuc, 2018), in spite of the fact that validation studies show it correlates at only r = .14 with expert-scored integrative complexity .
ii Houck et al., 2014). The AutoIC system scores differentiation and integration in the same hierarchical fashion as human-scored IC. Although its development was informed by both rudimentary correlational machine learning and expert human input, it was guided far more by human input than by machine learning.
Specifically, expert human integrative complexity scorers performed linguistic analysis of every word or phrase that might be associated with integratively complex or integratively simple language, using synonym trees to include as wide a range as possible. After creating an initial system, researchers trained the system on a set of datausing expert human integrative complexity scoring as the benchmark -and then prospectively tested it on an entirely new set of data.
The resulting system has over 3,500 complexity-relevant root words and phrases. Many of these words are lemmatized (e.g., "complex*" scores "complexity," "complexly," etc.) and thus the actual number of scored words/phrases is appreciably higher than the root number. AutoIC breaks documents down into equal-size paragraphs and thus, like human-scored IC, provides paragraph-level averages. The resulting AutoIC algorithm is probabilistic and hierarchical. (1) It is probabilistic because it has 13 separate dictionary categories that differentially assign points to particular words/phrases when they appear, depending on the probability that the word/phrase is associated with higher complexity. For example, the phrase "on the other hand" rarely appears without indicating differentiation -so when that phrase appears, 2 points are added (from the base score of 1). (2) Second, AutoIC is hierarchical: Words are parsed into those associated with integration and differentiation in a manner conceptually identical to human scoring. As a result, while it is possible for multiple words/phrases associated with differentiation to add to a score of three, no additional words from differentiation lists would increase the score beyond three. Instead, in a manner conceptually identical to human scoring, achieving an AutoIC score above three requires words from one of several integration lists (words/phrases like "proportional to" and "integrated with").
In the original validity paper, AutoIC (1) showed higher correlations with expert human scorers than other attempts to automate the construct and (2) showed that both the differentiation and integration dictionaries contribute positively to the overall score . Further, (3) AutoIC replicated effects from human-scored IC in the Bush/Kerry debates, Obama/McCain debates, early Christian writings, and smoking/health domains . However, as acknowledged by the authors Houck et al., 2014), automating integrative complexity was at that point still in a comparatively early stage of development, and therefore validation evidence for the AutoIC system was in its early stages as well. The purpose of this present paper is to summarize updated evidence for the AutoIC system, provide additional evidence for the system that pertains to key issues in social and political psychology, and discuss the specific contributions AutoIC has made to the social and political psychology literature.

Overlap With Expert Human Scorers
The most direct kind of validity is the degree that an automated system overlaps with expert human scorers on the same material scored by human coders on a prospective test that was not used in automated "training" Tetlock et al., 2014). In the original work Houck et al., 2014), AutoIC's average correlation with human scorers at the document level was r = .82. At the paragraph level, the overall correlation was r = .46; for prospective tests only, the paragraph level correlation was r = .41. Since the original paper, several additional studies have also correlated expert human scorers with AutoIC across religious documents , comparisons of fictional versus non-fictional characters (McCullough & Conway, 2018a), decisionmaking scenarios (Prinsloo, 2016), and health (Test 3, this paper). As can be seen in Table 1, the correlations with human scorers exceeded the tests from the original validity paper in every case. Note. All effect sizes = r. Paragraph level = correlations paragraph-by-paragraph. Document level = correlations summarizing the same exact corpus of materials at the appropriate document/person level.

Theoretical Contributions of AutoIC
Additional validity tests involve the existence of theoretically-interpretable findings that were obtained using the measurement tool . While such findings are not direct evidence of the tool in question's ability to measure complexity per se -because it is possible the findings may have emerged for non-complexity related factors -they are nonetheless important, as any system is in one sense only as good as the interpretable findings it has produced.
In the five years since Conway et al. (2014), evidence showing the descriptive usefulness of the system to understand theoretically-interpretable phenomena has grown. For example, terrorism is a vitally important research area at the intersection of social and political psychology, and yet terrorist groups are very difficult to study. Thus, at-a-distance methods such as integrative complexity have proven to be important in both understanding and preventing terrorism (see, e.g., . AutoIC has recently added to our knowledge of terrorism in this regard. Extending prior work using hand-scored IC Smith et al., 2008), research has shown that AutoIC is lower in a more extreme terrorist group than a terrorist group using less extreme methodsand that extremity is tied to drops in complexity over time . Importantly, this work suggests that terrorist groups differ from other terrorist groups, such that more violent terrorist groups are lower in complexity . Other work on terrorists has revealed that peace-based dialogue sessions with convicted Indonesian terrorists increase terrorists' AutoIC in theoretically-expected ways (Putra et al., 2018). This work with AutoIC suggests that it is possible to use interventions to increase terrorists' integrative complexity in ways consistent with violence-reduction (Putra et al., 2018). Taken together, this work on terrorism has not merely replicated what has come before. Rather, it has revealed new and important avenues for understanding that would not have existed without AutoIC.
AutoIC has similarly advanced our understanding of the individual stability of complex thinking (Conway & Woodard, 2019), the complexity of real versus fictional writings (McCullough & Conway, 2018a), educational interventions (Felts, 2017;Prinsloo, 2016;University of Montana Psychology Department, 2018), the popularity of movies (McCullough & Conway, 2018b), the rated quality of video game dialogue (McCullough, 2019a), the success of fan fiction (McCullough, 2020), critical response to horror films (McCullough, 2019b), and the complexity of twitter (McCullough & Conway, 2019). Thus, AutoIC has begun to offer theoretical insights into multiple important psychological arenas.

Additional Validity Tests: Transitional Summary
In summary, evidence across dozens of studies reveals that (a) AutoIC is moderately correlated with humanscored IC across multiple contexts, and (b) AutoIC helps us understand theoretically-interpretable phenomena across varied domains. However, additional validity tests are needed (see Houck et al., 2014, for discussion). To fill in this gap, we provide 5 new validity tests.
First, Houck et al. (2014) discuss the need for tests that compare groups or conditions on which complexity ought to differ. Such tests would intentionally not provide carefully controlled conditions attempting to isolate a key variable; rather, they would purposefully stack the proverbial deck, such that it is clear that one group should be higher than another group in complexity. Thus, if an automated system failed to distinguish between groups that ought to differ in complexity in this way, it would call its validity into question, in much the same way that Google's object recognition tool was called into question when it was unable to distinguish a cat from an avocado (Ross, 2019). In the present study, we provide two such tests (described in more detail below) -tests that attempt to distinguish higher-complexity groups (classic philosophers) from lower-complexity groups (modern political rhetoric and layperson opinions).
A second valuable piece of validity evidence involves replications of previously-found hand-scored IC effects. In their original paper, Conway et al. (2014) discuss several attempts at such replications. Clearly, however, the area needs more work. In Validity Tests 3-5, we attempt to replicate some of the key findings from three published papers Houck & Conway, 2019;Thoemmes & Conway, 2007).

Politicians and Lay Populations
We would expect classic philosophical works to be higher in complexity on average than political rhetoric or the opinions of lay persons. Classic philosophical works involve some of the greatest minds of all time (thus, those persons who ought to have the most ability to think complexly), with abundant time and cognitive resources (thus, conditions that ought to afford maximum resources to think complexly), working through complex problems with the goal of parsing them in a complex way for a highly intelligent audience (thus, high motivation and domainspecific likelihood of complex communication). As a result, we would expect that classic philosophy should be higher in complexity than modern political speeches designed largely for lay audiences, or the opinions of those lay audiences themselves. Although there are always exceptions, it is nonetheless the case that any complexity system that failed to consistently distinguish classic philosophy on average from these other forms of communication would be called into question as a complexity-measuring tool (see Houck et al., 2014;Tetlock et al., 2014 For Validity Test 1, we compare these works' AutoIC scores to modern political rhetoric and lay populations. In Validity Test 2, we do a more focused test comparing these works to one particular modern politician: Donald Trump.

Validity Test 1: Comparing Classic Philosophy to Modern Political Rhetoric and Layperson Opinions
We scored each of the above philosophy works for AutoIC in its entirety. Although for Validity Test 1 we use the philosophy work (N = 14) as the unit of analysis, this scoring of the classic philosophers entailed over 19,000 paragraphs and over 1.4 million words.
For comparison groups in Validity Test 1, we used two groups that we would expect to fall into the average-to-low range for IC: State of the Union speeches from U.S. Presidents and over 37,000 This I Believe essays from lay persons.
iii Both form excellent comparison groups for the present purpose. We would expect SOTU speeches to be low-to-average in complexity: And indeed, when scored by expert human coders, SOTU speeches showed a fairly low mean score (mean IC = 1.78). This I Believe essays represent average opinions of typical lay people about what they believe -and thus we would expect them to be lower in complexity on average than classic, serious philosophical works.
AutoIC passed this validity test: The AutoIC score for philosophers (M = 2.60) was higher than both the total of  (Conway & Zubrod, 2020). Given that, in general, we would expect classic philosophy to be higher in complexity than political rhetoric for the masses, we ought to especially expect classic philosophy to be higher in complexity than a politician for which there are unique reasons to expect low complexity. Thus, this "Donald Trump" test clearly qualifies as a strong expected validity test as described by Houck et al. (2014).
For this additional test, we further compared AutoIC to a new method for scoring integrative complexity: V+POStags (Robertson et al., 2019). Admirably, V+POStags involves both an attempt to create a human-scored vocabulary of words associated with complexity and a machine learning approach focused on syntax (see Robertson et al., 2019 groupings; for works that did not have 4 groupings, we created as many as the material allowed).
Our primary question of interest is the degree that AutoIC and V+POStags can each consistently distinguish Trump from classic philosophy. To accomplish this, we compared (separately for each system) each debate transcript score against each philosophical grouping score. This provided 153 separate comparisons to evaluate if each system assigned a higher score to a famous philosophical work than to Donald Trump. When a comparison yielded a higher score for classic philosophy, it was counted a successful trial; when the comparison did not yield a higher score for classic philosophy, it was counted as a failed trial. It is worth noting that throughout, AutoIC and V+POStags scored the exact same materials.
AutoIC assigned a higher complexity score to the philosophic work on all 153 trials (100% success). However, V+POStags assigned a higher score to the philosophic work on 33% (51 of 153) of the trials.

Discussion of Validity Tests 1 and 2
Validity tests that demonstrate expected differences between groups on a linguistic variable are vital forms of natural language processing validity , and yet no such validity tests currently exist for automated integrative complexity measurements. Validity Tests 1 and 2 help fill in this gap. Across both tests, AutoIC consistently showed higher levels of complexity for classic philosophical works than for politicians and laypersons.
Validity Test 2 further compared AutoIC to a newly-developed system (V+POStags Validating Automated Integrative Complexity 510 and human learning in system development. Machine learning is excellent at detecting patterns in large datasets that humans cannot detect. However, it is less good at predicting alterations that might occur to those patterns in new data that it was not "trained" on. While both AutoIC and V+POStags used both human learning and machine learning in development, AutoIC focused mostly on human learning and V+POStags focused comparatively more on machine learning. Thus, one possible reason for AutoIC's success is that human-learning developed enterprises are more stable across contexts. We return to this larger issue in the discussion. We also suspect that part of the reason AutoIC outperformed V+POStags on this validity test is more specific to the V+POStags methodology (and not machine learning in general). Specifically, a closer look at the numbers for V+POStags suggests that part of the problem is the commitment of that system to assigning discreet integers, instead of scoring (as AutoIC does) on a sliding scale. Indeed, V+POStags assigned the exact same score to all philosophic works (IC = 3), improbably suggesting there is no variability among the philosophers on IC. Further, the additional probability assessments provided by V+POStags for each discrete score (1, 2, 3, etc.) suggest that there is a strong tendency for the philosophers (compared to Trump) to be assigned higher probability values associated with higher scores -a fact that suggests the system is discarding useful variability in assigning scores.
Of course, while the Trump test is an important step in our understanding of natural language processing validity, we do not want to make too much of one validity test comparison. Nor are we suggesting that V+POStags has no value. Quite the contrary: We believe the V+POStags system is an excellent (and much-needed) machine learning-focused effort for the natural language processing of integrative complexity, and we commend the authors for their work in this regard. Rather, our evaluation of this test is that V+POStags is a promising system that, like all newly-developed systems, requires more work to fulfill that potential.

Validity Tests 3-5: Replication of Existing Studies
Below, we further provide three additional validity tests of a different type. Using AutoIC, we attempt to replicate key aspects of three published studies that originally used human-scored IC. Table 2 provides a larger summary of these data. As can be seen there, we use a two-fold rubric for evaluating these replication attempts.
(1) First, computing similar tests as the original studies, we evaluate whether or not the replication attempt showed a similarly-significant result in the same direction as the original study. We consider a successful replication in this regard if the original study showed a significant effect that is also significant in the replication, or if the original study showed no significant effect and the replication attempt identically showed a non-effect.
(2) However, we further compute common effect size metrics for each study and compare the strength of each effect (significant or otherwise) for each comparable effect for the original study and the AutoIC replication attempt. As can be seen in Table   2, we provide not only these descriptive statistics for each comparison, but also a brief subjective summary comment for ease of discussion.
Below, we briefly describe each test and offer a narrative summary of the outcomes.

Validity Test 3: Replication of Smoking Attitudes Study From Conway et al. (2017)
Few problems are more pressing in modern society than the issue of health -and smoking remains one of the largest health issues in the world (see Conway et al., 2017). Conway et al. (2017)   In validity Test 3, we first scored all the materials for each session (total paragraph N = 6,906). As can be seen in .001. separately. Although no inferential differences emerged between the whole corpus and the identical corpus, narrowing the focus to only those materials scored in the original study did yield effect sizes that were more in the range of the original study (see Table 2).
These results importantly contribute to our understanding of the relationship between smoking behavior and complexity during counselling sessions. Contrary to the assumption that complexity is an unqualified panacea, often complexity in health contexts can backfire because people need simple-minded focus to make positive health-related change (see Conway et al., 2017). Yet, despite the potential utility of this idea, data testing the effects of complexity is health contexts is scarce. Thus, the present results importantly validate this original finding. And, because (unlike the original study), this study scored the entire corpus of materials, they rule out the possibility that something about the original selection process may have influenced the results. Further, because AutoIC is much faster than hand-scoring, validating AutoIC for this context opens up a tool for researchers that might be very pragmatically and theoretically useful moving forward.

Validity Test 4: Replication of U.S. Presidents' Study From Thoemmes and Conway (2007)
Thoemmes and Conway (2007)  we did not randomly sample paragraphs, instead scoring all the materials in that corpus. viii We attempt to replicate findings from the Thoemmes and Conway (2007) study that fall into two categories: (1) Whether or not patterns systematically differed over four years in office, and (2) whether or not individual differences across presidents were in evidence.

Systematic Patterns Over Time
The primary large-scale finding of Thoemmes and Conway (2007) using hand-scored integrative complexity was that SOTU speeches tended to drop for all presidents over the course of the first term. This primary result was replicated with AutoIC: There was a similar main effect of Year of Speech, F(3, 18492) = 4.65, p = .003. Descriptive results for this pattern were very close to those using hand-scoring reported by Thoemmes and Conway (2007): There was a drop in complexity from year 1 to year 4. Consistent with other work using large samples, the effect sizes were smaller for these comparisons using AutoIC than in the original study (Averaging Year 1-4 and Year 2-4 comparisons, Original Study d = 0.33, AutoIC effect = .06). Overall, however, the AutoIC pattern closely (and significantly) replicates that of Thoemmes and Conway (2007).
Thoemmes and Conway also reported an interaction between success and year of term for hand-scored IC. Unlike in Thoemmes and Conway (2007), for AutoIC there was no significant interaction between success and year of term for complexity (p = .629), and the directional pattern bore little resemblance to the one from the original study.
Taken together, what are we to make of these results? First, they importantly reaffirm one of the basic conclusions of the Thoemmes and Conway (2007)  validates AutoIC, but simultaneously provides needed triangulating support to the effect of time in office on integrative complexity.
Why did the time in office X electoral success interaction not replicate using AutoIC? There are several possibilities.
(1) It is possible that human scoring is tracking a nuance that is important in the existence of the effect -a nuance that AutoIC does not score as effectively.
(2) Of course, a failure to replicate can occur for multiple reasons that have little to do with the system under scrutiny (see, e.g., Conway et al., 2014). For example, it is conceptually possible that, because AutoIC is scoring vastly more of the SOTU materials, this failure to replicate casts doubt on the original finding (perhaps if the original study had scored the other 96.3% of the material, it would have likewise showed a non-effect in this case). This potential itself is an important contribution. We cannot completely know the exact cause of a failure to replicate in the present case without more data -but, importantly, in addition to validating the original drop-over-time finding of the Thoemmes and Conway (2007) study, the present results suggest that larger-scale election studies are needed understand the relationship between electoral success and integrative complexity.

Individual Differences
Individual differences-based tests were also provided by Thoemmes and Conway (2007). Importantly, replicating Thoemmes and Conway (2007), the present results showed an effect of the individual president, F(40, 18456) = 9.00, ICC = .80, p < .001, suggesting that part of the variance in complexity is accounted for by individual differences between persons.
Thoemmes and Conway also attempted to ascertain what personality traits might be associated with presidential complexity by correlating trait scores for each president with their overall IC score. We used AutoIC to perform identical analyses with these personality traits. These results are presented in full in Table 2. Generally, these analyses reveal a similar pattern of results for AutoIC as for human-scored IC, although the AutoIC pattern is weaker overall (average effect size for human-scored IC = .30; for AutoIC = .16). Given that one of the best predictors of integrative complexity has generally been affiliation-related variables (see, e.g., Thoemmes & Conway, 2007), it is perhaps noteworthy AutoIC (like human-scored IC) showed a significant positive correlation with affiliation motive (r = .35, p = .049). AutoIC also showed a positive relationship with political liberalism that is not only almost identical to that used in the original Thoemmes and Conway (2007) study, but is further validated across multiple studies of politicians via meta-analyses (Houck & Conway, 2019).
Taken together, these results provide an important contribution to our understanding of presidential integrative complexity. First, the original Thoemmes and Conway finding that substantive variance is attributable to stable differences across presidents has been discussed as one of the few empirical investigations into individual differences in politicians' complexity (see Conway & Woodard, 2019). Given the vital implications of understanding the degree that persons are (or are not) chronically complex, the present replication's finding that individual variability in presidential integrative complexity accounts for a significant percentage of the variance is important. It further validates additional recent work (Conway & Woodard, 2019) suggesting that integrative complexity can reasonably be construed, in part, as an individual difference variable.
The present results also generally validate the conclusions of Thoemmes and Conway (2007) concerning what the chronically complex person might look like. That person is especially likely to be high in the affiliation motive Validating Automated Integrative Complexity 514 and (to a lesser degree) liberal. While it is tempting to over-interpret differences across the studies, it seems clear that, in the main, these results tend to point to roughly similar conclusions as the original study. ix

Validity Test 5: Replication of Meta-Analysis on Political Ideology From Houck and Conway (2019)
Some prior work suggests that liberals use more complex rhetoric than conservatives (e.g., Tetlock, 1983Tetlock, , 1984Tetlock, , 1985see Jost et al., 2003, for a summary), while other work suggests no differences between liberals and conservatives in their use of complex rhetoric (e.g., Conway et al., 2016a;see Houck & Conway, 2019, for a summary).
To help resolve this puzzle, Houck and Conway (2019) performed a meta-analysis of 35 studies that had measurements of integrative complexity and political ideology. Because this test used only precise measurements of both constructs -for example, they only used political ideology measurements that were unlikely to be contaminated with complexity-relevant variables such as dogmatism or authoritarianism -this study provides a litmus test of the relationship between ideology and the use of complex language.
Houck and Conway's (2019) results suggested a clear resolution to the puzzle of the ideology-complexity relationship: Whereas liberal political elites were significantly more complex than their conservative counterparts, liberal and conservative laypersons showed very similar levels of complexity. Drawing on previous work in other domains on strategic ideological communication Repke et al., 2018;Tetlock, 1981), Houck and Conway (2019) suggested this difference is due to differing norms for conservatives and liberals that cause liberal (but not conservative) politicians to strategically alter their communications to better meet the expected norms of their populaces.
Houck and Conway's (2019) meta-analysis is one important piece of evidence in our understanding of the ideologycomplexity relationship. However, it is increasingly important to provide multiple triangulating tests of a particular theory or model (see Crandall & Sherman, 2016), especially when the issue is as hotly debated as the ideologycomplexity link (e.g., Baron & Jost, 2019;Clark & Winegard, 2020). The present study provides a conceptual replication of Houck and Conway's (2019) model on an almost entirely new set of data, using AutoIC (as opposed to hand-scored IC) for measuring integrative complexity.
In the present study, we performed a mini meta-analysis (see Goh, Hall, & Rosenthal, 2016) on samples of data collected and scored for AutoIC by the authors. From this potential sample, we followed the same inclusion criteria, coding procedures, and analytic strategy as used in Houck and Conway (2019) Results are presented in Table 3 and Table 4   These results provide important additional evidence that the relationship between political conservatism and complexity differs for public officials and private citizens. In 11 separate samples (encompassing 5,877 persons, 11,859 documents, and 40,428 paragraphs), political conservatives showed a significant negative relationship to integrative complexity for public political officials, but no such relationship for private citizens. This basic pattern is identical to that of a separate meta-analysis (Houck & Conway, 2019)

Effect Sizes of Natural Language Processing and Big Data
Although in most cases the basic pattern and inferential statistics were identical to prior studies for Tests 3-5, the replication attempts using AutoIC generally yielded smaller effect sizes. What does this mean?
There are two potential reasons why AutoIC yielded smaller effect sizes, on average, than human-scored IC. (1) Others have noted that big data -whether natural language processing data or otherwise -can often produce smaller effect sizes (see, e.g., Slavin & Smith, 2009; see also Houck et al., 2018;Kramer, Guillory, & Hancock, 2014). Thus, it is possible that the effect sizes for AutoIC are smaller than human-scored IC simply because it generally scores a much larger amount of material.
(2) It is also possible that effect sizes differed for a simpler reason: The present AutoIC results generally did not draw from the exact same paragraphs as the human-scored studies. AutoIC and human-scored IC results might be more similar if we had more frequently been able to use the exact same paragraphs for both systems.
The present results cannot definitively distinguish between these two possibilities, but they can offer some clues.
In Validation Test 3, using an identical corpus showed effects more similar in size to human-scored IC than those using the whole corpus (see Table 2, Test 3 Whole Corpus versus Identical Corpus; note that Test 3 had both Whole and Identical Corpuses). However, both sets of analyses used the same aggregated unit (meaning they had the same n for computing effect sizes). This suggests that the increased AutoIC effect size for the identical corpus was not due to a general "large n" problem, but rather to AutoIC scoring the same exact set of materials as the original (and thus giving it the best chance at replication due to direct overlap). The paragraph-by-paragraph match hypothesis is further bolstered by the fact that in the present work we found smaller AutoIC effect sizes for the personality measurements for Test 4 (where an identical unit of analysis was used, but no paragraph-byparagraph match). Taken together, this set of results might argue that the lower effect sizes in this work are generally due to the lack of specificity (and not to a general large-n problem), and if we had the exact same materials available for scoring, AutoIC effect sizes would be closer to their human-scored counterparts.
However, because we have few cases that can distinguish between various competing explanations, it is still possible that that a more general big data = small effect size problem might account for some of our smaller effect sizes. Importantly, while others have commented on effect size issues with big data/natural language processing and argued that we should not dismiss subsequent small effect sizes out of hand (see, e.g., Slavin & Smith, 2009; see also Houck et al., 2018;Kramer, Guillory, & Hancock, 2014), no study that we know of attempts to set empirical boundaries on the exact parameters of when (and how much) the big data effect size reduction occurs, or (conversely) at what point additional data becomes redundant in natural language processing (e.g., Schönbrodt & Perugini, 2013). Moving forward, it would be very useful for social scientists to more fully explore this issue in empirical studies. When artificial intelligence android K-2SO claims that the probability of Jyn Erso betraying them is "very high" in the Star Wars movie Rogue One, it carries a lot of weight to the intended audience. After all, we tend to view computer intelligences as unburdened with human limitations such as slow processing speed and emotional biases.

Machine Learning Versus Human Learning
Similarly, there may be a tendency to assume that "machine learning" is a superior method of approaching any linguistic problem. However, the truth is that machine learning has clear positive and negative trade-offs. Indeed, when we originally laid plans for an automated integrative complexity system, we first used a rudimentary machine learning approach by evaluating which words and punctuations were associated with higher or lower complexity scores. What we found was that, while we could construct effective algorithms for each data set this way, what worked in one data set often failed on another (see Houck et al., 2014). (2) Partially as a result of this, a human can imagine what would potentially happen in other scenarios that the data in question do not remotely cover. For example, imagine that in one dataset the term "on the other hand" ("Republicans' foreign policy is bad; on the other hand, their economic policy is…") and the term "apart from" ("quite apart from the influence of the war in Iraq, Bush's domestic policies…") are consistently used by one political author to mean clear differentiation (IC score of 3). Based on these data, a computer algorithm would subsequently assign both of those phrases a high probability score for 3. But human coders would do something else entirely. They would realize that, in many other contexts, the phrase "apart from" is actually used in a purely descriptive fashion that implies no complexity at all ("I do not wish to be apart from you"), whereas the number of contexts that "on the other hand" is likely to be used in a non-complex way is comparatively much smaller. Therefore a human-based approach would use the data from the computer learning in a different way -to make estimates of how each phrase would fare beyond the dataset in question -and thus assign "on the other hand" a higher complexity probability score than "apart from." This is one of the reasons that, we believe, AutoIC has consistently performed at similar levels across multiple new contexts beyond those it was originally designed on, and systems such as V+POStags (which focus more Validating Automated Integrative Complexity 518 extensively on machine learning) have done more poorly when faced with a new context (such as the philosophic writings from the current paper). Indeed, AutoIC researchers spent a larger proportion of time developing humangenerated dictionaries of words and phrases than other researchers have done. Consider that, in contrast to V+POStags -which spent more time on machine learning and thus developed a human dictionary that had 312 base words -AutoIC has thousands of complexity-related words and phrases in its lexicon. Further, AutoIC researchers spent more time estimating the probability of each word or phrase's contribution to complexity, whereas V+POStags researchers used a simple binary classification system (see Robertson et al., 2019) that lost potential human-inspired nuance.
Of course, machine learning has advantages too -machine learning can often uncover complex relationships that humans cannot. We expect that, as advances in machine learning grow, it will be used more and more effectively. Indeed, one of the clear implications of the above line of reasoning is that the greatest current need for machine learning approaches is a larger corpus of human-scored IC data for machine-learning development. The best way to deal with the problem of continuity across datasets head-on is to use as large and as varied a set of data as possible to develop natural language processing systems on. Because human-scoring of IC takes a lot of time, this corpus is currently not large enough to likely be sufficient. However, an important goal of future research should be to expand that corpus so that it is large enough to more fully take advantage of the strengths of machine learning.
Thus, our point is not to undermine the validity of machine learning, but simply to point out that is has both great strengths and severe limitations, and to encourage methods-building from multiple perspectives based on rigorous scientific standards (see Schoonvelde, Schumacher, & Bakker, 2019, for a discussion). While ultimately the next generation of improvements will likely indeed come from machine learning -and we would applaud those improvements -we should not assume that just because a system was developed via "machine learning" that it is de facto an improvement. This is an empirical field, and those assumptions (however appealing) must still be put to the empirical test. ii) Evidence suggests that LIWC's measurements are relevant to complex thinking/rhetoric (see, e.g., Boyd et al., 2015;Jordan et al., 2019), but not to integrative complexity specifically. We guess that researchers are simply unaware that the measure they view as a measurement of "integrative complexity" is largely unrelated to that instantiation of complexity.
iii) This I Believe essays were first scored for AutoIC in Houck et al. (2018). However, their use in the present research is entirely novel. iv) To compute the effect size for the This I Believe dataset, we randomly selected a number of participants equal to the number of philosophers and used that randomly selected set to compute the reported effect size. v) They further provide a validity test showing that V+POStags finds an expected effect in a social media analysis. However, they did not score AutoIC on this subsequent test.