Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals

Robustness to Spurious Correlations in Text Classification via Automatically Generated... The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals 1 2 Zhao Wang and Aron Culotta Department of Computer Science, Illinois Institute of Technology, Chicago, IL Department of Computer Science, Tulane University, New Orleans, LA zwang185@hawk.iit.edu, aculotta@tulane.edu Abstract system recognizes a sheep based on the grass in the back- ground (Ghorbani, Abid, and Zou 2019). If these kinds of Spurious correlations threaten the validity of statisti- spurious correlations are built into the model during train- cal classifiers. While model accuracy may appear high ing time, the model could fail when test data has a different when the test data is from the same distribution as the distribution or even on samples with minor changes, and the training data, it can quickly degrade when the test dis- predictions will be problematic and suffer from algorithm tribution changes. For example, it has been shown that fairness or trust issues. classifiers perform poorly when humans make minor modifications to change the label of an example. One One solution to achieve robustness is to learn causal asso- solution to increase model reliability and generalizabil- ciations between features and classes. E.g., in the sentence ity is to identify causal associations between features “This was a free book that sounded boring to me”, the word and classes. In this paper, we propose to train a robust most responsible for the label being negative is “boring” in- text classifier by augmenting the training data with au- stead of “free”. Identifying causal associations provides a tomatically generated counterfactual data. We first iden- way to build more robust and generalizable models. tify likely causal features using a statistical matching Recent works try to achieve robustness with the aid approach. Next, we generate counterfactual samples for the original training data by substituting causal features of human-in-the-loop systems. Srivastava, Hashimoto, and with their antonyms and then assigning opposite labels Liang (2020) present a framework to make models robust to to the counterfactual samples. Finally, we combine the spurious correlations by leveraging human common sense of original data and counterfactual data to train a robust causality. They augment training data with crowd-sourced classifier. Experiments on two classification tasks show annotations about reasoning of possible shifts in unmea- that a traditional classifier trained on the original data sured variables and finally conduct robust optimization to does very poorly on human-generated counterfactual control worst-case loss. Similarly, Kaushik, Hovy, and Lip- samples (e.g., 10%-37% drop in accuracy). However, ton (2020) ask humans to revise documents with minimal the classifier trained on the combined data is more ro- edits to change the class label, then augment the original bust and performs well on both the original test data and training data with the counterfactual samples. Results show the counterfactual test data (e.g., 12%-25% increase in accuracy compared with the traditional classifier). De- that the robust classifier is less sensitive to spurious corre- tailed analysis shows that the robust classifier makes lations. While these prior works show the potential of using meaningful and trustworthy predictions by emphasizing human annotations to improve model robustness, collecting causal features and de-emphasizing non-causal features. such annotations can be costly. In this paper, we propose to train a robust classifier with automatically generated counterfactual samples. Specifi- Introduction cally, we first identify likely causal features using the clos- Despite the remarkable performance machine learning mod- est opposite matching approach and then generate counter- els have achieved in various tasks, studies have shown that factual training samples by substituting causal features with statistical models typically learn correlational associations their antonyms and assigning opposite labels to the newly between features and classes, and model validity and relia- generated samples. Finally, we combine the original training bility are threatened by spurious correlations. Examples in- data with counterfactual data to train a more robust classifier. clude: a sentiment classifier learns that “Spielberg” is corre- We experiment with sentiment classification tasks on lated with positive movie reviews (Wang and Culotta 2020); two datasets (IMDB movie reviews and Amazon kindle re- a toxicity classifier learns that “gay” is correlated with toxic views). For each dataset, we have the original training data comments (Wulczyn, Thain, and Dixon 2017); a medical and testing data, and additional human-generated counter- system learns that the disease is associated with patient factual testing data. We first train a traditional classifier us- ID (Kaufman, Rosset, and Perlich 2011); an object detection ing the original data, which performs poorly on the counter- factual testing data (i.e., 10%-37% drop in accuracy). Then, Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. we train a robust classifier with the combination of orig- 14024 inal training data and automatically-generated counterfac- labeling examples). Garg et al. (2019) presume a predefined tual training data, and it performs well on both the origi- set of 50 counterfactually fair tokens and augment the train- nal testing data and the counterfactual testing data (i.e., 12% ing data with counterfactuals to improve toxicity classifier - 25% absolute improvement over the baseline). Addition- fairness. ally, we consider limited human supervision in the form of While recent works have proposed the idea of generating human-provided causal features, which we then use to gen- and augmenting with counterfactuals for robust classifica- erate counterfactual training samples. We find that a small tions, the main contributions of this paper are as follows: number of causal features (e.g., 50) results in accuracy that is • We propose to discover likely causal features using statis- comparable to a model trained with 1:7K human-generated tical matching techniques. counterfactual training samples from previous work. • Using these features, we automatically generate coun- terfactual samples by substituting causal features with Related Work antonyms, which significantly reduces human effort. Spurious correlations are problematic and could be in- troduced in many ways. Sagawa et al. (2020) investigate • We conduct experiments demonstrating the improved ro- how overparameterization exacerbates spurious correlations. bustness of the resulting classifier to spurious features. They compare overparameterized models with underparam- • We conduct additional analyses to show how the robust eterized models and show that overparameterization encodes classifier increases the importance of causal features and spurious correlations that do not hold in worst-group data. decreases the importance of spurious features. Kiritchenko and Mohammad (2018) showed that training data imbalances can lead to unintended bias and unfair ap- Problem and Motivation plications (e.g., bias towards gender, race). Besides that, data leakage (Roemmele, Bejan, and Gordon 2011) and distribu- To train a classification model, we fit a function f() with tion shift between training data and testing data (Quionero- a set of labeled data and learn a map between input fea- Candela et al. 2009) are particularly challenging and hard to tures and output labels. We consider a binary text classifi- detect as they introduce spurious correlations during model cation task with the simple approach of logistic regression training and hurt model performance when deployed. An- model : f(x; ) = using bag-of-words features. hx;i 1+e other new type of threat is backdoor attack (Dai, Chen, and Specifically, each document is a sequence of words d = Li 2019), where an attacker intentionally poisons a model by hw : : : w i that is transformed into a feature vector x via 1 k injecting spurious correlations into training data and manip- one-hot representation x = hx : : : x i (V is the vocabulary 1 V ulating model performance by specific triggers. size), and has a binary label y 2 f1; 1g. The model is fit on A growing line of research explores the challenges and a set of labeled documents D = f(d ; y ) : : : (d ; y )g, and 1 1 n n benefits of using causal inference to improve model robust- parameters are estimated by minimizing the loss functionL: ness. Wood-Doughty, Shpitser, and Dredze (2018) uses text arg min L(D; ). We can examine the (partial) cor- classifiers in causal analyses to address issues of missing relations between features and labels by model coefficients. data and measurement error. Keith, Jensen, and O’Connor Spurious correlations are very common in statistical mod- (2020) introduce methods to remove confounding from els and they could mislead classifiers. For example, in our causal estimates. Paul (2017) proposes a propensity score experimental dataset of Amazon kindle reviews, the classi- matching method to learn meaningful causal associations fier learns that “free” has a strong correlation with negative between variables. Jia et al. (2019) consider label preserv- sentiment because “free” has a high frequency in negative ing transformations to improve model robustness to adver- book reviews (e.g., “This was a free book that sounded bor- sarial perturbations with Interval Bound Propagation. Lan- ing to me”), and thus the classifier makes errors when pre- deiro and Culotta (2018) address the issue of spurious corre- dicting positive documents that contain “free”. lations by doing back-door adjustment to control for known Previous works have tried various methods to reduce confounders. Wang and Culotta (2020) train a classifier to spurious correlations (e.g., regularization, feature selection, distinguish between spurious features and genuine features, back-door adjustment (Hoerl and Kennard 1970; Forman and gradually remove spurious features to improve worst- 2003; Landeiro and Culotta 2018)). However, a more direct case accuracy of minority groups. solution is to learn meaningful causal associations between Recent works investigate how additional human supervi- features and classes. While expressing causality in the con- sion can reduce spurious correlations and improve model text of text classification can be challenging, we follow the robustness. Roemmele, Bejan, and Gordon (2011) and Sap previous work (Paul 2017) to operationalize the definition et al. (2018) show that humans achieve high performance of a causal feature as follows: term w is a causal feature in on commonsense causal reasoning and counterfactual tasks. document d if, all else being equal, one would expect w to be Zaidan and Eisner (2008) ask annotators to provide ratio- a determining factor in assigning a label to d. For example, nales as hints to guide classifiers paying attention to relevant in the sentence “This was a free book that sounded boring to features. Lu et al. (2018) and Zmigrod et al. (2019) use coun- me”, “boring” is primarily responsible for the negative senti- terfactual data augmentation to mitigate bias. Ribeiro et al. ment. In contrast, the term “free” itself does not convey neg- (2020) evaluate model robustness using generated counter- factuals that requires significant human intervention (either Our approach is model agnostic. We focus on logistic regres- by specifying substitution terms or generating templates and sion for interpretability and clarity. 14025 0 ative sentiment. We consider “boring” as a causal term and for each match is the context similarity of d and d . The “free” as a non-causal term (term refers to word feature). matched documents have opposite labels. Our approach in this paper is to first identify such causal 3. Then for each term t and its corresponding matching set features and then use them to automatically generate coun- D , we pick the tuple (d ; d ; score ) that has the match i i terfactual training samples. Specifically, for a sample (d; y), highest similarity score as the closest opposite match. 0 0 we get the corresponding counterfactual sample (d ; y ) by We then identify likely causal features by picking those (i) substituting causal terms in d with their antonyms to get whose closest opposite matches have scores greater than 0 0 0 d , and (ii) assigning an opposite label y to d . Let’s consider a threshold (0.95 is used below). the previous example to see how augmenting with counter- 4. We use PyDictionary to get antonyms for causal terms. factual samples might work. Traditional classifiers trained on original data learns that “free” is correlated with the nega- 5. For each training sample, we generate its counterfactual tive class due to its high frequency in negative book reviews. sample by substituting causal terms with antonyms and For every negative document containing “free”, we generate assigning an opposite label to the counterfactual sample. one corresponding counterfactual document. The counter- 6. Finally, we train a robust classifier using the combination factual sample for “This was a free book that sounded boring of original training data and counterfactual data. to me”(neg) would be “This was a free book that sounded in- We provide more details on these steps below. teresting to me”(pos). When augmenting the original train- ing data with counterfactual data, “free” would get equal fre- Identifying Likely Causal Features quency in both classes for the ideal case (i.e, if we could We expect causal features to have at least some correlations generate counterfactual samples for all documents contain- with the target class, so we first fit an initial binary classifier ing “free”). Thus, a classifier fit on the combined dataset f(x; ) on original training dataD = f(d ; y ) : : : (d ; y )g 1 1 n n should have a reduced coefficient for “free” and increased and extract top terms ht : : : t i that have relatively large 1 k coefficients for “boring” and “interesting.” magnitude coefficients (e.g., > 1 in experiments below). For a top term t and a document d containing t, we let Methods d[t] represent the context of removing t from d. We search Our approach is a two-stage process: we first identify 0 0 0 0 for another document d that (i) has t 2 d and t 2= d , likely causal features and then generate counterfactual train- 0 0 where t is another top term, and (ii) d has the opposite la- ing data using causal features. To identify causal features, 0 0 bel with d. We use a best match approach to search for d [t ] we consider the counterfactual framework of causal infer- that has highest semantic similarity to d[t] among all pos- ence (Winship and Morgan 1999). If word w in document ^ ^ ^ 0 sible d [t ]: d arg max sim(d[t]; d [t ]). For a term d were replaced with some other word w , how likely is t, we get a set of corresponding matches as D = match it that the label y would change? Since conducting ran- 0 0 f(d ; d ; score ) : : : (d ; d ; score )g, where the score for 1 1 n n 1 n domized control trials to answer this question is infeasible, each match is the semantic similarity between d[t] and we instead use matching methods (Imbens 2004; King and 0 0 d [t ]. Each context is represented by concatenating the Nielsen 2019). The intuition is as follows: if w is a reliable last four layers of a pre-trained BERT model (i.e., recom- piece of evidence to determine the label of d, we should be mended by (Devlin et al. 2019)). We then select the match able to find a very similar document d that (i) does not con- (d ; d ; score ) that has the highest score in D as the tain w, and (ii) has the opposite label of d. For example, i i match closest opposite match for t. Table 1 shows examples of the (d; y) = (“This was a free book that sounded boring to me”, 0 0 closest opposite matches. neg) and (d ; y ) = (“This was a free book that sounded in- From the previous step, we get the closest opposite match teresting to me”, pos) would be an ideal match where substi- for each top term. We then identify terms with closest oppo- tuting causal term “boring” with another term “interesting” site match scores greater than 0.95 as likely causal terms. flips the label. While this is not a necessary condition of a To evaluate the quality of this approach, the left panel of causal feature (there may not be a good match in a limited Figure 1 shows terms annotated by a human as likely to be training set), in the experiments below we find this to be a causal or not, plotted by both their closest opposite match fairly precise approach to generate a small number of high- scores as well as the magnitude of coefficients from the clas- quality causal features. sifier trained on original data. We can see that terms with The full steps of our approach are as follows: very high closest opposite match scores are very likely to be 1. We first train an initial classifier and extract strongly cor- causal. Note that this is not necessarily the case for terms related termsht : : : t i as candidate causal features. E.g., 1 k with high coefficients (y-axis). The high precision and low for logistic regression model, we would extract features recall pattern is further supported by the right panel. with high magnitude coefficients. For more complex mod- els, other transparency algorithms may be used (Martens Selecting Antonyms for Causal Terms and Provost 2014). After identifying causal terms, we search for their antonyms 2. For each top term t and a set of documents con- using PyDictionary. This package provides simple inter- taining t: D = hd : : : d i, we search for a set of t 1 n faces for getting meanings from WordNet and synonyms and 0 0 0 matched documents D = hd : : : d i and get D = match t 1 n 0 0 0 2 f(d ; d ; score ) : : : (d ; d ; score )g, where the score https://github.com/geekpradd/PyDictionary 1 1 n 1 n n 14026 Original sentence Matched sentence Context similarity This was an amazing book. This was a boring book 0.977 It was a boring read. The book was great and long. 0.998 This short story was a disappointment. This was a great short story. 0.992 This is one of the funniest movies I have seen. This is one of the worst movies I have ever seen. 0.980 Fantastic film. Terrible film. 1.00 Table 1: Examples of Closest Opposite Matches with Corresponding Context Similarity Scores tigate more sophisticated language models to ensure fluency of the generated counterfactuals. Training a Robust Text Classifier We augment the original training data with the automati- cally generated counterfactual data to train a robust classi- fier. We perform experiments below to investigate how do causal terms affect the quantity and quality of automatically generated counterfactual samples. Data Figure 1: The “closest opposite match” score provides a We perform sentiment classification experiments on the fol- high-precision indicator of causal features (IMDB dataset). lowing two datasets. Each dataset has human-edited coun- terfactual testing samples to provide benchmark perfor- Causal Term Antonyms mance for classifier robustness. fantastic: 1.638 unimpressive: -0.462; inferior: -0.644 awesome: 1.202 unimpressive: -0.462 IMDB movie reviews: This dataset is sampled from the pleasant: 1.106 unpleasant: -0.333 original IMDB dataset (Pang and Lee 2005) and the con- dull: -1.881 lively: 0.302; colorful: 0.252 terfactual part is collected and published by Kaushik, Hovy, boring: -2.592 interesting: 0.734 and Lipton (2020). They randomly sampled 2:5K reviews with balanced class distributions and partition them into Table 2: Discovered antonyms for causal terms and corre- 1707 training, 245 validation, and 488 testing samples. Then sponding coefficients from the initial classifier. they instruct Amazon Mechanical Turk workers to revise each document with minimum changes towards a counter- factual label, and finally collected 2:5K counterfactually- antonyms from synonym.com. To reduce the noise of the manipulated samples. returned antonyms, we require the antonyms to have oppo- Each document of this dataset is a long paragraph. We are site coefficients with the causal terms. Specifically, for each both interested in exploring classifier performance for long causal term t, we search for its antonyms by: texts and short texts. So, we additionally create a version of • First, check the direct antonyms for t and save those that this dataset segmented into single sentences. To do so, we satisfy the coefficient requirement as candidate antonyms. first fit a binary classifier on the original data and identify strongly correlated terms as keywords. Then we split each • If no satisfying antonym is found, we then get synonyms original document into single sentences and keep those con- for t and iteratively search for each synonym’s antonyms, taining at least one keyword. Sentence labels are inherited and save the satisfied antonyms as candidate antonyms. from the original document labels. To justify the validity After these two steps, we get at least one candidate antonym of this approach, we randomly sampled 500 sentences and for each causal term t : fa : : : a g; k  1. Table 2 shows 1 k manually checked their labels. The inherited labels were cor- examples of the antonyms we get for causal terms. rect for 484 sentences (i.e., 96.8% accuracy). We differenti- ate the IMDB dataset with long texts as IMDB-L and short Generating Counterfactual Samples texts as IMDB-S. Next, for each training document d, we first identify all the Amazon Kindle reviews (Kindle): This dataset contains causal terms in d: ht : : : t i, and then substitute all causal book reviews from the Amazon Kindle Store and each re- 1 m terms with their corresponding antonyms. If a causal term view has a rating ranges from 1-5 (He and McAuley 2016). has multiple candidate antonyms, we randomly pick one to We label reviews with ratings f4,5g as positive and reviews substitute. We only generate counterfactuals for documents with ratings f1,2g as negative, and then process this dataset contain at least one causal term. Finally, we assign oppo- to be single sentences following the approach used in IMDB. site labels to the generated samples. Table 5 shows examples of generated counterfactual sentences. While most substitu- Code and data available at: https://github.com/tapilab/aaai- tions result in reasonable sentences, future work may inves- 2021-counterfactuals 14027 IMDB-L IMDB-S Kindle 2. The original training samples are augmented with auto- pos neg pos neg pos neg matically generated counterfactual training samples using Train 856 851 4114 4059 5000 5000 predicted causal terms . Test 245 243 1144 1101 250 250 3. The original training samples are augmented with coun- Top terms 231 198 194 terfactual samples automatically generated using human Causal terms 282 285 264 annotated causal terms from top words (i.e., 65 for IMDB-L, 80 for IMDB-S, and 76 for Kindle). Table 3: Dataset summary 4. The original training samples are augmented with coun- terfactual samples automatically generated using human Human edited counterfactuals: For the IMDB dataset, annotated causal terms from the entire vocabulary (i.e., we have the human-generated counterfactual training data 282 for IMDB-L, 285 for IMDB-S, and 264 for Kindle). and counterfactual testing data. For kindle dataset, we ran- 5. The original training samples are augmented with human- domly select 500 samples as test data (comparable size with generated counterfactual training samples. the test data from IMDB-L) and manually edit them to be We train the classifiers using the five different training sets counterfactual samples with the minimum edits. and compare their performances on the original test sam- Ground truth causal terms: We manually annotated a ples and the human-generated counterfactual test samples. set of ground truth causal terms for each dataset. Specif- Table 4 shows the results. ically, we asked two student annotators to label a term as When the classifier is trained on original training samples, causal if, all else being equal, this term is a determining fac- it performs well on the original test data, but the accuracy de- tor in assigning a label to a document. While there is some grades quickly when tested on human-generated counterfac- subjectivity in the annotation, we did a round of training tual data (e.g., 20.1% absolute decrease for IMDB-L, 10.6% to resolve disagreements prior to annotation and the final decrease for IMDB-S, 37.4% decrease for Kindle). This in- agreement was generally high for this task (e.g., 96% raw dicates that spurious correlations learned in the original clas- agreement by fraction of labels that agree). sifier do not generalize well on the counterfactual test data. Table 3 shows the basic data statistics. For the top terms, When evaluating on human-generated counterfactual test we select them by thresholding on the magnitude of coeffi- samples, the classifier performance increases when we aug- cients. For IMDB-L, we use threshold 0.4, and for IMDB-S ment the original training data with counterfactual data. and Kindle, we use threshold 1.0. Even with no additional human supervision, the approach that automatically identifies causal terms outperforms the Experiments and Discussion original classifier across all datasets (13%, 5.5%, 11% ab- solute improvement). Further improvements occur with ad- Causal Term Identification ditional human supervision in the form of causal terms. Us- According to the left panel of Figure 1, we find that the simi- ing all causal terms (less than 300 terms per dataset), the larity scores of closest opposite matches seem to be a viable approach achieves comparable performance to the more ex- signal of true causal terms. The right panel shows the per- pensive baseline which requires humans to edit > 1K coun- formance of identifying causal terms when thresholding on 4 terfactual samples. the closest opposite match scores. Using threshold 0.95, we We also observe that model accuracy slightly decreases identify 32 causal terms for IMDB-L and IMDB-S datasets, on the original test data. This is because the spurious correla- of which 27 are true causal terms (i.e., precision: 84%), and tions hold in the original test data, but the importance of such 23 causal terms for Kindle dataset, of which 19 are true features is reduced in the models trained on counterfactual causal terms (i.e., precision: 83%). samples. This suggests a potential tradeoff between accuracy on a specific dataset and generalizability of the model. Robust Classification for Counterfactual Test Data Alternative Experiments We fit five binary LogisticRegression classifiers with differ- The Appendix provides additional results using more com- ent training data (using scikit-learn (Pedregosa et al. 2011)) plex neural network models (LSTM with distributed word and evaluate their performance on the original test samples representations). The baseline classification accuracy is as well as counterfactual test samples. The training data quite similar (within .03), and the relative accuracy of the compared below have increasing requirements for human different approaches exhibit very similar trends with the cur- supervision. For the first and second, only original training rent results using logistic regression. data is required. For the third and fourth, a human provides We have also run experiments to control for the training a list of causal terms, either by selecting from the list of data size by downsampling the augmented training data to top terms, or from the entire vocabulary. In the final setting, have the same size as original training data. Results show humans manually annotate counterfactual training samples that there are only minor changes in accuracy (i.e., < 0.04), (equivalent to the approach of Kaushik, Hovy, and Lipton (2020)). Details of the five levels of human supervision are We lack human-generated counterfactual training samples for as follows: Kindle dataset, so we omit that result from Table 4. 1. Only original training samples. The Appendix is available in the Arxiv version of this paper. 14028 Training data: Testing data Original train samples + Counterfactual train samples IMDB-L IMDB-S Kindle Counterfactual Causal terms Orig CTF Orig CTF Orig CTF training samples not used not used .816 .615 .711 .605 .888 .514 predicted from top words .742 .744 .685 .660 .866 .624 auto-generated annotated from top words .760 .818 .679 .696 .882 .662 annotated from whole vocabulary .773 .857 .685 .726 .752 .720 human-generated not used .818 .869 .705 .762 n/a n/a Table 4: Classification accuracy results. (CTF is human-generated counterfactual testing data.) change in coefficients from the original classifier to the ro- bust classifier for causal and non-causal terms. Take the word “free” for example, which has a coefficient 1:41 in the original classifier due to its high frequency in negative samples. When generating counterfactuals, we substitute “boring” with “interesting” and generate a sam- ple “This was a free book that sounded interesting to me” with positive sentiment. So the counterfactual samples in- crease the frequency of “free” in positive class and thus mit- igate the spurious correlation of “free” with negative class. The classifier trained on a combination of original data and counterfactual data would learn a smaller magnitude coef- ficient for “free”. Analogously, the approach increases the magnitude of causal terms such as “awesome” and “awful” Figure 2: Performance change with counterfactuals gener- by providing counterfactuals with opposite labels. ated using different numbers of annotated causal terms Error Analysis and the overall trends match the current results (see Ap- Table 6 shows several test sentences that are misclassified by pendix). the original classifier and later corrected by the robust clas- Finally, as regularization terms may impact spurious fea- sifier. We can see again that the robust classifier increases tures, we have also experimented with the L2 regularization coefficients of causal terms and decreases coefficients of term in logistic regression (fC = 0:01; 0:1; 1; 10; 100g), and non-causal terms. For example, “Really good movie” is in- there are only minor differences in accuracy on the counter- correctly predicted as negative by the original classifier, be- factual test data (see Appendix). cause the causal term “good” has a small positive coefficient and the prediction is misled by the spuriously correlated Performance Change with Different Number of negative term “movie” . The robust classifier corrects this Human Annotated Causal Terms prediction by increasing the coefficient of the causal term “good” and decreasing the coefficient of the non-causal term To further investigate how many human-provided causal “movie.” terms are needed to improve robustness, Figure 2 shows the We conduct a final analysis to explore the impact of classification performance with different numbers of causal causal versus non-causal terms when correcting misclassi- terms used for generating counterfactual samples. The qual- fications. For each corrected sample, we compute separately ity of automatically generated counterfactuals depends on the change in coefficient magnitudes for causal and non- the causal terms used for antonym substitutions. We ob- causal terms. We then aggregate across all corrected sam- serve that performance seems to plateau after about 100 ples to summarize the impact each type of correction has. causal terms, which suggests that we can get similar perfor- As shown in Table 7, for IMDB-L, increasing coefficients mance by annotating 100 causal terms, as opposed to creat- of causal terms is more important than decreasing coeffi- ing > 1K counterfactual training samples. The cause of the cients of non-causal terms, and the reverse is true for the plateau is likely due to the infrequency of subsequent terms other two datasets. This suggests that document length is and the fact that such terms co-occur with other causal terms, an important factor in determining whether increasing co- so they do not result in many new counterfactual samples. efficients of causal terms has bigger impacts or decreasing coefficients of non-causal terms has bigger impacts. Exam- Coefficient Change for Causal and Non-causal ining the average coefficient change of each term, the robust Terms To understand why training with counterfactual data im- In the data, “film” correlates with high ratings, while “movie” proves classifier robustness, Table 5 shows examples of the correlates with low ratings. 14029 Term Original coef Robust coef Original sentence Counterfactual sentence movie -0.236 0.028 Terrible movie Fantastic movie Non-causal free -1.41 -0.919 This was a free book that This was a free book that terms sounded boring to me. sounded interesting to me. awesome 0.584 1.838 He was an awesome actor. He was an awful actor. Causal terrible -1.283 -2.336 The whole movie consists of The whole movie consists terms terrible dialogue. of pleasant dialogue. Table 5: Coefficient change of causal and non-causal terms. Corrected samples Original coef Robust coef Acknowledgments good:0.231 good:0.714 Really good movie.(pos) This research was funded in part by the National Science movie:-0.236 movie:0.028 Foundation under grant #1618244. Zhao Wang was funded dubbing:-0.472 dubbing:-0.1 The dubbing was as good in part by a Dissertation Fellowship from the Computer Sci- good:0.231 good:0.714 as I have seen.(pos) ence department at Illinois Tech. We would also like to thank story:-0.171 story:-0.083 the anonymous reviewers for useful feedback. The story was incredibly incredibly:-0.874 incredibly:0.029 interesting.(pos) interesting:-0.874 interesting:1.012 References Dai, J.; Chen, C.; and Li, Y. 2019. A Backdoor Attack Table 6: Explanation for robust classifier corrected samples. Against LSTM-Based Text Classification Systems. IEEE Access 7: 138872–138878. Change per document Change per term causal non-causal causal non-causal Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. IMDB-L 1.888 -0.734 0.327 -0.01 BERT: Pre-training of Deep Bidirectional Transformers for IMDB-S 0.435 -0.626 0.302 -0.042 Language Understanding. In NAACL-HLT. Kindle 0.293 -0.772 0.315 -0.109 Forman, G. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Table 7: Original versus robust classifier coefficient change Res. 3: 1289–1305. ISSN 1532-4435. for causal versus non-causal terms for corrected samples. Garg, S.; Perot, V.; Limtiaco, N.; Taly, A.; Chi, E. H.; and Beutel, A. 2019. Counterfactual Fairness in Text Classi- fication through Robustness. In Proceedings of the 2019 classifier tends to make bigger increases for causal terms and AAAI/ACM Conference on AI, Ethics, and Society, AIES’19. smaller decreases for non-causal terms. However, the greater frequency of non-causal terms can lead to these changes to Ghorbani, A.; Abid, A.; and Zou, J. 2019. Interpretation of have a greater overall impact on classification accuracy. neural networks is fragile. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 33, 3681–3688. Conclusion and Future Work He, R.; and McAuley, J. 2016. Ups and Downs. Proceedings of the 25th International Conference on World Wide Web - We have presented a framework to automatically generate WWW’16 doi:10.1145/2872427.2883037. counterfactual training samples from causal terms and then Hoerl, A. E.; and Kennard, R. W. 1970. Ridge Regression: train a robust classifier using the combination of original Biased Estimation for Nonorthogonal Problems. Technomet- data and counterfactual data. Using this framework, we can rics 12(1): 55–67. doi:10.1080/00401706.1970.10488634. easily improve classifier robustness even with few causal terms. If enough causal terms are annotated (e.g., 100 in Imbens, G. W. 2004. Nonparametric estimation of aver- our experiments), it is possible to achieve performance com- age treatment effects under exogeneity: A review. Review parable to using human-generated counterfactuals. In future of Economics and statistics 86(1): 4–29. work, we will investigate extensions to increase the preci- Jia, R.; Raghunathan, A.; Goksel, ¨ K.; and Liang, P. 2019. sion and recall of causal term identification to further reduce Certified Robustness to Adversarial Word Substitutions. In the reliance on human supervision. Additionally, it would be Proceedings of the 2019 Conference on Empirical Meth- interesting to extend this framework to other tasks such as ods in Natural Language Processing and the 9th Interna- topic classification robustness. To do so, we would need to tional Joint Conference on Natural Language Processing generalize the notion of “antonyms” to include terms that in- (EMNLP-IJCNLP), 4129–4142. Hong Kong, China. dicate a different topic (e.g., to convert a sports news story to a political news story, we might change the sentence “watch Kaufman, S.; Rosset, S.; and Perlich, C. 2011. Leakage in the game” to “watch the debate”). Then we could generate Data Mining: Formulation, Detection, and Avoidance. In “counterfactuals” by substituting topic-related terms with Proceedings of the 17th ACM SIGKDD International Con- terms that are not semantically related to the current topic ference on Knowledge Discovery and Data Mining, KDD (or related to other topics). ’11, 556–563. New York, NY, USA. ISBN 9781450308137. 14030 Kaushik, D.; Hovy, E.; and Lipton, Z. 2020. Learning The Sap, M.; Bras, R. L.; Allaway, E.; Bhagavatula, C.; Lourie, Difference That Makes A Difference With Counterfactually- N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2018. Augmented Data. In International Conference on Learning ATOMIC: An Atlas of Machine Commonsense for If-Then Representations, ICLR’20. Reasoning. ArXiv abs/1811.00146. Keith, K.; Jensen, D.; and O’Connor, B. 2020. Text and Srivastava, M.; Hashimoto, T.; and Liang, P. 2020. Robust- Causal Inference: A Review of Using Text to Remove Con- ness to Spurious Correlations via Human Annotations. In founding from Causal Estimates. In Proceedings of the 58th III, H. D.; and Singh, A., eds., Proceedings of the 37th In- Annual Meeting of the Association for Computational Lin- ternational Conference on Machine Learning, volume 119 guistics, 5332–5344. Online: ACL. of Proceedings of Machine Learning Research, 9109–9119. King, G.; and Nielsen, R. 2019. Why propensity scores Wang, Z.; and Culotta, A. 2020. Identifying Spurious Cor- should not be used for matching. Political Analysis 27(4): relations for Robust Text Classification. In Findings of the 435–454. Association for Computational Linguistics, EMNLP 2020. Kiritchenko, S.; and Mohammad, S. 2018. Examining Gen- Winship, C.; and Morgan, S. L. 1999. The estimation of der and Race Bias in Two Hundred Sentiment Analysis Sys- causal effects from observational data. Annual review of so- tems. In Proceedings of the Seventh Joint Conference on ciology 25(1): 659–706. Lexical and Computational Semantics, 43–53. New Orleans, Wood-Doughty, Z.; Shpitser, I.; and Dredze, M. 2018. Chal- Louisiana: ACL. doi:10.18653/v1/S18-2005. lenges of Using Text Classifiers for Causal Inference. Pro- Landeiro, V.; and Culotta, A. 2018. Robust Text Classifica- ceedings of the Conference on Empirical Methods in Natural tion under Confounding Shift. Journal of Artificial Intelli- Language Processing. EMNLP 2018: 4586–4598. gence Research 63: 391–419. doi:10.1613/jair.1.11248. Wulczyn, E.; Thain, N.; and Dixon, L. 2017. Ex Machina: Lu, K.; Mardziel, P.; Wu, F.; Amancharla, P.; and Datta, Personal Attacks Seen at Scale. In Proceedings of the 26th A. 2018. Gender Bias in Neural Natural Language Pro- International Conference on World Wide Web, WWW ’17. cessing. In Logic, Language, and Security, volume 12300, Zaidan, O. F.; and Eisner, J. 2008. Modeling Annotators: 189–202. Springer, Cham. doi:https://doi.org/10.1007/978- A Generative Approach to Learning from Annotator Ratio- 3-030-62077-6 14. nales. In Proceedings of the Conference on Empirical Meth- Martens, D.; and Provost, F. 2014. Explaining data-driven ods in Natural Language Processing, EMNLP ’08. ACL. document classifications. Mis Quarterly 38(1): 73–100. Zmigrod, R.; Mielke, S. J.; Wallach, H.; and Cotterell, R. Pang, B.; and Lee, L. 2005. Seeing stars: Exploiting class re- 2019. Counterfactual Data Augmentation for Mitigating lationships for sentiment categorization with respect to rat- Gender Stereotypes in Languages with Rich Morphology. In ing scales. In Proceedings of the 43rd annual meeting on Proceedings of the 57th Annual Meeting of the Association association for computational linguistics, 115–124. ACL. for Computational Linguistics. ACL. Paul, M. J. 2017. Feature Selection as Causal Inference: Experiments with Text Classification. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: ACL. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830. Quionero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N. D. 2009. Dataset Shift in Machine Learning. The MIT Press. ISBN 0262170051. Ribeiro, M. T.; Wu, T.; Guestrin, C.; and Singh, S. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL. Roemmele, M.; Bejan, C.; and Gordon, A. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI Spring Symp - Technical Report. Sagawa, S.; Raghunathan, A.; Koh, P. W.; and Liang, P. 2020. An Investigation of Why Overparameterization Ex- acerbates Spurious Correlations. In Proceedings of the 37th International Conference on Machine Learning,ICML 2020. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Proceedings of the AAAI Conference on Artificial Intelligence Unpaywall

Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals

Proceedings of the AAAI Conference on Artificial IntelligenceMay 18, 2021

Loading next page...
 
/lp/unpaywall/robustness-to-spurious-correlations-in-text-classification-via-BXq0Ozk0ta

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
2159-5399
DOI
10.1609/aaai.v35i16.17651
Publisher site
See Article on Publisher Site

Abstract

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals 1 2 Zhao Wang and Aron Culotta Department of Computer Science, Illinois Institute of Technology, Chicago, IL Department of Computer Science, Tulane University, New Orleans, LA zwang185@hawk.iit.edu, aculotta@tulane.edu Abstract system recognizes a sheep based on the grass in the back- ground (Ghorbani, Abid, and Zou 2019). If these kinds of Spurious correlations threaten the validity of statisti- spurious correlations are built into the model during train- cal classifiers. While model accuracy may appear high ing time, the model could fail when test data has a different when the test data is from the same distribution as the distribution or even on samples with minor changes, and the training data, it can quickly degrade when the test dis- predictions will be problematic and suffer from algorithm tribution changes. For example, it has been shown that fairness or trust issues. classifiers perform poorly when humans make minor modifications to change the label of an example. One One solution to achieve robustness is to learn causal asso- solution to increase model reliability and generalizabil- ciations between features and classes. E.g., in the sentence ity is to identify causal associations between features “This was a free book that sounded boring to me”, the word and classes. In this paper, we propose to train a robust most responsible for the label being negative is “boring” in- text classifier by augmenting the training data with au- stead of “free”. Identifying causal associations provides a tomatically generated counterfactual data. We first iden- way to build more robust and generalizable models. tify likely causal features using a statistical matching Recent works try to achieve robustness with the aid approach. Next, we generate counterfactual samples for the original training data by substituting causal features of human-in-the-loop systems. Srivastava, Hashimoto, and with their antonyms and then assigning opposite labels Liang (2020) present a framework to make models robust to to the counterfactual samples. Finally, we combine the spurious correlations by leveraging human common sense of original data and counterfactual data to train a robust causality. They augment training data with crowd-sourced classifier. Experiments on two classification tasks show annotations about reasoning of possible shifts in unmea- that a traditional classifier trained on the original data sured variables and finally conduct robust optimization to does very poorly on human-generated counterfactual control worst-case loss. Similarly, Kaushik, Hovy, and Lip- samples (e.g., 10%-37% drop in accuracy). However, ton (2020) ask humans to revise documents with minimal the classifier trained on the combined data is more ro- edits to change the class label, then augment the original bust and performs well on both the original test data and training data with the counterfactual samples. Results show the counterfactual test data (e.g., 12%-25% increase in accuracy compared with the traditional classifier). De- that the robust classifier is less sensitive to spurious corre- tailed analysis shows that the robust classifier makes lations. While these prior works show the potential of using meaningful and trustworthy predictions by emphasizing human annotations to improve model robustness, collecting causal features and de-emphasizing non-causal features. such annotations can be costly. In this paper, we propose to train a robust classifier with automatically generated counterfactual samples. Specifi- Introduction cally, we first identify likely causal features using the clos- Despite the remarkable performance machine learning mod- est opposite matching approach and then generate counter- els have achieved in various tasks, studies have shown that factual training samples by substituting causal features with statistical models typically learn correlational associations their antonyms and assigning opposite labels to the newly between features and classes, and model validity and relia- generated samples. Finally, we combine the original training bility are threatened by spurious correlations. Examples in- data with counterfactual data to train a more robust classifier. clude: a sentiment classifier learns that “Spielberg” is corre- We experiment with sentiment classification tasks on lated with positive movie reviews (Wang and Culotta 2020); two datasets (IMDB movie reviews and Amazon kindle re- a toxicity classifier learns that “gay” is correlated with toxic views). For each dataset, we have the original training data comments (Wulczyn, Thain, and Dixon 2017); a medical and testing data, and additional human-generated counter- system learns that the disease is associated with patient factual testing data. We first train a traditional classifier us- ID (Kaufman, Rosset, and Perlich 2011); an object detection ing the original data, which performs poorly on the counter- factual testing data (i.e., 10%-37% drop in accuracy). Then, Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. we train a robust classifier with the combination of orig- 14024 inal training data and automatically-generated counterfac- labeling examples). Garg et al. (2019) presume a predefined tual training data, and it performs well on both the origi- set of 50 counterfactually fair tokens and augment the train- nal testing data and the counterfactual testing data (i.e., 12% ing data with counterfactuals to improve toxicity classifier - 25% absolute improvement over the baseline). Addition- fairness. ally, we consider limited human supervision in the form of While recent works have proposed the idea of generating human-provided causal features, which we then use to gen- and augmenting with counterfactuals for robust classifica- erate counterfactual training samples. We find that a small tions, the main contributions of this paper are as follows: number of causal features (e.g., 50) results in accuracy that is • We propose to discover likely causal features using statis- comparable to a model trained with 1:7K human-generated tical matching techniques. counterfactual training samples from previous work. • Using these features, we automatically generate coun- terfactual samples by substituting causal features with Related Work antonyms, which significantly reduces human effort. Spurious correlations are problematic and could be in- troduced in many ways. Sagawa et al. (2020) investigate • We conduct experiments demonstrating the improved ro- how overparameterization exacerbates spurious correlations. bustness of the resulting classifier to spurious features. They compare overparameterized models with underparam- • We conduct additional analyses to show how the robust eterized models and show that overparameterization encodes classifier increases the importance of causal features and spurious correlations that do not hold in worst-group data. decreases the importance of spurious features. Kiritchenko and Mohammad (2018) showed that training data imbalances can lead to unintended bias and unfair ap- Problem and Motivation plications (e.g., bias towards gender, race). Besides that, data leakage (Roemmele, Bejan, and Gordon 2011) and distribu- To train a classification model, we fit a function f() with tion shift between training data and testing data (Quionero- a set of labeled data and learn a map between input fea- Candela et al. 2009) are particularly challenging and hard to tures and output labels. We consider a binary text classifi- detect as they introduce spurious correlations during model cation task with the simple approach of logistic regression training and hurt model performance when deployed. An- model : f(x; ) = using bag-of-words features. hx;i 1+e other new type of threat is backdoor attack (Dai, Chen, and Specifically, each document is a sequence of words d = Li 2019), where an attacker intentionally poisons a model by hw : : : w i that is transformed into a feature vector x via 1 k injecting spurious correlations into training data and manip- one-hot representation x = hx : : : x i (V is the vocabulary 1 V ulating model performance by specific triggers. size), and has a binary label y 2 f1; 1g. The model is fit on A growing line of research explores the challenges and a set of labeled documents D = f(d ; y ) : : : (d ; y )g, and 1 1 n n benefits of using causal inference to improve model robust- parameters are estimated by minimizing the loss functionL: ness. Wood-Doughty, Shpitser, and Dredze (2018) uses text arg min L(D; ). We can examine the (partial) cor- classifiers in causal analyses to address issues of missing relations between features and labels by model coefficients. data and measurement error. Keith, Jensen, and O’Connor Spurious correlations are very common in statistical mod- (2020) introduce methods to remove confounding from els and they could mislead classifiers. For example, in our causal estimates. Paul (2017) proposes a propensity score experimental dataset of Amazon kindle reviews, the classi- matching method to learn meaningful causal associations fier learns that “free” has a strong correlation with negative between variables. Jia et al. (2019) consider label preserv- sentiment because “free” has a high frequency in negative ing transformations to improve model robustness to adver- book reviews (e.g., “This was a free book that sounded bor- sarial perturbations with Interval Bound Propagation. Lan- ing to me”), and thus the classifier makes errors when pre- deiro and Culotta (2018) address the issue of spurious corre- dicting positive documents that contain “free”. lations by doing back-door adjustment to control for known Previous works have tried various methods to reduce confounders. Wang and Culotta (2020) train a classifier to spurious correlations (e.g., regularization, feature selection, distinguish between spurious features and genuine features, back-door adjustment (Hoerl and Kennard 1970; Forman and gradually remove spurious features to improve worst- 2003; Landeiro and Culotta 2018)). However, a more direct case accuracy of minority groups. solution is to learn meaningful causal associations between Recent works investigate how additional human supervi- features and classes. While expressing causality in the con- sion can reduce spurious correlations and improve model text of text classification can be challenging, we follow the robustness. Roemmele, Bejan, and Gordon (2011) and Sap previous work (Paul 2017) to operationalize the definition et al. (2018) show that humans achieve high performance of a causal feature as follows: term w is a causal feature in on commonsense causal reasoning and counterfactual tasks. document d if, all else being equal, one would expect w to be Zaidan and Eisner (2008) ask annotators to provide ratio- a determining factor in assigning a label to d. For example, nales as hints to guide classifiers paying attention to relevant in the sentence “This was a free book that sounded boring to features. Lu et al. (2018) and Zmigrod et al. (2019) use coun- me”, “boring” is primarily responsible for the negative senti- terfactual data augmentation to mitigate bias. Ribeiro et al. ment. In contrast, the term “free” itself does not convey neg- (2020) evaluate model robustness using generated counter- factuals that requires significant human intervention (either Our approach is model agnostic. We focus on logistic regres- by specifying substitution terms or generating templates and sion for interpretability and clarity. 14025 0 ative sentiment. We consider “boring” as a causal term and for each match is the context similarity of d and d . The “free” as a non-causal term (term refers to word feature). matched documents have opposite labels. Our approach in this paper is to first identify such causal 3. Then for each term t and its corresponding matching set features and then use them to automatically generate coun- D , we pick the tuple (d ; d ; score ) that has the match i i terfactual training samples. Specifically, for a sample (d; y), highest similarity score as the closest opposite match. 0 0 we get the corresponding counterfactual sample (d ; y ) by We then identify likely causal features by picking those (i) substituting causal terms in d with their antonyms to get whose closest opposite matches have scores greater than 0 0 0 d , and (ii) assigning an opposite label y to d . Let’s consider a threshold (0.95 is used below). the previous example to see how augmenting with counter- 4. We use PyDictionary to get antonyms for causal terms. factual samples might work. Traditional classifiers trained on original data learns that “free” is correlated with the nega- 5. For each training sample, we generate its counterfactual tive class due to its high frequency in negative book reviews. sample by substituting causal terms with antonyms and For every negative document containing “free”, we generate assigning an opposite label to the counterfactual sample. one corresponding counterfactual document. The counter- 6. Finally, we train a robust classifier using the combination factual sample for “This was a free book that sounded boring of original training data and counterfactual data. to me”(neg) would be “This was a free book that sounded in- We provide more details on these steps below. teresting to me”(pos). When augmenting the original train- ing data with counterfactual data, “free” would get equal fre- Identifying Likely Causal Features quency in both classes for the ideal case (i.e, if we could We expect causal features to have at least some correlations generate counterfactual samples for all documents contain- with the target class, so we first fit an initial binary classifier ing “free”). Thus, a classifier fit on the combined dataset f(x; ) on original training dataD = f(d ; y ) : : : (d ; y )g 1 1 n n should have a reduced coefficient for “free” and increased and extract top terms ht : : : t i that have relatively large 1 k coefficients for “boring” and “interesting.” magnitude coefficients (e.g., > 1 in experiments below). For a top term t and a document d containing t, we let Methods d[t] represent the context of removing t from d. We search Our approach is a two-stage process: we first identify 0 0 0 0 for another document d that (i) has t 2 d and t 2= d , likely causal features and then generate counterfactual train- 0 0 where t is another top term, and (ii) d has the opposite la- ing data using causal features. To identify causal features, 0 0 bel with d. We use a best match approach to search for d [t ] we consider the counterfactual framework of causal infer- that has highest semantic similarity to d[t] among all pos- ence (Winship and Morgan 1999). If word w in document ^ ^ ^ 0 sible d [t ]: d arg max sim(d[t]; d [t ]). For a term d were replaced with some other word w , how likely is t, we get a set of corresponding matches as D = match it that the label y would change? Since conducting ran- 0 0 f(d ; d ; score ) : : : (d ; d ; score )g, where the score for 1 1 n n 1 n domized control trials to answer this question is infeasible, each match is the semantic similarity between d[t] and we instead use matching methods (Imbens 2004; King and 0 0 d [t ]. Each context is represented by concatenating the Nielsen 2019). The intuition is as follows: if w is a reliable last four layers of a pre-trained BERT model (i.e., recom- piece of evidence to determine the label of d, we should be mended by (Devlin et al. 2019)). We then select the match able to find a very similar document d that (i) does not con- (d ; d ; score ) that has the highest score in D as the tain w, and (ii) has the opposite label of d. For example, i i match closest opposite match for t. Table 1 shows examples of the (d; y) = (“This was a free book that sounded boring to me”, 0 0 closest opposite matches. neg) and (d ; y ) = (“This was a free book that sounded in- From the previous step, we get the closest opposite match teresting to me”, pos) would be an ideal match where substi- for each top term. We then identify terms with closest oppo- tuting causal term “boring” with another term “interesting” site match scores greater than 0.95 as likely causal terms. flips the label. While this is not a necessary condition of a To evaluate the quality of this approach, the left panel of causal feature (there may not be a good match in a limited Figure 1 shows terms annotated by a human as likely to be training set), in the experiments below we find this to be a causal or not, plotted by both their closest opposite match fairly precise approach to generate a small number of high- scores as well as the magnitude of coefficients from the clas- quality causal features. sifier trained on original data. We can see that terms with The full steps of our approach are as follows: very high closest opposite match scores are very likely to be 1. We first train an initial classifier and extract strongly cor- causal. Note that this is not necessarily the case for terms related termsht : : : t i as candidate causal features. E.g., 1 k with high coefficients (y-axis). The high precision and low for logistic regression model, we would extract features recall pattern is further supported by the right panel. with high magnitude coefficients. For more complex mod- els, other transparency algorithms may be used (Martens Selecting Antonyms for Causal Terms and Provost 2014). After identifying causal terms, we search for their antonyms 2. For each top term t and a set of documents con- using PyDictionary. This package provides simple inter- taining t: D = hd : : : d i, we search for a set of t 1 n faces for getting meanings from WordNet and synonyms and 0 0 0 matched documents D = hd : : : d i and get D = match t 1 n 0 0 0 2 f(d ; d ; score ) : : : (d ; d ; score )g, where the score https://github.com/geekpradd/PyDictionary 1 1 n 1 n n 14026 Original sentence Matched sentence Context similarity This was an amazing book. This was a boring book 0.977 It was a boring read. The book was great and long. 0.998 This short story was a disappointment. This was a great short story. 0.992 This is one of the funniest movies I have seen. This is one of the worst movies I have ever seen. 0.980 Fantastic film. Terrible film. 1.00 Table 1: Examples of Closest Opposite Matches with Corresponding Context Similarity Scores tigate more sophisticated language models to ensure fluency of the generated counterfactuals. Training a Robust Text Classifier We augment the original training data with the automati- cally generated counterfactual data to train a robust classi- fier. We perform experiments below to investigate how do causal terms affect the quantity and quality of automatically generated counterfactual samples. Data Figure 1: The “closest opposite match” score provides a We perform sentiment classification experiments on the fol- high-precision indicator of causal features (IMDB dataset). lowing two datasets. Each dataset has human-edited coun- terfactual testing samples to provide benchmark perfor- Causal Term Antonyms mance for classifier robustness. fantastic: 1.638 unimpressive: -0.462; inferior: -0.644 awesome: 1.202 unimpressive: -0.462 IMDB movie reviews: This dataset is sampled from the pleasant: 1.106 unpleasant: -0.333 original IMDB dataset (Pang and Lee 2005) and the con- dull: -1.881 lively: 0.302; colorful: 0.252 terfactual part is collected and published by Kaushik, Hovy, boring: -2.592 interesting: 0.734 and Lipton (2020). They randomly sampled 2:5K reviews with balanced class distributions and partition them into Table 2: Discovered antonyms for causal terms and corre- 1707 training, 245 validation, and 488 testing samples. Then sponding coefficients from the initial classifier. they instruct Amazon Mechanical Turk workers to revise each document with minimum changes towards a counter- factual label, and finally collected 2:5K counterfactually- antonyms from synonym.com. To reduce the noise of the manipulated samples. returned antonyms, we require the antonyms to have oppo- Each document of this dataset is a long paragraph. We are site coefficients with the causal terms. Specifically, for each both interested in exploring classifier performance for long causal term t, we search for its antonyms by: texts and short texts. So, we additionally create a version of • First, check the direct antonyms for t and save those that this dataset segmented into single sentences. To do so, we satisfy the coefficient requirement as candidate antonyms. first fit a binary classifier on the original data and identify strongly correlated terms as keywords. Then we split each • If no satisfying antonym is found, we then get synonyms original document into single sentences and keep those con- for t and iteratively search for each synonym’s antonyms, taining at least one keyword. Sentence labels are inherited and save the satisfied antonyms as candidate antonyms. from the original document labels. To justify the validity After these two steps, we get at least one candidate antonym of this approach, we randomly sampled 500 sentences and for each causal term t : fa : : : a g; k  1. Table 2 shows 1 k manually checked their labels. The inherited labels were cor- examples of the antonyms we get for causal terms. rect for 484 sentences (i.e., 96.8% accuracy). We differenti- ate the IMDB dataset with long texts as IMDB-L and short Generating Counterfactual Samples texts as IMDB-S. Next, for each training document d, we first identify all the Amazon Kindle reviews (Kindle): This dataset contains causal terms in d: ht : : : t i, and then substitute all causal book reviews from the Amazon Kindle Store and each re- 1 m terms with their corresponding antonyms. If a causal term view has a rating ranges from 1-5 (He and McAuley 2016). has multiple candidate antonyms, we randomly pick one to We label reviews with ratings f4,5g as positive and reviews substitute. We only generate counterfactuals for documents with ratings f1,2g as negative, and then process this dataset contain at least one causal term. Finally, we assign oppo- to be single sentences following the approach used in IMDB. site labels to the generated samples. Table 5 shows examples of generated counterfactual sentences. While most substitu- Code and data available at: https://github.com/tapilab/aaai- tions result in reasonable sentences, future work may inves- 2021-counterfactuals 14027 IMDB-L IMDB-S Kindle 2. The original training samples are augmented with auto- pos neg pos neg pos neg matically generated counterfactual training samples using Train 856 851 4114 4059 5000 5000 predicted causal terms . Test 245 243 1144 1101 250 250 3. The original training samples are augmented with coun- Top terms 231 198 194 terfactual samples automatically generated using human Causal terms 282 285 264 annotated causal terms from top words (i.e., 65 for IMDB-L, 80 for IMDB-S, and 76 for Kindle). Table 3: Dataset summary 4. The original training samples are augmented with coun- terfactual samples automatically generated using human Human edited counterfactuals: For the IMDB dataset, annotated causal terms from the entire vocabulary (i.e., we have the human-generated counterfactual training data 282 for IMDB-L, 285 for IMDB-S, and 264 for Kindle). and counterfactual testing data. For kindle dataset, we ran- 5. The original training samples are augmented with human- domly select 500 samples as test data (comparable size with generated counterfactual training samples. the test data from IMDB-L) and manually edit them to be We train the classifiers using the five different training sets counterfactual samples with the minimum edits. and compare their performances on the original test sam- Ground truth causal terms: We manually annotated a ples and the human-generated counterfactual test samples. set of ground truth causal terms for each dataset. Specif- Table 4 shows the results. ically, we asked two student annotators to label a term as When the classifier is trained on original training samples, causal if, all else being equal, this term is a determining fac- it performs well on the original test data, but the accuracy de- tor in assigning a label to a document. While there is some grades quickly when tested on human-generated counterfac- subjectivity in the annotation, we did a round of training tual data (e.g., 20.1% absolute decrease for IMDB-L, 10.6% to resolve disagreements prior to annotation and the final decrease for IMDB-S, 37.4% decrease for Kindle). This in- agreement was generally high for this task (e.g., 96% raw dicates that spurious correlations learned in the original clas- agreement by fraction of labels that agree). sifier do not generalize well on the counterfactual test data. Table 3 shows the basic data statistics. For the top terms, When evaluating on human-generated counterfactual test we select them by thresholding on the magnitude of coeffi- samples, the classifier performance increases when we aug- cients. For IMDB-L, we use threshold 0.4, and for IMDB-S ment the original training data with counterfactual data. and Kindle, we use threshold 1.0. Even with no additional human supervision, the approach that automatically identifies causal terms outperforms the Experiments and Discussion original classifier across all datasets (13%, 5.5%, 11% ab- solute improvement). Further improvements occur with ad- Causal Term Identification ditional human supervision in the form of causal terms. Us- According to the left panel of Figure 1, we find that the simi- ing all causal terms (less than 300 terms per dataset), the larity scores of closest opposite matches seem to be a viable approach achieves comparable performance to the more ex- signal of true causal terms. The right panel shows the per- pensive baseline which requires humans to edit > 1K coun- formance of identifying causal terms when thresholding on 4 terfactual samples. the closest opposite match scores. Using threshold 0.95, we We also observe that model accuracy slightly decreases identify 32 causal terms for IMDB-L and IMDB-S datasets, on the original test data. This is because the spurious correla- of which 27 are true causal terms (i.e., precision: 84%), and tions hold in the original test data, but the importance of such 23 causal terms for Kindle dataset, of which 19 are true features is reduced in the models trained on counterfactual causal terms (i.e., precision: 83%). samples. This suggests a potential tradeoff between accuracy on a specific dataset and generalizability of the model. Robust Classification for Counterfactual Test Data Alternative Experiments We fit five binary LogisticRegression classifiers with differ- The Appendix provides additional results using more com- ent training data (using scikit-learn (Pedregosa et al. 2011)) plex neural network models (LSTM with distributed word and evaluate their performance on the original test samples representations). The baseline classification accuracy is as well as counterfactual test samples. The training data quite similar (within .03), and the relative accuracy of the compared below have increasing requirements for human different approaches exhibit very similar trends with the cur- supervision. For the first and second, only original training rent results using logistic regression. data is required. For the third and fourth, a human provides We have also run experiments to control for the training a list of causal terms, either by selecting from the list of data size by downsampling the augmented training data to top terms, or from the entire vocabulary. In the final setting, have the same size as original training data. Results show humans manually annotate counterfactual training samples that there are only minor changes in accuracy (i.e., < 0.04), (equivalent to the approach of Kaushik, Hovy, and Lipton (2020)). Details of the five levels of human supervision are We lack human-generated counterfactual training samples for as follows: Kindle dataset, so we omit that result from Table 4. 1. Only original training samples. The Appendix is available in the Arxiv version of this paper. 14028 Training data: Testing data Original train samples + Counterfactual train samples IMDB-L IMDB-S Kindle Counterfactual Causal terms Orig CTF Orig CTF Orig CTF training samples not used not used .816 .615 .711 .605 .888 .514 predicted from top words .742 .744 .685 .660 .866 .624 auto-generated annotated from top words .760 .818 .679 .696 .882 .662 annotated from whole vocabulary .773 .857 .685 .726 .752 .720 human-generated not used .818 .869 .705 .762 n/a n/a Table 4: Classification accuracy results. (CTF is human-generated counterfactual testing data.) change in coefficients from the original classifier to the ro- bust classifier for causal and non-causal terms. Take the word “free” for example, which has a coefficient 1:41 in the original classifier due to its high frequency in negative samples. When generating counterfactuals, we substitute “boring” with “interesting” and generate a sam- ple “This was a free book that sounded interesting to me” with positive sentiment. So the counterfactual samples in- crease the frequency of “free” in positive class and thus mit- igate the spurious correlation of “free” with negative class. The classifier trained on a combination of original data and counterfactual data would learn a smaller magnitude coef- ficient for “free”. Analogously, the approach increases the magnitude of causal terms such as “awesome” and “awful” Figure 2: Performance change with counterfactuals gener- by providing counterfactuals with opposite labels. ated using different numbers of annotated causal terms Error Analysis and the overall trends match the current results (see Ap- Table 6 shows several test sentences that are misclassified by pendix). the original classifier and later corrected by the robust clas- Finally, as regularization terms may impact spurious fea- sifier. We can see again that the robust classifier increases tures, we have also experimented with the L2 regularization coefficients of causal terms and decreases coefficients of term in logistic regression (fC = 0:01; 0:1; 1; 10; 100g), and non-causal terms. For example, “Really good movie” is in- there are only minor differences in accuracy on the counter- correctly predicted as negative by the original classifier, be- factual test data (see Appendix). cause the causal term “good” has a small positive coefficient and the prediction is misled by the spuriously correlated Performance Change with Different Number of negative term “movie” . The robust classifier corrects this Human Annotated Causal Terms prediction by increasing the coefficient of the causal term “good” and decreasing the coefficient of the non-causal term To further investigate how many human-provided causal “movie.” terms are needed to improve robustness, Figure 2 shows the We conduct a final analysis to explore the impact of classification performance with different numbers of causal causal versus non-causal terms when correcting misclassi- terms used for generating counterfactual samples. The qual- fications. For each corrected sample, we compute separately ity of automatically generated counterfactuals depends on the change in coefficient magnitudes for causal and non- the causal terms used for antonym substitutions. We ob- causal terms. We then aggregate across all corrected sam- serve that performance seems to plateau after about 100 ples to summarize the impact each type of correction has. causal terms, which suggests that we can get similar perfor- As shown in Table 7, for IMDB-L, increasing coefficients mance by annotating 100 causal terms, as opposed to creat- of causal terms is more important than decreasing coeffi- ing > 1K counterfactual training samples. The cause of the cients of non-causal terms, and the reverse is true for the plateau is likely due to the infrequency of subsequent terms other two datasets. This suggests that document length is and the fact that such terms co-occur with other causal terms, an important factor in determining whether increasing co- so they do not result in many new counterfactual samples. efficients of causal terms has bigger impacts or decreasing coefficients of non-causal terms has bigger impacts. Exam- Coefficient Change for Causal and Non-causal ining the average coefficient change of each term, the robust Terms To understand why training with counterfactual data im- In the data, “film” correlates with high ratings, while “movie” proves classifier robustness, Table 5 shows examples of the correlates with low ratings. 14029 Term Original coef Robust coef Original sentence Counterfactual sentence movie -0.236 0.028 Terrible movie Fantastic movie Non-causal free -1.41 -0.919 This was a free book that This was a free book that terms sounded boring to me. sounded interesting to me. awesome 0.584 1.838 He was an awesome actor. He was an awful actor. Causal terrible -1.283 -2.336 The whole movie consists of The whole movie consists terms terrible dialogue. of pleasant dialogue. Table 5: Coefficient change of causal and non-causal terms. Corrected samples Original coef Robust coef Acknowledgments good:0.231 good:0.714 Really good movie.(pos) This research was funded in part by the National Science movie:-0.236 movie:0.028 Foundation under grant #1618244. Zhao Wang was funded dubbing:-0.472 dubbing:-0.1 The dubbing was as good in part by a Dissertation Fellowship from the Computer Sci- good:0.231 good:0.714 as I have seen.(pos) ence department at Illinois Tech. We would also like to thank story:-0.171 story:-0.083 the anonymous reviewers for useful feedback. The story was incredibly incredibly:-0.874 incredibly:0.029 interesting.(pos) interesting:-0.874 interesting:1.012 References Dai, J.; Chen, C.; and Li, Y. 2019. A Backdoor Attack Table 6: Explanation for robust classifier corrected samples. Against LSTM-Based Text Classification Systems. IEEE Access 7: 138872–138878. Change per document Change per term causal non-causal causal non-causal Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. IMDB-L 1.888 -0.734 0.327 -0.01 BERT: Pre-training of Deep Bidirectional Transformers for IMDB-S 0.435 -0.626 0.302 -0.042 Language Understanding. In NAACL-HLT. Kindle 0.293 -0.772 0.315 -0.109 Forman, G. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Table 7: Original versus robust classifier coefficient change Res. 3: 1289–1305. ISSN 1532-4435. for causal versus non-causal terms for corrected samples. Garg, S.; Perot, V.; Limtiaco, N.; Taly, A.; Chi, E. H.; and Beutel, A. 2019. Counterfactual Fairness in Text Classi- fication through Robustness. In Proceedings of the 2019 classifier tends to make bigger increases for causal terms and AAAI/ACM Conference on AI, Ethics, and Society, AIES’19. smaller decreases for non-causal terms. However, the greater frequency of non-causal terms can lead to these changes to Ghorbani, A.; Abid, A.; and Zou, J. 2019. Interpretation of have a greater overall impact on classification accuracy. neural networks is fragile. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 33, 3681–3688. Conclusion and Future Work He, R.; and McAuley, J. 2016. Ups and Downs. Proceedings of the 25th International Conference on World Wide Web - We have presented a framework to automatically generate WWW’16 doi:10.1145/2872427.2883037. counterfactual training samples from causal terms and then Hoerl, A. E.; and Kennard, R. W. 1970. Ridge Regression: train a robust classifier using the combination of original Biased Estimation for Nonorthogonal Problems. Technomet- data and counterfactual data. Using this framework, we can rics 12(1): 55–67. doi:10.1080/00401706.1970.10488634. easily improve classifier robustness even with few causal terms. If enough causal terms are annotated (e.g., 100 in Imbens, G. W. 2004. Nonparametric estimation of aver- our experiments), it is possible to achieve performance com- age treatment effects under exogeneity: A review. Review parable to using human-generated counterfactuals. In future of Economics and statistics 86(1): 4–29. work, we will investigate extensions to increase the preci- Jia, R.; Raghunathan, A.; Goksel, ¨ K.; and Liang, P. 2019. sion and recall of causal term identification to further reduce Certified Robustness to Adversarial Word Substitutions. In the reliance on human supervision. Additionally, it would be Proceedings of the 2019 Conference on Empirical Meth- interesting to extend this framework to other tasks such as ods in Natural Language Processing and the 9th Interna- topic classification robustness. To do so, we would need to tional Joint Conference on Natural Language Processing generalize the notion of “antonyms” to include terms that in- (EMNLP-IJCNLP), 4129–4142. Hong Kong, China. dicate a different topic (e.g., to convert a sports news story to a political news story, we might change the sentence “watch Kaufman, S.; Rosset, S.; and Perlich, C. 2011. Leakage in the game” to “watch the debate”). Then we could generate Data Mining: Formulation, Detection, and Avoidance. In “counterfactuals” by substituting topic-related terms with Proceedings of the 17th ACM SIGKDD International Con- terms that are not semantically related to the current topic ference on Knowledge Discovery and Data Mining, KDD (or related to other topics). ’11, 556–563. New York, NY, USA. ISBN 9781450308137. 14030 Kaushik, D.; Hovy, E.; and Lipton, Z. 2020. Learning The Sap, M.; Bras, R. L.; Allaway, E.; Bhagavatula, C.; Lourie, Difference That Makes A Difference With Counterfactually- N.; Rashkin, H.; Roof, B.; Smith, N. A.; and Choi, Y. 2018. Augmented Data. In International Conference on Learning ATOMIC: An Atlas of Machine Commonsense for If-Then Representations, ICLR’20. Reasoning. ArXiv abs/1811.00146. Keith, K.; Jensen, D.; and O’Connor, B. 2020. Text and Srivastava, M.; Hashimoto, T.; and Liang, P. 2020. Robust- Causal Inference: A Review of Using Text to Remove Con- ness to Spurious Correlations via Human Annotations. In founding from Causal Estimates. In Proceedings of the 58th III, H. D.; and Singh, A., eds., Proceedings of the 37th In- Annual Meeting of the Association for Computational Lin- ternational Conference on Machine Learning, volume 119 guistics, 5332–5344. Online: ACL. of Proceedings of Machine Learning Research, 9109–9119. King, G.; and Nielsen, R. 2019. Why propensity scores Wang, Z.; and Culotta, A. 2020. Identifying Spurious Cor- should not be used for matching. Political Analysis 27(4): relations for Robust Text Classification. In Findings of the 435–454. Association for Computational Linguistics, EMNLP 2020. Kiritchenko, S.; and Mohammad, S. 2018. Examining Gen- Winship, C.; and Morgan, S. L. 1999. The estimation of der and Race Bias in Two Hundred Sentiment Analysis Sys- causal effects from observational data. Annual review of so- tems. In Proceedings of the Seventh Joint Conference on ciology 25(1): 659–706. Lexical and Computational Semantics, 43–53. New Orleans, Wood-Doughty, Z.; Shpitser, I.; and Dredze, M. 2018. Chal- Louisiana: ACL. doi:10.18653/v1/S18-2005. lenges of Using Text Classifiers for Causal Inference. Pro- Landeiro, V.; and Culotta, A. 2018. Robust Text Classifica- ceedings of the Conference on Empirical Methods in Natural tion under Confounding Shift. Journal of Artificial Intelli- Language Processing. EMNLP 2018: 4586–4598. gence Research 63: 391–419. doi:10.1613/jair.1.11248. Wulczyn, E.; Thain, N.; and Dixon, L. 2017. Ex Machina: Lu, K.; Mardziel, P.; Wu, F.; Amancharla, P.; and Datta, Personal Attacks Seen at Scale. In Proceedings of the 26th A. 2018. Gender Bias in Neural Natural Language Pro- International Conference on World Wide Web, WWW ’17. cessing. In Logic, Language, and Security, volume 12300, Zaidan, O. F.; and Eisner, J. 2008. Modeling Annotators: 189–202. Springer, Cham. doi:https://doi.org/10.1007/978- A Generative Approach to Learning from Annotator Ratio- 3-030-62077-6 14. nales. In Proceedings of the Conference on Empirical Meth- Martens, D.; and Provost, F. 2014. Explaining data-driven ods in Natural Language Processing, EMNLP ’08. ACL. document classifications. Mis Quarterly 38(1): 73–100. Zmigrod, R.; Mielke, S. J.; Wallach, H.; and Cotterell, R. Pang, B.; and Lee, L. 2005. Seeing stars: Exploiting class re- 2019. Counterfactual Data Augmentation for Mitigating lationships for sentiment categorization with respect to rat- Gender Stereotypes in Languages with Rich Morphology. In ing scales. In Proceedings of the 43rd annual meeting on Proceedings of the 57th Annual Meeting of the Association association for computational linguistics, 115–124. ACL. for Computational Linguistics. ACL. Paul, M. J. 2017. Feature Selection as Causal Inference: Experiments with Text Classification. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada: ACL. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830. Quionero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N. D. 2009. Dataset Shift in Machine Learning. The MIT Press. ISBN 0262170051. Ribeiro, M. T.; Wu, T.; Guestrin, C.; and Singh, S. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL. Roemmele, M.; Bejan, C.; and Gordon, A. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI Spring Symp - Technical Report. Sagawa, S.; Raghunathan, A.; Koh, P. W.; and Liang, P. 2020. An Investigation of Why Overparameterization Ex- acerbates Spurious Correlations. In Proceedings of the 37th International Conference on Machine Learning,ICML 2020.

Journal

Proceedings of the AAAI Conference on Artificial IntelligenceUnpaywall

Published: May 18, 2021

There are no references for this article.