Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Counterfactual Inference for Text Classification Debiasing

Counterfactual Inference for Text Classification Debiasing Counterfactual Inference for Text Classification Debiasing Chen Qian Fuli Feng Lijie Wen Tsinghua University National University of Singapore Tsinghua University qc16@mails.tsinghua.edu.cn fulifeng93@gmail.com wenlj@tsinghua.edu.cn Chunping Ma Pengjun Xie Alibaba DAMO Academy Alibaba DAMO Academy chunping.mcp@alibaba-inc.com chengchen.xpjg@taobao.com Abstract partisanship recognition (Kiesel et al., 2019) and spam detection (Castillo et al., 2007). Machine Today’s text classifiers inevitably suffer from learning models have become the default choice unintended dataset biases, especially the of solving text classification, owing to their abil- document-level label bias and word-level key- ity to recognize the textual patterns from the la- word bias, which may hurt models’ general- beled documents (Kim, 2014; Howard and Ruder, ization. Many previous studies employed data- 2018). Nevertheless, they are at the risk of inad- level manipulations or model-level balancing mechanisms to recover unbiased distributions vertently capturing and even amplifying the unin- and thus prevent models from capturing the tended dataset biases (Zhao et al., 2017; Zhang two types of biases. Unfortunately, they ei- et al., 2020; Feder et al., 2020; Blodgett et al., ther suffer from the extra cost of data col- 2020), which can be at document-level (i.e., label lection/selection/annotation or need an elab- bias) and word-level (i.e., keyword bias). orate design of balancing strategies. Differ- The label bias issue occurs in the scenarios ent from traditional factual inference in which where a portion of the categories possesses a ma- debiasing occurs before or during training, jority of training examples than others. For ex- counterfactual inference mitigates the influ- ence brought by unintended confounders after ample, the label distribution of a binary sentiment training, which can make unbiased decisions analysis dataset could be 95%:5% (Dixon et al., with biased observations. Inspired by this, 2018). Many previous studies found that the mod- we propose a model-agnostic text classifica- els trained on such data are potentially at the risk tion debiasing framework – C ORSAIR, which of simply predicting the majority answers (Dixon can effectively avoid employing data manip- et al., 2018; Zhang et al., 2020). The keyword ulations or designing balancing mechanisms. bias issue occurs in the situation where trained Concretely, CORSAIR first trains a base model on a training set directly, allowing the dataset models exhibit excessive correlations between cer- biases “poison” the trained model. In infer- tain words and categories, e.g., some sentiment- ence, given a factual input document, C OR- irrelevant words – “black” or “islam” – are always SAIR imagines its two counterfactual counter- connected to negative category. As such, mod- parts to distill and mitigate the two biases cap- els always lean to unfairly predict any document tured by the poisonous model. Extensive ex- containing those keywords to a specific category periments demonstrate CORSAIR’s effective- according to the biased statistical information in- ness, generalizability and fairness. stead of intrinsic textual semantics (Waseem and 1 Introduction Hovy, 2016; Liu and Avci, 2019). The serious disadvantages limit models’ generalization, espe- Text classification, mapping text documents to a cially in the scenarios where the training data is set of predefined categories, is a fundamental and differently-distributed with the testing data (Niu important technique serving for many applications et al., 2021; Goyal et al., 2017). such as sentiment analysis (Qian et al., 2020b), To resolve the issues, an effective solution is to This work was partly done during Chen Qian’s intern- perform data-level manipulations (e.g., resampling ship at Alibaba DAMO academy. Fuli Feng and Lijie Wen (Qian et al., 2020b)), which effectively transforms are the co-corresponding authors. 1 a training set to a relatively balanced one before The code is available at https://github.com/ qianc62/Corsair. training. Another line of debiasing work typically Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5434–5445 August 1–6, 2021. ©2021 Association for Computational Linguistics designs model-level balancing mechanisms (e.g., a base model on an original training set, allowing reweighting (Zhang et al., 2020)), aiming to adap- the unintended dataset biases “poison” the model. tively decrease the influence of majority categories To “rescue” the testing documents from the poi- while increasing the minority during training. The sonous model, in testing, for each factual input core of the two types of solutions is to explicitly document, C ORSAIR imagines its two types of or implicitly recover unbiased distributions and counterfactual counterparts to produce two coun- prevent models from capturing the unintended bi- terfactual outputs as the distilled label bias and ases. Unfortunately, the data-level strategy typi- keyword bias. Lastly, CORSAIR performs a bias cally suffers from the extra manual cost of data removal operation to produce a counterfactual pre- collection, selection and annotation (Zhang et al., diction that corresponds to a debiased decision. 2020), requires much longer training time and nor- To verify, we perform extensive experiments on mally enlarges the gap between training and test- multiple public benchmark datasets. The results ing data distributions. The model-level strategy demonstrate our proposed framework’s effective- typically needs elaborate selection or definition ness, generalizability and fairness, proving that of balancing strategies and needs relearning from C ORSAIR, when employed on four different types scratch once certain balancing mechanisms (e.g., of base models, is significantly helpful to mitigate an unbiased training objective) are redesigned. the two types of dataset biases. Must machine learning models perform debias- 2 Methodology ing before or during training? Think about the dif- ference in the decision making processes between Problem Formalization Let X and Y denote machines and humans. Machine learning systems the input (text document) and output (category) are forced to imitate the behavior from observa- spaces, respectively. Given a labeled training set tions via maximizing the prior probability, from D = f(x ; y ) 2 X  Yg (i.e., the observed train i i which the decision is directly drawn during infer- data), the goal is to learn a text classifier M on ence. By contrast, we humans, although born and D , which serves as a mapping function f() : train raised in a biased nature, have the ability of coun- X 7! Y to accurately classify testing examples in terfactual inference to make unbiased decisions D = fx^jx^ 2 Xg. test with biased observations (Niu et al., 2021). To il- Considering that the dataset biases would not be lustrate, we briefly compare the traditional factual completely eliminated via data manipulations, em- inference and the counterfactual inference in text ploying data manipulations (e.g., resampling) or classification: designing balancing mechanisms (e.g., reweight- Factual Inference: What will the prediction be ing) may be not a directly-reasonable solution. In- if seeing an input document? spired by the success of counterfactual inference Counterfactual Inference: What will the pre- in mitigating biases in computer vision (Niu et al., diction be if seeing the main content of an input 2021; Wang et al., 2020; Tang et al., 2020; Yang document only and had not seen the confounding et al., 2020; Goyal et al., 2017), we propose a dataset biases? counterfactual-inference-based text-classification debiasing framework (CORSAIR), which is able The counterfactual inference essentially gifts hu- to make unbiased decisions with biased observa- mans the imagination ability (i.e., had not done) tions. The core idea of C ORSAIR is to train a to make decisions with a collaboration of the main “poisonous” text classifier regardless the dataset content and the confounding biases (Tang et al., biases and post-adjust the biased predictions ac- 2020), as well as to introspect whether our deci- cording to the causes of the biases in inference. sion is deceived (Niu et al., 2021), i.e., counter- It’s worth mentioning that our proposed CORSAIR factual inference leads to debiased prediction. can be applied to almost any parameterized base Inspired by this, we propose a novel model- model, including traditional one-stage classifiers agnostic paradigm (CORSAIR), which adopts fac- (e.g., T EXTCNN (Kim, 2014), RCNN (Lai et al., tual learning before mitigating the negative influ- 2015) and L ECO (Qian et al., 2020b)) and cur- ence of the dataset biases in inference (i.e., after rently prevalent two-stage classifiers (e.g., U LM- training), without the need of employing data ma- nipulations or designing balancing mechanisms. 2 For brevity, two-stage classifiers refer to two-stage lan- Concretely, in training, C ORSAIR directly trains guage models with an additional prediction layer. 5435 M Biased Learning Bias Distillation f(x)=Y2 Bias Removal (on training set) (on testing set) (on testing set) w1 w2 w3 w8 Y1 λ X Y * * λ λ w2 w5 wn Y2 Factual World Factual Output Label Bias (Traditional Inference) (i.e., Factual Prediction) λ Document Counterfactual Input M Word-Masked Document w6 w1 w8 w2 w8 Ym w1 w2 w3 w4 w5 see nothing Probability Distribution or Logit Vector w1 w2 w3 w4 w5 Keyword Bias Factual Input Imagination ˜ Scaling Factor λ λ w1 w2 w3 Y1 Y2 Elastic Scaling (on validation set) Counterfactual Output Counterfactual World Element-Wise Subtraction w4 w5 … Y3 … (i.e., Label Bias) (Fully-Blindfolded) Vocabulary Categories Counterfactual Input c(x)=Y4 M w1 w2 w3 w4 w5 see some * * = λ λ Imagination X̃ Y Factual Label Keyword Counterfactual X Y Counterfactual World Counterfactual Output Prediction Bias Bias Prediction Factual Learning (Partially-Blindfolded) (i.e., Keyword Bias) Figure 1: The architecture of our proposed model-agnostic framework (CORSAIR). Specifically, C ORSAIR first trains a base model on the training data directly so as to preserve the dataset biases in the trained model. In the inference phase, given a factual input document, CORSAIR first imagines its two types of counterfactual documents to produce two counterfactual outputs as the distilled label bias and keyword bias. Finally, C ORSAIR searches two adaptive parameters to perform bias removal to produce a counterfactual prediction for a debiased answer. F IT (Howard and Ruder, 2018), BERT (Devlin operation on the trained base model to obtain the et al., 2019) and RoBERTa (Liu et al., 2019)). probability distribution over Y (i.e., factual pre- For brevity, we will elaborate C ORSAIR by taking diction) for a most possible answer. However, in RoBERTa (a robustly optimized BERT-shape addition to the textual contents of the document, language model) as the example base model, and the prediction is also affected by unintended con- binary sentiment analysis as the example applica- founders (Pearl and Mackenzie, 2018) which may tion. The high-level architecture of CORSAIR is il- produce the label bias and keyword bias. Aiming lustrated in Figure 1, which consists of three main to obtain unbiased prediction, the key is to debias components: biased learning, bias distillation and during inference by blocking the spread of the bi- bias removal. ases from learning to inference. To achieve that, inspired by the counterfactual studies in causal 2.1 Biased Learning reasoning (Niu et al., 2021; Tang et al., 2020), we design an effective strategy based on causal inter- In the learning phase (i.e., training), C ORSAIR first vention (Pearl, 2013; Pearl and Mackenzie, 2018) trains the base model RoBERTa to learn a mapping to distill the potentially-harmful biases captured relation based on training data. Similar to tradi- by the trained model (Niu et al., 2021; Tang et al., tional training, C ORSAIR uses feedforward to pre- 2020), and then mitigate them via bias removal. dict batch examples and backward to update those learnable parameters in an end-to-end fashion. In 2.2.1 Causal Graph practice, we adopt the standard cross entropy as the training objective (i.e., loss function): Aiming to conduct proper causal intervention, we first formulate the causal graph (Pearl, 2013; Pearl X X L() =  ln i;y i;y and Mackenzie, 2018; Tang et al., 2020) for the (1) i=1 y2Y text classification models (see the left-bottom part = softmax(f(x )) i i of Figure 1), which sheds light on how the docu- where  denotes the learnable parameters of the ment contents and dataset biases affecting the pre- base model f(), n is the number of batch exam- diction. Formally, a causal graph is a directed ples,  is the ground-truth label distribution (over acyclic graph G = (N;E), indicating how a set Y ) and  is the predicted probability distribution i of variables N causally interact with each other (overY ) for a given training example x . i through the causal links E . It provides a sketch of the causal relations behind the data and how 2.2 Bias Distillation variables obtain their values (Tang et al., 2020), In the inference phase (i.e., testing), traditional de- e.g., (X; M)!Y . In this causal graph, X , Y and biasing methods making predictions for each test- M denote a text document’s embedding, its corre- ing document via the conventional feedforward sponding prediction and the trained model which 5436 inevitably captures unintended confounders exist- information is given. Thus, the fully-blindfolded ing in training data, respectively. counterfactual output: P (Yjdo(X)) = f(x^) = f(hw ; w ; ; w i) 1 2 n 2.2.2 Label Bias Distillation (4) 8w 2 x; ^ w [MASK] i i According to the causal graph, we diagnose how naturally reflects as the label bias captured by M , the dataset biases existing in training data misleads where [MASK] is a special token to mask a single inference. Concretely, by using Bayes rule (Wang word. Due to x^ is fully-blindfolded and indepen- et al., 2020), we can view the inference as: dent with trained model M , in implementation, we follow Wang et al. (2020) to use the average doc- f(x) = P (YjX) = P (YjX; c)P (cjX) (2) c ument feature on the whole training set as its em- bedding of the counterfactual document. where c could be any confounder captured by the model trained on a biased training set (e.g., 2.2.3 Keyword Bias Distillation the overwhelming majority of training documents Inspired by the factual inference where all tex- fall in P OSITIVE). Under such circumstances, tual information in test documents are exposed once the training documents corresponding to the to the base model and the fully-blindfolded case P OSITIVE category are dominating than NEGA- where all textual information in each test docu- TIVE, the trained model tends to build strong spu- ment are not exposed, we make the first attempt to rious connections between testing documents and utilize a partially-blindfolded counterfactual docu- P OSITIVE, achieving high accuracy even with- ment where some words in the test document x are out knowing testing documents’ main contents. masked to distill the keyword bias from the trained As such, the model is inadvertently contaminated base model. by the spurious causal correlation: X M!Y , Specifically, we deliberately expose some a.k.a. a back-door path in causal theory (Pearl words which may potentially cause spurious cor- and Mackenzie, 2018; Pearl, 2013). To decouple relations (e.g., the spurious “black”-to-NEGATIVE the spurious causal correlation, the back-door ad- mapping) to the trained model to exhibit their justment (Pearl and Mackenzie, 2018; Pearl, 2013; potentially negative influence. Some evil words Pearl et al., 2016) predicts an actively intervened may serve as unintended confounders (Tang et al., answer via the do() operation: 2020), splitting a document into two pieces: main P (Yjdo(X)) = P (YjX = x^) = f(x^) (3) content and relatively-unimportant context. In the following, we use x~ to denote another counterfac- where x^ could be any counterfactual embedding tual document where the main-content words in as long as it is no longer dependent on M to detach a test document x are masked while other con- the connection between X and M . As illustrated text words are not, and f(x~) as the corresponding in the fully-blindfolded counterfactual world in counterfactual output. To achieve that, an effective Figure 1, the causal intervention operation wipes masking strategy is to use discriminative text sum- out all the in-coming links of a cause variable X , marization methods to extract the main content of which encourages the model M to inference with- the document, before masking content words (im- out seeing any testing document, i.e., RoBERTa portant classification clues) and exposing others should be fully blind in order to detaching the as potentially harmful biasing factors. Since the connection between M and X . To achieve that, model is forced to see only the non-masked con- we use x^ to denote the imagined fully-blindfolded text words in x, f(x~) actually reflects the influence counterfactual document where all words in the from both the potentially harmful contexts and test document x are consistently masked (to cre- the trained model. Thus, the partially-blindfolded ate a counterfactual embedding), and f(x^) as the counterfactual output: corresponding counterfactual output via feedfor- f(x~) = f(hw ; w ; ; w i) 1 2 n ward through the trained model. Since the model (5) w [MASK] if w 2 x cannot see any word in the factual input x after i i content 8w 2 x; ~ w w if w 2 x i i i context fully blindfolding, f(x^) actually reflects the pure influence from the trained base model M . Further- naturally reflects as the keyword bias captured by more, f(x^) refers to the output (e.g., a probabil- M for a specific text document x, where x content ity distribution or a logit vector) where no textual and x denote the main content and the con- context 5437 text of x, respectively. Inspired by a recent coun- where is a metric function (e.g., recall, precision terfactual word-embedding study of Feder et al. and F -score) to evaluate the performance on the (2020), to realize discriminative text summariza- validation set D =(X ; Y ); a and b are the dev dev dev tion, we use Jieba tool, whose TextRank-based boundaries of the search range. The two factors interface can effectively extract the words that are at dataset-level and thus searched only once for may influence the semantics of a sentence as con- each validation set, and would be used in inference tent, leaving potentially discriminative/unfair key- for all testing documents. words (e.g., stop words, a part of adjectives, and 3 Evaluation semantically-unimportant particles) as contexts. Empirically, the average ratio of contents to con- Baselines We choose four types of represen- texts produced by Jieba on all datasets is approxi- tative text classifiers as the base models of mately 62.03%:37.97%. our proposed framework, covering classical, data-manipulation-based, model-balancing-based, 2.3 Bias Removal as well as large-scale and two-stage methods. Our final goal is to use the direct effect from X to T EXTCNN (Kim, 2014) is a classical classifier Y for debiased prediction, removing (n) the label that uses convolutional neural networks (CNN) bias and the keyword bias existing in training data with scale-variant convolution filters to capture lo- (i.e., blocking the spread of the biases from train- cal textual features, which may potentially cap- ing data to inference): f(x)nf(x^)nf(x~). The de- ture spurious correlations between certain key- biased prediction via bias removal can be formal- words and categories. L ECO (Qian et al., 2020b) ized via the conceptually simple and empirically utilizes the combination of the implicit encod- powerful element-wise subtraction operation: ing of deep linguistic information and the ex- ^ ~ c(x) = f(x)nf(x^)nf(x~) = f(x) f(x^) f(x~) (6) plicit encoding of morphological features, which would also capture the keyword bias inadvertently. where f(x) and c(x) correspond to the traditional factual prediction and our counterfactual predic- Besides, it uses a sentence-level over-sampling mechanism (He and Garcia, 2009) to mitigate the tion, respectively; f(x^) and f(x~) correspond to the label bias and the keyword bias distilled from label bias, and we further enhance it via a pow- ^ ~ erful word-level augmentation technique (EDA) the trained base model, respectively;  and are two independent parameters balancing the two (Wei and Zou, 2019) to mitigate the keyword bias, types of biases. denoted as L ECOEDA. W EIGHT (Zhang et al., 2020) is a most recent debiasing text classifier that Note that the two distilled biases could be prob- uses a specially-designed reweighting technique ability distributions over all categories or logit under an unbiased objective for fair (i.e., non- vectors (i.e., without normalization), and they typ- discrimination) learning, which is proven effective ically do not contribute completely equally to to mitigate the unfairness or discrimination issue the final classification. As such, in Equation 6, caused by unintended dataset biases. RoBERTa directly subtracting without adaptive parameters ^ ~ (Liu et al., 2019) is an improved version of BERT, (i.e., == ) would cause that mitigating a certain whose effective modifications allow RoBERTa to bias too much or too less for a specific testing set. generalize better and match or exceed the perfor- Therefore, we propose the elastic scaling mecha- mance of many post-BERT methods, serving as a nism to search two adaptive parameters (scaling ^ ~ very strong baseline in recent work (Gururangan factors) –  and  – on the validation set to et al., 2020). amplify or penalize the two biases, which would dynamically adapt to different datasets accord- Datasets We use multiple English benchmark ing to the extent to which two biases in training datasets (used mainly in academic commu- set “poison” the validation set. In practice, elas- nity): HyperPartisan (Kiesel et al., 2019), Twit- tic scaling can be implemented using grid beam ter (Huang et al., 2017), ARC (Jurgens et al., search (Hokamp and Liu, 2017) in a scoped two- 2018), SCIERC (Luan et al., 2018), ChemProt dimensional space: (Kringelum et al., 2016), Economy (Huang and ^ ~ ^ ~ ^ ~ ;  = arg max (D ; c(x; ; )) ;  2 [a; b] (7) dev Paul, 2018), News (Lang, 1995), Parties (Huang ^ ~ and Paul, 2018), YelpHotel (Zhang et al., 2014); https://github.com/fxsjy/jieba and also randomly collect real-world query- 5438 Table 1: Statistics of the datasets. #D denotes the aver- traditional factual predictions and two counter- age number of characters per document. #C denotes the factual outputs to produce counterfactual predic- number of categories. #Train, #Dev and #Test denote tions, the comparison between each baseline and the number of training set, validation set and testing set, its CORSAIR-equipped counterparts highlights the respectively. importance of the counterfactual inference, which Dataset Domain/Genre #D #C #Train" #Dev #Test is largely ignored by most of previous text clas- HYP Political News 3,265.64 2 516 64 65 TWI Social Network 84.32 2 1,631 272 272 sification methods. Particularly, CORSAIR can ARC Computer Science 222.49 6 1,688 125 128 SCI Computer Science 192.92 7 3,219 712 717 even benefit the data-manipulation-based method CHE Biomedicine 220.28 13 4,169 2,944 2,952 ECO Finance 1,152.22 2 4,744 595 596 (i.e., L ECOEDA) and the model-balancing-based NEW News 1,801.20 20 9,445 4,689 4,694 PAR Political Speech 140.31 2 10,059 2,012 2,012 method (i.e., WEIGHT) consistently, which in turn YEL User Comment 651.73 3 20,975 6,991 6,993 TAO E-Commerce 8.09 143 68,086 6,949 7,022 verifies our initial intuition that the dataset biases SUN E-Commerce 7.70 56 234,074 50,851 50,844 would not be completely eliminated via data ma- nipulations merely, and further illuminates our key category pairs (used in industrial community) insight – preserving biases in models before debi- from two famous Chinese e-commerce platforms: asing in inference. 4 5 Taobao and Suning . For brevity, we will use the We can also notice that CORSAIR sometimes first three letters to denote each dataset (e.g., HYP hurts performance (e.g., RoBERTa+CORSAIR on for HyperPartisan). The statistics of the datasets HYP and ARC); we conjecture the phenomenon are summarized in Table 1. comes from the small-scale data, making the giant model RoBERTa overfits and thus “fail” to dis- Metric We use the widely-used macro-F met- till two potential biases that are identically dis- ric, which is the balanced harmonic mean of pre- tributed with the ideal distributions of factual bi- cision and recall. Furthermore, macro-F is more ases. Moreover, finetuning a RoBERTa model on suitable than micro-F to reflect the extent of the large-datasets (e.g., SUN) would take about 36 dataset biases, especially for the highly-skewed hours, nearly 50 times that of training a WEIGHT cases, since macro-F is strongly influenced by model (about 44 minutes); we thus suggest to use the performance in each category (i.e., category- lightweight base models in practice with consid- sensitive) but micro-F easily gives equal weight ering systems’ robustness and efficiency. Besides, over all documents (i.e., category-agnostic) (Kim the proposed framework works only in inference et al., 2019). and can thus be employed on the previous already- Implementation Details The search range in trained models. Therefore, by leveraging coun- Equation 7 is set as [2:0; 2:0]. Each training terfactual inference, our approach can serve as a is run for 10 epochs with the Adam optimizer powerful, “data-manipulation-free” and “model- (Kingma and Ba, 2015), a mini-batch size of 16, balancing-free” weapon to enhance different types a learning rate of 2e , and a dropout rate of 0.1. of text classification methods. We implement C ORSAIR via Python 3.7.3 and Py- 3.2 Bias Analysis torch 1.0.1. All of our experiments are run on a machine equipped with seven standard NVIDIA According to Sweeney and Najafian (2020), the TITAN-RTX GPUs. more imbalanced/skewed a prediction produced by a trained model is, the more unfair opportuni- 3.1 Overall Performance ties it gives over predefined categories, the more We report the average results over five different unfairly-discriminative the trained model is. We initiations in Table 2. We can observe that COR- thus follow previous work (Xiang and Ding, 2020; SAIR consistently improves the four types of rep- Sweeney and Najafian, 2020) to use the metric – resentative baselines on almost all datasets with a imbalance divergence – to evaluate whether a pre- significance level, regardless of the languages, do- diction (normally a probability distribution) P is mains, volumes and applications of the datasets, imbalanced/skewed/unfair: which validates the effectiveness and the general- D(P; U) =JS(PjjU) (8) izability of the proposed framework. Furthermore, since CORSAIR performs debiasing between the where D() is defined as the distance of P and the uniform distribution U (with jPj elements). Con- https://www.taobao.com https://www.suning.com cretely, we use the JS divergence as the distance 5439 Table 2: Experimental results (F ; %) of all methods on all benchmark datasets (higher is better). For each dataset, the best-performing results among all methods are highlighted with boldfaces. For each baseline, the best- performing results between the baseline and our approach are highlighted with. y denotes statistical significance (p0.05) between a baseline and the counterpart employed on our framework. Method HYP TWI ARC SCI CHE ECO NEW PAR YEL TAO SUN AVG. T EXTCNN 40.48 65.94 12.46 10.09 18.96 46.07 12.07 54.94 51.49 08.16 10.90 30.14 – y y y y y y y y y y y T EXTCNN+C ORSAIR 46.71 69.03 17.03 19.85 22.55 59.74 16.18 56.39 58.37 08.70 14.20 35.34 5.20" L ECOEDA 58.78 72.43 52.64 22.37 30.22 60.81 54.39 57.33 60.60 12.02 17.17 45.34 – y y y y y y L ECOEDA+C ORSAIR 60.46 74.62 53.10 23.28 30.42 61.81 54.48 57.51 60.87 14.25 22.62 46.67 1.33" W EIGHT 49.14 60.80 12.71 09.80 11.98 44.67 15.19 54.90 45.73 01.67 06.54 28.46 – y y y y y y y y y y y W EIGHT+CORSAIR 55.03 68.35 18.04 17.73 22.08 59.24 20.93 55.70 58.47 06.54 14.02 36.01 7.55" RoBERTa 87.92 88.71 68.76 81.76 50.10 53.55 85.38 65.54 77.67 50.70 44.05 68.55 – y y y y y y RoBERTa+C ORSAIR 86.45 89.12 68.10 82.21 51.65 61.31 86.83 67.09 77.69 51.52 46.15 69.82 1.27" Table 3: Experimental results (imbalance divergence or unfairness; %) of all methods on all benchmark datasets (lower is better). The top subtable shows the average document-level imbalance of predictions for label bias evaluation, and the bottom one shows the average word-level imbalance of predictions for keyword bias evaluation. Method HYP TWI ARC SCI CHE ECO NEW PAR YEL TAO SUN AVG. T EXTCNN 01.39 06.31 11.88 09.99 18.86 06.62 28.21 01.41 09.43 41.87 46.12 16.55 – y y y y y y T EXTCNN+CORSAIR 01.07 05.18 02.27 01.62 11.53 01.52 28.49 01.49 09.23 42.01 46.77 13.74 2.81# y y y y L ECOEDA 01.11 07.47 10.42 11.08 08.93 03.51 05.36 00.64 06.66 26.91 22.25 09.48 – L ECOEDA+C ORSAIR 01.21 11.29 12.96 11.99 09.26 04.47 06.05 00.72 05.08 26.06 23.05 10.19 0.71" WEIGHT 00.81 03.19 07.06 05.10 12.65 03.81 01.99 00.18 02.43 25.71 34.76 08.88 – y y y y y y y y y y WEIGHT+C ORSAIR 00.88 01.66 01.95 00.98 04.68 00.56 01.30 00.16 01.21 14.08 14.01 03.77 5.11# RoBERTa 01.29 02.96 14.57 18.10 16.74 06.69 00.16 00.01 02.55 57.74 56.76 16.14 – y y y y y y y y y y RoBERTa+C ORSAIR 00.11 01.27 01.66 12.57 02.76 02.15 00.02 00.01 00.82 28.83 22.91 06.64 9.50# T EXTCNN 17.96 17.39 44.76 47.39 37.35 20.69 38.23 05.76 18.46 65.37 60.87 34.02 – y y y y y y y y y y y T EXTCNN+CORSAIR 07.44 15.17 29.36 22.36 28.84 08.51 35.80 05.09 12.02 64.81 58.37 26.16 7.86# L ECOEDA 06.77 11.93 26.54 15.01 24.16 07.71 30.05 05.09 12.39 65.30 60.63 24.14 – y y y y y y L ECOEDA+C ORSAIR 06.61 14.46 25.94 14.13 22.53 04.77 30.03 05.05 12.58 57.51 52.98 22.41 1.73# WEIGHT 10.32 18.77 43.64 47.70 46.53 21.29 38.98 06.30 21.34 66.75 61.73 34.85 – y y y y y y y y y y y WEIGHT+C ORSAIR 06.34 13.70 33.29 23.40 28.97 08.80 34.74 05.32 10.12 64.87 58.63 26.19 8.66# RoBERTa 21.58 21.58 45.39 41.57 54.57 21.58 59.26 21.58 31.83 67.23 64.82 40.99 – y y y y y y y y y y RoBERTa+C ORSAIR 19.40 13.52 35.87 34.19 53.37 18.99 55.82 17.74 30.52 62.23 60.82 36.58 4.41# metric since it is symmetric (i.e., JS(PjjU) = duces the imbalance metrics (lower is better) when JS(UjjP )) and strictly scoped (in [0:0; 1:0]) com- employed on non-data-balanced baselines signif- pared with the KL divergence. Based on this, to icantly and consistently, indicating it is indeed evaluate the label bias and the keyword bias of a helpful to mitigate the two dataset bias issues. trained model M , we average its relative label im- We all know that data-balanced LECOEDA per- balance (RLI) over the predicted distributions of fectly mitigates the label bias issue via data bal- all the testing documents, and the relative keyword ancing, thus achieving the lowest RLI. Due to the imbalance (RKI) over all the testing documents powerful debiasing operations via strictly balanc- containing whichever context word, respectively: ing data, it serves as the skyline of RLI. This finding is similar to previous evidence of Morik RLI(M) = D(P (x); U) et al. (2020). Moreover, we can also see that jDj x2D L ECOEDA reduces the RKI, validating that data RKI(M;V) = D(P (fxjw 2 x^ x 2 Dg); U) manipulation methodology is indeed helpful to jVj w2V debias the keyword bias issue but fails to elimi- (9) nate it completely; our framework can further re- where a prediction P (x) could be a factual predic- duce RKI (1.73#). Note that W EIGHT exhibits a tion f(x) or a counterfactual one c(x); V denotes more severe keyword bias than label bias (34.85 the vocabulary of context words. The two metrics vs. 08.88). The key reason is that WEIGHT ex- implicitly capture the distance between all predic- plicitly balances each category according to a the- tions and the fair uniform distribution U . oretically fair objective but ignores the consider- Table 3 shows the average results of the bias ation of label distributions conditioned on finer- analysis investigation over five different initia- grained words. Moreover, RoBERTa exhibits the tions. The results show that our framework re- Keyword Imbalance Label Imbalance (RKI) (RLI) Table 4: Ablation study on main components or mecha- most imbalanced prediction against all baselines nisms of our framework evaluated on all datasets. n de- and across small- and large-scale datasets (e.g., notes the removing operation. # denotes performance ARC and TAO), indicating that its answers ex- drop. The worst scores are underlined. cessively distribute on certain categories due to L ECOEDA+C ORSAIR 46.67  WEIGHT+C ORSAIR 36.01 the overfitting phenomenon rooted from its large- y y nC ORSAIR 45.34 1.33# nC ORSAIR 28.46 7.55# y y scale parameters (about 110M). Luckily, by being nLBR 40.82 5.85# nLBR 33.05 2.96# y y nKBR 45.30 1.37# nKBR 30.05 5.96# y y equipped with our framework, the RoBERTa case nES 43.97 2.70# nES 32.85 3.16# remarkably reduces the imbalance issue caused by dataset biases (9.50# and 4.41#). 3.4 Further Investigation on Counterfactual Another finding is that the keyword bias issue Learning typically is more severe than the label bias, mean- ing that trained models typically utilize the word- Recall that our proposed framework first trains a level information to inference, which could catch base model on a training set directly (factual learn- angel keywords as good clues but also inevitably ing) so as to preserve dataset biases in the trained utilize evil keywords that are potential biases. Ad- model, and in the inference phase, given a factual ditionally, the keyword bias issue, compared with input document, CORSAIR imagines two types of label bias, is much harder to be completely elim- counterfactual documents aiming to produce two inated via data manipulations, which imposes a counterfactual outputs as the distilled label bias caution for relevant studies to keep a watchful eye and keyword bias for bias removal. That is, the on the detrimental causal correlations. framework deliberately causes the discrepancy be- tween learning and inference, leading to an opera- tional gap between the two phases. In this section, 3.3 Ablation Study we investigate more deeply to explore what will happen if the operational gap is bridged. We conduct ablation studies on CORSAIR to em- pirically examine the contribution of its main  Factual Learning. Learn with L(; f(x ); y ) i i mechanisms/components, including the label bias as objective, i.e., to minimize the loss between fac- removal operation (nLBR), the keyword bias re- tual predictions and ground-truth labels. Then, in- moval operation (nKBR) and the elastic scaling ference via counterfactual predictions. mechanism (nES). Counterfactual Learning. Learn with L(; c(x ); y ) as objective, i.e., to minimize The average results of the ablation study are i i the loss between counterfactual predictions and shown in Table 4. We can see that removing the ground-truth labels. Then, inference directly. proposed CORSAIR causes serious performance degradation, dropping F -score by 7.55 points for The average results of TEXTCNN on ECO the W EIGHT case. Additionally, it also provides (jYj=2) and CHE (jYj=13) are reported in Fig- evidence that using the counterfactual framework ure 2. We observe that these configurations con- for text classification can explicitly mitigate two verge at different F scores as the number of types of dataset biases to generalize better on un- epochs increases gradually. As for each dataset, seen examples. Moreover, we observe that mit- the configuration of a factual model with coun- igating the two types of biases are consistently terfactual inference (i.e., C ORSAIR) achieves the helpful for classification tasks. The key reason best performance with even a relatively more rapid is that the distilled label bias provides a global convergence. More interestingly, in the early (i.e., document-agnostic) offset and the distilled phases of model training (e.g., epoch=0), C OR- keyword bias provides a local (i.e., document- SAIR usually provides a higher starting point than specific) one to “move” in the predicted space, traditional factual inference. We conjecture that which makes the trained models “blind” to see po- the superiority may come from the use of average tentially harmful biases existing in observed data embedding which usually produces a stable distri- so as to focus only on the main content of each bution similarly distributed with ideal biases, mak- document to inference. Meanwhile, elastic scal- ing a base model happen to “see” the label bias ing effectively finds two dynamic scaling factors once the initiation operation is done. This phe- to amplify or shrink two biases, making the biases nomenon is empirically held, especially for small- be mitigated properly and adaptively. scale classification tasks. 5441 60.00 30.00 vectors, some subsequent studies explored differ- 50.00 25.00 ent types of downstream text classification mod- 40.00 20.00 els, including support vector machine (Joachims, 30.00 15.00 1999), maximum entropy model (Nigamy and Mc- 20.00 10.00 Callum, 1999), naive Bayes (Pang et al., 2002), Asymptote word clustering (Baker and McCallum, 1998) and 10.00 05.00 Factual Learning via Factual Inference Factual Learning via Counterfactual Inference neural networks (Kim, 2014; Zhou et al., 2016; 00.00 Counterfactual Learning via Direct Inference 00.00 0 1 2 3 4 5 6 7 8 9 10 ECO CHE Howard and Ruder, 2018; Devlin et al., 2019; Liu Epoch et al., 2019). Figure 2: The average results of three types of different To solve the dataset bias issue, a straightfor- learning paradigms on two datasets, including a factual ward solution is to perform data-level manipula- learning with factual inference, a factual learning with tions to prevent models from capturing the unin- counterfactual inference (i.e., CORSAIR) and a coun- tended dataset biases in model training, including terfactual learning with direct inference. data balance (Dixon et al., 2018; Geng et al., 2007; Chen et al., 2017; Sun et al., 2018; Rayhan et al., Surprisingly, counterfactual learning converges 2017; Nguyen et al., 2011) (a.k.a. resampling) at the factual learning case. This finding consis- and data augmentation (Wei and Zou, 2019; Qian tently holds on all other baselines across datasets, et al., 2020b). Another common paradigm for text which means that the so-called counterfactual classification is typically to design model-level learning actually degrades to a factual inference. balancing mechanisms, including unbiased em- This indicates that if a training model explicitly bedding (Bolukbasi et al., 2016; Kaneko and Bol- mitigates two types of dataset biases in an end-to- legala, 2019), threshold correction (Kang et al., end fashion, i.e., without the operational gap, it ac- 2020; Provost, 2000; Calders and Verwer, 2010) tually loses the function to perform debiased infer- and instance weighting (Zhang et al., 2020; Zhao ence. The important reason is that under such cir- et al., 2017; Jiang and Zhai, 2007). cumstance, the potential biases actually “spread” throughout the whole model architecture, instead 5 Conclusion of the mere part before bias removal is operated, which makes bias removal only look like debi- asing but is just a factual feedforward operation We have designed a counterfactual framework for that is unable to capture, distill and even miti- text classification debiasing. Extensive experi- gate biases. Therefore, the counterfactual infer- ments demonstrated the framework’s good effec- ence works only when the operational gap be- tiveness, generalizability and fairness. Future tween learning and inferencing exists. This ben- work will design a joint-learning technique to dy- eficial gap instead makes the biases spread only namically decide each document’s main content. We hope the paradigm can illuminate a promising throughout the part before the bias removal mod- ule, and thus enables them to be distilled via coun- technical direction of causal inference in natural terfactual inference. language processing. 4 Related Work Acknowledgements Text classification is a backbone component in many downstream tasks or applications (Broder We thank the anonymous reviewers for their en- et al., 2007; Chen et al., 2019; Sun et al., 2019; couraging feedbacks. The work was supported by Qian et al., 2020a,c). Earlier text classifica- the National Key Research and Development Pro- tion methods focus on manual feature engineering gram of China (No. 2019YFB1704003), the Na- (Aggarwal and Zhai, 2012; Cavnar and Trenkle, tional Nature Science Foundation of China (No. 1994; Post and Bergsma, 2013). The key factor 71690231), Tsinghua BNRist, Alibaba DAMO of text classification lies in the quality of text rep- academy, NExT++ Research Center and Beijing resentation (Mikolov et al., 2013b,a; Pennington Key Laboratory of Industrial Bigdata System and et al., 2014; Canuto et al., 2019; Yan, 2009; Qian Application. et al., 2021). Benefiting from high-quality word F1-Score (%) References Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. 2019. BERT: Pre-training of Deep Bidirectional Charu C. Aggarwal and ChengXiang Zhai. 2012. A Transformers for Language Understanding. In the Survey of Text Classification Algorithms. In Mining North American Chapter of the Association for Text Data, pages 163–222. Computational Linguistics (NAACL), pages 4171– L. Douglas Baker and Andrew Kachites McCallum. 1998. Distributional Clustering of Words for Text Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, Classification. In the ACM SIGIR Conference on and Lucy Vasserman. 2018. Measuring and Mitigat- Research and Development in Information Retrieval ing Unintended Bias in Text Classification. In the (SIGIR), pages 96–103. AAAI/ACM Conference on AI, Ethics, and Society (AIES), pages 67–73. Su Lin Blodgett, Solon Barocas, Hal Daume ´ III, and Hanna Wallach. 2020. Language (Technology) is Amir Feder, Nadav Oved, Uri Shalit, and Roi Re- Power: A Critical Survey of Bias in NLP. In the An- ichart. 2020. CausaLM: Causal Model Explana- nual Meeting of the Association for Computational tion Through Counterfactual Language Models. In Linguistics (ACL), pages 5454–5476. arXiv:2005.13407. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Guang-Gang Geng, Chun-Heng Wang, Qiu-Dan Li, Venkatesh Saligrama, and Adam Kalai. 2016. Man Lei Xu, and Xiao-Bo Jin. 2007. Boosting the is to Computer Programmer as Woman is to Home- Performance of Web Spam Detection with Ensem- maker? Debiasing Word Embeddings. In the Con- ble Under-Sampling Classification. In the Confer- ference on Neural Information Processing Systems ence on Fuzzy Systems and Knowledge Discovery (FSKD), pages 583–587. (NeurIPS), pages 4356–4364. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Andrei Broder, Marcus Fontoura, Evgeniy Dhruv Batra, and Devi Parikh. 2017. Making the Gabrilovich, et al. 2007. Robust Classification V in VQA Matter: Elevating the Role of Image Un- of Rare Queries Using Web Knowledge. In the derstanding in Visual Question Answering. In the ACM SIGIR Conference on Research and Devel- Conference on Computer Vision and Pattern Recog- opment in Information Retrieval (SIGIR), pages nition (CVPR), pages 6904–6913. 231–238. Suchin Gururangan, Ana Marasovic, Swabha Toon Calders and Sicco Verwer. 2010. Three Naive Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Bayes Approaches for Discrimination-free Classifi- and Noah A. Smith. 2020. Don’t Stop Pretraining: cation. In Data Mining and Knowledge Discovery, Adapt Language Models to Domains and Tasks. In pages 277–292. the Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 8342–8360. Sergio Canuto, Thiago Salles, et al. 2019. Similarity- Based Synthetic Document Representations for Haibo He and Edwardo A. Garcia. 2009. Learning Meta-Feature Generation in Text Classification. In from Imbalanced Data. In IEEE Transactions on the ACM SIGIR Conference on Research and De- Knowledge and Data Engineering (TKDE), pages velopment in Information Retrieval (SIGIR), pages 1263–1284. 355–364. Chris Hokamp and Qun Liu. 2017. Lexically Con- Carlos Castillo, Debora Donato, Aristides Gionis, strained Decoding for Sequence Generation Using Vanessa Graham Murdock, and Fabrizio Silvestri. Grid Beam Search. In the Annual Meeting of the 2007. Know Your Neighbors: Web Spam Detection Association for Computational Linguistics (ACL), using the Web Topology. In the ACM SIGIR Confer- pages 1535–1546. ence on Research and Development in Information Retrieval (SIGIR), pages 423–430. Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. William B Cavnar and John M Trenkle. 1994. N-gram- In the Annual Meeting of the Association for Com- based Text Categorization. In Annual Symposium putational Linguistics (ACL), pages 328–339. on Document Analysis and Information Retrieval (SDAIR). Xiaolei Huang and Michael J. Paul. 2018. Examining Temporality in Document Classification. In the An- XiaoShuang Chen, SiSi Li, and MengChu Zhou. 2017. nual Meeting of the Association for Computational A Noise-Filtered Under-Sampling Scheme for Im- Linguistics (ACL), pages 694–699. balanced Classification. In IEEE Transactions on Xiaolei Huang, Michael C. Smith, Michael J. Paul, Cybernetics, pages 4263–4274. Dmytro Ryzhkov, Sandra C. Quinn, David A. Bro- Zhenpeng Chen, Sheng Shen, Ziniu Hu, et al. 2019. niatowski, and Mark Dredze. 2017. Examining Emoji-Powered Representation Learning for Cross- Patterns of Influenza Vaccination in Social Media. Lingual Sentiment Classification. In the World Wide In the AAAI Conference on Artificial Intelligence Web Conference (WWW), pages 251–262. (AAAI), pages 4–5. 5443 Jing Jiang and ChengXiang Zhai. 2007. Instance for Computational Linguistics (ACL), pages 6274– Weighting for Domain Adaptation in NLP. In the 6283. Annual Meeting of the Association for Computa- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tional Linguistics (ACL), pages 264–271. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Thorsten Joachims. 1999. Transductive Inference for Luke Zettlemoyer, and Veselin Stoyanov. 2019. Text Classification using Support Vector Machines. RoBERTa: A Robustly Optimized BERT Pretrain- In the International Conference on Machine Learn- ing Approach. In arXiv:1907.11692. ing (ICML), pages 200–209. Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh David Jurgens, Srijan Kumar, Raine Hoover, Dan Hajishirzi. 2018. Multi-Task Identification of En- McFarland, and Dan Jurafsky. 2018. Measuring tities and Relations and and Coreference for Scien- the Evolution of a Scientific Field through Citation tific Knowledge Graph Construction. In the Con- Frames. In Transactions of the Association for Com- ference on Empirical Methods in Natural Language putational Linguistics (TACL), pages 391–406. Processing (EMNLP), pages 3219–3232. Masahiro Kaneko and Danushka Bollegala. 2019. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Gender-preserving Debiasing for Pre-trained Word Dean. 2013a. Efficient Estimation of Word Repre- Embeddings. In the Annual Meeting of the Asso- sentations in Vector Space. In arXiv:1301.3781. ciation for Computational Linguistics (ACL), pages 1641–1650. Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. 2013b. Distributed Representations of Words and Phrases Bingyi Kang, Saining Xie, Marcus Rohrbach, and their Compositionality. In the Conference on Zhicheng Yan, Albert Gordo, and Jiashi Feng. Neural Information Processing Systems (NeurIPS), 2020. Decoupling Representation and Classifier pages 3111–3119. for Long-tailed Recognition. In the International Conference on Learning Representations (ICLR). Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. Controlling Fairness and Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em- Bias in Dynamic Learning-to-Rank. In the ACM SI- manuel Vincent, Payam Adineh, David Corney, GIR Conference on Research and Development in Benno Stein, and Martin Potthast. 2019. SemEval- Information Retrieval (SIGIR), pages 429–438. 2019 Task 4: Hyperpartisan News Detection. In the International Workshop on Semantic Evaluation, Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. pages 829–839. 2011. Borderline Over-Sampling for Imbalanced Data Classification. In the International Journal of Kang-Min Kim, Yeachan Kim, Jungho Lee, et al. 2019. Knowledge Engineering and Soft Data Paradigms From Small-scale to Large-scale Text Classification. (IJKESDP), pages 4–21. In the World Wide Web Conference (WWW), pages 853–862. Kamal Nigamy and Andrew McCallum. 1999. Using Maximum Entropy for Text Classification. In the Yoon Kim. 2014. Convolutional Neural Networks for International Joint Conference on Artificial Intelli- Sentence Classification. In the Conference on Em- gence (IJCAI), pages 61–67. pirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Coun- Diederik P. Kingma and Jimmy Lei Ba. 2015. terfactual VQA: A Cause-Effect Look at Language Adam: A method for stochastic optimization. In Bias. In the Conference on Computer Vision and arXiv:1412.6980. Pattern Recognition (CVPR). Jens Kringelum, Sonny Kim Kjaerulff, Soren Brunak, Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Ole Lund, Tudor I Oprea, and Olivier Taboureau. 2002. Thumbs Up: Sentiment Classification using 2016. ChemProt-3.0: A Global Chemical Biology Machine Llearning Techniques. In the Conference Diseases Mapping. In Database (Oxford). on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 79–86. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Judea Pearl. 2013. Direct and Indirect Effects. In Classification. In the AAAI Conference on Artificial arXiv:1301.2300. Intelligence (AAAI), pages 2267–2273. Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Ken Lang. 1995. Newsweeder: Learning to Filter Net- 2016. Causal Inference in Statistics: A Primer. In news. In the International Conference on Machine John Wiley and Sons. Learning (ICML), pages 331–339. Frederick Liu and Besim Avci. 2019. Incorporating Judea Pearl and Dana Mackenzie. 2018. The Book of Priors with Feature Attribution on Text Classifica- Why: The New Science of Cause and Effect. In tion. In the Annual Meeting of the Association Basic Books. 5444 Jeffrey Pennington, Richard Socher, and Christo- Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, pher D. Manning. 2014. Glove: Global Vectors for and Hanwang Zhang. 2020. Unbiased Scene Graph Word Representation. In the Conference on Em- Generation from Biased Training. In the Confer- pirical Methods in Natural Language Processing ence on Computer Vision and Pattern Recognition (EMNLP), pages 1532–1543. (CVPR), pages 3716–3725. Matt Post and Shane Bergsma. 2013. Explicit and Im- Tan Wang, Jianqiang Huang, Hanwang Zhang, and plicit Syntactic Features for Text Classification. In Qianru Sun. 2020. Visual Commonsense R-CNN. the Annual Meeting of the Association for Computa- In the Conference on Computer Vision and Pattern tional Linguistics (ACL), pages 866–872. Recognition (CVPR), pages 10760–10770. Foster Provost. 2000. Machine Learning from Imbal- Zeerak Waseem and Dirk Hovy. 2016. Hateful Sym- anced Data Sets 101. In the AAAI Conference on bols or Hateful People? Predictive Features for Hate Artificial Intelligence (AAAI), pages 1–3. Speech Detection on Twitter. In the North American Chapter of the Association for Computational Lin- Chen Qian, Fuli Feng, Lijie Wen, Zhenpeng Chen, guistics (NAACL), pages 88–93. Li Lin, Yanan Zheng, and Tat-Seng Chua. 2020a. Solving Sequential Text Classification as Board- Jason Wei and Kai Zou. 2019. EDA: Easy Data Aug- Game Playing. In the AAAI Conference on Artificial mentation Techniques for Boosting Performance on Intelligence (AAAI), pages 8640–8648. Text Classification Tasks. In the Conference on Empirical Methods in Natural Language Processing Chen Qian, Fuli Feng, Lijie Wen, and Tat-Seng Chua. (EMNLP), pages 6382–6388. 2021. Conceptualized and Contextualized Gaussian Embedding. In the AAAI Conference on Artificial Liuyu Xiang and Guiguang Ding. 2020. Learning Intelligence (AAAI). From Multiple Experts: Self-paced Knowledge Dis- tillation for Long-tailed Classification. In the Euro- Chen Qian, Fuli Feng, Lijie Wen, Li Lin, and Tat-Seng pean Conference on Computer Vision (ECCV). Chua. 2020b. Enhancing Text Classification via Discovering Additional Semantic Clues from Lo- Jun Yan. 2009. Text Representation. In Encyclopedia gograms. In the ACM SIGIR Conference on Re- of Database Systems. search and Development in Information Retrieval Xu Yang, Hanwang Zhang, and Jianfei Cai. 2020. De- (SIGIR), pages 1201–1210. confounded Image Captioning: A Causal Retro- Chen Qian, Lijie Wen, Akhil Kumar, Leilei Lin, Li Lin, spect. In arXiv:2003.03923. Zan Zong, Shuang Li, and Jianmin Wang. 2020c. An approach for process model extraction by multi- Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Con- grained text classification. In Proceedings of The ghui Zhu, and Tiejun Zhao. 2020. Demographics 32nd International Conference on Advanced Infor- Should Not Be the Reason of Toxicity: Mitigating mation Systems Engineering (CAiSE), pages 268– Discrimination in Text Classifications with Instance 282. Weighting. In the Annual Meeting of the Association for Computational Linguistics (ACL), pages 4134– Farshid Rayhan, Sajid Ahmed, Asif Mahbub, Rafsan Jani, Swakkhar Shatabda, and Dewan Md. Farid. 2017. CUSBoost: Cluster-Based Under-Sampling Yongfeng Zhang, Guokun Lai, et al. 2014. Explicit with Boosting for Imbalanced Classification. In the Factor Models for Explainable Recommendation International Conference on Computational Systems based on Phrase-level Sentiment Analysis. In the and Information Technology for Sustainable Solu- ACM SIGIR Conference on Research and Develop- tion (CSITSS), pages 1–5. ment in Information Retrieval (SIGIR), pages 83–92. Bo Sun, Haiyan Chen, Jiandong Wang, and Hua Xie. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- 2018. Evolutionary Under-Sampling based Bagging donez, and Kai-Wei Chang. 2017. Men Also Like Ensemble Method for Imbalanced Data Classifica- Shopping: Reducing Gender Bias Amplification us- tion. In Frontiers of Computer Science, pages 331– ing Corpus-level Constraints. In the Conference on 350. Empirical Methods in Natural Language Processing (EMNLP), pages 2979–2989. Lihua Sun, Junpeng Guo, and Yanlin Zhu. 2019. Ap- plying Uncertainty Theory into the Restaurant Rec- Peng Zhou, Wei Shi, Jun Tian, et al. 2016. Attention- ommender System based on Sentiment Analysis of Based Bidirectional Long Short-Term Memory Net- Online Chinese Reviews. In the World Wide Web works for Relation Classification. In the Annual Conference (WWW), pages 83–100. Meeting of the Association for Computational Lin- guistics (ACL), pages 207–212. Chris Sweeney and Maryam Najafian. 2020. A Trans- parent Framework for Evaluating Unintended De- mographic Bias in Word Embeddings. In the An- nual Meeting of the Association for Computational Linguistics (ACL), pages 1662–1667. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Unpaywall

Counterfactual Inference for Text Classification Debiasing

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)Jan 1, 2021

Loading next page...
 
/lp/unpaywall/counterfactual-inference-for-text-classification-debiasing-NX7v9r4G0R

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
DOI
10.18653/v1/2021.acl-long.422
Publisher site
See Article on Publisher Site

Abstract

Counterfactual Inference for Text Classification Debiasing Chen Qian Fuli Feng Lijie Wen Tsinghua University National University of Singapore Tsinghua University qc16@mails.tsinghua.edu.cn fulifeng93@gmail.com wenlj@tsinghua.edu.cn Chunping Ma Pengjun Xie Alibaba DAMO Academy Alibaba DAMO Academy chunping.mcp@alibaba-inc.com chengchen.xpjg@taobao.com Abstract partisanship recognition (Kiesel et al., 2019) and spam detection (Castillo et al., 2007). Machine Today’s text classifiers inevitably suffer from learning models have become the default choice unintended dataset biases, especially the of solving text classification, owing to their abil- document-level label bias and word-level key- ity to recognize the textual patterns from the la- word bias, which may hurt models’ general- beled documents (Kim, 2014; Howard and Ruder, ization. Many previous studies employed data- 2018). Nevertheless, they are at the risk of inad- level manipulations or model-level balancing mechanisms to recover unbiased distributions vertently capturing and even amplifying the unin- and thus prevent models from capturing the tended dataset biases (Zhao et al., 2017; Zhang two types of biases. Unfortunately, they ei- et al., 2020; Feder et al., 2020; Blodgett et al., ther suffer from the extra cost of data col- 2020), which can be at document-level (i.e., label lection/selection/annotation or need an elab- bias) and word-level (i.e., keyword bias). orate design of balancing strategies. Differ- The label bias issue occurs in the scenarios ent from traditional factual inference in which where a portion of the categories possesses a ma- debiasing occurs before or during training, jority of training examples than others. For ex- counterfactual inference mitigates the influ- ence brought by unintended confounders after ample, the label distribution of a binary sentiment training, which can make unbiased decisions analysis dataset could be 95%:5% (Dixon et al., with biased observations. Inspired by this, 2018). Many previous studies found that the mod- we propose a model-agnostic text classifica- els trained on such data are potentially at the risk tion debiasing framework – C ORSAIR, which of simply predicting the majority answers (Dixon can effectively avoid employing data manip- et al., 2018; Zhang et al., 2020). The keyword ulations or designing balancing mechanisms. bias issue occurs in the situation where trained Concretely, CORSAIR first trains a base model on a training set directly, allowing the dataset models exhibit excessive correlations between cer- biases “poison” the trained model. In infer- tain words and categories, e.g., some sentiment- ence, given a factual input document, C OR- irrelevant words – “black” or “islam” – are always SAIR imagines its two counterfactual counter- connected to negative category. As such, mod- parts to distill and mitigate the two biases cap- els always lean to unfairly predict any document tured by the poisonous model. Extensive ex- containing those keywords to a specific category periments demonstrate CORSAIR’s effective- according to the biased statistical information in- ness, generalizability and fairness. stead of intrinsic textual semantics (Waseem and 1 Introduction Hovy, 2016; Liu and Avci, 2019). The serious disadvantages limit models’ generalization, espe- Text classification, mapping text documents to a cially in the scenarios where the training data is set of predefined categories, is a fundamental and differently-distributed with the testing data (Niu important technique serving for many applications et al., 2021; Goyal et al., 2017). such as sentiment analysis (Qian et al., 2020b), To resolve the issues, an effective solution is to This work was partly done during Chen Qian’s intern- perform data-level manipulations (e.g., resampling ship at Alibaba DAMO academy. Fuli Feng and Lijie Wen (Qian et al., 2020b)), which effectively transforms are the co-corresponding authors. 1 a training set to a relatively balanced one before The code is available at https://github.com/ qianc62/Corsair. training. Another line of debiasing work typically Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5434–5445 August 1–6, 2021. ©2021 Association for Computational Linguistics designs model-level balancing mechanisms (e.g., a base model on an original training set, allowing reweighting (Zhang et al., 2020)), aiming to adap- the unintended dataset biases “poison” the model. tively decrease the influence of majority categories To “rescue” the testing documents from the poi- while increasing the minority during training. The sonous model, in testing, for each factual input core of the two types of solutions is to explicitly document, C ORSAIR imagines its two types of or implicitly recover unbiased distributions and counterfactual counterparts to produce two coun- prevent models from capturing the unintended bi- terfactual outputs as the distilled label bias and ases. Unfortunately, the data-level strategy typi- keyword bias. Lastly, CORSAIR performs a bias cally suffers from the extra manual cost of data removal operation to produce a counterfactual pre- collection, selection and annotation (Zhang et al., diction that corresponds to a debiased decision. 2020), requires much longer training time and nor- To verify, we perform extensive experiments on mally enlarges the gap between training and test- multiple public benchmark datasets. The results ing data distributions. The model-level strategy demonstrate our proposed framework’s effective- typically needs elaborate selection or definition ness, generalizability and fairness, proving that of balancing strategies and needs relearning from C ORSAIR, when employed on four different types scratch once certain balancing mechanisms (e.g., of base models, is significantly helpful to mitigate an unbiased training objective) are redesigned. the two types of dataset biases. Must machine learning models perform debias- 2 Methodology ing before or during training? Think about the dif- ference in the decision making processes between Problem Formalization Let X and Y denote machines and humans. Machine learning systems the input (text document) and output (category) are forced to imitate the behavior from observa- spaces, respectively. Given a labeled training set tions via maximizing the prior probability, from D = f(x ; y ) 2 X  Yg (i.e., the observed train i i which the decision is directly drawn during infer- data), the goal is to learn a text classifier M on ence. By contrast, we humans, although born and D , which serves as a mapping function f() : train raised in a biased nature, have the ability of coun- X 7! Y to accurately classify testing examples in terfactual inference to make unbiased decisions D = fx^jx^ 2 Xg. test with biased observations (Niu et al., 2021). To il- Considering that the dataset biases would not be lustrate, we briefly compare the traditional factual completely eliminated via data manipulations, em- inference and the counterfactual inference in text ploying data manipulations (e.g., resampling) or classification: designing balancing mechanisms (e.g., reweight- Factual Inference: What will the prediction be ing) may be not a directly-reasonable solution. In- if seeing an input document? spired by the success of counterfactual inference Counterfactual Inference: What will the pre- in mitigating biases in computer vision (Niu et al., diction be if seeing the main content of an input 2021; Wang et al., 2020; Tang et al., 2020; Yang document only and had not seen the confounding et al., 2020; Goyal et al., 2017), we propose a dataset biases? counterfactual-inference-based text-classification debiasing framework (CORSAIR), which is able The counterfactual inference essentially gifts hu- to make unbiased decisions with biased observa- mans the imagination ability (i.e., had not done) tions. The core idea of C ORSAIR is to train a to make decisions with a collaboration of the main “poisonous” text classifier regardless the dataset content and the confounding biases (Tang et al., biases and post-adjust the biased predictions ac- 2020), as well as to introspect whether our deci- cording to the causes of the biases in inference. sion is deceived (Niu et al., 2021), i.e., counter- It’s worth mentioning that our proposed CORSAIR factual inference leads to debiased prediction. can be applied to almost any parameterized base Inspired by this, we propose a novel model- model, including traditional one-stage classifiers agnostic paradigm (CORSAIR), which adopts fac- (e.g., T EXTCNN (Kim, 2014), RCNN (Lai et al., tual learning before mitigating the negative influ- 2015) and L ECO (Qian et al., 2020b)) and cur- ence of the dataset biases in inference (i.e., after rently prevalent two-stage classifiers (e.g., U LM- training), without the need of employing data ma- nipulations or designing balancing mechanisms. 2 For brevity, two-stage classifiers refer to two-stage lan- Concretely, in training, C ORSAIR directly trains guage models with an additional prediction layer. 5435 M Biased Learning Bias Distillation f(x)=Y2 Bias Removal (on training set) (on testing set) (on testing set) w1 w2 w3 w8 Y1 λ X Y * * λ λ w2 w5 wn Y2 Factual World Factual Output Label Bias (Traditional Inference) (i.e., Factual Prediction) λ Document Counterfactual Input M Word-Masked Document w6 w1 w8 w2 w8 Ym w1 w2 w3 w4 w5 see nothing Probability Distribution or Logit Vector w1 w2 w3 w4 w5 Keyword Bias Factual Input Imagination ˜ Scaling Factor λ λ w1 w2 w3 Y1 Y2 Elastic Scaling (on validation set) Counterfactual Output Counterfactual World Element-Wise Subtraction w4 w5 … Y3 … (i.e., Label Bias) (Fully-Blindfolded) Vocabulary Categories Counterfactual Input c(x)=Y4 M w1 w2 w3 w4 w5 see some * * = λ λ Imagination X̃ Y Factual Label Keyword Counterfactual X Y Counterfactual World Counterfactual Output Prediction Bias Bias Prediction Factual Learning (Partially-Blindfolded) (i.e., Keyword Bias) Figure 1: The architecture of our proposed model-agnostic framework (CORSAIR). Specifically, C ORSAIR first trains a base model on the training data directly so as to preserve the dataset biases in the trained model. In the inference phase, given a factual input document, CORSAIR first imagines its two types of counterfactual documents to produce two counterfactual outputs as the distilled label bias and keyword bias. Finally, C ORSAIR searches two adaptive parameters to perform bias removal to produce a counterfactual prediction for a debiased answer. F IT (Howard and Ruder, 2018), BERT (Devlin operation on the trained base model to obtain the et al., 2019) and RoBERTa (Liu et al., 2019)). probability distribution over Y (i.e., factual pre- For brevity, we will elaborate C ORSAIR by taking diction) for a most possible answer. However, in RoBERTa (a robustly optimized BERT-shape addition to the textual contents of the document, language model) as the example base model, and the prediction is also affected by unintended con- binary sentiment analysis as the example applica- founders (Pearl and Mackenzie, 2018) which may tion. The high-level architecture of CORSAIR is il- produce the label bias and keyword bias. Aiming lustrated in Figure 1, which consists of three main to obtain unbiased prediction, the key is to debias components: biased learning, bias distillation and during inference by blocking the spread of the bi- bias removal. ases from learning to inference. To achieve that, inspired by the counterfactual studies in causal 2.1 Biased Learning reasoning (Niu et al., 2021; Tang et al., 2020), we design an effective strategy based on causal inter- In the learning phase (i.e., training), C ORSAIR first vention (Pearl, 2013; Pearl and Mackenzie, 2018) trains the base model RoBERTa to learn a mapping to distill the potentially-harmful biases captured relation based on training data. Similar to tradi- by the trained model (Niu et al., 2021; Tang et al., tional training, C ORSAIR uses feedforward to pre- 2020), and then mitigate them via bias removal. dict batch examples and backward to update those learnable parameters in an end-to-end fashion. In 2.2.1 Causal Graph practice, we adopt the standard cross entropy as the training objective (i.e., loss function): Aiming to conduct proper causal intervention, we first formulate the causal graph (Pearl, 2013; Pearl X X L() =  ln i;y i;y and Mackenzie, 2018; Tang et al., 2020) for the (1) i=1 y2Y text classification models (see the left-bottom part = softmax(f(x )) i i of Figure 1), which sheds light on how the docu- where  denotes the learnable parameters of the ment contents and dataset biases affecting the pre- base model f(), n is the number of batch exam- diction. Formally, a causal graph is a directed ples,  is the ground-truth label distribution (over acyclic graph G = (N;E), indicating how a set Y ) and  is the predicted probability distribution i of variables N causally interact with each other (overY ) for a given training example x . i through the causal links E . It provides a sketch of the causal relations behind the data and how 2.2 Bias Distillation variables obtain their values (Tang et al., 2020), In the inference phase (i.e., testing), traditional de- e.g., (X; M)!Y . In this causal graph, X , Y and biasing methods making predictions for each test- M denote a text document’s embedding, its corre- ing document via the conventional feedforward sponding prediction and the trained model which 5436 inevitably captures unintended confounders exist- information is given. Thus, the fully-blindfolded ing in training data, respectively. counterfactual output: P (Yjdo(X)) = f(x^) = f(hw ; w ; ; w i) 1 2 n 2.2.2 Label Bias Distillation (4) 8w 2 x; ^ w [MASK] i i According to the causal graph, we diagnose how naturally reflects as the label bias captured by M , the dataset biases existing in training data misleads where [MASK] is a special token to mask a single inference. Concretely, by using Bayes rule (Wang word. Due to x^ is fully-blindfolded and indepen- et al., 2020), we can view the inference as: dent with trained model M , in implementation, we follow Wang et al. (2020) to use the average doc- f(x) = P (YjX) = P (YjX; c)P (cjX) (2) c ument feature on the whole training set as its em- bedding of the counterfactual document. where c could be any confounder captured by the model trained on a biased training set (e.g., 2.2.3 Keyword Bias Distillation the overwhelming majority of training documents Inspired by the factual inference where all tex- fall in P OSITIVE). Under such circumstances, tual information in test documents are exposed once the training documents corresponding to the to the base model and the fully-blindfolded case P OSITIVE category are dominating than NEGA- where all textual information in each test docu- TIVE, the trained model tends to build strong spu- ment are not exposed, we make the first attempt to rious connections between testing documents and utilize a partially-blindfolded counterfactual docu- P OSITIVE, achieving high accuracy even with- ment where some words in the test document x are out knowing testing documents’ main contents. masked to distill the keyword bias from the trained As such, the model is inadvertently contaminated base model. by the spurious causal correlation: X M!Y , Specifically, we deliberately expose some a.k.a. a back-door path in causal theory (Pearl words which may potentially cause spurious cor- and Mackenzie, 2018; Pearl, 2013). To decouple relations (e.g., the spurious “black”-to-NEGATIVE the spurious causal correlation, the back-door ad- mapping) to the trained model to exhibit their justment (Pearl and Mackenzie, 2018; Pearl, 2013; potentially negative influence. Some evil words Pearl et al., 2016) predicts an actively intervened may serve as unintended confounders (Tang et al., answer via the do() operation: 2020), splitting a document into two pieces: main P (Yjdo(X)) = P (YjX = x^) = f(x^) (3) content and relatively-unimportant context. In the following, we use x~ to denote another counterfac- where x^ could be any counterfactual embedding tual document where the main-content words in as long as it is no longer dependent on M to detach a test document x are masked while other con- the connection between X and M . As illustrated text words are not, and f(x~) as the corresponding in the fully-blindfolded counterfactual world in counterfactual output. To achieve that, an effective Figure 1, the causal intervention operation wipes masking strategy is to use discriminative text sum- out all the in-coming links of a cause variable X , marization methods to extract the main content of which encourages the model M to inference with- the document, before masking content words (im- out seeing any testing document, i.e., RoBERTa portant classification clues) and exposing others should be fully blind in order to detaching the as potentially harmful biasing factors. Since the connection between M and X . To achieve that, model is forced to see only the non-masked con- we use x^ to denote the imagined fully-blindfolded text words in x, f(x~) actually reflects the influence counterfactual document where all words in the from both the potentially harmful contexts and test document x are consistently masked (to cre- the trained model. Thus, the partially-blindfolded ate a counterfactual embedding), and f(x^) as the counterfactual output: corresponding counterfactual output via feedfor- f(x~) = f(hw ; w ; ; w i) 1 2 n ward through the trained model. Since the model (5) w [MASK] if w 2 x cannot see any word in the factual input x after i i content 8w 2 x; ~ w w if w 2 x i i i context fully blindfolding, f(x^) actually reflects the pure influence from the trained base model M . Further- naturally reflects as the keyword bias captured by more, f(x^) refers to the output (e.g., a probabil- M for a specific text document x, where x content ity distribution or a logit vector) where no textual and x denote the main content and the con- context 5437 text of x, respectively. Inspired by a recent coun- where is a metric function (e.g., recall, precision terfactual word-embedding study of Feder et al. and F -score) to evaluate the performance on the (2020), to realize discriminative text summariza- validation set D =(X ; Y ); a and b are the dev dev dev tion, we use Jieba tool, whose TextRank-based boundaries of the search range. The two factors interface can effectively extract the words that are at dataset-level and thus searched only once for may influence the semantics of a sentence as con- each validation set, and would be used in inference tent, leaving potentially discriminative/unfair key- for all testing documents. words (e.g., stop words, a part of adjectives, and 3 Evaluation semantically-unimportant particles) as contexts. Empirically, the average ratio of contents to con- Baselines We choose four types of represen- texts produced by Jieba on all datasets is approxi- tative text classifiers as the base models of mately 62.03%:37.97%. our proposed framework, covering classical, data-manipulation-based, model-balancing-based, 2.3 Bias Removal as well as large-scale and two-stage methods. Our final goal is to use the direct effect from X to T EXTCNN (Kim, 2014) is a classical classifier Y for debiased prediction, removing (n) the label that uses convolutional neural networks (CNN) bias and the keyword bias existing in training data with scale-variant convolution filters to capture lo- (i.e., blocking the spread of the biases from train- cal textual features, which may potentially cap- ing data to inference): f(x)nf(x^)nf(x~). The de- ture spurious correlations between certain key- biased prediction via bias removal can be formal- words and categories. L ECO (Qian et al., 2020b) ized via the conceptually simple and empirically utilizes the combination of the implicit encod- powerful element-wise subtraction operation: ing of deep linguistic information and the ex- ^ ~ c(x) = f(x)nf(x^)nf(x~) = f(x) f(x^) f(x~) (6) plicit encoding of morphological features, which would also capture the keyword bias inadvertently. where f(x) and c(x) correspond to the traditional factual prediction and our counterfactual predic- Besides, it uses a sentence-level over-sampling mechanism (He and Garcia, 2009) to mitigate the tion, respectively; f(x^) and f(x~) correspond to the label bias and the keyword bias distilled from label bias, and we further enhance it via a pow- ^ ~ erful word-level augmentation technique (EDA) the trained base model, respectively;  and are two independent parameters balancing the two (Wei and Zou, 2019) to mitigate the keyword bias, types of biases. denoted as L ECOEDA. W EIGHT (Zhang et al., 2020) is a most recent debiasing text classifier that Note that the two distilled biases could be prob- uses a specially-designed reweighting technique ability distributions over all categories or logit under an unbiased objective for fair (i.e., non- vectors (i.e., without normalization), and they typ- discrimination) learning, which is proven effective ically do not contribute completely equally to to mitigate the unfairness or discrimination issue the final classification. As such, in Equation 6, caused by unintended dataset biases. RoBERTa directly subtracting without adaptive parameters ^ ~ (Liu et al., 2019) is an improved version of BERT, (i.e., == ) would cause that mitigating a certain whose effective modifications allow RoBERTa to bias too much or too less for a specific testing set. generalize better and match or exceed the perfor- Therefore, we propose the elastic scaling mecha- mance of many post-BERT methods, serving as a nism to search two adaptive parameters (scaling ^ ~ very strong baseline in recent work (Gururangan factors) –  and  – on the validation set to et al., 2020). amplify or penalize the two biases, which would dynamically adapt to different datasets accord- Datasets We use multiple English benchmark ing to the extent to which two biases in training datasets (used mainly in academic commu- set “poison” the validation set. In practice, elas- nity): HyperPartisan (Kiesel et al., 2019), Twit- tic scaling can be implemented using grid beam ter (Huang et al., 2017), ARC (Jurgens et al., search (Hokamp and Liu, 2017) in a scoped two- 2018), SCIERC (Luan et al., 2018), ChemProt dimensional space: (Kringelum et al., 2016), Economy (Huang and ^ ~ ^ ~ ^ ~ ;  = arg max (D ; c(x; ; )) ;  2 [a; b] (7) dev Paul, 2018), News (Lang, 1995), Parties (Huang ^ ~ and Paul, 2018), YelpHotel (Zhang et al., 2014); https://github.com/fxsjy/jieba and also randomly collect real-world query- 5438 Table 1: Statistics of the datasets. #D denotes the aver- traditional factual predictions and two counter- age number of characters per document. #C denotes the factual outputs to produce counterfactual predic- number of categories. #Train, #Dev and #Test denote tions, the comparison between each baseline and the number of training set, validation set and testing set, its CORSAIR-equipped counterparts highlights the respectively. importance of the counterfactual inference, which Dataset Domain/Genre #D #C #Train" #Dev #Test is largely ignored by most of previous text clas- HYP Political News 3,265.64 2 516 64 65 TWI Social Network 84.32 2 1,631 272 272 sification methods. Particularly, CORSAIR can ARC Computer Science 222.49 6 1,688 125 128 SCI Computer Science 192.92 7 3,219 712 717 even benefit the data-manipulation-based method CHE Biomedicine 220.28 13 4,169 2,944 2,952 ECO Finance 1,152.22 2 4,744 595 596 (i.e., L ECOEDA) and the model-balancing-based NEW News 1,801.20 20 9,445 4,689 4,694 PAR Political Speech 140.31 2 10,059 2,012 2,012 method (i.e., WEIGHT) consistently, which in turn YEL User Comment 651.73 3 20,975 6,991 6,993 TAO E-Commerce 8.09 143 68,086 6,949 7,022 verifies our initial intuition that the dataset biases SUN E-Commerce 7.70 56 234,074 50,851 50,844 would not be completely eliminated via data ma- nipulations merely, and further illuminates our key category pairs (used in industrial community) insight – preserving biases in models before debi- from two famous Chinese e-commerce platforms: asing in inference. 4 5 Taobao and Suning . For brevity, we will use the We can also notice that CORSAIR sometimes first three letters to denote each dataset (e.g., HYP hurts performance (e.g., RoBERTa+CORSAIR on for HyperPartisan). The statistics of the datasets HYP and ARC); we conjecture the phenomenon are summarized in Table 1. comes from the small-scale data, making the giant model RoBERTa overfits and thus “fail” to dis- Metric We use the widely-used macro-F met- till two potential biases that are identically dis- ric, which is the balanced harmonic mean of pre- tributed with the ideal distributions of factual bi- cision and recall. Furthermore, macro-F is more ases. Moreover, finetuning a RoBERTa model on suitable than micro-F to reflect the extent of the large-datasets (e.g., SUN) would take about 36 dataset biases, especially for the highly-skewed hours, nearly 50 times that of training a WEIGHT cases, since macro-F is strongly influenced by model (about 44 minutes); we thus suggest to use the performance in each category (i.e., category- lightweight base models in practice with consid- sensitive) but micro-F easily gives equal weight ering systems’ robustness and efficiency. Besides, over all documents (i.e., category-agnostic) (Kim the proposed framework works only in inference et al., 2019). and can thus be employed on the previous already- Implementation Details The search range in trained models. Therefore, by leveraging coun- Equation 7 is set as [2:0; 2:0]. Each training terfactual inference, our approach can serve as a is run for 10 epochs with the Adam optimizer powerful, “data-manipulation-free” and “model- (Kingma and Ba, 2015), a mini-batch size of 16, balancing-free” weapon to enhance different types a learning rate of 2e , and a dropout rate of 0.1. of text classification methods. We implement C ORSAIR via Python 3.7.3 and Py- 3.2 Bias Analysis torch 1.0.1. All of our experiments are run on a machine equipped with seven standard NVIDIA According to Sweeney and Najafian (2020), the TITAN-RTX GPUs. more imbalanced/skewed a prediction produced by a trained model is, the more unfair opportuni- 3.1 Overall Performance ties it gives over predefined categories, the more We report the average results over five different unfairly-discriminative the trained model is. We initiations in Table 2. We can observe that COR- thus follow previous work (Xiang and Ding, 2020; SAIR consistently improves the four types of rep- Sweeney and Najafian, 2020) to use the metric – resentative baselines on almost all datasets with a imbalance divergence – to evaluate whether a pre- significance level, regardless of the languages, do- diction (normally a probability distribution) P is mains, volumes and applications of the datasets, imbalanced/skewed/unfair: which validates the effectiveness and the general- D(P; U) =JS(PjjU) (8) izability of the proposed framework. Furthermore, since CORSAIR performs debiasing between the where D() is defined as the distance of P and the uniform distribution U (with jPj elements). Con- https://www.taobao.com https://www.suning.com cretely, we use the JS divergence as the distance 5439 Table 2: Experimental results (F ; %) of all methods on all benchmark datasets (higher is better). For each dataset, the best-performing results among all methods are highlighted with boldfaces. For each baseline, the best- performing results between the baseline and our approach are highlighted with. y denotes statistical significance (p0.05) between a baseline and the counterpart employed on our framework. Method HYP TWI ARC SCI CHE ECO NEW PAR YEL TAO SUN AVG. T EXTCNN 40.48 65.94 12.46 10.09 18.96 46.07 12.07 54.94 51.49 08.16 10.90 30.14 – y y y y y y y y y y y T EXTCNN+C ORSAIR 46.71 69.03 17.03 19.85 22.55 59.74 16.18 56.39 58.37 08.70 14.20 35.34 5.20" L ECOEDA 58.78 72.43 52.64 22.37 30.22 60.81 54.39 57.33 60.60 12.02 17.17 45.34 – y y y y y y L ECOEDA+C ORSAIR 60.46 74.62 53.10 23.28 30.42 61.81 54.48 57.51 60.87 14.25 22.62 46.67 1.33" W EIGHT 49.14 60.80 12.71 09.80 11.98 44.67 15.19 54.90 45.73 01.67 06.54 28.46 – y y y y y y y y y y y W EIGHT+CORSAIR 55.03 68.35 18.04 17.73 22.08 59.24 20.93 55.70 58.47 06.54 14.02 36.01 7.55" RoBERTa 87.92 88.71 68.76 81.76 50.10 53.55 85.38 65.54 77.67 50.70 44.05 68.55 – y y y y y y RoBERTa+C ORSAIR 86.45 89.12 68.10 82.21 51.65 61.31 86.83 67.09 77.69 51.52 46.15 69.82 1.27" Table 3: Experimental results (imbalance divergence or unfairness; %) of all methods on all benchmark datasets (lower is better). The top subtable shows the average document-level imbalance of predictions for label bias evaluation, and the bottom one shows the average word-level imbalance of predictions for keyword bias evaluation. Method HYP TWI ARC SCI CHE ECO NEW PAR YEL TAO SUN AVG. T EXTCNN 01.39 06.31 11.88 09.99 18.86 06.62 28.21 01.41 09.43 41.87 46.12 16.55 – y y y y y y T EXTCNN+CORSAIR 01.07 05.18 02.27 01.62 11.53 01.52 28.49 01.49 09.23 42.01 46.77 13.74 2.81# y y y y L ECOEDA 01.11 07.47 10.42 11.08 08.93 03.51 05.36 00.64 06.66 26.91 22.25 09.48 – L ECOEDA+C ORSAIR 01.21 11.29 12.96 11.99 09.26 04.47 06.05 00.72 05.08 26.06 23.05 10.19 0.71" WEIGHT 00.81 03.19 07.06 05.10 12.65 03.81 01.99 00.18 02.43 25.71 34.76 08.88 – y y y y y y y y y y WEIGHT+C ORSAIR 00.88 01.66 01.95 00.98 04.68 00.56 01.30 00.16 01.21 14.08 14.01 03.77 5.11# RoBERTa 01.29 02.96 14.57 18.10 16.74 06.69 00.16 00.01 02.55 57.74 56.76 16.14 – y y y y y y y y y y RoBERTa+C ORSAIR 00.11 01.27 01.66 12.57 02.76 02.15 00.02 00.01 00.82 28.83 22.91 06.64 9.50# T EXTCNN 17.96 17.39 44.76 47.39 37.35 20.69 38.23 05.76 18.46 65.37 60.87 34.02 – y y y y y y y y y y y T EXTCNN+CORSAIR 07.44 15.17 29.36 22.36 28.84 08.51 35.80 05.09 12.02 64.81 58.37 26.16 7.86# L ECOEDA 06.77 11.93 26.54 15.01 24.16 07.71 30.05 05.09 12.39 65.30 60.63 24.14 – y y y y y y L ECOEDA+C ORSAIR 06.61 14.46 25.94 14.13 22.53 04.77 30.03 05.05 12.58 57.51 52.98 22.41 1.73# WEIGHT 10.32 18.77 43.64 47.70 46.53 21.29 38.98 06.30 21.34 66.75 61.73 34.85 – y y y y y y y y y y y WEIGHT+C ORSAIR 06.34 13.70 33.29 23.40 28.97 08.80 34.74 05.32 10.12 64.87 58.63 26.19 8.66# RoBERTa 21.58 21.58 45.39 41.57 54.57 21.58 59.26 21.58 31.83 67.23 64.82 40.99 – y y y y y y y y y y RoBERTa+C ORSAIR 19.40 13.52 35.87 34.19 53.37 18.99 55.82 17.74 30.52 62.23 60.82 36.58 4.41# metric since it is symmetric (i.e., JS(PjjU) = duces the imbalance metrics (lower is better) when JS(UjjP )) and strictly scoped (in [0:0; 1:0]) com- employed on non-data-balanced baselines signif- pared with the KL divergence. Based on this, to icantly and consistently, indicating it is indeed evaluate the label bias and the keyword bias of a helpful to mitigate the two dataset bias issues. trained model M , we average its relative label im- We all know that data-balanced LECOEDA per- balance (RLI) over the predicted distributions of fectly mitigates the label bias issue via data bal- all the testing documents, and the relative keyword ancing, thus achieving the lowest RLI. Due to the imbalance (RKI) over all the testing documents powerful debiasing operations via strictly balanc- containing whichever context word, respectively: ing data, it serves as the skyline of RLI. This finding is similar to previous evidence of Morik RLI(M) = D(P (x); U) et al. (2020). Moreover, we can also see that jDj x2D L ECOEDA reduces the RKI, validating that data RKI(M;V) = D(P (fxjw 2 x^ x 2 Dg); U) manipulation methodology is indeed helpful to jVj w2V debias the keyword bias issue but fails to elimi- (9) nate it completely; our framework can further re- where a prediction P (x) could be a factual predic- duce RKI (1.73#). Note that W EIGHT exhibits a tion f(x) or a counterfactual one c(x); V denotes more severe keyword bias than label bias (34.85 the vocabulary of context words. The two metrics vs. 08.88). The key reason is that WEIGHT ex- implicitly capture the distance between all predic- plicitly balances each category according to a the- tions and the fair uniform distribution U . oretically fair objective but ignores the consider- Table 3 shows the average results of the bias ation of label distributions conditioned on finer- analysis investigation over five different initia- grained words. Moreover, RoBERTa exhibits the tions. The results show that our framework re- Keyword Imbalance Label Imbalance (RKI) (RLI) Table 4: Ablation study on main components or mecha- most imbalanced prediction against all baselines nisms of our framework evaluated on all datasets. n de- and across small- and large-scale datasets (e.g., notes the removing operation. # denotes performance ARC and TAO), indicating that its answers ex- drop. The worst scores are underlined. cessively distribute on certain categories due to L ECOEDA+C ORSAIR 46.67  WEIGHT+C ORSAIR 36.01 the overfitting phenomenon rooted from its large- y y nC ORSAIR 45.34 1.33# nC ORSAIR 28.46 7.55# y y scale parameters (about 110M). Luckily, by being nLBR 40.82 5.85# nLBR 33.05 2.96# y y nKBR 45.30 1.37# nKBR 30.05 5.96# y y equipped with our framework, the RoBERTa case nES 43.97 2.70# nES 32.85 3.16# remarkably reduces the imbalance issue caused by dataset biases (9.50# and 4.41#). 3.4 Further Investigation on Counterfactual Another finding is that the keyword bias issue Learning typically is more severe than the label bias, mean- ing that trained models typically utilize the word- Recall that our proposed framework first trains a level information to inference, which could catch base model on a training set directly (factual learn- angel keywords as good clues but also inevitably ing) so as to preserve dataset biases in the trained utilize evil keywords that are potential biases. Ad- model, and in the inference phase, given a factual ditionally, the keyword bias issue, compared with input document, CORSAIR imagines two types of label bias, is much harder to be completely elim- counterfactual documents aiming to produce two inated via data manipulations, which imposes a counterfactual outputs as the distilled label bias caution for relevant studies to keep a watchful eye and keyword bias for bias removal. That is, the on the detrimental causal correlations. framework deliberately causes the discrepancy be- tween learning and inference, leading to an opera- tional gap between the two phases. In this section, 3.3 Ablation Study we investigate more deeply to explore what will happen if the operational gap is bridged. We conduct ablation studies on CORSAIR to em- pirically examine the contribution of its main  Factual Learning. Learn with L(; f(x ); y ) i i mechanisms/components, including the label bias as objective, i.e., to minimize the loss between fac- removal operation (nLBR), the keyword bias re- tual predictions and ground-truth labels. Then, in- moval operation (nKBR) and the elastic scaling ference via counterfactual predictions. mechanism (nES). Counterfactual Learning. Learn with L(; c(x ); y ) as objective, i.e., to minimize The average results of the ablation study are i i the loss between counterfactual predictions and shown in Table 4. We can see that removing the ground-truth labels. Then, inference directly. proposed CORSAIR causes serious performance degradation, dropping F -score by 7.55 points for The average results of TEXTCNN on ECO the W EIGHT case. Additionally, it also provides (jYj=2) and CHE (jYj=13) are reported in Fig- evidence that using the counterfactual framework ure 2. We observe that these configurations con- for text classification can explicitly mitigate two verge at different F scores as the number of types of dataset biases to generalize better on un- epochs increases gradually. As for each dataset, seen examples. Moreover, we observe that mit- the configuration of a factual model with coun- igating the two types of biases are consistently terfactual inference (i.e., C ORSAIR) achieves the helpful for classification tasks. The key reason best performance with even a relatively more rapid is that the distilled label bias provides a global convergence. More interestingly, in the early (i.e., document-agnostic) offset and the distilled phases of model training (e.g., epoch=0), C OR- keyword bias provides a local (i.e., document- SAIR usually provides a higher starting point than specific) one to “move” in the predicted space, traditional factual inference. We conjecture that which makes the trained models “blind” to see po- the superiority may come from the use of average tentially harmful biases existing in observed data embedding which usually produces a stable distri- so as to focus only on the main content of each bution similarly distributed with ideal biases, mak- document to inference. Meanwhile, elastic scal- ing a base model happen to “see” the label bias ing effectively finds two dynamic scaling factors once the initiation operation is done. This phe- to amplify or shrink two biases, making the biases nomenon is empirically held, especially for small- be mitigated properly and adaptively. scale classification tasks. 5441 60.00 30.00 vectors, some subsequent studies explored differ- 50.00 25.00 ent types of downstream text classification mod- 40.00 20.00 els, including support vector machine (Joachims, 30.00 15.00 1999), maximum entropy model (Nigamy and Mc- 20.00 10.00 Callum, 1999), naive Bayes (Pang et al., 2002), Asymptote word clustering (Baker and McCallum, 1998) and 10.00 05.00 Factual Learning via Factual Inference Factual Learning via Counterfactual Inference neural networks (Kim, 2014; Zhou et al., 2016; 00.00 Counterfactual Learning via Direct Inference 00.00 0 1 2 3 4 5 6 7 8 9 10 ECO CHE Howard and Ruder, 2018; Devlin et al., 2019; Liu Epoch et al., 2019). Figure 2: The average results of three types of different To solve the dataset bias issue, a straightfor- learning paradigms on two datasets, including a factual ward solution is to perform data-level manipula- learning with factual inference, a factual learning with tions to prevent models from capturing the unin- counterfactual inference (i.e., CORSAIR) and a coun- tended dataset biases in model training, including terfactual learning with direct inference. data balance (Dixon et al., 2018; Geng et al., 2007; Chen et al., 2017; Sun et al., 2018; Rayhan et al., Surprisingly, counterfactual learning converges 2017; Nguyen et al., 2011) (a.k.a. resampling) at the factual learning case. This finding consis- and data augmentation (Wei and Zou, 2019; Qian tently holds on all other baselines across datasets, et al., 2020b). Another common paradigm for text which means that the so-called counterfactual classification is typically to design model-level learning actually degrades to a factual inference. balancing mechanisms, including unbiased em- This indicates that if a training model explicitly bedding (Bolukbasi et al., 2016; Kaneko and Bol- mitigates two types of dataset biases in an end-to- legala, 2019), threshold correction (Kang et al., end fashion, i.e., without the operational gap, it ac- 2020; Provost, 2000; Calders and Verwer, 2010) tually loses the function to perform debiased infer- and instance weighting (Zhang et al., 2020; Zhao ence. The important reason is that under such cir- et al., 2017; Jiang and Zhai, 2007). cumstance, the potential biases actually “spread” throughout the whole model architecture, instead 5 Conclusion of the mere part before bias removal is operated, which makes bias removal only look like debi- asing but is just a factual feedforward operation We have designed a counterfactual framework for that is unable to capture, distill and even miti- text classification debiasing. Extensive experi- gate biases. Therefore, the counterfactual infer- ments demonstrated the framework’s good effec- ence works only when the operational gap be- tiveness, generalizability and fairness. Future tween learning and inferencing exists. This ben- work will design a joint-learning technique to dy- eficial gap instead makes the biases spread only namically decide each document’s main content. We hope the paradigm can illuminate a promising throughout the part before the bias removal mod- ule, and thus enables them to be distilled via coun- technical direction of causal inference in natural terfactual inference. language processing. 4 Related Work Acknowledgements Text classification is a backbone component in many downstream tasks or applications (Broder We thank the anonymous reviewers for their en- et al., 2007; Chen et al., 2019; Sun et al., 2019; couraging feedbacks. The work was supported by Qian et al., 2020a,c). Earlier text classifica- the National Key Research and Development Pro- tion methods focus on manual feature engineering gram of China (No. 2019YFB1704003), the Na- (Aggarwal and Zhai, 2012; Cavnar and Trenkle, tional Nature Science Foundation of China (No. 1994; Post and Bergsma, 2013). The key factor 71690231), Tsinghua BNRist, Alibaba DAMO of text classification lies in the quality of text rep- academy, NExT++ Research Center and Beijing resentation (Mikolov et al., 2013b,a; Pennington Key Laboratory of Industrial Bigdata System and et al., 2014; Canuto et al., 2019; Yan, 2009; Qian Application. et al., 2021). Benefiting from high-quality word F1-Score (%) References Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. 2019. BERT: Pre-training of Deep Bidirectional Charu C. Aggarwal and ChengXiang Zhai. 2012. A Transformers for Language Understanding. In the Survey of Text Classification Algorithms. In Mining North American Chapter of the Association for Text Data, pages 163–222. Computational Linguistics (NAACL), pages 4171– L. Douglas Baker and Andrew Kachites McCallum. 1998. Distributional Clustering of Words for Text Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, Classification. In the ACM SIGIR Conference on and Lucy Vasserman. 2018. Measuring and Mitigat- Research and Development in Information Retrieval ing Unintended Bias in Text Classification. In the (SIGIR), pages 96–103. AAAI/ACM Conference on AI, Ethics, and Society (AIES), pages 67–73. Su Lin Blodgett, Solon Barocas, Hal Daume ´ III, and Hanna Wallach. 2020. Language (Technology) is Amir Feder, Nadav Oved, Uri Shalit, and Roi Re- Power: A Critical Survey of Bias in NLP. In the An- ichart. 2020. CausaLM: Causal Model Explana- nual Meeting of the Association for Computational tion Through Counterfactual Language Models. In Linguistics (ACL), pages 5454–5476. arXiv:2005.13407. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Guang-Gang Geng, Chun-Heng Wang, Qiu-Dan Li, Venkatesh Saligrama, and Adam Kalai. 2016. Man Lei Xu, and Xiao-Bo Jin. 2007. Boosting the is to Computer Programmer as Woman is to Home- Performance of Web Spam Detection with Ensem- maker? Debiasing Word Embeddings. In the Con- ble Under-Sampling Classification. In the Confer- ference on Neural Information Processing Systems ence on Fuzzy Systems and Knowledge Discovery (FSKD), pages 583–587. (NeurIPS), pages 4356–4364. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Andrei Broder, Marcus Fontoura, Evgeniy Dhruv Batra, and Devi Parikh. 2017. Making the Gabrilovich, et al. 2007. Robust Classification V in VQA Matter: Elevating the Role of Image Un- of Rare Queries Using Web Knowledge. In the derstanding in Visual Question Answering. In the ACM SIGIR Conference on Research and Devel- Conference on Computer Vision and Pattern Recog- opment in Information Retrieval (SIGIR), pages nition (CVPR), pages 6904–6913. 231–238. Suchin Gururangan, Ana Marasovic, Swabha Toon Calders and Sicco Verwer. 2010. Three Naive Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Bayes Approaches for Discrimination-free Classifi- and Noah A. Smith. 2020. Don’t Stop Pretraining: cation. In Data Mining and Knowledge Discovery, Adapt Language Models to Domains and Tasks. In pages 277–292. the Annual Meeting of the Association for Compu- tational Linguistics (ACL), pages 8342–8360. Sergio Canuto, Thiago Salles, et al. 2019. Similarity- Based Synthetic Document Representations for Haibo He and Edwardo A. Garcia. 2009. Learning Meta-Feature Generation in Text Classification. In from Imbalanced Data. In IEEE Transactions on the ACM SIGIR Conference on Research and De- Knowledge and Data Engineering (TKDE), pages velopment in Information Retrieval (SIGIR), pages 1263–1284. 355–364. Chris Hokamp and Qun Liu. 2017. Lexically Con- Carlos Castillo, Debora Donato, Aristides Gionis, strained Decoding for Sequence Generation Using Vanessa Graham Murdock, and Fabrizio Silvestri. Grid Beam Search. In the Annual Meeting of the 2007. Know Your Neighbors: Web Spam Detection Association for Computational Linguistics (ACL), using the Web Topology. In the ACM SIGIR Confer- pages 1535–1546. ence on Research and Development in Information Retrieval (SIGIR), pages 423–430. Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. William B Cavnar and John M Trenkle. 1994. N-gram- In the Annual Meeting of the Association for Com- based Text Categorization. In Annual Symposium putational Linguistics (ACL), pages 328–339. on Document Analysis and Information Retrieval (SDAIR). Xiaolei Huang and Michael J. Paul. 2018. Examining Temporality in Document Classification. In the An- XiaoShuang Chen, SiSi Li, and MengChu Zhou. 2017. nual Meeting of the Association for Computational A Noise-Filtered Under-Sampling Scheme for Im- Linguistics (ACL), pages 694–699. balanced Classification. In IEEE Transactions on Xiaolei Huang, Michael C. Smith, Michael J. Paul, Cybernetics, pages 4263–4274. Dmytro Ryzhkov, Sandra C. Quinn, David A. Bro- Zhenpeng Chen, Sheng Shen, Ziniu Hu, et al. 2019. niatowski, and Mark Dredze. 2017. Examining Emoji-Powered Representation Learning for Cross- Patterns of Influenza Vaccination in Social Media. Lingual Sentiment Classification. In the World Wide In the AAAI Conference on Artificial Intelligence Web Conference (WWW), pages 251–262. (AAAI), pages 4–5. 5443 Jing Jiang and ChengXiang Zhai. 2007. Instance for Computational Linguistics (ACL), pages 6274– Weighting for Domain Adaptation in NLP. In the 6283. Annual Meeting of the Association for Computa- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tional Linguistics (ACL), pages 264–271. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Thorsten Joachims. 1999. Transductive Inference for Luke Zettlemoyer, and Veselin Stoyanov. 2019. Text Classification using Support Vector Machines. RoBERTa: A Robustly Optimized BERT Pretrain- In the International Conference on Machine Learn- ing Approach. In arXiv:1907.11692. ing (ICML), pages 200–209. Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh David Jurgens, Srijan Kumar, Raine Hoover, Dan Hajishirzi. 2018. Multi-Task Identification of En- McFarland, and Dan Jurafsky. 2018. Measuring tities and Relations and and Coreference for Scien- the Evolution of a Scientific Field through Citation tific Knowledge Graph Construction. In the Con- Frames. In Transactions of the Association for Com- ference on Empirical Methods in Natural Language putational Linguistics (TACL), pages 391–406. Processing (EMNLP), pages 3219–3232. Masahiro Kaneko and Danushka Bollegala. 2019. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Gender-preserving Debiasing for Pre-trained Word Dean. 2013a. Efficient Estimation of Word Repre- Embeddings. In the Annual Meeting of the Asso- sentations in Vector Space. In arXiv:1301.3781. ciation for Computational Linguistics (ACL), pages 1641–1650. Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. 2013b. Distributed Representations of Words and Phrases Bingyi Kang, Saining Xie, Marcus Rohrbach, and their Compositionality. In the Conference on Zhicheng Yan, Albert Gordo, and Jiashi Feng. Neural Information Processing Systems (NeurIPS), 2020. Decoupling Representation and Classifier pages 3111–3119. for Long-tailed Recognition. In the International Conference on Learning Representations (ICLR). Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. Controlling Fairness and Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em- Bias in Dynamic Learning-to-Rank. In the ACM SI- manuel Vincent, Payam Adineh, David Corney, GIR Conference on Research and Development in Benno Stein, and Martin Potthast. 2019. SemEval- Information Retrieval (SIGIR), pages 429–438. 2019 Task 4: Hyperpartisan News Detection. In the International Workshop on Semantic Evaluation, Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. pages 829–839. 2011. Borderline Over-Sampling for Imbalanced Data Classification. In the International Journal of Kang-Min Kim, Yeachan Kim, Jungho Lee, et al. 2019. Knowledge Engineering and Soft Data Paradigms From Small-scale to Large-scale Text Classification. (IJKESDP), pages 4–21. In the World Wide Web Conference (WWW), pages 853–862. Kamal Nigamy and Andrew McCallum. 1999. Using Maximum Entropy for Text Classification. In the Yoon Kim. 2014. Convolutional Neural Networks for International Joint Conference on Artificial Intelli- Sentence Classification. In the Conference on Em- gence (IJCAI), pages 61–67. pirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Coun- Diederik P. Kingma and Jimmy Lei Ba. 2015. terfactual VQA: A Cause-Effect Look at Language Adam: A method for stochastic optimization. In Bias. In the Conference on Computer Vision and arXiv:1412.6980. Pattern Recognition (CVPR). Jens Kringelum, Sonny Kim Kjaerulff, Soren Brunak, Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Ole Lund, Tudor I Oprea, and Olivier Taboureau. 2002. Thumbs Up: Sentiment Classification using 2016. ChemProt-3.0: A Global Chemical Biology Machine Llearning Techniques. In the Conference Diseases Mapping. In Database (Oxford). on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 79–86. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Judea Pearl. 2013. Direct and Indirect Effects. In Classification. In the AAAI Conference on Artificial arXiv:1301.2300. Intelligence (AAAI), pages 2267–2273. Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. Ken Lang. 1995. Newsweeder: Learning to Filter Net- 2016. Causal Inference in Statistics: A Primer. In news. In the International Conference on Machine John Wiley and Sons. Learning (ICML), pages 331–339. Frederick Liu and Besim Avci. 2019. Incorporating Judea Pearl and Dana Mackenzie. 2018. The Book of Priors with Feature Attribution on Text Classifica- Why: The New Science of Cause and Effect. In tion. In the Annual Meeting of the Association Basic Books. 5444 Jeffrey Pennington, Richard Socher, and Christo- Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, pher D. Manning. 2014. Glove: Global Vectors for and Hanwang Zhang. 2020. Unbiased Scene Graph Word Representation. In the Conference on Em- Generation from Biased Training. In the Confer- pirical Methods in Natural Language Processing ence on Computer Vision and Pattern Recognition (EMNLP), pages 1532–1543. (CVPR), pages 3716–3725. Matt Post and Shane Bergsma. 2013. Explicit and Im- Tan Wang, Jianqiang Huang, Hanwang Zhang, and plicit Syntactic Features for Text Classification. In Qianru Sun. 2020. Visual Commonsense R-CNN. the Annual Meeting of the Association for Computa- In the Conference on Computer Vision and Pattern tional Linguistics (ACL), pages 866–872. Recognition (CVPR), pages 10760–10770. Foster Provost. 2000. Machine Learning from Imbal- Zeerak Waseem and Dirk Hovy. 2016. Hateful Sym- anced Data Sets 101. In the AAAI Conference on bols or Hateful People? Predictive Features for Hate Artificial Intelligence (AAAI), pages 1–3. Speech Detection on Twitter. In the North American Chapter of the Association for Computational Lin- Chen Qian, Fuli Feng, Lijie Wen, Zhenpeng Chen, guistics (NAACL), pages 88–93. Li Lin, Yanan Zheng, and Tat-Seng Chua. 2020a. Solving Sequential Text Classification as Board- Jason Wei and Kai Zou. 2019. EDA: Easy Data Aug- Game Playing. In the AAAI Conference on Artificial mentation Techniques for Boosting Performance on Intelligence (AAAI), pages 8640–8648. Text Classification Tasks. In the Conference on Empirical Methods in Natural Language Processing Chen Qian, Fuli Feng, Lijie Wen, and Tat-Seng Chua. (EMNLP), pages 6382–6388. 2021. Conceptualized and Contextualized Gaussian Embedding. In the AAAI Conference on Artificial Liuyu Xiang and Guiguang Ding. 2020. Learning Intelligence (AAAI). From Multiple Experts: Self-paced Knowledge Dis- tillation for Long-tailed Classification. In the Euro- Chen Qian, Fuli Feng, Lijie Wen, Li Lin, and Tat-Seng pean Conference on Computer Vision (ECCV). Chua. 2020b. Enhancing Text Classification via Discovering Additional Semantic Clues from Lo- Jun Yan. 2009. Text Representation. In Encyclopedia gograms. In the ACM SIGIR Conference on Re- of Database Systems. search and Development in Information Retrieval Xu Yang, Hanwang Zhang, and Jianfei Cai. 2020. De- (SIGIR), pages 1201–1210. confounded Image Captioning: A Causal Retro- Chen Qian, Lijie Wen, Akhil Kumar, Leilei Lin, Li Lin, spect. In arXiv:2003.03923. Zan Zong, Shuang Li, and Jianmin Wang. 2020c. An approach for process model extraction by multi- Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Con- grained text classification. In Proceedings of The ghui Zhu, and Tiejun Zhao. 2020. Demographics 32nd International Conference on Advanced Infor- Should Not Be the Reason of Toxicity: Mitigating mation Systems Engineering (CAiSE), pages 268– Discrimination in Text Classifications with Instance 282. Weighting. In the Annual Meeting of the Association for Computational Linguistics (ACL), pages 4134– Farshid Rayhan, Sajid Ahmed, Asif Mahbub, Rafsan Jani, Swakkhar Shatabda, and Dewan Md. Farid. 2017. CUSBoost: Cluster-Based Under-Sampling Yongfeng Zhang, Guokun Lai, et al. 2014. Explicit with Boosting for Imbalanced Classification. In the Factor Models for Explainable Recommendation International Conference on Computational Systems based on Phrase-level Sentiment Analysis. In the and Information Technology for Sustainable Solu- ACM SIGIR Conference on Research and Develop- tion (CSITSS), pages 1–5. ment in Information Retrieval (SIGIR), pages 83–92. Bo Sun, Haiyan Chen, Jiandong Wang, and Hua Xie. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- 2018. Evolutionary Under-Sampling based Bagging donez, and Kai-Wei Chang. 2017. Men Also Like Ensemble Method for Imbalanced Data Classifica- Shopping: Reducing Gender Bias Amplification us- tion. In Frontiers of Computer Science, pages 331– ing Corpus-level Constraints. In the Conference on 350. Empirical Methods in Natural Language Processing (EMNLP), pages 2979–2989. Lihua Sun, Junpeng Guo, and Yanlin Zhu. 2019. Ap- plying Uncertainty Theory into the Restaurant Rec- Peng Zhou, Wei Shi, Jun Tian, et al. 2016. Attention- ommender System based on Sentiment Analysis of Based Bidirectional Long Short-Term Memory Net- Online Chinese Reviews. In the World Wide Web works for Relation Classification. In the Annual Conference (WWW), pages 83–100. Meeting of the Association for Computational Lin- guistics (ACL), pages 207–212. Chris Sweeney and Maryam Najafian. 2020. A Trans- parent Framework for Evaluating Unintended De- mographic Bias in Word Embeddings. In the An- nual Meeting of the Association for Computational Linguistics (ACL), pages 1662–1667.

Journal

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)Unpaywall

Published: Jan 1, 2021

There are no references for this article.