Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance

Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While... Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance JOSHUA R. MINOT and NICHOLAS CHENEY, University of Vermont, USA MARC MAIER, MassMutual, USA DANNE C. ELBERS, University of Vermont and VA Cooperative Studies Program, VA Boston Healthcare 39 System, USA CHRISTOPHER M. DANFORTH and PETER SHERIDAN DODDS, University of Vermont, USA Medical systems in general, and patient treatment decisions and outcomes in particular, can be affected by bias based on gen- der and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models—statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how differences in gender-specific word frequency distributions and lan- guage models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of dataset bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce biases in natural language processing pipelines. CCS Concepts: • Applied computing → Health informatics; Document management and text processing;• Information systems→ Content analysis and feature selection;• Computing methodologies→ Natural language processing; Additional Key Words and Phrases: NLP, electronic health records, algorithmic fairness, data augmentation, interpretable machine learning ACM Reference format: Joshua R. Minot, Nicholas Cheney, Marc Maier, Danne C. Elbers, Christopher M. Danforth, and Peter Sheridan Dodds. 2022. Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance. ACM Trans. Comput. Healthcare 3, 4, Article 39 (October 2022), 41 pages. https://doi.org/10.1145/3524887 The authors are grateful for the computing resources provided by the Vermont Advanced Computing Core and financial support from the Massachusetts Mutual Life Insurance Company and Google. The views expressed are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the United States government. Authors’ addresses: J. R. Minot, N. Cheney, C. M. Danforth, and P. S. Dodds, University of Vermont, 82 University Pl, Burlington, VT 05405, United States; emails: {joshua.minot, ncheney, chris.danforth, peter.dodds}@uvm.edu; M. Maier, MassMutual, 59 E Pleasant St Amherst, MA 01002, United States; email: MMaier@MassMutual.com; D. C. Elbers, University of Vermont, 82 University Pl, Burlington, VT 05405, United States and VA Cooperative Studies Program, VA Boston Healthcare System, USA; email: danne.elbers@uvm.edu. This work is licensed under a Creative Commons Attribution International 4.0 License. © 2022 Copyright held by the owner/author(s). 2637-8051/2022/10-ART39 https://doi.org/10.1145/3524887 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:2 • J. R. Minot et al. 1 INTRODUCTION Efficiently and accurately encoding patient information into medical records is a critical activity in healthcare. Electronic health records (EHRs) document symptoms, treatments, and other relevant histories—providing a consistent reference through disease progression, provider churn, and the passage of time. Free-form text fields, the unstructured natural language components of a health record, can be incredibly rich sources of patient information. With the proliferation of EHRs, these text fields have also been an increasingly valuable source of data for researchers conducting large-scale observational studies. The promise of EHR data does not come without apprehension however, as the process of generating and analyzing text data is open to the influence of conscious and unconscious human bias. For example, health care providers entering information may have implicit or explicit demographic biases that ultimately become encoded in EHRs. Furthermore, language models that are often used to analyze clinical texts can encode broader societal biases [70]. As patient data and advanced language models increasingly come into contact, it is important to understand how existing biases may be perpetuated in modern day healthcare algorithms. In the healthcare context, many types of bias are worth considering. Race, gender, and socioeconomic status, among other attributes, all have the potential to introduce bias into the study and treatment of medical conditions. Bias may manifest in how patients are viewed, treated, and—most relevant here—documented. Due to ethical and legal considerations, as well as pragmatic constraints on data availability, we have focused the current research on gender bias. There are many sources of algorithmic bias along with multiple definitions of fairness in machine learning [ 46]. Bias in the data used for training algorithms can stem from imbalances in target classes, how specific features are measured, and historical forces leading certain classes to have longstanding, societal misrepresentation. Defini- tions of fairness include demographic parity, counterfactual fairness [38], and fairness through unawareness (FTU) [28]. In the current work, we use a more general measure that we refer to as potential bias in order to gauge the impact of our data augmentation technique. Potential bias is an assessment of bias under a sort of worst-case scenario, and provides a generalized measure independent of specific bias definitions. With our methods, we seek to provide human-interpretable insights on potential bias in the case of binary class data. Further, using the same measurement, we experiment with the application of an FTU-like data augmenta- tion process (although the concept of FTU does not neatly translate to unstructured text data). Combined, these methods can identify fundamental bias in language usage and the potential bias resulting from the application of a given machine learning model. We refer to two classes of algorithmic-bias evaluation: a) intrinsic evaluation for exploring semantic rela- tionships within an embedding space, and b) extrinsic evaluation for determining downstream performance differences on extrinsic tasks (e.g., classification) [ 34]. In medical context there are gender specific elements that can influence treatment and care. However, in this same context there might also be uses of gender when it is not relevant. In this manuscript we aim to understand the latter through analysis of clinical notes. There is growing interest in interpretable machine learning (IML) [51]. In the context of deep language models this can involve interrogating the functionality of specific model layers (e.g., BERTology [ 61]), or inves- tigating the impact of perturbations in data on outputs. This latter approach ties into the work outlined in this manuscript. Our use of the term ‘interpretable’ here mostly refers to a more general case where a given result can be interpreted by a human reviewer. For instance, our divergence-based measures highlight gendered terms in plain English with a clearly explained ranking methodology. While this conceptualization is complementary to IML, it does not necessarily fit cleanly within the field—we will mention explicitly when referring to an IML concept. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:3 1.1 Prior Work Gender bias in the field of medicine is a topic that must be viewed with nuance in light of the strong interaction between biological sex and health conditions. Medicine and gender bias interact in many ways—some of which are expected and desirable, whereas others may have uncertain or negative impacts in patient outcomes. Research has reported differences in the care and outcomes received by male and female patients for the same conditions. For example, given the same severity symptoms, men have higher treatment rates for conditions such as coronary artery disease, irritable bowel syndrome, and neck pain [30]. Women have higher treatment-adjusted excess mortality than men when receiving care for heart attacks [2]. Female patients treated by male physicians have higher mortality rates than when treated by female physicians—while male patients have similar mortality regardless of provider gender [26]. The rate of care-seeking behavior in men has been shown to be lower than women and has the potential to significantly affect health outcomes [ 23]. Some work has shown female providers have higher confidence in the truthfulness of female patients and resulting diagnoses when compared to male providers [29]. The concordance of patient and provider gender is also positively associated with rates of cancer screening [43]. Beyond gender, the mortality rate of black infants has been found to be lower when cared for by black physi- cians rather than their white counterparts [27]. Race and care-seeking behavior have also been shown to interact, with black patients more often seeking cardiovascular care from black providers than non-black providers [3]. It is important to note historical mistreatment and inequitable access when discussing racial disparities in health outcomes—for instance, the unethical Tuskegee Syphilis Study was found to lead to a 1.5-year decline in black male life expectancy through increased mistrust in the medical field after the exploitation of its participants was made public [4]. The gender of the healthcare practitioner can also impact EHR note characteristics that are subsequently quantified through language analysis tools. The writings of male and female medical students have been shown to have differences, with female students expressing more emotion and male students using less space [ 40]. More generally, some work has shown syntactic parsers generalize well for men and women when trained on data generated by women whereas training the tools on data from men leads to poor performance for texts written by women [24]. The ubiquity of text data along with advances in natural language processing (NLP) have led to a prolif- eration of text analysis in the medical realm. Researchers have used social media platforms for epidemiological research [57, 60, 62]—raising a separate set of ethical concerns [47]. NLP tools have been used to generate hy- potheses for biomedical research [63], detect adverse drug reactions from social media [68] classes, and expand the known lexicon around medical topics [22, 54]. There are numerous applications of text analysis in medicine beyond patient health records. While this manuscript does not directly address tasks outside of clinical notes, it is our hope that the research could be applied to other areas. It is because our methods are interpretable and based on gaining an empirical view of bias that we feel they could be a first resource in understanding bias beyond our example cases of gender in clinical texts. Our work leverages computational representations of statistically derived relationships between concepts, commonly known as word embedding models [52]. These real-valued vector representations of words facilitate comparative analyses of text data with machine learning methods. The generation of these vectors depends on the distributional hypothesis, which states that similar words are more likely to appear together within a given context. Ideally, word embeddings map semantically similar words to similar regions in the vector space—or ‘semantic space’ in this case. The choice of training dataset heavily impacts the qualities of the language model and resulting word embeddings. For instance, general purpose language models are often trained on Wikipedia and the Common Crawl collection of web pages (e.g., BERT [18], RoBERTa [42]). Training language models on text from specific domains often improves performance on tasks in those domains (see below). More recent, state-of-the-art word embeddings (e.g., ELMo [55], BERT [18], GPT-2 [58]) are generally ‘contextual’, where the ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:4 • J. R. Minot et al. vector representation of a word from the trained model is dependent on the context around the word. Older word embeddings, such as GloVe [53] and word2vec [49, 50], are ‘static’, where the output from the trained model is only dependent on the word of interest—with context still being central to the task of training the model. As medical text data are made increasingly accessible through EHRs, there has been a growing focus on devel- oping word embeddings tailored for the medical domain. The practice of publicly releasing pre-trained, domain- specific word embeddings is common across domains, and it can be especially helpful in medical contexts de- scribed using specialized vocabulary (and even manner of writing). SciBERT is trained on a random sample of over one million biomedical and computer science papers [8]. BioBERT similarly is trained on papers from PubMed abstracts and articles [39]. There are also pre-trained embeddings focused on tasks involving clinical notes. Clinical BERT [5, 31] is trained on clinical notes from the MIMIC-III dataset [33]. A similar approach was applied with the XLNet architecture, resulting in clinical XLNet [32]. These pre-trained embeddings perform better on domain-specific tasks related to the training data and procedure. The undesirable bias present in word embeddings has attracted growing attention in recent years. Bolukbasi et al. present evidence of gender bias in word2vec embeddings, along with proposing a method for removing bias from gender-neutral terms [11]. Contextual word embeddings (e.g., BERT) show gender-biases [6] that can have effects on downstream tasks, although these biases may present differently than those in static embeddings [ 7, 37]. Vig et al. investigate which model components (attention heads) are responsible for gender bias in transformer- based language models (GPT-2) [65]. A simple way to mitigate gender bias in word embeddings is to ‘swap’ gendered terms in training data when generating word embeddings [71]. Beutel et al. [9] develop an adversarial system for debiasing language models—in the process, relating the distribution of training data to its effects on properties of fairness in the adversarial system. Simple masking of names and pronouns may reduce bias and improve classification performance for certain language classification tasks [ 15]. Swapping names has been shown to be an effective data augmentation technique for decreasing gender bias in pronoun resolution tasks [ 41]. Simple scrubbing of names and pronouns has been used to reduce gender-biases in biographies [16]. Zhang et al. examine the gender and racial biases present in Clinical BERT, concluding that after fine-tuning “[the] baseline clinical BERT model becomes more confident in the gender of the note, and may have captured relationships between gender and medical conditions which exceed biological associations” [70]. Some of these techniques for bias detection and mitigation have been critiqued as merely capturing over-simplified dimensions of bias—with proper debiasing requiring more holistic evaluation [25]. Data augmentation has been used to improve classification performance and privacy of text data. Simple meth- ods include random swapping of words, random deletion, and random insertion [66]. More computationally ex- pensive methods may involve using language models to generate contextually accurate synonyms [35], or even running text through multiple rounds of machine translation (e.g., English text to French and back again) [69]. De-identification is perhaps the most common data augmentation task for clinical text. Methods may range from simple dictionary look-ups [48] to more advanced neural network approaches [17]. De-identification approaches may be too aggressive and limit the utility of the resulting data while also offering no formal privacy guarantee. The field of differential privacy [ 20] offers principled methods for adding noise to data, and some recent work has explored applying these principles to text data augmentation [1]. Applying data-augmentation techniques to pipelines that use contextual word-embeddings presents some additional uncertainty given the on-going na- ture of research working on establishing what these trained embeddings actually represent and how they use contextual clues (e.g., the impact of word order on downstream tasks [56]). In the present study, we explore the intersection of the bias that stems from language choices made by health- care providers and the bias encoded in word embeddings commonly used in the analysis of clinical text. We present interpretable methods for detecting and reducing bias present in text data with binary classes. Part of this work is investigating how orthogonal text relating to gender bias is to text related to clinically-relevant information. While we focus on gender bias in health records, this framework could be applied to other do- mains and other types of bias as well. In Section 2, we describe our data and methods for evaluating bias. In ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:5 Section 3, we present our results contrasting empirically observed bias in our sample data with bias encoded in word embeddings. Finally, in Section 4, we discus the implications of our work and potential avenues for future research. Our main contributions include the following: • We demonstrate a model-agnostic technique for identifying biases in language usage for clinical notes corresponding to female and male patients. We provide examples of words and phrases highlighted by our method for two clinical note datasets. This methodology could readily be applied to other demographic attributes of patients as well. • Continuing with the bias identification technique, we contrast the results from the model-agnostic bias de- tection with results from evaluating bias within a word-embedding space. We find that our model-agnostic method highlights domain- and dataset-specific terms, leading to more effective bias identification than when compared with results derived from language models. • We develop a data augmentation procedure to remove biased terms and present results demonstrating that this procedure has minimal effect on clinically relevant tasks. Our experiments show that removing words corresponding to 10% of the language-distribution divergence has little effect on condition classi- fication performance while largely reducing the gender signal in clinical notes. Further, the augmenta- tion procedure can be applied to a high volume of terms (for terms corresponding to up to 80% of total language-distribution divergence) with minimal degradation in performance for clinically relevant tasks. More broadly, our results demonstrate that transformers-based language models can be robust to high levels of data augmentation—as indicated by retention of relative performance on downstream tasks. Taken together, these contributions provide methods for bias identification that are readily interpreted by patients, providers, and healthcare informaticians. The bias measures are model-agnostic and dataset specific and can be applied upstream of any machine learning pipeline. This manuscript directly supports the blossoming field of ethical artificial intelligence in healthcare and beyond. Our methods could be helpful for evaluating the impact of demographic signals—beyond gender—present in text when developing machine learning models and workflows in healthcare. 2METHODS Here we outline our methods for identifying and removing gendered language and evaluating the impact of this data augmentation process. We also provide brief descriptions of the datasets for our case study. The bias evalu- ation techniques fall into two main categories. First, we make intrinsic evaluations of the language by looking at bias within word-embedding spaces and empirical word-frequency distributions of the datasets. The set of methods presented here enable the identification of biased language usage, a data augmentation approach for removing this language, and an example benchmark for evaluating the performance impacts on two biomedical datasets. Second, there are extrinsic evaluation tasks focused on comparing the performance of classifiers as we vary the level of data augmentation. For our dataset, this process involves testing health-condition and gender classifiers on augmented data. The extrinsic evaluation provides a measurement of potential bias and is meant to be similar to some real-world tasks that may utilize similar classifiers. 2.1 Bias Measurements We define three different bias evaluation frameworks for our study. The first is a measure of empirical bias between language distributions for two classes observed in a given dataset. The second is a measure of intrinsic bias present in a word embedding space (as explored with a given corpus). Finally, the third measure addresses the potential extrinsic algorithmic-bias present in a machine learning pipeline. For our evaluations of intrinsic language bias in empirical data we use a divergence metric (details in Section 2.2) which is calculated between the language distributions of male and female patients. We use a ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:6 • J. R. Minot et al. straightforward notion of bias that rates n-grams as more biased when their divergence contribution is higher. In this case we are detecting data bias, which may in itself have multiple sources such as measurement or sampling biases. We evaluate intrinsic language bias in embedding spaces by calculating similarity scores between gendered word-embedding clusters and n-grams appearing in male and female patient notes (details in Section 2.4). Here we are focused on a measure of algorithmic bias as expressed via language from a specific dataset that is encoded with a given language model. Finally, we evaluate extrinsic bias to test the effects of our data augmentation procedure. In this case we diverge from the largely established definitions of bias by using an evaluation framework that does not claim to detect explicit forms of bias for protected classes. Instead, we reframe our extrinsic measure as potential bias (PB), which we define generally as the capacity for a classifier to predict protected classes. Our PB measure does not equate directly to real-world biases, but we argue that it has utility in establishing a generalizeable indication of the potential for bias that is task-independent. Stated another way, it is a measure of the signal present in a data set and the capacity for a given machine learning algorithm to utilize this signal for biased predictions. In our case, we present PB as measured by the performance of a binary classifier trained to predict patient gender from text documents. We recognize concerns raised by Blodgett et al. [10] and others relating to the imprecise definitions of bias in the field of algorithmic fairness. Indeed, it is often important to motivate a given case of bias by establishing its potential harms. In our case, we are limiting what we present to precursors to bias, and thus do not claim to making robust assessments of real-world bias and subsequent impact. 2.2 Rank-divergence and Trimming We parse clinical notes into n-grams—sequences of space delimited strings such as words or sentence fragments— and generate corresponding frequency distributions. To quantify the bias of specific n-grams we compare their frequency of usage in text data corresponding to each of our two classes. The same procedure is extended as a data augmentation technique intended to remove biased language in a targeted and principled manner. More specifically, in our case study of gendered language in clinical notes we quantify the “genderedness” of n-grams by comparing their frequency of usage in notes for male and female patient populations. For the task of comparing frequency of n-gram usage we use rank-turbulence divergence (RTD), as defined by Dodds et al. [19]. The rank-turbulence divergence between two sets, Ω and Ω , is calculated as follows, 1 2 R R D (Ω ||Ω ) = δD 1 2 α α,τ 1/(α+1) (1) α + 1  1 1 = − , α α α r r τ ,1 τ ,2 where r is the rank of element τ (n-grams, in our case) in system s and α is a tunable parameter that affects τ ,s the starting and ending ranks. While other techniques could be used to compare the two n-gram frequency distributions, we found RTD to be robust to differences in overall volume of n-grams for each patient population. For example, Figure 1 shows the RTD between 1-grams from clinical notes corresponding to female and male patients. A brief note on notation: τ always represents a unique element, or n-gram in our case. In certain contexts, τ may be an integer value that ultimately maps back to the element’s string representation. This integer conversion is to allow for clean indexing—in these cases τ can be converted back to a string representation of the element with the array of element strings W . We use the individual rank-turbulence divergence contribution, δD ,ofeach1-gramtothe gendered di- α,τ vergence, D (Ω ||Ω ), to select which terms to remove from the clinical notes. First, we sort the 1-grams female male ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:7 Fig. 1. Rank-turbulence divergence allotaxonograph [19] for male and female documents in the MIMIC-III dataset. For this figure, we generated 1-gram frequency and rank distributions from documents corresponding to male and female patients. Pronouns such as “she” and “he” are immediately apparent as drivers of divergence between the two corpora. From there, the histogram on the right highlights gendered language that is both common and medical in nature. Familial relations (e.g., “husband” and “daughter”) often present as highly gendered according to our measure. Further, medical terms like “hysterectomy” and “scrotum” are also highly ranked in terms of their divergence. Higher divergence contribution values, δD , are often driven by either relatively common words fluctuating between distributions (e.g., “daughter”), or the pres- α,τ ence of disjoint terms that appear in only one distribution (e.g., “hysterectomy”). The impact of higher rank values can be tuned by adjusting the α parameter. In the main horizontal bar chart, the bars indicate the divergence contribution value and the numbers next to the terms represent their rank in each corpus. The terms that appear in only one corpus are indicated with a rotated triangle. The smaller three vertical bars describe balances between the male and female corpora: 43% of total 1gram counts appear in the female corpus; we observed 68.6% of all 1grams in the female corpus; and 32.2% of the 1grams in the female corpus are unique to that corpus. based on their rank-turbulence divergence contribution. Next, we calculate the cumulative proportion of the over- all rank-turbulence divergence, RC , that is accounted for as we iterate through the sorted 1-gram list from words with the highest contribution to the least contribution (in this case, terms like “she” and “gentleman” will tend to have a greater contribution). Finally, we set logarithmically spaced thresholds of cumulative rank-divergence values to select which 1-grams to trim. The method allows us to select sets of 1-grams that contribute the most to the rank-divergence values (measured as divergence per 1-gram). Figure 2 provides a graphical overview of this procedure. Using this selection criteria, we are able to remove the least number of 1-grams per a given amount of rank- turbulence divergence removed from the clinical notes. The number of unique 1-grams removed per cumulative amount of rank-turbulence divergence grows super linearly as seen in Figure 8(I). This results in relatively stable distributions of document lengths for lower trim values (10–30%), although at higher trim values the procedure drastically shrinks the size of many documents (Figure 8(A–H)). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:8 • J. R. Minot et al. Fig. 2. Overview of the rank-turbulence divergence trimming procedure. Solid lines indicate steps that are specific to our trimming procedure and evaluation process. The pipeline starts with a repository of patient records that include clinical notes and class labels (in our case gender and ICD9 codes). From these notes we generate n-gram rank distributions for the female and male patient populations, which are then used to calculate the rank-turbulence divergence (RTD) for individual n-grams. Sorting the n-grams based on RTD contribution, we then trim the clinical notes. Finally, we view the results directly from the RTD calculation to review imbalance in language use. With the trimmed documents we compare the performance of classifiers on both the un-trimmed notes and notes with varying levels of trimming applied. To implement this trimming procedure, we use regular expressions to replace the 1-grams we have identified for removal with a space character. We found that using 1-grams as the basis for our trimming procedure is both effective and straightforward to implement. Generally, if higher order n-grams (e.g., 2-grams) are determined to be biased, the constituent 1-grams are also detected by the RTD metric. Our string removal procedure is applied to the overall corpus of data, upstream of any train-test dataset generation for specific classification tasks. Other potential string replacement strategies include redaction with a generic token or randomly swapping n-grams that appear within the same category across the corpus [1]. The RTD method we use could also be adapted for use with these and other replacement strategies. We chose string removal because of its simplicity and prioritization of the de-biasing task over preserving semantic structure (i.e., it presents an extreme case of data augmentation). The pipeline’s performance on downstream tasks provides some indication of the seman- tic information retained, and as we show in Section 3 it is possible retain meaningful signals while pursuing relatively aggressive string removal. 2.3 Language Models Large language models are increasingly common in many NLP tasks, and we feel it is important to present our results in the context of a pipeline that utilizes these models. Furthermore, language models have the potential to encode bias, and we found it necessary to contrast our empirical PB detection methods with bias metrics calculated on general purpose and domain-adapted language models. We use pre-trained BERT-base [18] and Clinical BERT [5] word embeddings. BERT provides a contextual word embedding trained on “general” language whereas Clinical BERT builds on these embeddings by utilizing transfer learning to improve performance on scientific and clinical texts. All models were implemented in PyTorch using the Transformers library [67]. For tasks such as nearest-neighbor classification and gender-similarity scoring, we use the off-the-shelf weights for BERT and Clinical BERT (see Figure 3 for an example of n2c2 embedding space). These models were then fine-tuned on the gender and health-condition classification tasks. In cases where we fine-tuned the model, we added a final linear layer to the network. All classification tasks were binary with a categorical cross-entropy loss function. All models were run with a maximum sequence ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:9 Fig. 3. A tSNE embedding of n2c2 document vectors generated using a pre-trained version of BERT with off-the-shelf weights. We observe the appearance of gendered clusters even before training for a gender classification task. See Figure 17 for the same visualization but with Clinical BERT embeddings. length of 512, batch size of 4, and gradient accumulation steps set to 12. We considered various methods for handling documents longer than the maximum sequence length (see variable length note embedding in SI), but ultimately the performance gains did not merit further use. We also run a nearest-neighbor classifier on the document embeddings produced by the off-the-shelf BERT- base and Clinical BERT models. This is intended to be a point of comparison when evaluating the potential bias present within the embedding space, as indicated by performance on extrinsic classification tasks. In addition to the BERT-based language models, we used a simple term frequency-inverse document fre- quency (TFIDF) [59] based classification model as a point of comparison. For this model, we fit a TFIDF vector- izer to our training data and use logistic regression for binary classification. For classification performance metrics we report the Matthews correlation coefficient (MCC) and receiver operating characteristic (ROC) curves. We primarily use MCC values for ease of presentation and because of the balanced nature of the measurement even in the face of class imbalances [14]. MCC is calculated as follows, TP ·TN − FP · FN MCC = (2) (TP + FP) · (TP + FN ) · (TN + FP) · (TN + FN ) Where TP, TN , FP,and FN are counts of true positives, true negatives, false positives, and false negatives, re- spectively. MCC ranges between -1 and 1 with 1 indicating the best performance and 0 indicating performance no better than random guessing. The measure is often described as the correlation between observed and predicted labels. 2.4 Gender Distances in Word Embeddings Using a pre-trained BERT model, we embed all the 1-grams present in the clinical note datasets. For this task, we retain the full length vector for each 1-gram, taking the average in cases where additional tokens are created by the tokenizer. The results of this process are 1x768 vectors for each n-gram. We also calculate the average embedding for a collection of terms manually selected to constitute ‘gender clusters’. From these gender clusters, we calculate the cosine similarly to each of the embeddings for n-grams in the Zipf distribution. Using measures such as cosine similarity with BERT raises some concerns, especially when looking at abso- lute values. BERT was not trained on the task of calculating sentence or document similarities. With BERT-base, all dimensions are weighted equally, which when applying cosine similarity can result in somewhat arbitrary ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:10 • J. R. Minot et al. ALGORITHM 1: RTD trimming procedure Input: Documents D , i = 1,..., N Input: 1-gram rank dists. for each class Ω ,ψ = 1, 2 (k ) Output: Trimmed text data C , i = 1,..., N ; k ∈ (0, 1] R R 1: δD ,W ← RTD_calc(Ω , Ω , α ) , τ = 1,..., M  δD is RTD contribution for ngram W ,both τ 1 2 τ α,τ α,τ sorted by RTD contribution 2: RC ← cumsum(δD ) α,τ 3: for k = .1,.2,...,.9 do 4: r ← max(where(RC <= k))  index up to bin max b 5: S ← W 0:r 6: for i = 1,..., N do (k ) 7: C ← strip(D , S)  remove 1−grams from doc. 8: end for 9: end for absolute values. As a workaround, we believe that using the ranked value of word and document embeddings can produce more meaningful results (if we do not wish to fine-tune BERT on the task of sentence similarity). We use both absolute values and ranks of cosine similarity when investigating bias in BERT-based language models—finding the absolute values of cosine similarity to be meaningful in our relatively coarse-grained analy- sis. Further, taking the difference in cosine similarities for each gendered cluster addresses some of the drawbacks of examining cosine similarity values in pre-trained models. Generating word or phrase embeddings from contextual language models raises some challenges in terms of calculating accurate embedding values. In many cases, the word-embedding for a given 1-gram—produced by the final layer of a model such as BERT—can vary significantly depending on context [ 21]. Some researchers have proposed converting contextual embeddings to static embeddings to address this challenge [12]. Others have presented methods for creating template sentences and comparing the relative probability of masked tokens for target terms [37]. After experimenting with the template approach, we determined that the resulting embeddings were not different enough to merit switching away from the simple isolated 1-gram embeddings. 2.5 Rank-turbulence Divergence for Embeddings and Documents We use rank-turbulence divergence in order to compare the bias encoded in word-embeddings and empirical data. For word embeddings, we need to devise a metric for bias—here we use cosine similarity between biased-clusters and candidate n-grams. The bias in the empirical data is evaluated using RTD for word-frequency distributions corresponding to two labeled classes. In terms of the clinical text data, for the word embeddings we use cosine similarity scores to evaluate bias relative to known gendered n-grams. For the clinical note datasets (text from documents with gender labels), we use rank-turbulence divergence calculated between the male and female patient populations. To evaluate bias in the embedding space, we rely on similarity scores relative to known gendered language. First, we create two gendered clusters of 1-grams—these clusters represent words that are manually determined to have inherent connotations relating to female and male genders. Next, we calculate the cosine similarity between the word embeddings for all 1-grams appearing in the empirical data and the average vector for each of the two gendered clusters. Finally, we rank each 1-gram based on the distribution of cosine similarity scores for the male and female clusters. For the empirical data, we calculate the RTD for 1-grams appearing in the clinical note data sets. The RTD value provides an indication of the bias—as indicated by differences in specific term frequency—present in the clinical notes. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:11 Table 1. Patient Sex Ratios for the Top 10 Conditions in MIMIC-III Sex Count ICD Description N N N /N N /N f m f total m total Acute kidney failure 3941 5178 0.43 0.57 Acute respiratory failure 3473 4024 0.46 0.54 Atrial fibrillation 5512 7379 0.43 0.57 Congestive heart failure 6106 7005 0.47 0.53 Coronary atherosclerosis 4322 8107 0.35 0.65 Diabetes mellitus 3902 5156 0.43 0.57 Esophageal reflux 2990 3336 0.47 0.53 Essential hypertension 9370 11333 0.45 0.55 Hyperlipidemia 3537 5153 0.41 0.59 Urinary tract infection 4027 2528 0.61 0.39 For most health conditions there is an imbalance in the gender ratio between male and female patients. This reflects an overall bias in the MIMIC-III dataset which hasmoremalepatients. Combined, these steps provide ranks for each 1-gram in terms of how much it differentiates the male and female clinical notes. Here again we can use the highly flexible rank-turbulence divergence measure to identify where there is ‘disagreement’ between the ranks returned by evaluating the embedding space and ranks from the empirical distribution. This is a divergence-of-divergence measure, using the iterative application of rank- turbulence divergence to compare two different measures of rank. Going forward, we refer to this measure as 2 2 RTD .RTD provides an indication of which n-grams are likely to be reported as less gendered in either the embedding space or in the empirical evaluation of the documents. For our purposes, RTD is especially useful for highlighting n-grams that embedding-based debiasing techniques may rank as minimally biased, despite the emperical distribution suggesting otherwise. 2.6 Data We use two open source datasets for our experiments: the n2c2 (formerly i2b2) 2014 deidentification chal- lenge [36] and the MIMIC-III critical care database [33]. The n2c2 data comprises around 1,300 documents with no gender or health condition coding (we generate our own labels for the former). MIMIC-III is a collection of diagnoses, procedures, interventions, and doctors notes for 46,520 patients that passed through an intensive care unit. There are 26,121 males and 20,399 females in the dataset, with over 2 million individual documents. MIMIC-III includes coding of health conditions with International Classification of Diseases (ICD-9) codes, as well as patient sex. For MIMIC-III, we focus our health-condition classification experiments on records corresponding to patients with at least one of the top 10 most prevalent ICD-9 codes. We restrict our sample population to those patients with at least one of the 10 most common health conditions—randomly drawing negative samples from this subset for each condition classification experiment. Rates of coincidence vary between 0.65 and 0.13 (Figure 9). All but one of the top 10 health conditions have more male than female patients (Table 1). As a point of reference, we also present summary results for records corresponding to patients with ICD-9 codes that appear at least 1,000 times in the MIMIC-III data (Table 8). 2.7 Text Pre-processing Before analyzing or running the data through our models, we apply a simple pre-processing procedure to the text fields of the n2c2 and MIMIC-III data sets. We remove numerical values, ranges, and dates from the text. This is done in an effort to limit confounding factors related to specific values and gender (e.g., higher weights ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:12 • J. R. Minot et al. and male populations). We also strip some characters and convert common abbreviations. See Section A.1 for information on note selection. 3RESULTS Here we present the results of applying empirical bias detection and potential bias mitigation methods. Using rank-turbulence divergence (RTD), we rank n-grams based on their contribution to bias between two classes. Next, we apply a data augmentation procedure where we remove 1-grams based on their ranking in the RTD results. The impact of the data augmentation process is measured by tracking classification performance as we apply increasingly aggressive 1-gram trimming to our clinical note datasets. Finally, we compare the bias present in the BERT embedding space with the empirical bias we detect in the case study datasets. One of our classification tasks is predicting patient gender from EHR notes. We include gender classification as a synthetic test that is meant to directly indicate the gender signal present in the data. Gender classification is an unrealistic task that we would not expect to see in real-world applications, but serves as an extreme case that provides insight on the potential for other classifiers to incorporate gender information (potential bias). We present results for both the n2c2 and MIMIC-III datasets. The n2c2 dataset provides a smaller dataset with more homogeneous documents and serves as a reference point for the tasks outlined here. MIMIC-III is much larger and its explicit coding of health conditions allows us to bring the extrinsic task of condition classification into our evaluation of bias and data augmentation. 3.1 Gender Divergence Interpretability is a key facet of our approach to empirical bias detection. To gain an understanding of biased language usage we start by presenting the ranks of RTD values for individual n-grams in text data corresponding to each of the binary classes. The allotaxonographs we use to present this information (e.g., Figure 1)showboth the RTD values and a 2-d rank-rank histogram for n-grams in each class. The rank-rank histogram (Figure 1 left) is useful for evaluating how the word-frequency distributions (Figure 1 right) are similar or disjoint among the two classes, and in the process visually inspecting the fit of the tunable parameter α, which modulates the impact of lowly-ranked n-grams. See Figures 12, 13, 14,and 15 for additional allotaxonographs, including 2- and 3-grams. In the case of our gender classes in the medical data, we find the rank distributions to be more similar than disjoint and visually confirm that α = 1/3 is an acceptable setting (by examining the relation between contour lines and rank-rank distribution). More specifically, in our case study gendered language is highlighted by calculating the RTD values for male and female patient notes. We present results from applying our RTD method to 1-grams in the unmodified MIMIC-III dataset in Figure 1. Unsurprisingly, gendered pronouns appear as the greatest contribution to RTD be- tween the two corpora. Further, 1-grams regarding social characteristics such as “husband”, “wife”, and “daughter” and medically relevant terms relating to sex-specific or sex-biased conditions such as “hysterectomy”, “scrotal”, and “parathyroid” are also highlighted. Some of these terms may be obvious to readers—suggesting the effectiveness of this approach to capture intu- itive differences. Upon deeper investigation, 1-grams such as “husband” and “wife” often appear in reference to a patient’s spouse providing information or social histories. The reasons for “daughter” appearing more commonly in female patient notes are varied, but appear to be related to higher relative rates of daughters providing infor- mation for their mothers. However the identification and ranking of other n-grams in terms of gendered-bias re- quires examination of a given dataset—perhaps indicating unintuitive relationships between terms and gendered language, or potentially indicating overfitting of this approach to specific datasets. For instance, “parathyroid” likely refers to hypoparathyroidism which is not a sex-specific condition, but rather a sex-biased condition with a ratio of 3.3:1 for female to male diagnoses. Further, men are more likely to present asymptomatically, which may then be less likely to be diagnosed in an ICU setting [45]. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:13 Table 2. Matthews Correlation Coefficient for Gender Classification Task on n2c2 Dataset BERT Clinical BERT Model notes Gendered No-gend. Gendered No-gend. Nearest neighbor 0.69 * 0.44 * 1Epoch 0.94 −0.06 0.92 0.00 10 Epochs * 0.88 * 0.56 BERT and Clinical BERT based models were run on the manually generated “no gender” test dataset (common pronouns, etc. have been removed). The nearest neighbor model uses off-the-shelf models to create document embeddings, while the models run for 1 and 10 Epochs were fine tuned. The application of RTD produces, in a principled fashion, a list of target terms to remove during the debiasing process—automating the selection of biased n-grams and tailoring results to a specific dataset. Using the same RTD results from above, we apply our trimming procedure—augmenting the text by iteratively removing the most biased 1-grams. For instance, in the MIMIC-III data the top 268 1-grams account for 10% of the total RTD identified by method—and these are the first words we trim. 3.2 Gender Classification As an extrinsic evaluation of our biased-language removal process, we present performance results for classifiers predicting membership in the two classes that we obscure through the data augmentation process. We posit that the performance of a classifier in this case is an important metric when determining if the data augmentation was successful in removing bias signals and thus potential bias from the classification pipeline. The performance of the classifier is analogous to a real-world application under an extreme case where we are trying to predict the protected class. We evaluate the performance of a binary gender classifier based on BERT and Clinical BERT language models. As a starting point, we investigate the performance of a basic nearest neighbors classifier running on docu- ment embeddings produced with off-the-shelf language models. The classification performance of the nearest- neighbor classifier is far better than random and speaks to the embedding space’s local clustering by gendered words in these datasets, suggesting that gender may be a major component of the words embedded within this representation space. The tendency for BERT, and to a lesser extent Clinical BERT, to encode gender informa- tion can be seen in the tSNE visualization of these document embeddings (Figures 3 and 17). As seen here and in other results, Clinical BERT exhibits less potential gender-bias according to our metrics. We leave a more in- depth comparison of gender-bias in BERT and Clinical BERT to future work, but it is worth noting that different embeddings appear to have different levels of potential gender-bias. Further, clinical text data may be more or less gender-biased than everyday text. The performance of the BERT-based nearest neighbor classifier on the gender classification task is notable (Matthews correlation coefficient of 0.69), given the language models were not fine-tuned (Table 2). Using Clinical BERT embeddings result in an MCC of 0.44 for the nearest neighbor classifier—with Clinical BERT generally performing slightly worse on gender classification tasks. As a point of comparison, we attempt a naive approach to removing gender bias through data augmentation that involves trimming a manually selected group of 20 words. When we run our complete BERT classifier, with fine tuning, for 1 epoch we find that the MCC drops from 0.94 to -0.06 when we trim the manually selected words. This patterns holds up for Clinical BERT as well. However, if we extend the training run to 10 epochs, we find that most of the classification performance is recovered. This suggests that although the manually selected terms may have some of the most prominent gender signals, removing them by no means prevents the models from learning other indicators of gender. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:14 • J. R. Minot et al. Fig. 4. Patient condition and gender classification performance. Using the fine-tuned Clinical BERT based model on the MIMIC dataset. (A) Proportion of baseline classification performance removed after minimum-trim level (1% of total RTD) is applied to the documents. (B) Same as (A) but with maximum trimming applied (70% of total RTD). Of all the classification tasks, ‘gender’ and ‘Urinary tra.’ experience the greatest relative decrease in classification performance. However, due to the low baseline performance of Urinary (≈ 0.2), the gender classification task has a notably higher absolute reduction in MCC than Urinary tra. (or any other task). It is worth noting under low levels of trimming MCC values slightly improved in individual trials. Further, under maximum trim levels the gender classification MCC was slightly negative. See Figure 5 for full information on MCC scores for each of the health conditions. On the MIMIC-III dataset we find gender classification to be generally accurate. With no gender trimming applied, MCC values are greater than 0.9 for both BERT and Clinical BERT classifiers. This performance is quickly degraded as we employ our trimming method (Figure 5(K)). When we remove 1-grams accounting for the first 10% of the RTD, we find an MCC value of approximately 0.2 for the gender classification task. The removal of the initial 10% of rank-divergence contributions has the most impact in terms of classification performance. Further trimming does not reduce the performance as much until 1-grams accounting for nearly 80% of the rank-turbulence divergence are removed. At this point, the classifier is effectively random, with an MCC of approximately 0. Taken together these results point to a reduction in potential bias through our trimming procedure. The large drop in performance for gender classification is in contrast to that of most health conditions (Figure 4(B)). On the health condition classification task most trim values result in negligible drops in classi- fication performance. 3.3 Condition Classification To evaluate the impact of the bias removal process, we track the performance of classification tasks that are not explicitly linked to the two classes we are trying to protect. Under varying levels of data augmentation we train and test multiple classification pipelines and report any degradation in performance. These tasks are meant to be analogous to real world applications in our domain that would require the maintenance of clinically-relevant information from the text—although we make no effort to achieve state-of-the-art results (see Table 3 for baseline condition classification performance). In the specific context of our case study, we train health-condition classifiers that produce modest perfor- mance on the MIMIC-III dataset. This performance is suitable for our purposes of evaluating the degradation in performance on the extrinsic task, relative to our trimming procedure. In the case of each health condition, we find that relative classification performance is minimally affected by the trimming procedure. For instance, the classifier for atrial fibrillation results in an MCC value of around 0.48 for ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:15 Table 3. Clinical BERT Performance on Top 10 ICD9 Codes in the MIMIC-III Dataset ICD9 Description ICD9 Code MCC Diabetes mellitus 25000 0.53 Hyperlipidemia 2724 0.46 Essential hypertension 4019 0.41 Coronary atherosclerosis 41401 0.67 Atrial fibrillation 42731 0.53 Congestive heart failure 4280 0.51 Acute respiratory failure 51881 0.43 Esophageal reflux 53081 0.43 Acute kidney failure 5849 0.29 Urinary tract infection 5990 0.23 the male patients (Figure 5(C)) in the test set when no trimming is applied. When the minimal level of trimming is applied (10% of RTD removed), the MCC for the males is largely unchanged, resulting in an MCC of 0.48. This largely holds true for most of the trimming levels, across the 10 conditions we evaluate in-depth. For 6 out of 10 conditions, we find that words accounting for approximately 80% of the gender RTD need to be removed before there is a noteworthy degradation of classification performance. At the 80% trim level, the gender classification task has an MCC value of approximately 0, while many other conditions maintain some predictive power. Comparing the relative degradation in performance, we see that the proportion of MCC lost between no- and maximum-trim between 0.05 and 0.4 for most conditions (Figure 4(B)). The only condition with full loss of predictive power is for urinary tract infections, which one might also speculate to be related to the anatomical differences in presentation of UTIs between biological sexes. Although, this task also proved the most challenging and had the worst starting (no-trim) performance (MCC≈ 0.2). The above results suggest that, for the conditions we examined, performance for medically relevant tasks can be preserved while reducing performance on gender classification. There is the chance that the trimming procedure may result in biased preservation of condition classification task performance. To investigate this we present results from a lightweight, TF-IDF based classifier for 123 health conditions. We find that when we trim the top 50% of RTD that classifiers for most conditions are relatively unaffected (Figure 6). For those conditions that do experience shifts in classification performance, any gender imbalance appears attributable related to the background gender distribution in the dataset. 3.4 Gender Distance To connect the empirical data with the language models, we embed n-grams from our case study datasets and evaluate their intrinsic bias within the word-embedding space. These language models have the same model- architectures that we (and many others) use when building NLP pipelines for classification and other tasks. Bias measures based on the word-embedding space are meant to provide some indication of how debiasing techniques that are more language model-centric would operate (and what specific n-grams they may highlight)—keeping with our theme of interpretability while contrasting these two approaches. In the context of our case study, we connect empirical data with word embeddings by presenting the dis- tributions of cosine similarity scores for 1-grams relative to gendered clusters in the embedding space. Cosine similarity scores are calculated for all 1-grams relative to clusters representing both female and male clusters (defined by 1-grams in Table 4). In our results we use both the maximum cosine similarity value relative to these clusters (i.e., the score calculated against either the female or male cluster) as well as differences in the scores for each 1-gram relative both female and male clusters. Looking at the distributions of maximum cosine ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:16 • J. R. Minot et al. Fig. 5. Matthews correlation coefficient (MCC) for classification results of health conditions and patient gender with varying trim levels. Results were produced with clinicalBERT embeddings and no-token n-gram trimming. (A)−(J) show MCC for the top 10 ICD9 codes present in the MIMIC data set. (K) shows MCC for gender classification on the same population. (L) presents a comparison of MCC results for data with no trimming and the maximum trimming level applied. Values are the relative MCC, or the proportion of the best classifiers performance we lose when applying the maximum rank-turbulence divergence trimming to the data. Here we see the relatively small effect of gender-based rank divergence trimming on the condition classification tasks for most conditions. The performance on the gender classification task is significantly degraded, even at modest trim levels, and is effectively no better than random guessing at our maximum trim level. It is worth noting that many conditions are stable for most of the trimming thresholds, although we do start to see more consistent degradation of performance at the maximum trim level for a few conditions. similarity scores for 1-grams appearing in both the n2c2 dataset (Figure 16) and the MIMIC-III dataset (Figure 7(B)), we observed a bimodal distribution of values. In both figures, a cluster with a mean around 0.9 is apparent as well as a cluster with a mean around 0.6. Through manual review of the 1-grams, we find that the cluster around 0.9 is largely comprised of more common, conversational English words whereas the cluster around 0.6 is largely comprised of medical terms. While there are more unique 1-grams in the cluster of medical terms, the overall volume of word occurrences is far higher for the conversational cluster. Referencing the cosine similarity clusters against the rank-turbulence divergence scores for the two data sets, we find that a high volume of individual 1-grams that trimmed are present in the conversational cluster. However, the number of unique terms there are removed for lower trim-values are spread throughout the cosine-similarity gender distribution. For instance, when trimming the first 1% of RTD, we find that terms selected are more con- versational cluster and more technical cluster (Figure 7(E)), with the former accounting for far more of the total volume of terms removed. The total volume of 1-grams is skewed towards the conversational cluster with terms ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:17 Fig. 6. Degradation in performance for Matthews correlation coefficient for condition classification of ICD9 codes with at least 1,000 patients. The performance degradation is presented relative to the proportion of the patients with that code who are female. We find little correlation between the efficacy of the condition classifier on highly augmented (trimmed) datasets and the gender balance for patients with that condition (coefficient of determination R = −2.48). Values are calculated for TF-IDF based classifier and include the top 10 health conditions we evaluate elsewhere. that have higher gender similarity (Figure 7(G)). The fact that the terms selected for early stages of trimming appear across the distribution of cosine similarity values illustrates the benefits of our empirical method, which is capable of selecting terms specific to a given dataset without relying on information contained in a language model. The contrast between the RTD selection criteria and the bias present in the language model helps explain why performance on the condition classification task is minimally impacted even when a high volume of 1-grams are removed—with RTD selecting only the most empirically biased terms. Using RTD-trimming, there is a middle ground between obscuring gender and barely preserving performance on condition classifications—some of the more nuanced language can be retained using our method. 3.5 Comparison of Language Model and Empirical Bias Finally, we identify n-grams that are more biased in either the language model or in the empirical data, using RTD to divert attention away from n-grams that appear to exhibit similar levels of bias in both contexts. Put more specifically, the first application of RTD—on the empirical data and word-embeddings—ranks n-grams that are more male or female biased. The second application, the divergence-of-divergence (RTD ), ranks n-grams in terms of where there is most disagreement between the two bias detection approaches. For the MIMIC-III dataset, we find RTD highlights sex-specific terms, social information, and medical con- ditions (Table 6). The abbreviations of “f” and “m” for instance, are rank 6288 and 244, respectively, for RTD bias measures on BERT. Moving to RTD bias measurements in MIMIC-III, “f” and “m” are the 3rd and 7th most ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:18 • J. R. Minot et al. Fig. 7. Measures of gender bias in BERT word-embeddings. (A) tSNE visualization of the BERT embedding space, colored by the maximum cosine similarity of MIMIC-III 1grams to either male or female gendered clusters. (B) Distribution of the maximum cosine similarity between male or female gender clusters for 163,539 1grams appearing the MIMIC-III corpus. Through manual inspection we find that the two clusters of cosine similarity values loosely represent more conversational English (around 0.87) and more technical language (around 0.6). The words shown here were manually selected from 20 random draws for each respective region. (C) tSNE visualization of BERT embeddings space, colored by the difference in the values of cosine similarity for each word and the male and female clusters. (D) Distribution of the differences in cosine similarity values for 1-grams and male and female clusters. (E) Distribution maximum gendered-cluster cosine similarity scores for the 1-grams selected for removal when using the rank-turbulence divergence trim technique and targeting the top 1% of words that contribute to overall divergence. The trimming procedure targets both common words that are consid- ered relatively gendered by the cosine similarity measure, and less common words that are more specific to the MIMIC-III dataset and relatively less gendered according to the cosine similarity measure. (F) Weighted distribution of differences in cosine similarity between 1-grams and male and female clusters (same measure as (D), but weighted by the total number of occurrences of the 1-gram in the MIMIC-III data). (G) Weighted distribution of maximum cosine similarity scores between 1-grams and male or female clusters (same measure as (B), but weighted by the total number of occurrences of the 1-gram in the MIMIC-III data). biased terms, respectively, appearing in practically every note when describing basic demographic information for patients. The BERT word embedding of the 1-gram “grandmother” has a rank of 4 but a rank of 3571 in the MIMIC-III data—due to the fact that the 1-gram “grandmother” is inherently semantically gendered, but in the context of health records does not necessarily contain meaningful information on patient gender. “Husband” on the other hand does contain meaningful information on the patient gender (at least in the MIMIC-III patient population), with it being rank 4 in terms of its empirical bias—the word embedding suggests it is biased, but less so with a rank of 860. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:19 As a final set of examples, we look at medical conditions. It is worth noting our choice of BERT rather than Clinical BERT most likely results in less effective word embeddings for medical terms. “Cervical” has a rank of 7 in the BERT bias rankings and a rank of 18374 in the empirical bias distribution—most likely owing to the split meanings in a medical context. Conversely, “flomax” has a rank of 10891 for the word embedding bias, while the empirical bias rank is 11—most likely due to the gender imbalance in the incidence of conditions (e.g., kidney stones, chronic prostatitis) that flomax is often prescribed to treat. Similarly, “hypothyroidism” is ranked 12 in MIMIC and 17831 in BERT RTD ranks, with the condition having a known increased prevalence in female- patients. The high RTD ranks for medical conditions somewhat owe to the fact that we used BERT rather than the medically-adapted Clinical BERT for these results. For these results the choice to use the general purpose BERT rather than Clinical BERT was motivated by illustrating the discrepancies in bias rankings when using the general purpose model (with the added contrast of a shifted domain, as indicated by jargonistic medical conditions). When applying this type of comparison in practice, it will most likely be more beneficial to compare bias ranks with language models that are used in any final pipeline (in this case, Clinical BERT). Additionally, the difficulty of constructing meaningful clusters of gendered terms using technical language limits the utility of the our cosine similarity bias measure in the Clinical BERT embedding space (see Table 7). Inspection of the 1-grams with high RTD values for BERT suggests a word of caution when using general purpose word embeddings on more technical datasets, while also illustrating how specific terms that drive bias may differ between different domains. The lesson derived by the case study of applying BERT to medical texts could be expanded to provide further caution when working in domains that do not have the benefit of fine-tuned models or where model fit may be generally poor for other reasons. 4 CONCLUDING REMARKS Here we present interpretable methods for detecting and reducing bias in text data. Using clinical notes and gen- der as a case study, we explore how using our methods to augment data may affect performance on classification tasks, which serve as extrinsic evaluations of the potential-bias removal process. We conclude by contrasting the inherent bias present in language models with the bias we detect in our two example datasets. These re- sults demonstrate that it is possible to obscure gender-features while preserving the signal needed to maintain performance on medically relevant classification tasks. Our methods start by using a divergence measure to identify empirical data bias present in our EHR datasets. We then assess the intrinsic bias present within the word embedding spaces for general purpose and clinically adapted language models. We introduce the concept of potential bias (PB) and evaluate the reduction of extrin- sic PB when we apply our mitigation strategy. PB results are generated by presenting performance on a gender classification task. Finally, we compare the results of assessing empirical data bias and intrinsic embedding space bias by contrasting the rankings of 1-grams produced by each method. When evaluating the differences in word use frequency in medical documents, certain intuitive results emerge: practitioners use gendered pronouns to describe patients, they note social and family status, and they encode medical conditions with known gender imbalances. Using our rank-turbulence divergence approach, we are able to evaluate how each of these practices, in aggregate, contribute to a divergence in word-frequency distributions between the male- and female-patient notes. This becomes more useful as we move to identifying language that while not explicitly gendered may still be used in an unbalanced fashion in practice (for instance, non-sex specific conditions that are diagnosed more frequently in one gender). The results from divergence methods are useful for both understanding differences in language usage and as a debiasing technique. While many methods addressing debiasing language models focus on the bias present in the model itself, our empirically-based method offers stronger debiasing of the data at hand. Modern language models are capable of detecting gender signals in a wide variety of datasets ranging from conversational to highly technical language. Many methods for removing bias from the pre-trained language model still leave the potential of meaningful ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:20 • J. R. Minot et al. proxies in the target dataset, while also raising questions on degradation in performance. We believe that bal- ancing debiasing with model performance is benefited by interpretable techniques, such as those we present here. For instance, our bias ranking and iterative application of divergence measures allow users to get a sense of disagreement in bias ranks for language models and empirical data. Our study is limited to looking at (a) intrinsic bias found in our dataset and pre-trained word embeddings, and (b) the extrinsic potential bias identified in our classification pipeline. We recognize concerns raised by Blodgett et al. [10] and others relating to the imprecise definitions of bias in the field of algorithmic fairness. Indeed, it is often important to motivate a given case of bias by establishing its potential harms. In this piece, we address precursors to bias, and thus do not claim to being making robust assessments of real-world bias and subsequent impact. The potential bias metric is instead meant to be a task agnostic indicator of the capacity for a complete pipeline to discriminate between protected classes. Due to the available data we were not able to develop methods that address non-binary cases of gender bias. There are other methodological considerations for expanding past the binary cases [13], although this is an important topic for a variety of bias types [44]. There are further complications when moving away from tasks where associated language is not as neatly segmented. For instance, we show above that when evaluating language models such as BERT, much of the gendered language largely appears in a readily identifiable region of the semantic space. As a rough heuristic: terms appearing in a medical dictionary tended to be less similar to gendered terms than terms that might appear in casual conversation. For doctors notes, the bulk of the bias stems from words that are largely distinct from those that we expect to be most informative for medically relevant tasks. Further research is required to determine the efficacy of our techniques in domains where language is not as neatly semantically segmented. Using clinical notes from an ICU context could bias our results due to the types of patients, conditions, and interactions that are common in this setting. For instance, there may be fewer verbal patient-provider interactions reflected in the data and social histories may not be as in-depth (compared with a primary-care setting). Further, the ways in which clinicians code health conditions may vary across contexts, institutions, and providers. In our study we aim to reduce the impact how conditions are coded by selecting common conditions that have large sample sizes in our data set—but this is still a factor that should be considered when working with such data. Future research that applies these interpretable methods to clinical text have the opportunity to examine pos- sible confounding factors such as patient-provider gender concordance. Further, it would be worthwhile to sep- arately address the impact of author gender on the content of clinical texts through using analytical framework. Other confounding factors relating to the patient populations and broader socio-demographic factors could be addressed by replicating these trials on new data sets. There is also the potential to research how presenting the results of our empirical bias analysis to clinicians may affect note writing practices—perhaps adding empirical examples to the growing medical school curriculum that addresses unconscious bias [64]. Our methods make no formal privacy guarantees nor do we claim complete removal of bias. There is always a trade-off when seeking to balance bias reduction with overall performance, and we feel our methods will help all stakeholders make more informed decisions. Our methodology allows stakeholders to specify the trade-off between bias reduction and performance that is best for their particular use case by selecting different trim levels and reviewing the n-grams removed. Using a debiasing method that is readily interpreted by doctors, patients, and machine learning practitioners is a benefit for all involved, especially as public interest in data privacy grows. Moving towards replacing strings rather than trimming or dropping them completely should be investigated in the future. More advanced data augmentation methods may be needed if we were to explore the impact of debiasing on highly tuned classification pipelines. Holistic comparisons of string replacement techniques and other text data augmentation approaches would be worthwhile next steps. Further research on varying and more difficult extrinsic evaluation tasks would be helpful in evaluating how our technique generalizes. Future work could also investigate coupling our data-driven method with methods focused on debiasing language models. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:21 APPENDICES A SUPPLEMENTARY INFORMATION (SI) A.1 Note Selection After reviewing the note types available in the MIMIC-III dataset, we determined that many types were not suitable for our task. This is due to a combination of factors including information content and note length (Figure 22. Note types such as radiology often include very specific information (not indicative of broader patient health status), are shorter, and may be written in a jargonistic fashion. For the work outlined here we only include notes that are of the types nursing, discharge summary,and physician. In order to be included in our training and test datasets, documents must come from patients with at least three recorded documents. A.2 Document Lengths after Trimming Fig. 8. Document length after applying a linearly-spaced rank-turbulence divergence based trimming procedure. Percentage values represent the percentage of total rank-turbulence divergence removed. Trimming is conducted by sorting words highest-to-lowest based on their individual contribution to the rank-turbulence divergence between male and female corpora (i.e., the first 10% trim will include words that, for most distributions, contribute far more to rank-turbulence divergence than the last 10%). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:22 • J. R. Minot et al. A.3 Variable Length Note Embedding When tokenized, many notes available in the MIMIC-III dataset are longer than the 512-token maximum sup- ported by BERT. To address this issue we experiment with truncating the note at the first 512 tokens. We also explore embedding at the sentence level (embedding with a maximum of 128 tokens) and simply divide the note in 512-token subsequences. In the latter two cases, we use the function outlined by Huang et al. [31], n n P + P n/c max mean P (Y = 1) = (3) 1+ n/c n n where P and P are the maximum and mean probabilities for the n subsquences associated with a given max mean note. Here, c is a tunable parameter that is adjusted for each task. For our purposes, the improvement in classification performance returned by employing this technique did not merit use in our final results. If overall performance of our classification system was our primary objective, this may be worth further investigation. A.4 ICD Co-occurrence Fig. 9. Normalized rates of health-condition co-occurrence for the top 10 ICD-9 codes. A.5 Hardware BERT and Clinical BERT models were fine-tuned on both an NVIDIA RTX 2070 (8GB VRAM) and NVIDIA Tesla V100s (32GB VRAM). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:23 A.6 Gendered 1-grams Table 4. Manually Selected Gendered Terms Female 1-grams Male 1-grams her his she he woman man female male Ms Mr Mrs him herself himself girl boy lady gentleman Fig. 10. Classification performance for the next 123 most frequently occurring conditions. Matthews correlation coefficient for condition classification of ICD9 codes with at least 1,000 patients compared to the proportion of the patients with that code who are female. While the most accurate classifiers tend to be for conditions with a male bias, we observed that this is in-part due to the underlying bias in patient gender. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:24 • J. R. Minot et al. Fig. 11. ROC curves for classification task on top 10 health conditions with varying proportions of rank-turbulence diver- gence removed. Echoing the results in Figure 5, the gender classifier has the best performance on the ‘no-trim’ data and experiences the greatest drop in performance when trimming is applied. Under the highest trim level reported here, the gen- der classifier is effectively random, while few condition classifiers retain prediction capability (albeit modest). The bar chart shows the area under the ROC curve for classifiers, by task, trained and tested with no-trimming and maximum-trimming applied. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:25 Fig. 12. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 2-grams have been split between genders and common gendered terms (pronouns, etc.) have been removed before calculating rank divergence. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:26 • J. R. Minot et al. Fig. 13. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 1-grams have been split between genders and common gendered terms (pronouns, etc., see Table 4) have been removed before calculating rank divergence. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:27 Fig. 14. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 3-grams have been split between genders and common gendered terms (pronouns, etc.) have been removed before calculating rank divergence. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:28 • J. R. Minot et al. Fig. 15. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 2-grams have been split between genders. Fig. 16. Maximum cosine similarity scores of BERT-base embeddings for 26,883 1grams appearing in n2c2 2014 challenge data relative to gendered clusters. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:29 Fig. 17. A tSNE embedding of n2c2 document vectors generated using a pre-trained version of Clinical BERT. Fig. 18. A tSNE embedding of MIMIC document vectors generated using a pre-trained version of Clinical BERT. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:30 • J. R. Minot et al. Fig. 19. A tSNE embedding of MIMIC document vectors generated using a pre-trained version of Clinical BERT. Fig. 20. A tSNE embedding of MIMIC document vectors generated using a pre-trained version of Clinical BERT. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:31 Table 5. Rank-Turbulence Divergence of the Rank-Turbulence Divergence Between Male and Female Zipf Distributions According to n2c2 Rank-Turbulence Divergence and BERT Cosine Similarity Ranking BERT n2c2 BERT-n2c2 1gram RTD rank RTD rank RTD rank 1 mrs 1.0 1172.5 1.0 2 ms 4.0 6560.5 2.0 3 her 279.0 3.0 3.0 4 mr. 15307.0 6.0 4.0 5 male 21798.0 8.0 5.0 6 mr 5.0 437.0 6.0 7 female 5150.0 9.0 7.0 8 linda 10.0 3681.5 8.0 9 ms. 2208.0 11.0 9.0 10 gentleman 3054.0 13.0 10.0 11 pap 25105.0 17.0 11.0 12 breast 4314.0 14.0 12.0 13 cervical 14.0 2483.0 13.0 14 biggest 16.0 5301.5 14.0 15 mammogram 25054.0 21.0 15.0 16 mrs. 2860.0 16.0 16.0 17 f 10458.0 20.0 17.0 18 woman 120.0 7.0 18.0 19 psa 6082.0 19.0 19.0 20 kathy 19.0 5301.5 20.0 21 he 3.0 1.0 21.0 22 them 20.0 4146.0 22.0 23 prostate 2223.0 18.0 23.0 24 bph 8601.0 23.0 24.0 25 husband 920.0 15.0 25.0 26 guy 22.0 5301.5 26.0 27 take 4278.0 22.0 27.0 28 infected 18.0 1455.0 28.0 29 patricia 21.0 2701.5 29.0 30 smear 21455.0 29.0 30.0 31 ellen 24.0 3681.5 31.0 32 cabg 19064.0 33.0 32.0 33 distal 7754.0 30.0 33.0 34 pend 22515.0 36.0 34.0 35 tablet 2322.0 25.0 35.0 36 cath 7111.0 31.0 36.0 37 qday 12829.0 34.0 37.0 38 peggy 17.0 485.0 38.0 39 flomax 17413.0 37.5 39.0 40 lad 2383.0 27.0 40.0 41 prostatic 23978.0 43.0 41.0 42 gout 11948.0 40.0 42.0 43 taking 9534.0 39.0 43.0 44 trouble 34.0 4183.5 44.0 45 harry 33.0 3372.0 45.0 46 vaginal 10418.0 41.0 46.0 47 qty 18645.0 45.0 47.0 48 she 6.0 2.0 48.0 49 p.o 9293.0 42.0 49.0 50 xie 20584.0 47.5 50.0 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:32 • J. R. Minot et al. Table 6. Comparison of Rank-Turbulence Divergences for Gendered Clusters in BERT Embeddings and the MIMIC Patient Health Records Text BERT MIMIC BERT-MIMC MIMIC MIMIC 1gram RTD rank RTD rank RTD rank Frank Mrank sexually 2.0 16755.5 1.0 10607.0 10373.5 biggest 3.0 7594.5 2.0 17520.5 21172.5 f 6288.0 3.0 3.0 251.0 1719.0 infected 5.0 16076.0 4.0 3475.0 3402.5 grandmother 4.0 3517.0 5.0 6090.0 7643.0 cervical 7.0 18374.0 6.0 1554.0 1551.5 m 244.0 2.0 7.0 2103.0 249.0 sister 6.0 4153.0 8.0 1213.5 1365.0 husband 860.0 4.0 9.0 495.0 5114.0 teenage 9.0 5513.0 10.0 15449.0 19594.0 trouble 12.0 10119.0 11.0 3778.5 4095.5 brother 10.0 1921.0 12.0 1925.0 1598.0 teenager 8.0 936.5 13.0 12928.5 21172.5 connected 16.0 16682.0 14.0 5184.5 5089.5 shaky 11.0 2341.0 15.0 10607.0 7872.0 my 15.0 8198.0 16.0 2395.0 2624.5 breast 3397.0 6.0 17.0 1075.0 4673.5 expelled 19.0 14652.5 18.0 20814.0 19594.0 them 18.0 11337.0 19.0 1738.5 1832.0 prostate 2196.0 5.0 20.0 9436.0 1576.5 immune 20.0 15043.5 21.0 9119.0 8753.0 daughter 1.0 16.0 22.0 463.0 801.5 initial 23.0 11433.0 23.0 632.5 610.0 ovarian 16374.0 8.0 24.0 3137.0 14082.0 recovering 24.0 11867.0 25.0 5351.5 5749.5 abnormal 25.0 9010.0 26.0 1849.5 1994.0 alcoholic 17.0 1136.0 27.0 3885.5 2952.0 obvious 26.0 10154.0 28.0 2725.5 2540.5 huge 28.0 13683.5 29.0 8069.0 7643.0 dirty 29.0 13264.5 30.0 8613.5 9179.5 suv 27.0 5911.0 31.0 14704.0 18377.5 container 31.0 17776.0 32.0 12090.5 12214.5 flomax 10891.0 11.0 33.0 17520.5 4095.5 sisters 32.0 17680.5 34.0 4727.0 4687.5 uterine 7260.0 10.0 35.0 4263.0 19594.0 hypothyroidism 17831.0 12.0 36.0 1239.5 2920.0 dried 34.0 15661.0 37.0 5124.5 4982.0 osteoporosis 18010.0 13.0 38.0 2354.0 7003.0 breasts 4727.0 9.0 39.0 4394.5 21172.5 certain 37.0 18020.5 40.0 7973.0 8023.0 i 30.0 4601.0 41.0 403.0 435.5 restless 21.0 1144.0 42.0 1366.0 1123.0 wife 13.0 1.0 43.0 5545.0 245.5 sle 8083.0 14.0 44.0 3511.0 12504.5 granddaughter 14.0 210.0 45.0 4656.5 7872.0 localized 47.0 14357.5 46.0 6136.0 6405.0 ciwa 7921.0 15.0 47.0 3106.5 1341.0 honey 44.0 11076.0 48.0 10868.5 9861.0 coronary 11570.0 18.0 49.0 560.0 349.0 systemic 41.0 6361.0 50.0 3696.0 4214.0 BERT RTD ranks are calculated based on cosine similarity scores for word embedding and gendered clusters (i.e., the RTD of cosine similarity score ranks relative to male and female clusters). MIMIC RTD ranks are for 1-grams from male and female clinical notes. “BERT-MIMIC RTD rank” is the rankings for 1-grams based on RTD between the first two columns—we also refer to this as RTD (ranking divergence-of-divergence). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:33 Table 7. Comparison of Rank-Turbulence Divergences for Gendered Clusters in Clinical BERT Embeddings and the MIMIC Patient Health Records Text BERT MIMIC BERT-MIMC MIMIC MIMIC 1gram RTD rank RTD rank RTD rank Frank Mrank is 1.0 18545.5 1.0 24.0 24.0 wife 3588.0 1.0 2.0 245.5 5545.0 yells 2.0 11292.0 3.0 10563.0 9615.0 looking 3.0 16700.0 4.0 3238.5 3289.5 m 7181.0 2.0 5.0 249.0 2103.0 kids 4.0 13675.0 6.0 8753.0 9266.5 essentially 5.0 17864.0 7.0 2269.0 2279.5 alter 6.0 15840.0 8.0 15764.0 16387.5 f 2166.0 3.0 9.0 1719.0 251.0 bumps 7.0 6936.0 10.0 13612.0 11417.0 husband 2958.0 4.0 11.0 5114.0 495.0 historian 8.0 7645.0 12.0 6895.0 6045.5 insult 9.0 7241.0 13.0 8854.5 10361.0 moments 10.0 16287.0 14.0 10563.0 10868.5 our 11.0 16938.0 15.0 2803.0 2838.0 asks 13.0 16147.0 16.0 7257.0 7449.5 goes 15.0 17662.0 17.0 3653.0 3624.5 someone 16.0 14030.0 18.0 6197.0 5920.5 ever 17.0 10698.0 19.0 5114.0 5545.0 prostate 6070.0 5.0 20.0 1576.5 9436.0 breast 6010.0 6.0 21.0 4673.5 1075.0 experiences 20.0 17062.0 22.0 9452.5 9615.0 suffer 12.0 1463.5 23.0 10968.5 16387.5 recordings 21.0 13249.5 24.0 12504.5 13440.5 wore 23.0 14774.5 25.0 8854.5 9266.5 largely 26.0 16941.0 26.0 4581.5 4515.5 hi 22.0 6459.0 27.0 3992.5 4558.0 et 27.0 17615.0 28.0 2079.0 2093.5 staying 25.0 10254.5 29.0 4366.5 4746.5 pursuing 18.0 1561.5 30.0 13612.0 20814.0 pet 24.0 4873.0 31.0 5879.0 7076.5 town 28.0 10038.0 32.0 10754.0 9615.0 tire 30.0 12053.5 33.0 12834.5 11736.5 ovarian 13390.0 8.0 34.0 14082.0 3137.0 beef 19.0 1355.5 35.0 23307.0 14704.0 dipping 34.0 17764.0 36.0 6371.0 6318.5 dip 35.0 18328.0 37.0 5585.0 5600.0 hat 33.0 10106.5 38.0 14082.0 12478.5 flomax 14782.0 11.0 39.0 4095.5 17520.5 punch 31.0 5064.5 40.0 11203.5 14033.0 ease 38.0 17265.5 41.0 6471.5 6556.0 hasn 36.0 11324.0 42.0 10373.5 11417.0 lasts 39.0 14072.5 43.0 11203.5 10607.0 grabbing 32.0 4294.0 44.0 10968.5 14033.0 hypothyroidism 17820.0 12.0 45.0 2920.0 1239.5 whatever 37.0 8524.5 46.0 13200.5 15449.0 osteoporosis 18276.0 13.0 47.0 7003.0 2354.0 uterine 6519.0 10.0 48.0 19594.0 4263.0 sle 16374.0 14.0 49.0 12504.5 3511.0 dump 47.0 15439.0 50.0 10968.5 11417.0 Clinical BERT RTD ranks are calculated based on cosine similarity scores for word embedding and gendered clusters (i.e., the RTD of cosine similarity score ranks relative to male and female clusters). MIMIC RTD ranks are for 1-grams from male and female clinical notes. “BERT-MIMIC RTD rank” is the rankings for 1-grams based on RTD between the first two columns—we also refer to this as RTD (ranking divergence-of-divergence). The presence of largely conversational terms rather than more technical, medical language owes to our defining of gender clusters through manually selected terms. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:34 • J. R. Minot et al. Fig. 21. Document length for MIMIC-III text notes. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:35 Fig. 22. Document length for MIMIC-III by note type. For our study we include discharge summary, physician, and nurs- ing notes. Consult notes were initially considered but were ultimately found to be highly varied in terms of notation and nomenclature. This had the effect of making results more difficult to interpret and would have required additional data clean- ing. We believe our methods could be applied to patient records that include consult notes, just at the cost of additional pre-processing and more nuanced interpretation. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:36 • J. R. Minot et al. Table 8. Condition Name and Gender Balance for the ICD9 Codes with at Least 1,000 Observations in the MIMIC-III Dataset Sex Count Sex Prop. ICD Description F M F M Personal history of malignant neoplasm of prostate 0 1207 0.00 1.00 Hypertrophy (benign) of prostate without urinar... 0 1490 0.00 1.00 Routine or ritual circumcision 0 2016 0.00 1.00 Gout, unspecified 552 1530 0.27 0.73 Alcoholic cirrhosis of liver 323 879 0.27 0.73 Retention of urine, unspecified 283 737 0.28 0.72 Intermediate coronary syndrome 466 1197 0.28 0.72 Chronic systolic heart failure 321 776 0.29 0.71 Aortocoronary bypass status 896 2160 0.29 0.71 Other and unspecified angina pectoris 330 770 0.30 0.70 Paroxysmal ventricular tachycardia 548 1263 0.30 0.70 Chronic hepatitis C without mention of hepatic ... 380 838 0.31 0.69 Coronary atherosclerosis of unspecified type of... 479 1015 0.32 0.68 Percutaneous transluminal coronary angioplasty ... 889 1836 0.33 0.67 Portal hypertension 332 675 0.33 0.67 Surgical operation with anastomosis, bypass, or... 406 805 0.34 0.66 Coronary atherosclerosis of native coronary artery 4322 8107 0.35 0.65 Old myocardial infarction 1156 2122 0.35 0.65 Acute on chronic systolic heart failure 406 737 0.36 0.64 Cardiac complications, not elsewhere classified 847 1496 0.36 0.64 Atrial flutter 444 773 0.36 0.64 Paralytic ileus 394 678 0.37 0.63 Chronic kidney disease, unspecified 1265 2170 0.37 0.63 Personal history of tobacco use 1042 1769 0.37 0.63 Pneumonitis due to inhalation of food or vomitus 1369 2311 0.37 0.63 Tobacco use disorder 1251 2107 0.37 0.63 Obstructive sleep apnea (adult)(pediatric) 891 1489 0.37 0.63 Cirrhosis of liver without mention of alcohol 486 801 0.38 0.62 Hypertensive chronic kidney disease, unspecifie... 1300 2121 0.38 0.62 Diabetes with neurological manifestations, type... 438 700 0.38 0.62 Other primary cardiomyopathies 664 1045 0.39 0.61 Cardiac arrest 542 819 0.40 0.60 Peripheral vascular disease, unspecified 564 837 0.40 0.60 Hyperpotassemia 874 1295 0.40 0.60 Bacteremia 599 879 0.41 0.59 Other and unspecified hyperlipidemia 3537 5153 0.41 0.59 Thrombocytopenia, unspecified 1255 1810 0.41 0.59 Pure hypercholesterolemia 2436 3494 0.41 0.59 Pressure ulcer, lower back 530 759 0.41 0.59 Subendocardial infarction, initial episode of care 1262 1793 0.41 0.59 Acute kidney failure with lesion of tubular nec... 945 1342 0.41 0.59 (Continued) ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:37 Table 8. Continued Sex Count Sex Prop. ICD Description F M F M Acute and subacute necrosis of liver 441 626 0.41 0.59 Hypertensive chronic kidney disease, unspecifie... 1091 1539 0.41 0.59 Hemorrhage complicating a procedure 637 898 0.41 0.59 Cardiogenic shock 480 674 0.42 0.58 Aortic valve disorders 1069 1481 0.42 0.58 Polyneuropathy in diabetes 667 917 0.42 0.58 Other postoperative infection 503 683 0.42 0.58 Respiratory distress syndrome in newborn 559 755 0.43 0.57 Cardiac pacemaker in situ 592 798 0.43 0.57 Atrial fibrillation 5512 7379 0.43 0.57 Pulmonary collapse 931 1234 0.43 0.57 Delirium due to conditions classified elsewhere 622 823 0.43 0.57 Diabetes mellitus without mention of complicati... 3902 5156 0.43 0.57 Hemorrhage of gastrointestinal tract, unspecified 602 795 0.43 0.57 Other and unspecified coagulation defects 438 578 0.43 0.57 Acute kidney failure, unspecified 3941 5178 0.43 0.57 End stage renal disease 836 1090 0.43 0.57 Accidents occurring in residential institution 456 583 0.44 0.56 Single liveborn, born in hospital, delivered by... 1220 1538 0.44 0.56 Sepsis 563 709 0.44 0.56 Hyperosmolality and/or hypernatremia 1009 1263 0.44 0.56 Other specified surgical operations and procedu... 600 750 0.44 0.56 Severe sepsis 1746 2166 0.45 0.55 Unspecified protein-calorie malnutrition 562 697 0.45 0.55 Long-term (current) use of insulin 1138 1400 0.45 0.55 Long-term (current) use of anticoagulants 1709 2097 0.45 0.55 Other iatrogenic hypotension 953 1168 0.45 0.55 Anemia in chronic kidney disease 623 761 0.45 0.55 Intracerebral hemorrhage 618 749 0.45 0.55 Unspecified essential hypertension 9370 11333 0.45 0.55 Acute posthemorrhagic anemia 2072 2480 0.46 0.54 Unspecified septicemia 1702 2023 0.46 0.54 Chronic airway obstruction, not elsewhere class... 2027 2404 0.46 0.54 Pneumonia, organism unspecified 2223 2616 0.46 0.54 Septic shock 1189 1397 0.46 0.54 Other convulsions 892 1042 0.46 0.54 Other specified procedures as the cause of abno... 693 809 0.46 0.54 Diarrhea 484 565 0.46 0.54 Hematoma complicating a procedure 566 658 0.46 0.54 Acute respiratory failure 3473 4024 0.46 0.54 Other specified cardiac dysrhythmias 1137 1316 0.46 0.54 (Continued) ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:38 • J. R. Minot et al. Table 8. Continued Sex Count Sex Prop. ICD Description F M F M Need for prophylactic vaccination and inoculati... 2680 3099 0.46 0.54 Personal history of transient ischemic attack (... 498 574 0.46 0.54 Neonatal jaundice associated with preterm delivery 1052 1212 0.46 0.54 Observation for suspected infectious condition 2570 2949 0.47 0.53 Congestive heart failure, unspecified 6106 7005 0.47 0.53 Hypovolemia hyponatremia 641 733 0.47 0.53 Single liveborn, born in hospital, delivered wi... 1668 1898 0.47 0.53 Unspecified pleural effusion 1281 1453 0.47 0.53 Acidosis 2127 2401 0.47 0.53 Esophageal reflux 2990 3336 0.47 0.53 Encounter for palliative care 485 535 0.48 0.52 Hyposmolality and/or hyponatremia 1445 1594 0.48 0.52 Iron deficiency anemia secondary to blood loss ... 482 530 0.48 0.52 Hypoxemia 625 673 0.48 0.52 Mitral valve disorders 1416 1510 0.48 0.52 Primary apnea of newborn 506 537 0.49 0.51 Hypotension, unspecified 996 1055 0.49 0.51 Personal history of venous thrombosis and embolism 786 826 0.49 0.51 Obesity, unspecified 744 767 0.49 0.51 Intestinal infection due to Clostridium difficile 716 728 0.50 0.50 Obstructive chronic bronchitis with (acute) exa... 598 600 0.50 0.50 Anemia of other chronic disease 550 543 0.50 0.50 Anemia, unspecified 2729 2677 0.50 0.50 Dehydration 704 681 0.51 0.49 Other chronic pulmonary heart diseases 1101 1047 0.51 0.49 Do not resuscitate status 694 633 0.52 0.48 Depressive disorder, not elsewhere classified 1888 1543 0.55 0.45 Morbid obesity 648 522 0.55 0.45 Iron deficiency anemia, unspecified 657 514 0.56 0.44 Chronic diastolic heart failure 708 532 0.57 0.43 Hypopotassemia 816 609 0.57 0.43 Anxiety state, unspecified 944 636 0.60 0.40 Dysthymic disorder 663 446 0.60 0.40 Asthma, unspecified type, unspecified 1317 878 0.60 0.40 Urinary tract infection, site not specified 4027 2528 0.61 0.39 Other persistent mental disorders due to condit... 698 428 0.62 0.38 Acute on chronic diastolic heart failure 779 441 0.64 0.36 Unspecified acquired hypothyroidism 3307 1610 0.67 0.33 Osteoporosis, unspecified 1637 310 0.84 0.16 Personal history of malignant neoplasm of breast 1259 18 0.99 0.01 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:39 REFERENCES [1] David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, and Dietrich Klakow. 2020. Privacy guarantees for de-identifying text trans- formations. arXiv preprint arXiv:2008.03101 (2020). [2] Oras A. Alabas, Chris P. Gale, Marlous Hall, Mark J. Rutherford, Karolina Szummer, Sofia Sederholm Lawesson, Joakim Alfredsson, Bertil Lindahl, and Tomas Jernberg. 2017. Sex differences in treatments, relative survival, and excess mortality following acute myocardial infarction: National cohort study using the SWEDEHEART registry. Journal of the American Heart Association 6, 12 (2017), e007123. [3] Marcella Alsan, Owen Garrick, and Grant C. Graziani. 2018. Does Diversity Matter for Health? Experimental Evidence from Oakland. Technical Report. National Bureau of Economic Research. [4] Marcella Alsan and Marianne Wanamaker. 2018. Tuskegee and the health of black men. The Quarterly Journal of Economics 133, 1 (2018), 407–455. [5] Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly avail- able clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics, Minneapolis, Minnesota, USA, 72–78. https://doi.org/10.18653/v1/W19-1909 [6] Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking contextual stereotypes: Measuring and mitigating BERT’s gender bias. arXiv preprint arXiv:2010.14534 (2020). [7] Christine Basta, Marta R. Costa-Jussà, and Noe Casas. 2019. Evaluating the underlying gender bias in contextualized word embeddings. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 33–39. https://doi.org/10.18653/v1/W19-3805 [8] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019). [9] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075 (2017). [10] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. arXiv preprint arXiv:2005.14050 (2020). [11] Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems. 4349–4357. [12] Rishi Bommasani, Kelly Davis, and Claire Cardie. 2020. Interpreting pretrained contextualized representations via reductions to static embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4758–4781. [13] Yang Trista Cao and Hal Daumé III. 2019. Toward gender-inclusive coreference resolution. arXiv preprint arXiv:1910.13913 (2019). [14] Davide Chicco, Niklas Tötsch, and Giuseppe Jurman. 2021. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Mining 14, 1 (2021), 1–22. [15] Erenay Dayanik and Sebastian Padó. 2020. Masking actor information leads to fairer political claims detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4385–4391. [16] Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krish- naram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128. [17] Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (2017), 596–606. [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [19] Peter Sheridan Dodds, Joshua R. Minot, Michael V. Arnold, Thayer Alshaabi, Jane Lydia Adams, David Rushing Dewhurst, Tyler J. Gray, Morgan R. Frank, Andrew J. Reagan, and Christopher M. Danforth. 2020. Allotaxonometry and rank-turbulence divergence: A universal instrument for comparing complex systems. arXiv preprint arXiv:2002.09770 (2020). [20] Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19. [21] Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512 (2019). [22] Yadan Fan, Serguei Pakhomov, Reed McEwan, Wendi Zhao, Elizabeth Lindemann, and Rui Zhang. 2019. Using word embeddings to expand terminology of dietary supplements on clinical notes. JAMIA Open 2, 2 (2019), 246–253. [23] Paul M. Galdas, Francine Cheater, and Paul Marshall. 2005. Men and health help-seeking behaviour: Literature review. Journal of Advanced Nursing 49, 6 (2005), 616–623. [24] Aparna Garimella, Carmen Banea, Dirk Hovy, and Rada Mihalcea. 2019. Women’s syntactic resilience and men’s grammatical luck: Gender-bias in part-of-speech tagging and dependency parsing. In Proceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics. 3493–3498. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:40 • J. R. Minot et al. [25] Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862 (2019). [26] Brad N. Greenwood, Seth Carnahan, and Laura Huang. 2018. Patient–physician gender concordance and increased mortality among female heart attack patients. Proceedings of the National Academy of Sciences 115, 34 (2018), 8569–8574. [27] Brad N. Greenwood, Rachel R. Hardeman, Laura Huang, and Aaron Sojourner. 2020. Physician–patient racial concordance and dispari- ties in birthing mortality for newborns. Proceedings of the National Academy of Sciences 117, 35 (2020), 21194–21200. [28] Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P. Gummadi, and Adrian Weller. 2016. The case for process fairness in learning: Feature selection for fair decision making. In NIPS Symposium on Machine Learning and the Law,Vol.1.2. [29] Revital Gross, Rob McNeill, Peter Davis, Roy Lay-Yee, Santosh Jatrana, and Peter Crampton. 2008. The association of gender concordance and primary care physicians’ perceptions of their patients. Women & Health 48, 2 (2008), 123–144. [30] Katarina Hamberg. 2008. Gender bias in medicine. Women’s Health 4, 3 (2008), 237–243. [31] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. ClinicalBERT: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019). [32] Kexin Huang, Abhishek Singh, Sitong Chen, Edward T. Moseley, Chih-ying Deng, Naomi George, and Charlotta Lindvall. 2019. Clinical XLNet: Modeling sequential clinical notes and predicting prolonged mechanical ventilation. arXiv preprint arXiv:1912.11975 (2019). [33] Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 1–9. [34] Faiza Khan Khattak, Serena Jeblee, Chloé Pou-Prom, Mohamed Abdalla, Christopher Meaney, and Frank Rudzicz. 2019. A survey of word embeddings for clinical text. Journal of Biomedical Informatics: X 4 (2019), 100057. [35] Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201 (2018). [36] Vishesh Kumar, Amber Stubbs, Stanley Shaw, and Özlem Uzuner. 2015. Creation of a new longitudinal corpus of clinical narratives. Journal of Biomedical Informatics 58 (2015), S6–S10. [37] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. 2019. Measuring bias in contextualized word representations. arXiv preprint arXiv:1906.07337 (2019). [38] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. arXiv preprint arXiv:1703.06856 (2017). [39] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240. [40] Chi-Wei Lin, Meei-Ju Lin, Chin-Chen Wen, and Shao-Yin Chu. 2016. A word-count approach to analyze linguistic patterns in the reflective writings of medical students. Medical Education Online 21, 1 (2016), 29522. [41] Bo Liu. 2019. Anonymized BERT: An augmentation approach to the gendered pronoun resolution challenge. arXiv preprint arXiv:1905.01780 (2019). [42] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [43] Jyoti Malhotra, David Rotter, Jennifer Tsui, Adana A. M. Llanos, Bijal A. Balasubramanian, and Kitaw Demissie. 2017. Impact of patient– provider race, ethnicity, and gender concordance on cancer screening: Findings from Medical Expenditure Panel Survey. Cancer Epi- demiology and Prevention Biomarkers 26, 12 (2017), 1804–1811. [44] Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, and Alan W. Black. 2019. Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. arXiv preprint arXiv:1904.04047 (2019). [45] Haggi Mazeh, Rebecca S. Sippel, and Herbert Chen. 2012. The role of gender in primary hyperparathyroidism: Same disease, different presentation. Annals of Surgical Oncology 19, 9 (2012), 2958–2962. [46] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019). [47] Michelle M. Mello and C. Jason Wang. 2020. Ethics and governance for digital disease surveillance. Science 368, 6494 (2020), 951–954. [48] Stephane M. Meystre, F. Jeffrey Friedlin, Brett R. South, Shuying Shen, and Matthew H. Samore. 2010. Automatic de-identification of textual documents in the electronic health record: A review of recent research. BMC Medical Research Methodology 10, 1 (2010), 1–16. [49] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). [50] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013). [51] Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. 2020. Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. arXiv:2010.09337 [stat.ML] [52] Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In AISTATS, Vol. 5. Citeseer, 246– ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:41 [53] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. [54] Bethany Percha, Yuhao Zhang, Selen Bozkurt, Daniel Rubin, Russ B. Altman, and Curtis P. Langlotz. 2018. Expanding a radiology lexicon using contextual patterns in radiology reports. Journal of the American Medical Informatics Association 25, 6 (2018), 679–685. [55] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018). [56] Thang M. Pham, Trung Bui, Long Mai, and Anh Nguyen. 2020. Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks? arXiv preprint arXiv:2012.15180 (2020). [57] Víctor M. Prieto, Sergio Matos, Manuel Alvarez, Fidel Cacheda, and José Luís Oliveira. 2014. Twitter: A good place to detect health conditions. PloS One 9, 1 (2014), e86191. [58] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019). [59] Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation (2004). [60] Manuel Rodríguez-Martínez and Cristian C. Garzón-Alfonso. 2018. Twitter health surveillance (THS) system. In Proceedings of the IEEE International Conference on Big Data, Vol. 2018. NIH Public Access, 1647. [61] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. arXiv preprint arXiv:2002.12327 (2020). [62] Marcel Salathé. 2018. Digital epidemiology: What is it, and where is it going? Life Sciences, Society and Policy 14, 1 (2018), 1–5. [63] Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. Moliere: Automatic biomedical hypothesis generation system. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1633–1642. [64] Cayla R. Teal, Anne C. Gill, Alexander R. Green, and Sonia Crandall. 2012. Helping medical learners recognise and manage unconscious bias toward certain patient groups. Medical Education 46, 1 (2012), 80–88. [65] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems 33 (2020). [66] Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019). [67] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. ArXiv abs/1910.03771 (2019). [68] Christopher C. Yang, Haodong Yang, Ling Jiang, and Mi Zhang. 2012. Social media mining for drug safety signal detection. In Proceedings of the 2012 International Workshop on Smart Health and Wellbeing. 33–40. [69] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541 (2018). [70] Haoran Zhang, Amy X. Lu, Mohamed Abdalla, Matthew McDermott, and Marzyeh Ghassemi. 2020. Hurtful words: Quantifying biases in clinical contextual word embeddings. In Proceedings of the ACM Conference on Health, Inference, and Learning. 110–120. [71] Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496 (2018). Received 13 August 2021; revised 1 February 2022; accepted 9 March 2022 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Computing for Healthcare (HEALTH) Association for Computing Machinery

Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance

Loading next page...
 
/lp/association-for-computing-machinery/interpretable-bias-mitigation-for-textual-data-reducing-genderization-hcoTC97BtX
Publisher
Association for Computing Machinery
Copyright
Copyright © 2022 Copyright held by the owner/author(s).
ISSN
2691-1957
eISSN
2637-8051
DOI
10.1145/3524887
Publisher site
See Article on Publisher Site

Abstract

Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance JOSHUA R. MINOT and NICHOLAS CHENEY, University of Vermont, USA MARC MAIER, MassMutual, USA DANNE C. ELBERS, University of Vermont and VA Cooperative Studies Program, VA Boston Healthcare 39 System, USA CHRISTOPHER M. DANFORTH and PETER SHERIDAN DODDS, University of Vermont, USA Medical systems in general, and patient treatment decisions and outcomes in particular, can be affected by bias based on gen- der and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models—statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how differences in gender-specific word frequency distributions and lan- guage models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of dataset bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce biases in natural language processing pipelines. CCS Concepts: • Applied computing → Health informatics; Document management and text processing;• Information systems→ Content analysis and feature selection;• Computing methodologies→ Natural language processing; Additional Key Words and Phrases: NLP, electronic health records, algorithmic fairness, data augmentation, interpretable machine learning ACM Reference format: Joshua R. Minot, Nicholas Cheney, Marc Maier, Danne C. Elbers, Christopher M. Danforth, and Peter Sheridan Dodds. 2022. Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance. ACM Trans. Comput. Healthcare 3, 4, Article 39 (October 2022), 41 pages. https://doi.org/10.1145/3524887 The authors are grateful for the computing resources provided by the Vermont Advanced Computing Core and financial support from the Massachusetts Mutual Life Insurance Company and Google. The views expressed are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the United States government. Authors’ addresses: J. R. Minot, N. Cheney, C. M. Danforth, and P. S. Dodds, University of Vermont, 82 University Pl, Burlington, VT 05405, United States; emails: {joshua.minot, ncheney, chris.danforth, peter.dodds}@uvm.edu; M. Maier, MassMutual, 59 E Pleasant St Amherst, MA 01002, United States; email: MMaier@MassMutual.com; D. C. Elbers, University of Vermont, 82 University Pl, Burlington, VT 05405, United States and VA Cooperative Studies Program, VA Boston Healthcare System, USA; email: danne.elbers@uvm.edu. This work is licensed under a Creative Commons Attribution International 4.0 License. © 2022 Copyright held by the owner/author(s). 2637-8051/2022/10-ART39 https://doi.org/10.1145/3524887 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:2 • J. R. Minot et al. 1 INTRODUCTION Efficiently and accurately encoding patient information into medical records is a critical activity in healthcare. Electronic health records (EHRs) document symptoms, treatments, and other relevant histories—providing a consistent reference through disease progression, provider churn, and the passage of time. Free-form text fields, the unstructured natural language components of a health record, can be incredibly rich sources of patient information. With the proliferation of EHRs, these text fields have also been an increasingly valuable source of data for researchers conducting large-scale observational studies. The promise of EHR data does not come without apprehension however, as the process of generating and analyzing text data is open to the influence of conscious and unconscious human bias. For example, health care providers entering information may have implicit or explicit demographic biases that ultimately become encoded in EHRs. Furthermore, language models that are often used to analyze clinical texts can encode broader societal biases [70]. As patient data and advanced language models increasingly come into contact, it is important to understand how existing biases may be perpetuated in modern day healthcare algorithms. In the healthcare context, many types of bias are worth considering. Race, gender, and socioeconomic status, among other attributes, all have the potential to introduce bias into the study and treatment of medical conditions. Bias may manifest in how patients are viewed, treated, and—most relevant here—documented. Due to ethical and legal considerations, as well as pragmatic constraints on data availability, we have focused the current research on gender bias. There are many sources of algorithmic bias along with multiple definitions of fairness in machine learning [ 46]. Bias in the data used for training algorithms can stem from imbalances in target classes, how specific features are measured, and historical forces leading certain classes to have longstanding, societal misrepresentation. Defini- tions of fairness include demographic parity, counterfactual fairness [38], and fairness through unawareness (FTU) [28]. In the current work, we use a more general measure that we refer to as potential bias in order to gauge the impact of our data augmentation technique. Potential bias is an assessment of bias under a sort of worst-case scenario, and provides a generalized measure independent of specific bias definitions. With our methods, we seek to provide human-interpretable insights on potential bias in the case of binary class data. Further, using the same measurement, we experiment with the application of an FTU-like data augmenta- tion process (although the concept of FTU does not neatly translate to unstructured text data). Combined, these methods can identify fundamental bias in language usage and the potential bias resulting from the application of a given machine learning model. We refer to two classes of algorithmic-bias evaluation: a) intrinsic evaluation for exploring semantic rela- tionships within an embedding space, and b) extrinsic evaluation for determining downstream performance differences on extrinsic tasks (e.g., classification) [ 34]. In medical context there are gender specific elements that can influence treatment and care. However, in this same context there might also be uses of gender when it is not relevant. In this manuscript we aim to understand the latter through analysis of clinical notes. There is growing interest in interpretable machine learning (IML) [51]. In the context of deep language models this can involve interrogating the functionality of specific model layers (e.g., BERTology [ 61]), or inves- tigating the impact of perturbations in data on outputs. This latter approach ties into the work outlined in this manuscript. Our use of the term ‘interpretable’ here mostly refers to a more general case where a given result can be interpreted by a human reviewer. For instance, our divergence-based measures highlight gendered terms in plain English with a clearly explained ranking methodology. While this conceptualization is complementary to IML, it does not necessarily fit cleanly within the field—we will mention explicitly when referring to an IML concept. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:3 1.1 Prior Work Gender bias in the field of medicine is a topic that must be viewed with nuance in light of the strong interaction between biological sex and health conditions. Medicine and gender bias interact in many ways—some of which are expected and desirable, whereas others may have uncertain or negative impacts in patient outcomes. Research has reported differences in the care and outcomes received by male and female patients for the same conditions. For example, given the same severity symptoms, men have higher treatment rates for conditions such as coronary artery disease, irritable bowel syndrome, and neck pain [30]. Women have higher treatment-adjusted excess mortality than men when receiving care for heart attacks [2]. Female patients treated by male physicians have higher mortality rates than when treated by female physicians—while male patients have similar mortality regardless of provider gender [26]. The rate of care-seeking behavior in men has been shown to be lower than women and has the potential to significantly affect health outcomes [ 23]. Some work has shown female providers have higher confidence in the truthfulness of female patients and resulting diagnoses when compared to male providers [29]. The concordance of patient and provider gender is also positively associated with rates of cancer screening [43]. Beyond gender, the mortality rate of black infants has been found to be lower when cared for by black physi- cians rather than their white counterparts [27]. Race and care-seeking behavior have also been shown to interact, with black patients more often seeking cardiovascular care from black providers than non-black providers [3]. It is important to note historical mistreatment and inequitable access when discussing racial disparities in health outcomes—for instance, the unethical Tuskegee Syphilis Study was found to lead to a 1.5-year decline in black male life expectancy through increased mistrust in the medical field after the exploitation of its participants was made public [4]. The gender of the healthcare practitioner can also impact EHR note characteristics that are subsequently quantified through language analysis tools. The writings of male and female medical students have been shown to have differences, with female students expressing more emotion and male students using less space [ 40]. More generally, some work has shown syntactic parsers generalize well for men and women when trained on data generated by women whereas training the tools on data from men leads to poor performance for texts written by women [24]. The ubiquity of text data along with advances in natural language processing (NLP) have led to a prolif- eration of text analysis in the medical realm. Researchers have used social media platforms for epidemiological research [57, 60, 62]—raising a separate set of ethical concerns [47]. NLP tools have been used to generate hy- potheses for biomedical research [63], detect adverse drug reactions from social media [68] classes, and expand the known lexicon around medical topics [22, 54]. There are numerous applications of text analysis in medicine beyond patient health records. While this manuscript does not directly address tasks outside of clinical notes, it is our hope that the research could be applied to other areas. It is because our methods are interpretable and based on gaining an empirical view of bias that we feel they could be a first resource in understanding bias beyond our example cases of gender in clinical texts. Our work leverages computational representations of statistically derived relationships between concepts, commonly known as word embedding models [52]. These real-valued vector representations of words facilitate comparative analyses of text data with machine learning methods. The generation of these vectors depends on the distributional hypothesis, which states that similar words are more likely to appear together within a given context. Ideally, word embeddings map semantically similar words to similar regions in the vector space—or ‘semantic space’ in this case. The choice of training dataset heavily impacts the qualities of the language model and resulting word embeddings. For instance, general purpose language models are often trained on Wikipedia and the Common Crawl collection of web pages (e.g., BERT [18], RoBERTa [42]). Training language models on text from specific domains often improves performance on tasks in those domains (see below). More recent, state-of-the-art word embeddings (e.g., ELMo [55], BERT [18], GPT-2 [58]) are generally ‘contextual’, where the ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:4 • J. R. Minot et al. vector representation of a word from the trained model is dependent on the context around the word. Older word embeddings, such as GloVe [53] and word2vec [49, 50], are ‘static’, where the output from the trained model is only dependent on the word of interest—with context still being central to the task of training the model. As medical text data are made increasingly accessible through EHRs, there has been a growing focus on devel- oping word embeddings tailored for the medical domain. The practice of publicly releasing pre-trained, domain- specific word embeddings is common across domains, and it can be especially helpful in medical contexts de- scribed using specialized vocabulary (and even manner of writing). SciBERT is trained on a random sample of over one million biomedical and computer science papers [8]. BioBERT similarly is trained on papers from PubMed abstracts and articles [39]. There are also pre-trained embeddings focused on tasks involving clinical notes. Clinical BERT [5, 31] is trained on clinical notes from the MIMIC-III dataset [33]. A similar approach was applied with the XLNet architecture, resulting in clinical XLNet [32]. These pre-trained embeddings perform better on domain-specific tasks related to the training data and procedure. The undesirable bias present in word embeddings has attracted growing attention in recent years. Bolukbasi et al. present evidence of gender bias in word2vec embeddings, along with proposing a method for removing bias from gender-neutral terms [11]. Contextual word embeddings (e.g., BERT) show gender-biases [6] that can have effects on downstream tasks, although these biases may present differently than those in static embeddings [ 7, 37]. Vig et al. investigate which model components (attention heads) are responsible for gender bias in transformer- based language models (GPT-2) [65]. A simple way to mitigate gender bias in word embeddings is to ‘swap’ gendered terms in training data when generating word embeddings [71]. Beutel et al. [9] develop an adversarial system for debiasing language models—in the process, relating the distribution of training data to its effects on properties of fairness in the adversarial system. Simple masking of names and pronouns may reduce bias and improve classification performance for certain language classification tasks [ 15]. Swapping names has been shown to be an effective data augmentation technique for decreasing gender bias in pronoun resolution tasks [ 41]. Simple scrubbing of names and pronouns has been used to reduce gender-biases in biographies [16]. Zhang et al. examine the gender and racial biases present in Clinical BERT, concluding that after fine-tuning “[the] baseline clinical BERT model becomes more confident in the gender of the note, and may have captured relationships between gender and medical conditions which exceed biological associations” [70]. Some of these techniques for bias detection and mitigation have been critiqued as merely capturing over-simplified dimensions of bias—with proper debiasing requiring more holistic evaluation [25]. Data augmentation has been used to improve classification performance and privacy of text data. Simple meth- ods include random swapping of words, random deletion, and random insertion [66]. More computationally ex- pensive methods may involve using language models to generate contextually accurate synonyms [35], or even running text through multiple rounds of machine translation (e.g., English text to French and back again) [69]. De-identification is perhaps the most common data augmentation task for clinical text. Methods may range from simple dictionary look-ups [48] to more advanced neural network approaches [17]. De-identification approaches may be too aggressive and limit the utility of the resulting data while also offering no formal privacy guarantee. The field of differential privacy [ 20] offers principled methods for adding noise to data, and some recent work has explored applying these principles to text data augmentation [1]. Applying data-augmentation techniques to pipelines that use contextual word-embeddings presents some additional uncertainty given the on-going na- ture of research working on establishing what these trained embeddings actually represent and how they use contextual clues (e.g., the impact of word order on downstream tasks [56]). In the present study, we explore the intersection of the bias that stems from language choices made by health- care providers and the bias encoded in word embeddings commonly used in the analysis of clinical text. We present interpretable methods for detecting and reducing bias present in text data with binary classes. Part of this work is investigating how orthogonal text relating to gender bias is to text related to clinically-relevant information. While we focus on gender bias in health records, this framework could be applied to other do- mains and other types of bias as well. In Section 2, we describe our data and methods for evaluating bias. In ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:5 Section 3, we present our results contrasting empirically observed bias in our sample data with bias encoded in word embeddings. Finally, in Section 4, we discus the implications of our work and potential avenues for future research. Our main contributions include the following: • We demonstrate a model-agnostic technique for identifying biases in language usage for clinical notes corresponding to female and male patients. We provide examples of words and phrases highlighted by our method for two clinical note datasets. This methodology could readily be applied to other demographic attributes of patients as well. • Continuing with the bias identification technique, we contrast the results from the model-agnostic bias de- tection with results from evaluating bias within a word-embedding space. We find that our model-agnostic method highlights domain- and dataset-specific terms, leading to more effective bias identification than when compared with results derived from language models. • We develop a data augmentation procedure to remove biased terms and present results demonstrating that this procedure has minimal effect on clinically relevant tasks. Our experiments show that removing words corresponding to 10% of the language-distribution divergence has little effect on condition classi- fication performance while largely reducing the gender signal in clinical notes. Further, the augmenta- tion procedure can be applied to a high volume of terms (for terms corresponding to up to 80% of total language-distribution divergence) with minimal degradation in performance for clinically relevant tasks. More broadly, our results demonstrate that transformers-based language models can be robust to high levels of data augmentation—as indicated by retention of relative performance on downstream tasks. Taken together, these contributions provide methods for bias identification that are readily interpreted by patients, providers, and healthcare informaticians. The bias measures are model-agnostic and dataset specific and can be applied upstream of any machine learning pipeline. This manuscript directly supports the blossoming field of ethical artificial intelligence in healthcare and beyond. Our methods could be helpful for evaluating the impact of demographic signals—beyond gender—present in text when developing machine learning models and workflows in healthcare. 2METHODS Here we outline our methods for identifying and removing gendered language and evaluating the impact of this data augmentation process. We also provide brief descriptions of the datasets for our case study. The bias evalu- ation techniques fall into two main categories. First, we make intrinsic evaluations of the language by looking at bias within word-embedding spaces and empirical word-frequency distributions of the datasets. The set of methods presented here enable the identification of biased language usage, a data augmentation approach for removing this language, and an example benchmark for evaluating the performance impacts on two biomedical datasets. Second, there are extrinsic evaluation tasks focused on comparing the performance of classifiers as we vary the level of data augmentation. For our dataset, this process involves testing health-condition and gender classifiers on augmented data. The extrinsic evaluation provides a measurement of potential bias and is meant to be similar to some real-world tasks that may utilize similar classifiers. 2.1 Bias Measurements We define three different bias evaluation frameworks for our study. The first is a measure of empirical bias between language distributions for two classes observed in a given dataset. The second is a measure of intrinsic bias present in a word embedding space (as explored with a given corpus). Finally, the third measure addresses the potential extrinsic algorithmic-bias present in a machine learning pipeline. For our evaluations of intrinsic language bias in empirical data we use a divergence metric (details in Section 2.2) which is calculated between the language distributions of male and female patients. We use a ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:6 • J. R. Minot et al. straightforward notion of bias that rates n-grams as more biased when their divergence contribution is higher. In this case we are detecting data bias, which may in itself have multiple sources such as measurement or sampling biases. We evaluate intrinsic language bias in embedding spaces by calculating similarity scores between gendered word-embedding clusters and n-grams appearing in male and female patient notes (details in Section 2.4). Here we are focused on a measure of algorithmic bias as expressed via language from a specific dataset that is encoded with a given language model. Finally, we evaluate extrinsic bias to test the effects of our data augmentation procedure. In this case we diverge from the largely established definitions of bias by using an evaluation framework that does not claim to detect explicit forms of bias for protected classes. Instead, we reframe our extrinsic measure as potential bias (PB), which we define generally as the capacity for a classifier to predict protected classes. Our PB measure does not equate directly to real-world biases, but we argue that it has utility in establishing a generalizeable indication of the potential for bias that is task-independent. Stated another way, it is a measure of the signal present in a data set and the capacity for a given machine learning algorithm to utilize this signal for biased predictions. In our case, we present PB as measured by the performance of a binary classifier trained to predict patient gender from text documents. We recognize concerns raised by Blodgett et al. [10] and others relating to the imprecise definitions of bias in the field of algorithmic fairness. Indeed, it is often important to motivate a given case of bias by establishing its potential harms. In our case, we are limiting what we present to precursors to bias, and thus do not claim to making robust assessments of real-world bias and subsequent impact. 2.2 Rank-divergence and Trimming We parse clinical notes into n-grams—sequences of space delimited strings such as words or sentence fragments— and generate corresponding frequency distributions. To quantify the bias of specific n-grams we compare their frequency of usage in text data corresponding to each of our two classes. The same procedure is extended as a data augmentation technique intended to remove biased language in a targeted and principled manner. More specifically, in our case study of gendered language in clinical notes we quantify the “genderedness” of n-grams by comparing their frequency of usage in notes for male and female patient populations. For the task of comparing frequency of n-gram usage we use rank-turbulence divergence (RTD), as defined by Dodds et al. [19]. The rank-turbulence divergence between two sets, Ω and Ω , is calculated as follows, 1 2 R R D (Ω ||Ω ) = δD 1 2 α α,τ 1/(α+1) (1) α + 1  1 1 = − , α α α r r τ ,1 τ ,2 where r is the rank of element τ (n-grams, in our case) in system s and α is a tunable parameter that affects τ ,s the starting and ending ranks. While other techniques could be used to compare the two n-gram frequency distributions, we found RTD to be robust to differences in overall volume of n-grams for each patient population. For example, Figure 1 shows the RTD between 1-grams from clinical notes corresponding to female and male patients. A brief note on notation: τ always represents a unique element, or n-gram in our case. In certain contexts, τ may be an integer value that ultimately maps back to the element’s string representation. This integer conversion is to allow for clean indexing—in these cases τ can be converted back to a string representation of the element with the array of element strings W . We use the individual rank-turbulence divergence contribution, δD ,ofeach1-gramtothe gendered di- α,τ vergence, D (Ω ||Ω ), to select which terms to remove from the clinical notes. First, we sort the 1-grams female male ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:7 Fig. 1. Rank-turbulence divergence allotaxonograph [19] for male and female documents in the MIMIC-III dataset. For this figure, we generated 1-gram frequency and rank distributions from documents corresponding to male and female patients. Pronouns such as “she” and “he” are immediately apparent as drivers of divergence between the two corpora. From there, the histogram on the right highlights gendered language that is both common and medical in nature. Familial relations (e.g., “husband” and “daughter”) often present as highly gendered according to our measure. Further, medical terms like “hysterectomy” and “scrotum” are also highly ranked in terms of their divergence. Higher divergence contribution values, δD , are often driven by either relatively common words fluctuating between distributions (e.g., “daughter”), or the pres- α,τ ence of disjoint terms that appear in only one distribution (e.g., “hysterectomy”). The impact of higher rank values can be tuned by adjusting the α parameter. In the main horizontal bar chart, the bars indicate the divergence contribution value and the numbers next to the terms represent their rank in each corpus. The terms that appear in only one corpus are indicated with a rotated triangle. The smaller three vertical bars describe balances between the male and female corpora: 43% of total 1gram counts appear in the female corpus; we observed 68.6% of all 1grams in the female corpus; and 32.2% of the 1grams in the female corpus are unique to that corpus. based on their rank-turbulence divergence contribution. Next, we calculate the cumulative proportion of the over- all rank-turbulence divergence, RC , that is accounted for as we iterate through the sorted 1-gram list from words with the highest contribution to the least contribution (in this case, terms like “she” and “gentleman” will tend to have a greater contribution). Finally, we set logarithmically spaced thresholds of cumulative rank-divergence values to select which 1-grams to trim. The method allows us to select sets of 1-grams that contribute the most to the rank-divergence values (measured as divergence per 1-gram). Figure 2 provides a graphical overview of this procedure. Using this selection criteria, we are able to remove the least number of 1-grams per a given amount of rank- turbulence divergence removed from the clinical notes. The number of unique 1-grams removed per cumulative amount of rank-turbulence divergence grows super linearly as seen in Figure 8(I). This results in relatively stable distributions of document lengths for lower trim values (10–30%), although at higher trim values the procedure drastically shrinks the size of many documents (Figure 8(A–H)). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:8 • J. R. Minot et al. Fig. 2. Overview of the rank-turbulence divergence trimming procedure. Solid lines indicate steps that are specific to our trimming procedure and evaluation process. The pipeline starts with a repository of patient records that include clinical notes and class labels (in our case gender and ICD9 codes). From these notes we generate n-gram rank distributions for the female and male patient populations, which are then used to calculate the rank-turbulence divergence (RTD) for individual n-grams. Sorting the n-grams based on RTD contribution, we then trim the clinical notes. Finally, we view the results directly from the RTD calculation to review imbalance in language use. With the trimmed documents we compare the performance of classifiers on both the un-trimmed notes and notes with varying levels of trimming applied. To implement this trimming procedure, we use regular expressions to replace the 1-grams we have identified for removal with a space character. We found that using 1-grams as the basis for our trimming procedure is both effective and straightforward to implement. Generally, if higher order n-grams (e.g., 2-grams) are determined to be biased, the constituent 1-grams are also detected by the RTD metric. Our string removal procedure is applied to the overall corpus of data, upstream of any train-test dataset generation for specific classification tasks. Other potential string replacement strategies include redaction with a generic token or randomly swapping n-grams that appear within the same category across the corpus [1]. The RTD method we use could also be adapted for use with these and other replacement strategies. We chose string removal because of its simplicity and prioritization of the de-biasing task over preserving semantic structure (i.e., it presents an extreme case of data augmentation). The pipeline’s performance on downstream tasks provides some indication of the seman- tic information retained, and as we show in Section 3 it is possible retain meaningful signals while pursuing relatively aggressive string removal. 2.3 Language Models Large language models are increasingly common in many NLP tasks, and we feel it is important to present our results in the context of a pipeline that utilizes these models. Furthermore, language models have the potential to encode bias, and we found it necessary to contrast our empirical PB detection methods with bias metrics calculated on general purpose and domain-adapted language models. We use pre-trained BERT-base [18] and Clinical BERT [5] word embeddings. BERT provides a contextual word embedding trained on “general” language whereas Clinical BERT builds on these embeddings by utilizing transfer learning to improve performance on scientific and clinical texts. All models were implemented in PyTorch using the Transformers library [67]. For tasks such as nearest-neighbor classification and gender-similarity scoring, we use the off-the-shelf weights for BERT and Clinical BERT (see Figure 3 for an example of n2c2 embedding space). These models were then fine-tuned on the gender and health-condition classification tasks. In cases where we fine-tuned the model, we added a final linear layer to the network. All classification tasks were binary with a categorical cross-entropy loss function. All models were run with a maximum sequence ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:9 Fig. 3. A tSNE embedding of n2c2 document vectors generated using a pre-trained version of BERT with off-the-shelf weights. We observe the appearance of gendered clusters even before training for a gender classification task. See Figure 17 for the same visualization but with Clinical BERT embeddings. length of 512, batch size of 4, and gradient accumulation steps set to 12. We considered various methods for handling documents longer than the maximum sequence length (see variable length note embedding in SI), but ultimately the performance gains did not merit further use. We also run a nearest-neighbor classifier on the document embeddings produced by the off-the-shelf BERT- base and Clinical BERT models. This is intended to be a point of comparison when evaluating the potential bias present within the embedding space, as indicated by performance on extrinsic classification tasks. In addition to the BERT-based language models, we used a simple term frequency-inverse document fre- quency (TFIDF) [59] based classification model as a point of comparison. For this model, we fit a TFIDF vector- izer to our training data and use logistic regression for binary classification. For classification performance metrics we report the Matthews correlation coefficient (MCC) and receiver operating characteristic (ROC) curves. We primarily use MCC values for ease of presentation and because of the balanced nature of the measurement even in the face of class imbalances [14]. MCC is calculated as follows, TP ·TN − FP · FN MCC = (2) (TP + FP) · (TP + FN ) · (TN + FP) · (TN + FN ) Where TP, TN , FP,and FN are counts of true positives, true negatives, false positives, and false negatives, re- spectively. MCC ranges between -1 and 1 with 1 indicating the best performance and 0 indicating performance no better than random guessing. The measure is often described as the correlation between observed and predicted labels. 2.4 Gender Distances in Word Embeddings Using a pre-trained BERT model, we embed all the 1-grams present in the clinical note datasets. For this task, we retain the full length vector for each 1-gram, taking the average in cases where additional tokens are created by the tokenizer. The results of this process are 1x768 vectors for each n-gram. We also calculate the average embedding for a collection of terms manually selected to constitute ‘gender clusters’. From these gender clusters, we calculate the cosine similarly to each of the embeddings for n-grams in the Zipf distribution. Using measures such as cosine similarity with BERT raises some concerns, especially when looking at abso- lute values. BERT was not trained on the task of calculating sentence or document similarities. With BERT-base, all dimensions are weighted equally, which when applying cosine similarity can result in somewhat arbitrary ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:10 • J. R. Minot et al. ALGORITHM 1: RTD trimming procedure Input: Documents D , i = 1,..., N Input: 1-gram rank dists. for each class Ω ,ψ = 1, 2 (k ) Output: Trimmed text data C , i = 1,..., N ; k ∈ (0, 1] R R 1: δD ,W ← RTD_calc(Ω , Ω , α ) , τ = 1,..., M  δD is RTD contribution for ngram W ,both τ 1 2 τ α,τ α,τ sorted by RTD contribution 2: RC ← cumsum(δD ) α,τ 3: for k = .1,.2,...,.9 do 4: r ← max(where(RC <= k))  index up to bin max b 5: S ← W 0:r 6: for i = 1,..., N do (k ) 7: C ← strip(D , S)  remove 1−grams from doc. 8: end for 9: end for absolute values. As a workaround, we believe that using the ranked value of word and document embeddings can produce more meaningful results (if we do not wish to fine-tune BERT on the task of sentence similarity). We use both absolute values and ranks of cosine similarity when investigating bias in BERT-based language models—finding the absolute values of cosine similarity to be meaningful in our relatively coarse-grained analy- sis. Further, taking the difference in cosine similarities for each gendered cluster addresses some of the drawbacks of examining cosine similarity values in pre-trained models. Generating word or phrase embeddings from contextual language models raises some challenges in terms of calculating accurate embedding values. In many cases, the word-embedding for a given 1-gram—produced by the final layer of a model such as BERT—can vary significantly depending on context [ 21]. Some researchers have proposed converting contextual embeddings to static embeddings to address this challenge [12]. Others have presented methods for creating template sentences and comparing the relative probability of masked tokens for target terms [37]. After experimenting with the template approach, we determined that the resulting embeddings were not different enough to merit switching away from the simple isolated 1-gram embeddings. 2.5 Rank-turbulence Divergence for Embeddings and Documents We use rank-turbulence divergence in order to compare the bias encoded in word-embeddings and empirical data. For word embeddings, we need to devise a metric for bias—here we use cosine similarity between biased-clusters and candidate n-grams. The bias in the empirical data is evaluated using RTD for word-frequency distributions corresponding to two labeled classes. In terms of the clinical text data, for the word embeddings we use cosine similarity scores to evaluate bias relative to known gendered n-grams. For the clinical note datasets (text from documents with gender labels), we use rank-turbulence divergence calculated between the male and female patient populations. To evaluate bias in the embedding space, we rely on similarity scores relative to known gendered language. First, we create two gendered clusters of 1-grams—these clusters represent words that are manually determined to have inherent connotations relating to female and male genders. Next, we calculate the cosine similarity between the word embeddings for all 1-grams appearing in the empirical data and the average vector for each of the two gendered clusters. Finally, we rank each 1-gram based on the distribution of cosine similarity scores for the male and female clusters. For the empirical data, we calculate the RTD for 1-grams appearing in the clinical note data sets. The RTD value provides an indication of the bias—as indicated by differences in specific term frequency—present in the clinical notes. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:11 Table 1. Patient Sex Ratios for the Top 10 Conditions in MIMIC-III Sex Count ICD Description N N N /N N /N f m f total m total Acute kidney failure 3941 5178 0.43 0.57 Acute respiratory failure 3473 4024 0.46 0.54 Atrial fibrillation 5512 7379 0.43 0.57 Congestive heart failure 6106 7005 0.47 0.53 Coronary atherosclerosis 4322 8107 0.35 0.65 Diabetes mellitus 3902 5156 0.43 0.57 Esophageal reflux 2990 3336 0.47 0.53 Essential hypertension 9370 11333 0.45 0.55 Hyperlipidemia 3537 5153 0.41 0.59 Urinary tract infection 4027 2528 0.61 0.39 For most health conditions there is an imbalance in the gender ratio between male and female patients. This reflects an overall bias in the MIMIC-III dataset which hasmoremalepatients. Combined, these steps provide ranks for each 1-gram in terms of how much it differentiates the male and female clinical notes. Here again we can use the highly flexible rank-turbulence divergence measure to identify where there is ‘disagreement’ between the ranks returned by evaluating the embedding space and ranks from the empirical distribution. This is a divergence-of-divergence measure, using the iterative application of rank- turbulence divergence to compare two different measures of rank. Going forward, we refer to this measure as 2 2 RTD .RTD provides an indication of which n-grams are likely to be reported as less gendered in either the embedding space or in the empirical evaluation of the documents. For our purposes, RTD is especially useful for highlighting n-grams that embedding-based debiasing techniques may rank as minimally biased, despite the emperical distribution suggesting otherwise. 2.6 Data We use two open source datasets for our experiments: the n2c2 (formerly i2b2) 2014 deidentification chal- lenge [36] and the MIMIC-III critical care database [33]. The n2c2 data comprises around 1,300 documents with no gender or health condition coding (we generate our own labels for the former). MIMIC-III is a collection of diagnoses, procedures, interventions, and doctors notes for 46,520 patients that passed through an intensive care unit. There are 26,121 males and 20,399 females in the dataset, with over 2 million individual documents. MIMIC-III includes coding of health conditions with International Classification of Diseases (ICD-9) codes, as well as patient sex. For MIMIC-III, we focus our health-condition classification experiments on records corresponding to patients with at least one of the top 10 most prevalent ICD-9 codes. We restrict our sample population to those patients with at least one of the 10 most common health conditions—randomly drawing negative samples from this subset for each condition classification experiment. Rates of coincidence vary between 0.65 and 0.13 (Figure 9). All but one of the top 10 health conditions have more male than female patients (Table 1). As a point of reference, we also present summary results for records corresponding to patients with ICD-9 codes that appear at least 1,000 times in the MIMIC-III data (Table 8). 2.7 Text Pre-processing Before analyzing or running the data through our models, we apply a simple pre-processing procedure to the text fields of the n2c2 and MIMIC-III data sets. We remove numerical values, ranges, and dates from the text. This is done in an effort to limit confounding factors related to specific values and gender (e.g., higher weights ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:12 • J. R. Minot et al. and male populations). We also strip some characters and convert common abbreviations. See Section A.1 for information on note selection. 3RESULTS Here we present the results of applying empirical bias detection and potential bias mitigation methods. Using rank-turbulence divergence (RTD), we rank n-grams based on their contribution to bias between two classes. Next, we apply a data augmentation procedure where we remove 1-grams based on their ranking in the RTD results. The impact of the data augmentation process is measured by tracking classification performance as we apply increasingly aggressive 1-gram trimming to our clinical note datasets. Finally, we compare the bias present in the BERT embedding space with the empirical bias we detect in the case study datasets. One of our classification tasks is predicting patient gender from EHR notes. We include gender classification as a synthetic test that is meant to directly indicate the gender signal present in the data. Gender classification is an unrealistic task that we would not expect to see in real-world applications, but serves as an extreme case that provides insight on the potential for other classifiers to incorporate gender information (potential bias). We present results for both the n2c2 and MIMIC-III datasets. The n2c2 dataset provides a smaller dataset with more homogeneous documents and serves as a reference point for the tasks outlined here. MIMIC-III is much larger and its explicit coding of health conditions allows us to bring the extrinsic task of condition classification into our evaluation of bias and data augmentation. 3.1 Gender Divergence Interpretability is a key facet of our approach to empirical bias detection. To gain an understanding of biased language usage we start by presenting the ranks of RTD values for individual n-grams in text data corresponding to each of the binary classes. The allotaxonographs we use to present this information (e.g., Figure 1)showboth the RTD values and a 2-d rank-rank histogram for n-grams in each class. The rank-rank histogram (Figure 1 left) is useful for evaluating how the word-frequency distributions (Figure 1 right) are similar or disjoint among the two classes, and in the process visually inspecting the fit of the tunable parameter α, which modulates the impact of lowly-ranked n-grams. See Figures 12, 13, 14,and 15 for additional allotaxonographs, including 2- and 3-grams. In the case of our gender classes in the medical data, we find the rank distributions to be more similar than disjoint and visually confirm that α = 1/3 is an acceptable setting (by examining the relation between contour lines and rank-rank distribution). More specifically, in our case study gendered language is highlighted by calculating the RTD values for male and female patient notes. We present results from applying our RTD method to 1-grams in the unmodified MIMIC-III dataset in Figure 1. Unsurprisingly, gendered pronouns appear as the greatest contribution to RTD be- tween the two corpora. Further, 1-grams regarding social characteristics such as “husband”, “wife”, and “daughter” and medically relevant terms relating to sex-specific or sex-biased conditions such as “hysterectomy”, “scrotal”, and “parathyroid” are also highlighted. Some of these terms may be obvious to readers—suggesting the effectiveness of this approach to capture intu- itive differences. Upon deeper investigation, 1-grams such as “husband” and “wife” often appear in reference to a patient’s spouse providing information or social histories. The reasons for “daughter” appearing more commonly in female patient notes are varied, but appear to be related to higher relative rates of daughters providing infor- mation for their mothers. However the identification and ranking of other n-grams in terms of gendered-bias re- quires examination of a given dataset—perhaps indicating unintuitive relationships between terms and gendered language, or potentially indicating overfitting of this approach to specific datasets. For instance, “parathyroid” likely refers to hypoparathyroidism which is not a sex-specific condition, but rather a sex-biased condition with a ratio of 3.3:1 for female to male diagnoses. Further, men are more likely to present asymptomatically, which may then be less likely to be diagnosed in an ICU setting [45]. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:13 Table 2. Matthews Correlation Coefficient for Gender Classification Task on n2c2 Dataset BERT Clinical BERT Model notes Gendered No-gend. Gendered No-gend. Nearest neighbor 0.69 * 0.44 * 1Epoch 0.94 −0.06 0.92 0.00 10 Epochs * 0.88 * 0.56 BERT and Clinical BERT based models were run on the manually generated “no gender” test dataset (common pronouns, etc. have been removed). The nearest neighbor model uses off-the-shelf models to create document embeddings, while the models run for 1 and 10 Epochs were fine tuned. The application of RTD produces, in a principled fashion, a list of target terms to remove during the debiasing process—automating the selection of biased n-grams and tailoring results to a specific dataset. Using the same RTD results from above, we apply our trimming procedure—augmenting the text by iteratively removing the most biased 1-grams. For instance, in the MIMIC-III data the top 268 1-grams account for 10% of the total RTD identified by method—and these are the first words we trim. 3.2 Gender Classification As an extrinsic evaluation of our biased-language removal process, we present performance results for classifiers predicting membership in the two classes that we obscure through the data augmentation process. We posit that the performance of a classifier in this case is an important metric when determining if the data augmentation was successful in removing bias signals and thus potential bias from the classification pipeline. The performance of the classifier is analogous to a real-world application under an extreme case where we are trying to predict the protected class. We evaluate the performance of a binary gender classifier based on BERT and Clinical BERT language models. As a starting point, we investigate the performance of a basic nearest neighbors classifier running on docu- ment embeddings produced with off-the-shelf language models. The classification performance of the nearest- neighbor classifier is far better than random and speaks to the embedding space’s local clustering by gendered words in these datasets, suggesting that gender may be a major component of the words embedded within this representation space. The tendency for BERT, and to a lesser extent Clinical BERT, to encode gender informa- tion can be seen in the tSNE visualization of these document embeddings (Figures 3 and 17). As seen here and in other results, Clinical BERT exhibits less potential gender-bias according to our metrics. We leave a more in- depth comparison of gender-bias in BERT and Clinical BERT to future work, but it is worth noting that different embeddings appear to have different levels of potential gender-bias. Further, clinical text data may be more or less gender-biased than everyday text. The performance of the BERT-based nearest neighbor classifier on the gender classification task is notable (Matthews correlation coefficient of 0.69), given the language models were not fine-tuned (Table 2). Using Clinical BERT embeddings result in an MCC of 0.44 for the nearest neighbor classifier—with Clinical BERT generally performing slightly worse on gender classification tasks. As a point of comparison, we attempt a naive approach to removing gender bias through data augmentation that involves trimming a manually selected group of 20 words. When we run our complete BERT classifier, with fine tuning, for 1 epoch we find that the MCC drops from 0.94 to -0.06 when we trim the manually selected words. This patterns holds up for Clinical BERT as well. However, if we extend the training run to 10 epochs, we find that most of the classification performance is recovered. This suggests that although the manually selected terms may have some of the most prominent gender signals, removing them by no means prevents the models from learning other indicators of gender. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:14 • J. R. Minot et al. Fig. 4. Patient condition and gender classification performance. Using the fine-tuned Clinical BERT based model on the MIMIC dataset. (A) Proportion of baseline classification performance removed after minimum-trim level (1% of total RTD) is applied to the documents. (B) Same as (A) but with maximum trimming applied (70% of total RTD). Of all the classification tasks, ‘gender’ and ‘Urinary tra.’ experience the greatest relative decrease in classification performance. However, due to the low baseline performance of Urinary (≈ 0.2), the gender classification task has a notably higher absolute reduction in MCC than Urinary tra. (or any other task). It is worth noting under low levels of trimming MCC values slightly improved in individual trials. Further, under maximum trim levels the gender classification MCC was slightly negative. See Figure 5 for full information on MCC scores for each of the health conditions. On the MIMIC-III dataset we find gender classification to be generally accurate. With no gender trimming applied, MCC values are greater than 0.9 for both BERT and Clinical BERT classifiers. This performance is quickly degraded as we employ our trimming method (Figure 5(K)). When we remove 1-grams accounting for the first 10% of the RTD, we find an MCC value of approximately 0.2 for the gender classification task. The removal of the initial 10% of rank-divergence contributions has the most impact in terms of classification performance. Further trimming does not reduce the performance as much until 1-grams accounting for nearly 80% of the rank-turbulence divergence are removed. At this point, the classifier is effectively random, with an MCC of approximately 0. Taken together these results point to a reduction in potential bias through our trimming procedure. The large drop in performance for gender classification is in contrast to that of most health conditions (Figure 4(B)). On the health condition classification task most trim values result in negligible drops in classi- fication performance. 3.3 Condition Classification To evaluate the impact of the bias removal process, we track the performance of classification tasks that are not explicitly linked to the two classes we are trying to protect. Under varying levels of data augmentation we train and test multiple classification pipelines and report any degradation in performance. These tasks are meant to be analogous to real world applications in our domain that would require the maintenance of clinically-relevant information from the text—although we make no effort to achieve state-of-the-art results (see Table 3 for baseline condition classification performance). In the specific context of our case study, we train health-condition classifiers that produce modest perfor- mance on the MIMIC-III dataset. This performance is suitable for our purposes of evaluating the degradation in performance on the extrinsic task, relative to our trimming procedure. In the case of each health condition, we find that relative classification performance is minimally affected by the trimming procedure. For instance, the classifier for atrial fibrillation results in an MCC value of around 0.48 for ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:15 Table 3. Clinical BERT Performance on Top 10 ICD9 Codes in the MIMIC-III Dataset ICD9 Description ICD9 Code MCC Diabetes mellitus 25000 0.53 Hyperlipidemia 2724 0.46 Essential hypertension 4019 0.41 Coronary atherosclerosis 41401 0.67 Atrial fibrillation 42731 0.53 Congestive heart failure 4280 0.51 Acute respiratory failure 51881 0.43 Esophageal reflux 53081 0.43 Acute kidney failure 5849 0.29 Urinary tract infection 5990 0.23 the male patients (Figure 5(C)) in the test set when no trimming is applied. When the minimal level of trimming is applied (10% of RTD removed), the MCC for the males is largely unchanged, resulting in an MCC of 0.48. This largely holds true for most of the trimming levels, across the 10 conditions we evaluate in-depth. For 6 out of 10 conditions, we find that words accounting for approximately 80% of the gender RTD need to be removed before there is a noteworthy degradation of classification performance. At the 80% trim level, the gender classification task has an MCC value of approximately 0, while many other conditions maintain some predictive power. Comparing the relative degradation in performance, we see that the proportion of MCC lost between no- and maximum-trim between 0.05 and 0.4 for most conditions (Figure 4(B)). The only condition with full loss of predictive power is for urinary tract infections, which one might also speculate to be related to the anatomical differences in presentation of UTIs between biological sexes. Although, this task also proved the most challenging and had the worst starting (no-trim) performance (MCC≈ 0.2). The above results suggest that, for the conditions we examined, performance for medically relevant tasks can be preserved while reducing performance on gender classification. There is the chance that the trimming procedure may result in biased preservation of condition classification task performance. To investigate this we present results from a lightweight, TF-IDF based classifier for 123 health conditions. We find that when we trim the top 50% of RTD that classifiers for most conditions are relatively unaffected (Figure 6). For those conditions that do experience shifts in classification performance, any gender imbalance appears attributable related to the background gender distribution in the dataset. 3.4 Gender Distance To connect the empirical data with the language models, we embed n-grams from our case study datasets and evaluate their intrinsic bias within the word-embedding space. These language models have the same model- architectures that we (and many others) use when building NLP pipelines for classification and other tasks. Bias measures based on the word-embedding space are meant to provide some indication of how debiasing techniques that are more language model-centric would operate (and what specific n-grams they may highlight)—keeping with our theme of interpretability while contrasting these two approaches. In the context of our case study, we connect empirical data with word embeddings by presenting the dis- tributions of cosine similarity scores for 1-grams relative to gendered clusters in the embedding space. Cosine similarity scores are calculated for all 1-grams relative to clusters representing both female and male clusters (defined by 1-grams in Table 4). In our results we use both the maximum cosine similarity value relative to these clusters (i.e., the score calculated against either the female or male cluster) as well as differences in the scores for each 1-gram relative both female and male clusters. Looking at the distributions of maximum cosine ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:16 • J. R. Minot et al. Fig. 5. Matthews correlation coefficient (MCC) for classification results of health conditions and patient gender with varying trim levels. Results were produced with clinicalBERT embeddings and no-token n-gram trimming. (A)−(J) show MCC for the top 10 ICD9 codes present in the MIMIC data set. (K) shows MCC for gender classification on the same population. (L) presents a comparison of MCC results for data with no trimming and the maximum trimming level applied. Values are the relative MCC, or the proportion of the best classifiers performance we lose when applying the maximum rank-turbulence divergence trimming to the data. Here we see the relatively small effect of gender-based rank divergence trimming on the condition classification tasks for most conditions. The performance on the gender classification task is significantly degraded, even at modest trim levels, and is effectively no better than random guessing at our maximum trim level. It is worth noting that many conditions are stable for most of the trimming thresholds, although we do start to see more consistent degradation of performance at the maximum trim level for a few conditions. similarity scores for 1-grams appearing in both the n2c2 dataset (Figure 16) and the MIMIC-III dataset (Figure 7(B)), we observed a bimodal distribution of values. In both figures, a cluster with a mean around 0.9 is apparent as well as a cluster with a mean around 0.6. Through manual review of the 1-grams, we find that the cluster around 0.9 is largely comprised of more common, conversational English words whereas the cluster around 0.6 is largely comprised of medical terms. While there are more unique 1-grams in the cluster of medical terms, the overall volume of word occurrences is far higher for the conversational cluster. Referencing the cosine similarity clusters against the rank-turbulence divergence scores for the two data sets, we find that a high volume of individual 1-grams that trimmed are present in the conversational cluster. However, the number of unique terms there are removed for lower trim-values are spread throughout the cosine-similarity gender distribution. For instance, when trimming the first 1% of RTD, we find that terms selected are more con- versational cluster and more technical cluster (Figure 7(E)), with the former accounting for far more of the total volume of terms removed. The total volume of 1-grams is skewed towards the conversational cluster with terms ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:17 Fig. 6. Degradation in performance for Matthews correlation coefficient for condition classification of ICD9 codes with at least 1,000 patients. The performance degradation is presented relative to the proportion of the patients with that code who are female. We find little correlation between the efficacy of the condition classifier on highly augmented (trimmed) datasets and the gender balance for patients with that condition (coefficient of determination R = −2.48). Values are calculated for TF-IDF based classifier and include the top 10 health conditions we evaluate elsewhere. that have higher gender similarity (Figure 7(G)). The fact that the terms selected for early stages of trimming appear across the distribution of cosine similarity values illustrates the benefits of our empirical method, which is capable of selecting terms specific to a given dataset without relying on information contained in a language model. The contrast between the RTD selection criteria and the bias present in the language model helps explain why performance on the condition classification task is minimally impacted even when a high volume of 1-grams are removed—with RTD selecting only the most empirically biased terms. Using RTD-trimming, there is a middle ground between obscuring gender and barely preserving performance on condition classifications—some of the more nuanced language can be retained using our method. 3.5 Comparison of Language Model and Empirical Bias Finally, we identify n-grams that are more biased in either the language model or in the empirical data, using RTD to divert attention away from n-grams that appear to exhibit similar levels of bias in both contexts. Put more specifically, the first application of RTD—on the empirical data and word-embeddings—ranks n-grams that are more male or female biased. The second application, the divergence-of-divergence (RTD ), ranks n-grams in terms of where there is most disagreement between the two bias detection approaches. For the MIMIC-III dataset, we find RTD highlights sex-specific terms, social information, and medical con- ditions (Table 6). The abbreviations of “f” and “m” for instance, are rank 6288 and 244, respectively, for RTD bias measures on BERT. Moving to RTD bias measurements in MIMIC-III, “f” and “m” are the 3rd and 7th most ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:18 • J. R. Minot et al. Fig. 7. Measures of gender bias in BERT word-embeddings. (A) tSNE visualization of the BERT embedding space, colored by the maximum cosine similarity of MIMIC-III 1grams to either male or female gendered clusters. (B) Distribution of the maximum cosine similarity between male or female gender clusters for 163,539 1grams appearing the MIMIC-III corpus. Through manual inspection we find that the two clusters of cosine similarity values loosely represent more conversational English (around 0.87) and more technical language (around 0.6). The words shown here were manually selected from 20 random draws for each respective region. (C) tSNE visualization of BERT embeddings space, colored by the difference in the values of cosine similarity for each word and the male and female clusters. (D) Distribution of the differences in cosine similarity values for 1-grams and male and female clusters. (E) Distribution maximum gendered-cluster cosine similarity scores for the 1-grams selected for removal when using the rank-turbulence divergence trim technique and targeting the top 1% of words that contribute to overall divergence. The trimming procedure targets both common words that are consid- ered relatively gendered by the cosine similarity measure, and less common words that are more specific to the MIMIC-III dataset and relatively less gendered according to the cosine similarity measure. (F) Weighted distribution of differences in cosine similarity between 1-grams and male and female clusters (same measure as (D), but weighted by the total number of occurrences of the 1-gram in the MIMIC-III data). (G) Weighted distribution of maximum cosine similarity scores between 1-grams and male or female clusters (same measure as (B), but weighted by the total number of occurrences of the 1-gram in the MIMIC-III data). biased terms, respectively, appearing in practically every note when describing basic demographic information for patients. The BERT word embedding of the 1-gram “grandmother” has a rank of 4 but a rank of 3571 in the MIMIC-III data—due to the fact that the 1-gram “grandmother” is inherently semantically gendered, but in the context of health records does not necessarily contain meaningful information on patient gender. “Husband” on the other hand does contain meaningful information on the patient gender (at least in the MIMIC-III patient population), with it being rank 4 in terms of its empirical bias—the word embedding suggests it is biased, but less so with a rank of 860. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:19 As a final set of examples, we look at medical conditions. It is worth noting our choice of BERT rather than Clinical BERT most likely results in less effective word embeddings for medical terms. “Cervical” has a rank of 7 in the BERT bias rankings and a rank of 18374 in the empirical bias distribution—most likely owing to the split meanings in a medical context. Conversely, “flomax” has a rank of 10891 for the word embedding bias, while the empirical bias rank is 11—most likely due to the gender imbalance in the incidence of conditions (e.g., kidney stones, chronic prostatitis) that flomax is often prescribed to treat. Similarly, “hypothyroidism” is ranked 12 in MIMIC and 17831 in BERT RTD ranks, with the condition having a known increased prevalence in female- patients. The high RTD ranks for medical conditions somewhat owe to the fact that we used BERT rather than the medically-adapted Clinical BERT for these results. For these results the choice to use the general purpose BERT rather than Clinical BERT was motivated by illustrating the discrepancies in bias rankings when using the general purpose model (with the added contrast of a shifted domain, as indicated by jargonistic medical conditions). When applying this type of comparison in practice, it will most likely be more beneficial to compare bias ranks with language models that are used in any final pipeline (in this case, Clinical BERT). Additionally, the difficulty of constructing meaningful clusters of gendered terms using technical language limits the utility of the our cosine similarity bias measure in the Clinical BERT embedding space (see Table 7). Inspection of the 1-grams with high RTD values for BERT suggests a word of caution when using general purpose word embeddings on more technical datasets, while also illustrating how specific terms that drive bias may differ between different domains. The lesson derived by the case study of applying BERT to medical texts could be expanded to provide further caution when working in domains that do not have the benefit of fine-tuned models or where model fit may be generally poor for other reasons. 4 CONCLUDING REMARKS Here we present interpretable methods for detecting and reducing bias in text data. Using clinical notes and gen- der as a case study, we explore how using our methods to augment data may affect performance on classification tasks, which serve as extrinsic evaluations of the potential-bias removal process. We conclude by contrasting the inherent bias present in language models with the bias we detect in our two example datasets. These re- sults demonstrate that it is possible to obscure gender-features while preserving the signal needed to maintain performance on medically relevant classification tasks. Our methods start by using a divergence measure to identify empirical data bias present in our EHR datasets. We then assess the intrinsic bias present within the word embedding spaces for general purpose and clinically adapted language models. We introduce the concept of potential bias (PB) and evaluate the reduction of extrin- sic PB when we apply our mitigation strategy. PB results are generated by presenting performance on a gender classification task. Finally, we compare the results of assessing empirical data bias and intrinsic embedding space bias by contrasting the rankings of 1-grams produced by each method. When evaluating the differences in word use frequency in medical documents, certain intuitive results emerge: practitioners use gendered pronouns to describe patients, they note social and family status, and they encode medical conditions with known gender imbalances. Using our rank-turbulence divergence approach, we are able to evaluate how each of these practices, in aggregate, contribute to a divergence in word-frequency distributions between the male- and female-patient notes. This becomes more useful as we move to identifying language that while not explicitly gendered may still be used in an unbalanced fashion in practice (for instance, non-sex specific conditions that are diagnosed more frequently in one gender). The results from divergence methods are useful for both understanding differences in language usage and as a debiasing technique. While many methods addressing debiasing language models focus on the bias present in the model itself, our empirically-based method offers stronger debiasing of the data at hand. Modern language models are capable of detecting gender signals in a wide variety of datasets ranging from conversational to highly technical language. Many methods for removing bias from the pre-trained language model still leave the potential of meaningful ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:20 • J. R. Minot et al. proxies in the target dataset, while also raising questions on degradation in performance. We believe that bal- ancing debiasing with model performance is benefited by interpretable techniques, such as those we present here. For instance, our bias ranking and iterative application of divergence measures allow users to get a sense of disagreement in bias ranks for language models and empirical data. Our study is limited to looking at (a) intrinsic bias found in our dataset and pre-trained word embeddings, and (b) the extrinsic potential bias identified in our classification pipeline. We recognize concerns raised by Blodgett et al. [10] and others relating to the imprecise definitions of bias in the field of algorithmic fairness. Indeed, it is often important to motivate a given case of bias by establishing its potential harms. In this piece, we address precursors to bias, and thus do not claim to being making robust assessments of real-world bias and subsequent impact. The potential bias metric is instead meant to be a task agnostic indicator of the capacity for a complete pipeline to discriminate between protected classes. Due to the available data we were not able to develop methods that address non-binary cases of gender bias. There are other methodological considerations for expanding past the binary cases [13], although this is an important topic for a variety of bias types [44]. There are further complications when moving away from tasks where associated language is not as neatly segmented. For instance, we show above that when evaluating language models such as BERT, much of the gendered language largely appears in a readily identifiable region of the semantic space. As a rough heuristic: terms appearing in a medical dictionary tended to be less similar to gendered terms than terms that might appear in casual conversation. For doctors notes, the bulk of the bias stems from words that are largely distinct from those that we expect to be most informative for medically relevant tasks. Further research is required to determine the efficacy of our techniques in domains where language is not as neatly semantically segmented. Using clinical notes from an ICU context could bias our results due to the types of patients, conditions, and interactions that are common in this setting. For instance, there may be fewer verbal patient-provider interactions reflected in the data and social histories may not be as in-depth (compared with a primary-care setting). Further, the ways in which clinicians code health conditions may vary across contexts, institutions, and providers. In our study we aim to reduce the impact how conditions are coded by selecting common conditions that have large sample sizes in our data set—but this is still a factor that should be considered when working with such data. Future research that applies these interpretable methods to clinical text have the opportunity to examine pos- sible confounding factors such as patient-provider gender concordance. Further, it would be worthwhile to sep- arately address the impact of author gender on the content of clinical texts through using analytical framework. Other confounding factors relating to the patient populations and broader socio-demographic factors could be addressed by replicating these trials on new data sets. There is also the potential to research how presenting the results of our empirical bias analysis to clinicians may affect note writing practices—perhaps adding empirical examples to the growing medical school curriculum that addresses unconscious bias [64]. Our methods make no formal privacy guarantees nor do we claim complete removal of bias. There is always a trade-off when seeking to balance bias reduction with overall performance, and we feel our methods will help all stakeholders make more informed decisions. Our methodology allows stakeholders to specify the trade-off between bias reduction and performance that is best for their particular use case by selecting different trim levels and reviewing the n-grams removed. Using a debiasing method that is readily interpreted by doctors, patients, and machine learning practitioners is a benefit for all involved, especially as public interest in data privacy grows. Moving towards replacing strings rather than trimming or dropping them completely should be investigated in the future. More advanced data augmentation methods may be needed if we were to explore the impact of debiasing on highly tuned classification pipelines. Holistic comparisons of string replacement techniques and other text data augmentation approaches would be worthwhile next steps. Further research on varying and more difficult extrinsic evaluation tasks would be helpful in evaluating how our technique generalizes. Future work could also investigate coupling our data-driven method with methods focused on debiasing language models. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:21 APPENDICES A SUPPLEMENTARY INFORMATION (SI) A.1 Note Selection After reviewing the note types available in the MIMIC-III dataset, we determined that many types were not suitable for our task. This is due to a combination of factors including information content and note length (Figure 22. Note types such as radiology often include very specific information (not indicative of broader patient health status), are shorter, and may be written in a jargonistic fashion. For the work outlined here we only include notes that are of the types nursing, discharge summary,and physician. In order to be included in our training and test datasets, documents must come from patients with at least three recorded documents. A.2 Document Lengths after Trimming Fig. 8. Document length after applying a linearly-spaced rank-turbulence divergence based trimming procedure. Percentage values represent the percentage of total rank-turbulence divergence removed. Trimming is conducted by sorting words highest-to-lowest based on their individual contribution to the rank-turbulence divergence between male and female corpora (i.e., the first 10% trim will include words that, for most distributions, contribute far more to rank-turbulence divergence than the last 10%). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:22 • J. R. Minot et al. A.3 Variable Length Note Embedding When tokenized, many notes available in the MIMIC-III dataset are longer than the 512-token maximum sup- ported by BERT. To address this issue we experiment with truncating the note at the first 512 tokens. We also explore embedding at the sentence level (embedding with a maximum of 128 tokens) and simply divide the note in 512-token subsequences. In the latter two cases, we use the function outlined by Huang et al. [31], n n P + P n/c max mean P (Y = 1) = (3) 1+ n/c n n where P and P are the maximum and mean probabilities for the n subsquences associated with a given max mean note. Here, c is a tunable parameter that is adjusted for each task. For our purposes, the improvement in classification performance returned by employing this technique did not merit use in our final results. If overall performance of our classification system was our primary objective, this may be worth further investigation. A.4 ICD Co-occurrence Fig. 9. Normalized rates of health-condition co-occurrence for the top 10 ICD-9 codes. A.5 Hardware BERT and Clinical BERT models were fine-tuned on both an NVIDIA RTX 2070 (8GB VRAM) and NVIDIA Tesla V100s (32GB VRAM). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:23 A.6 Gendered 1-grams Table 4. Manually Selected Gendered Terms Female 1-grams Male 1-grams her his she he woman man female male Ms Mr Mrs him herself himself girl boy lady gentleman Fig. 10. Classification performance for the next 123 most frequently occurring conditions. Matthews correlation coefficient for condition classification of ICD9 codes with at least 1,000 patients compared to the proportion of the patients with that code who are female. While the most accurate classifiers tend to be for conditions with a male bias, we observed that this is in-part due to the underlying bias in patient gender. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:24 • J. R. Minot et al. Fig. 11. ROC curves for classification task on top 10 health conditions with varying proportions of rank-turbulence diver- gence removed. Echoing the results in Figure 5, the gender classifier has the best performance on the ‘no-trim’ data and experiences the greatest drop in performance when trimming is applied. Under the highest trim level reported here, the gen- der classifier is effectively random, while few condition classifiers retain prediction capability (albeit modest). The bar chart shows the area under the ROC curve for classifiers, by task, trained and tested with no-trimming and maximum-trimming applied. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:25 Fig. 12. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 2-grams have been split between genders and common gendered terms (pronouns, etc.) have been removed before calculating rank divergence. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:26 • J. R. Minot et al. Fig. 13. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 1-grams have been split between genders and common gendered terms (pronouns, etc., see Table 4) have been removed before calculating rank divergence. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:27 Fig. 14. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 3-grams have been split between genders and common gendered terms (pronouns, etc.) have been removed before calculating rank divergence. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:28 • J. R. Minot et al. Fig. 15. Rank-turbulence divergence for 2014 n2c2 challenge. For this figure, 2-grams have been split between genders. Fig. 16. Maximum cosine similarity scores of BERT-base embeddings for 26,883 1grams appearing in n2c2 2014 challenge data relative to gendered clusters. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:29 Fig. 17. A tSNE embedding of n2c2 document vectors generated using a pre-trained version of Clinical BERT. Fig. 18. A tSNE embedding of MIMIC document vectors generated using a pre-trained version of Clinical BERT. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:30 • J. R. Minot et al. Fig. 19. A tSNE embedding of MIMIC document vectors generated using a pre-trained version of Clinical BERT. Fig. 20. A tSNE embedding of MIMIC document vectors generated using a pre-trained version of Clinical BERT. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:31 Table 5. Rank-Turbulence Divergence of the Rank-Turbulence Divergence Between Male and Female Zipf Distributions According to n2c2 Rank-Turbulence Divergence and BERT Cosine Similarity Ranking BERT n2c2 BERT-n2c2 1gram RTD rank RTD rank RTD rank 1 mrs 1.0 1172.5 1.0 2 ms 4.0 6560.5 2.0 3 her 279.0 3.0 3.0 4 mr. 15307.0 6.0 4.0 5 male 21798.0 8.0 5.0 6 mr 5.0 437.0 6.0 7 female 5150.0 9.0 7.0 8 linda 10.0 3681.5 8.0 9 ms. 2208.0 11.0 9.0 10 gentleman 3054.0 13.0 10.0 11 pap 25105.0 17.0 11.0 12 breast 4314.0 14.0 12.0 13 cervical 14.0 2483.0 13.0 14 biggest 16.0 5301.5 14.0 15 mammogram 25054.0 21.0 15.0 16 mrs. 2860.0 16.0 16.0 17 f 10458.0 20.0 17.0 18 woman 120.0 7.0 18.0 19 psa 6082.0 19.0 19.0 20 kathy 19.0 5301.5 20.0 21 he 3.0 1.0 21.0 22 them 20.0 4146.0 22.0 23 prostate 2223.0 18.0 23.0 24 bph 8601.0 23.0 24.0 25 husband 920.0 15.0 25.0 26 guy 22.0 5301.5 26.0 27 take 4278.0 22.0 27.0 28 infected 18.0 1455.0 28.0 29 patricia 21.0 2701.5 29.0 30 smear 21455.0 29.0 30.0 31 ellen 24.0 3681.5 31.0 32 cabg 19064.0 33.0 32.0 33 distal 7754.0 30.0 33.0 34 pend 22515.0 36.0 34.0 35 tablet 2322.0 25.0 35.0 36 cath 7111.0 31.0 36.0 37 qday 12829.0 34.0 37.0 38 peggy 17.0 485.0 38.0 39 flomax 17413.0 37.5 39.0 40 lad 2383.0 27.0 40.0 41 prostatic 23978.0 43.0 41.0 42 gout 11948.0 40.0 42.0 43 taking 9534.0 39.0 43.0 44 trouble 34.0 4183.5 44.0 45 harry 33.0 3372.0 45.0 46 vaginal 10418.0 41.0 46.0 47 qty 18645.0 45.0 47.0 48 she 6.0 2.0 48.0 49 p.o 9293.0 42.0 49.0 50 xie 20584.0 47.5 50.0 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:32 • J. R. Minot et al. Table 6. Comparison of Rank-Turbulence Divergences for Gendered Clusters in BERT Embeddings and the MIMIC Patient Health Records Text BERT MIMIC BERT-MIMC MIMIC MIMIC 1gram RTD rank RTD rank RTD rank Frank Mrank sexually 2.0 16755.5 1.0 10607.0 10373.5 biggest 3.0 7594.5 2.0 17520.5 21172.5 f 6288.0 3.0 3.0 251.0 1719.0 infected 5.0 16076.0 4.0 3475.0 3402.5 grandmother 4.0 3517.0 5.0 6090.0 7643.0 cervical 7.0 18374.0 6.0 1554.0 1551.5 m 244.0 2.0 7.0 2103.0 249.0 sister 6.0 4153.0 8.0 1213.5 1365.0 husband 860.0 4.0 9.0 495.0 5114.0 teenage 9.0 5513.0 10.0 15449.0 19594.0 trouble 12.0 10119.0 11.0 3778.5 4095.5 brother 10.0 1921.0 12.0 1925.0 1598.0 teenager 8.0 936.5 13.0 12928.5 21172.5 connected 16.0 16682.0 14.0 5184.5 5089.5 shaky 11.0 2341.0 15.0 10607.0 7872.0 my 15.0 8198.0 16.0 2395.0 2624.5 breast 3397.0 6.0 17.0 1075.0 4673.5 expelled 19.0 14652.5 18.0 20814.0 19594.0 them 18.0 11337.0 19.0 1738.5 1832.0 prostate 2196.0 5.0 20.0 9436.0 1576.5 immune 20.0 15043.5 21.0 9119.0 8753.0 daughter 1.0 16.0 22.0 463.0 801.5 initial 23.0 11433.0 23.0 632.5 610.0 ovarian 16374.0 8.0 24.0 3137.0 14082.0 recovering 24.0 11867.0 25.0 5351.5 5749.5 abnormal 25.0 9010.0 26.0 1849.5 1994.0 alcoholic 17.0 1136.0 27.0 3885.5 2952.0 obvious 26.0 10154.0 28.0 2725.5 2540.5 huge 28.0 13683.5 29.0 8069.0 7643.0 dirty 29.0 13264.5 30.0 8613.5 9179.5 suv 27.0 5911.0 31.0 14704.0 18377.5 container 31.0 17776.0 32.0 12090.5 12214.5 flomax 10891.0 11.0 33.0 17520.5 4095.5 sisters 32.0 17680.5 34.0 4727.0 4687.5 uterine 7260.0 10.0 35.0 4263.0 19594.0 hypothyroidism 17831.0 12.0 36.0 1239.5 2920.0 dried 34.0 15661.0 37.0 5124.5 4982.0 osteoporosis 18010.0 13.0 38.0 2354.0 7003.0 breasts 4727.0 9.0 39.0 4394.5 21172.5 certain 37.0 18020.5 40.0 7973.0 8023.0 i 30.0 4601.0 41.0 403.0 435.5 restless 21.0 1144.0 42.0 1366.0 1123.0 wife 13.0 1.0 43.0 5545.0 245.5 sle 8083.0 14.0 44.0 3511.0 12504.5 granddaughter 14.0 210.0 45.0 4656.5 7872.0 localized 47.0 14357.5 46.0 6136.0 6405.0 ciwa 7921.0 15.0 47.0 3106.5 1341.0 honey 44.0 11076.0 48.0 10868.5 9861.0 coronary 11570.0 18.0 49.0 560.0 349.0 systemic 41.0 6361.0 50.0 3696.0 4214.0 BERT RTD ranks are calculated based on cosine similarity scores for word embedding and gendered clusters (i.e., the RTD of cosine similarity score ranks relative to male and female clusters). MIMIC RTD ranks are for 1-grams from male and female clinical notes. “BERT-MIMIC RTD rank” is the rankings for 1-grams based on RTD between the first two columns—we also refer to this as RTD (ranking divergence-of-divergence). ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:33 Table 7. Comparison of Rank-Turbulence Divergences for Gendered Clusters in Clinical BERT Embeddings and the MIMIC Patient Health Records Text BERT MIMIC BERT-MIMC MIMIC MIMIC 1gram RTD rank RTD rank RTD rank Frank Mrank is 1.0 18545.5 1.0 24.0 24.0 wife 3588.0 1.0 2.0 245.5 5545.0 yells 2.0 11292.0 3.0 10563.0 9615.0 looking 3.0 16700.0 4.0 3238.5 3289.5 m 7181.0 2.0 5.0 249.0 2103.0 kids 4.0 13675.0 6.0 8753.0 9266.5 essentially 5.0 17864.0 7.0 2269.0 2279.5 alter 6.0 15840.0 8.0 15764.0 16387.5 f 2166.0 3.0 9.0 1719.0 251.0 bumps 7.0 6936.0 10.0 13612.0 11417.0 husband 2958.0 4.0 11.0 5114.0 495.0 historian 8.0 7645.0 12.0 6895.0 6045.5 insult 9.0 7241.0 13.0 8854.5 10361.0 moments 10.0 16287.0 14.0 10563.0 10868.5 our 11.0 16938.0 15.0 2803.0 2838.0 asks 13.0 16147.0 16.0 7257.0 7449.5 goes 15.0 17662.0 17.0 3653.0 3624.5 someone 16.0 14030.0 18.0 6197.0 5920.5 ever 17.0 10698.0 19.0 5114.0 5545.0 prostate 6070.0 5.0 20.0 1576.5 9436.0 breast 6010.0 6.0 21.0 4673.5 1075.0 experiences 20.0 17062.0 22.0 9452.5 9615.0 suffer 12.0 1463.5 23.0 10968.5 16387.5 recordings 21.0 13249.5 24.0 12504.5 13440.5 wore 23.0 14774.5 25.0 8854.5 9266.5 largely 26.0 16941.0 26.0 4581.5 4515.5 hi 22.0 6459.0 27.0 3992.5 4558.0 et 27.0 17615.0 28.0 2079.0 2093.5 staying 25.0 10254.5 29.0 4366.5 4746.5 pursuing 18.0 1561.5 30.0 13612.0 20814.0 pet 24.0 4873.0 31.0 5879.0 7076.5 town 28.0 10038.0 32.0 10754.0 9615.0 tire 30.0 12053.5 33.0 12834.5 11736.5 ovarian 13390.0 8.0 34.0 14082.0 3137.0 beef 19.0 1355.5 35.0 23307.0 14704.0 dipping 34.0 17764.0 36.0 6371.0 6318.5 dip 35.0 18328.0 37.0 5585.0 5600.0 hat 33.0 10106.5 38.0 14082.0 12478.5 flomax 14782.0 11.0 39.0 4095.5 17520.5 punch 31.0 5064.5 40.0 11203.5 14033.0 ease 38.0 17265.5 41.0 6471.5 6556.0 hasn 36.0 11324.0 42.0 10373.5 11417.0 lasts 39.0 14072.5 43.0 11203.5 10607.0 grabbing 32.0 4294.0 44.0 10968.5 14033.0 hypothyroidism 17820.0 12.0 45.0 2920.0 1239.5 whatever 37.0 8524.5 46.0 13200.5 15449.0 osteoporosis 18276.0 13.0 47.0 7003.0 2354.0 uterine 6519.0 10.0 48.0 19594.0 4263.0 sle 16374.0 14.0 49.0 12504.5 3511.0 dump 47.0 15439.0 50.0 10968.5 11417.0 Clinical BERT RTD ranks are calculated based on cosine similarity scores for word embedding and gendered clusters (i.e., the RTD of cosine similarity score ranks relative to male and female clusters). MIMIC RTD ranks are for 1-grams from male and female clinical notes. “BERT-MIMIC RTD rank” is the rankings for 1-grams based on RTD between the first two columns—we also refer to this as RTD (ranking divergence-of-divergence). The presence of largely conversational terms rather than more technical, medical language owes to our defining of gender clusters through manually selected terms. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:34 • J. R. Minot et al. Fig. 21. Document length for MIMIC-III text notes. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:35 Fig. 22. Document length for MIMIC-III by note type. For our study we include discharge summary, physician, and nurs- ing notes. Consult notes were initially considered but were ultimately found to be highly varied in terms of notation and nomenclature. This had the effect of making results more difficult to interpret and would have required additional data clean- ing. We believe our methods could be applied to patient records that include consult notes, just at the cost of additional pre-processing and more nuanced interpretation. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:36 • J. R. Minot et al. Table 8. Condition Name and Gender Balance for the ICD9 Codes with at Least 1,000 Observations in the MIMIC-III Dataset Sex Count Sex Prop. ICD Description F M F M Personal history of malignant neoplasm of prostate 0 1207 0.00 1.00 Hypertrophy (benign) of prostate without urinar... 0 1490 0.00 1.00 Routine or ritual circumcision 0 2016 0.00 1.00 Gout, unspecified 552 1530 0.27 0.73 Alcoholic cirrhosis of liver 323 879 0.27 0.73 Retention of urine, unspecified 283 737 0.28 0.72 Intermediate coronary syndrome 466 1197 0.28 0.72 Chronic systolic heart failure 321 776 0.29 0.71 Aortocoronary bypass status 896 2160 0.29 0.71 Other and unspecified angina pectoris 330 770 0.30 0.70 Paroxysmal ventricular tachycardia 548 1263 0.30 0.70 Chronic hepatitis C without mention of hepatic ... 380 838 0.31 0.69 Coronary atherosclerosis of unspecified type of... 479 1015 0.32 0.68 Percutaneous transluminal coronary angioplasty ... 889 1836 0.33 0.67 Portal hypertension 332 675 0.33 0.67 Surgical operation with anastomosis, bypass, or... 406 805 0.34 0.66 Coronary atherosclerosis of native coronary artery 4322 8107 0.35 0.65 Old myocardial infarction 1156 2122 0.35 0.65 Acute on chronic systolic heart failure 406 737 0.36 0.64 Cardiac complications, not elsewhere classified 847 1496 0.36 0.64 Atrial flutter 444 773 0.36 0.64 Paralytic ileus 394 678 0.37 0.63 Chronic kidney disease, unspecified 1265 2170 0.37 0.63 Personal history of tobacco use 1042 1769 0.37 0.63 Pneumonitis due to inhalation of food or vomitus 1369 2311 0.37 0.63 Tobacco use disorder 1251 2107 0.37 0.63 Obstructive sleep apnea (adult)(pediatric) 891 1489 0.37 0.63 Cirrhosis of liver without mention of alcohol 486 801 0.38 0.62 Hypertensive chronic kidney disease, unspecifie... 1300 2121 0.38 0.62 Diabetes with neurological manifestations, type... 438 700 0.38 0.62 Other primary cardiomyopathies 664 1045 0.39 0.61 Cardiac arrest 542 819 0.40 0.60 Peripheral vascular disease, unspecified 564 837 0.40 0.60 Hyperpotassemia 874 1295 0.40 0.60 Bacteremia 599 879 0.41 0.59 Other and unspecified hyperlipidemia 3537 5153 0.41 0.59 Thrombocytopenia, unspecified 1255 1810 0.41 0.59 Pure hypercholesterolemia 2436 3494 0.41 0.59 Pressure ulcer, lower back 530 759 0.41 0.59 Subendocardial infarction, initial episode of care 1262 1793 0.41 0.59 Acute kidney failure with lesion of tubular nec... 945 1342 0.41 0.59 (Continued) ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:37 Table 8. Continued Sex Count Sex Prop. ICD Description F M F M Acute and subacute necrosis of liver 441 626 0.41 0.59 Hypertensive chronic kidney disease, unspecifie... 1091 1539 0.41 0.59 Hemorrhage complicating a procedure 637 898 0.41 0.59 Cardiogenic shock 480 674 0.42 0.58 Aortic valve disorders 1069 1481 0.42 0.58 Polyneuropathy in diabetes 667 917 0.42 0.58 Other postoperative infection 503 683 0.42 0.58 Respiratory distress syndrome in newborn 559 755 0.43 0.57 Cardiac pacemaker in situ 592 798 0.43 0.57 Atrial fibrillation 5512 7379 0.43 0.57 Pulmonary collapse 931 1234 0.43 0.57 Delirium due to conditions classified elsewhere 622 823 0.43 0.57 Diabetes mellitus without mention of complicati... 3902 5156 0.43 0.57 Hemorrhage of gastrointestinal tract, unspecified 602 795 0.43 0.57 Other and unspecified coagulation defects 438 578 0.43 0.57 Acute kidney failure, unspecified 3941 5178 0.43 0.57 End stage renal disease 836 1090 0.43 0.57 Accidents occurring in residential institution 456 583 0.44 0.56 Single liveborn, born in hospital, delivered by... 1220 1538 0.44 0.56 Sepsis 563 709 0.44 0.56 Hyperosmolality and/or hypernatremia 1009 1263 0.44 0.56 Other specified surgical operations and procedu... 600 750 0.44 0.56 Severe sepsis 1746 2166 0.45 0.55 Unspecified protein-calorie malnutrition 562 697 0.45 0.55 Long-term (current) use of insulin 1138 1400 0.45 0.55 Long-term (current) use of anticoagulants 1709 2097 0.45 0.55 Other iatrogenic hypotension 953 1168 0.45 0.55 Anemia in chronic kidney disease 623 761 0.45 0.55 Intracerebral hemorrhage 618 749 0.45 0.55 Unspecified essential hypertension 9370 11333 0.45 0.55 Acute posthemorrhagic anemia 2072 2480 0.46 0.54 Unspecified septicemia 1702 2023 0.46 0.54 Chronic airway obstruction, not elsewhere class... 2027 2404 0.46 0.54 Pneumonia, organism unspecified 2223 2616 0.46 0.54 Septic shock 1189 1397 0.46 0.54 Other convulsions 892 1042 0.46 0.54 Other specified procedures as the cause of abno... 693 809 0.46 0.54 Diarrhea 484 565 0.46 0.54 Hematoma complicating a procedure 566 658 0.46 0.54 Acute respiratory failure 3473 4024 0.46 0.54 Other specified cardiac dysrhythmias 1137 1316 0.46 0.54 (Continued) ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:38 • J. R. Minot et al. Table 8. Continued Sex Count Sex Prop. ICD Description F M F M Need for prophylactic vaccination and inoculati... 2680 3099 0.46 0.54 Personal history of transient ischemic attack (... 498 574 0.46 0.54 Neonatal jaundice associated with preterm delivery 1052 1212 0.46 0.54 Observation for suspected infectious condition 2570 2949 0.47 0.53 Congestive heart failure, unspecified 6106 7005 0.47 0.53 Hypovolemia hyponatremia 641 733 0.47 0.53 Single liveborn, born in hospital, delivered wi... 1668 1898 0.47 0.53 Unspecified pleural effusion 1281 1453 0.47 0.53 Acidosis 2127 2401 0.47 0.53 Esophageal reflux 2990 3336 0.47 0.53 Encounter for palliative care 485 535 0.48 0.52 Hyposmolality and/or hyponatremia 1445 1594 0.48 0.52 Iron deficiency anemia secondary to blood loss ... 482 530 0.48 0.52 Hypoxemia 625 673 0.48 0.52 Mitral valve disorders 1416 1510 0.48 0.52 Primary apnea of newborn 506 537 0.49 0.51 Hypotension, unspecified 996 1055 0.49 0.51 Personal history of venous thrombosis and embolism 786 826 0.49 0.51 Obesity, unspecified 744 767 0.49 0.51 Intestinal infection due to Clostridium difficile 716 728 0.50 0.50 Obstructive chronic bronchitis with (acute) exa... 598 600 0.50 0.50 Anemia of other chronic disease 550 543 0.50 0.50 Anemia, unspecified 2729 2677 0.50 0.50 Dehydration 704 681 0.51 0.49 Other chronic pulmonary heart diseases 1101 1047 0.51 0.49 Do not resuscitate status 694 633 0.52 0.48 Depressive disorder, not elsewhere classified 1888 1543 0.55 0.45 Morbid obesity 648 522 0.55 0.45 Iron deficiency anemia, unspecified 657 514 0.56 0.44 Chronic diastolic heart failure 708 532 0.57 0.43 Hypopotassemia 816 609 0.57 0.43 Anxiety state, unspecified 944 636 0.60 0.40 Dysthymic disorder 663 446 0.60 0.40 Asthma, unspecified type, unspecified 1317 878 0.60 0.40 Urinary tract infection, site not specified 4027 2528 0.61 0.39 Other persistent mental disorders due to condit... 698 428 0.62 0.38 Acute on chronic diastolic heart failure 779 441 0.64 0.36 Unspecified acquired hypothyroidism 3307 1610 0.67 0.33 Osteoporosis, unspecified 1637 310 0.84 0.16 Personal history of malignant neoplasm of breast 1259 18 0.99 0.01 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:39 REFERENCES [1] David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, and Dietrich Klakow. 2020. Privacy guarantees for de-identifying text trans- formations. arXiv preprint arXiv:2008.03101 (2020). [2] Oras A. Alabas, Chris P. Gale, Marlous Hall, Mark J. Rutherford, Karolina Szummer, Sofia Sederholm Lawesson, Joakim Alfredsson, Bertil Lindahl, and Tomas Jernberg. 2017. Sex differences in treatments, relative survival, and excess mortality following acute myocardial infarction: National cohort study using the SWEDEHEART registry. Journal of the American Heart Association 6, 12 (2017), e007123. [3] Marcella Alsan, Owen Garrick, and Grant C. Graziani. 2018. Does Diversity Matter for Health? Experimental Evidence from Oakland. Technical Report. National Bureau of Economic Research. [4] Marcella Alsan and Marianne Wanamaker. 2018. Tuskegee and the health of black men. The Quarterly Journal of Economics 133, 1 (2018), 407–455. [5] Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly avail- able clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics, Minneapolis, Minnesota, USA, 72–78. https://doi.org/10.18653/v1/W19-1909 [6] Marion Bartl, Malvina Nissim, and Albert Gatt. 2020. Unmasking contextual stereotypes: Measuring and mitigating BERT’s gender bias. arXiv preprint arXiv:2010.14534 (2020). [7] Christine Basta, Marta R. Costa-Jussà, and Noe Casas. 2019. Evaluating the underlying gender bias in contextualized word embeddings. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 33–39. https://doi.org/10.18653/v1/W19-3805 [8] Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019). [9] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. 2017. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075 (2017). [10] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. arXiv preprint arXiv:2005.14050 (2020). [11] Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Systems. 4349–4357. [12] Rishi Bommasani, Kelly Davis, and Claire Cardie. 2020. Interpreting pretrained contextualized representations via reductions to static embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4758–4781. [13] Yang Trista Cao and Hal Daumé III. 2019. Toward gender-inclusive coreference resolution. arXiv preprint arXiv:1910.13913 (2019). [14] Davide Chicco, Niklas Tötsch, and Giuseppe Jurman. 2021. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Mining 14, 1 (2021), 1–22. [15] Erenay Dayanik and Sebastian Padó. 2020. Masking actor information leads to fairer political claims detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4385–4391. [16] Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krish- naram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 120–128. [17] Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (2017), 596–606. [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [19] Peter Sheridan Dodds, Joshua R. Minot, Michael V. Arnold, Thayer Alshaabi, Jane Lydia Adams, David Rushing Dewhurst, Tyler J. Gray, Morgan R. Frank, Andrew J. Reagan, and Christopher M. Danforth. 2020. Allotaxonometry and rank-turbulence divergence: A universal instrument for comparing complex systems. arXiv preprint arXiv:2002.09770 (2020). [20] Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19. [21] Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512 (2019). [22] Yadan Fan, Serguei Pakhomov, Reed McEwan, Wendi Zhao, Elizabeth Lindemann, and Rui Zhang. 2019. Using word embeddings to expand terminology of dietary supplements on clinical notes. JAMIA Open 2, 2 (2019), 246–253. [23] Paul M. Galdas, Francine Cheater, and Paul Marshall. 2005. Men and health help-seeking behaviour: Literature review. Journal of Advanced Nursing 49, 6 (2005), 616–623. [24] Aparna Garimella, Carmen Banea, Dirk Hovy, and Rada Mihalcea. 2019. Women’s syntactic resilience and men’s grammatical luck: Gender-bias in part-of-speech tagging and dependency parsing. In Proceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics. 3493–3498. ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. 39:40 • J. R. Minot et al. [25] Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862 (2019). [26] Brad N. Greenwood, Seth Carnahan, and Laura Huang. 2018. Patient–physician gender concordance and increased mortality among female heart attack patients. Proceedings of the National Academy of Sciences 115, 34 (2018), 8569–8574. [27] Brad N. Greenwood, Rachel R. Hardeman, Laura Huang, and Aaron Sojourner. 2020. Physician–patient racial concordance and dispari- ties in birthing mortality for newborns. Proceedings of the National Academy of Sciences 117, 35 (2020), 21194–21200. [28] Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P. Gummadi, and Adrian Weller. 2016. The case for process fairness in learning: Feature selection for fair decision making. In NIPS Symposium on Machine Learning and the Law,Vol.1.2. [29] Revital Gross, Rob McNeill, Peter Davis, Roy Lay-Yee, Santosh Jatrana, and Peter Crampton. 2008. The association of gender concordance and primary care physicians’ perceptions of their patients. Women & Health 48, 2 (2008), 123–144. [30] Katarina Hamberg. 2008. Gender bias in medicine. Women’s Health 4, 3 (2008), 237–243. [31] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. ClinicalBERT: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019). [32] Kexin Huang, Abhishek Singh, Sitong Chen, Edward T. Moseley, Chih-ying Deng, Naomi George, and Charlotta Lindvall. 2019. Clinical XLNet: Modeling sequential clinical notes and predicting prolonged mechanical ventilation. arXiv preprint arXiv:1912.11975 (2019). [33] Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 1–9. [34] Faiza Khan Khattak, Serena Jeblee, Chloé Pou-Prom, Mohamed Abdalla, Christopher Meaney, and Frank Rudzicz. 2019. A survey of word embeddings for clinical text. Journal of Biomedical Informatics: X 4 (2019), 100057. [35] Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201 (2018). [36] Vishesh Kumar, Amber Stubbs, Stanley Shaw, and Özlem Uzuner. 2015. Creation of a new longitudinal corpus of clinical narratives. Journal of Biomedical Informatics 58 (2015), S6–S10. [37] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. 2019. Measuring bias in contextualized word representations. arXiv preprint arXiv:1906.07337 (2019). [38] Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. arXiv preprint arXiv:1703.06856 (2017). [39] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240. [40] Chi-Wei Lin, Meei-Ju Lin, Chin-Chen Wen, and Shao-Yin Chu. 2016. A word-count approach to analyze linguistic patterns in the reflective writings of medical students. Medical Education Online 21, 1 (2016), 29522. [41] Bo Liu. 2019. Anonymized BERT: An augmentation approach to the gendered pronoun resolution challenge. arXiv preprint arXiv:1905.01780 (2019). [42] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [43] Jyoti Malhotra, David Rotter, Jennifer Tsui, Adana A. M. Llanos, Bijal A. Balasubramanian, and Kitaw Demissie. 2017. Impact of patient– provider race, ethnicity, and gender concordance on cancer screening: Findings from Medical Expenditure Panel Survey. Cancer Epi- demiology and Prevention Biomarkers 26, 12 (2017), 1804–1811. [44] Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, and Alan W. Black. 2019. Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. arXiv preprint arXiv:1904.04047 (2019). [45] Haggi Mazeh, Rebecca S. Sippel, and Herbert Chen. 2012. The role of gender in primary hyperparathyroidism: Same disease, different presentation. Annals of Surgical Oncology 19, 9 (2012), 2958–2962. [46] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019). [47] Michelle M. Mello and C. Jason Wang. 2020. Ethics and governance for digital disease surveillance. Science 368, 6494 (2020), 951–954. [48] Stephane M. Meystre, F. Jeffrey Friedlin, Brett R. South, Shuying Shen, and Matthew H. Samore. 2010. Automatic de-identification of textual documents in the electronic health record: A review of recent research. BMC Medical Research Methodology 10, 1 (2010), 1–16. [49] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). [50] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013). [51] Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl. 2020. Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. arXiv:2010.09337 [stat.ML] [52] Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In AISTATS, Vol. 5. Citeseer, 246– ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022. Interpretable Bias Mitigation for Textual Data • 39:41 [53] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. [54] Bethany Percha, Yuhao Zhang, Selen Bozkurt, Daniel Rubin, Russ B. Altman, and Curtis P. Langlotz. 2018. Expanding a radiology lexicon using contextual patterns in radiology reports. Journal of the American Medical Informatics Association 25, 6 (2018), 679–685. [55] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018). [56] Thang M. Pham, Trung Bui, Long Mai, and Anh Nguyen. 2020. Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks? arXiv preprint arXiv:2012.15180 (2020). [57] Víctor M. Prieto, Sergio Matos, Manuel Alvarez, Fidel Cacheda, and José Luís Oliveira. 2014. Twitter: A good place to detect health conditions. PloS One 9, 1 (2014), e86191. [58] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019). [59] Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation (2004). [60] Manuel Rodríguez-Martínez and Cristian C. Garzón-Alfonso. 2018. Twitter health surveillance (THS) system. In Proceedings of the IEEE International Conference on Big Data, Vol. 2018. NIH Public Access, 1647. [61] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. arXiv preprint arXiv:2002.12327 (2020). [62] Marcel Salathé. 2018. Digital epidemiology: What is it, and where is it going? Life Sciences, Society and Policy 14, 1 (2018), 1–5. [63] Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. Moliere: Automatic biomedical hypothesis generation system. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1633–1642. [64] Cayla R. Teal, Anne C. Gill, Alexander R. Green, and Sonia Crandall. 2012. Helping medical learners recognise and manage unconscious bias toward certain patient groups. Medical Education 46, 1 (2012), 80–88. [65] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems 33 (2020). [66] Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019). [67] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. ArXiv abs/1910.03771 (2019). [68] Christopher C. Yang, Haodong Yang, Ling Jiang, and Mi Zhang. 2012. Social media mining for drug safety signal detection. In Proceedings of the 2012 International Workshop on Smart Health and Wellbeing. 33–40. [69] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541 (2018). [70] Haoran Zhang, Amy X. Lu, Mohamed Abdalla, Matthew McDermott, and Marzyeh Ghassemi. 2020. Hurtful words: Quantifying biases in clinical contextual word embeddings. In Proceedings of the ACM Conference on Health, Inference, and Learning. 110–120. [71] Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496 (2018). Received 13 August 2021; revised 1 February 2022; accepted 9 March 2022 ACM Transactions on Computing for Healthcare, Vol. 3, No. 4, Article 39. Publication date: October 2022.

Journal

ACM Transactions on Computing for Healthcare (HEALTH)Association for Computing Machinery

Published: Nov 3, 2022

Keywords: NLP

References