Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning

Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning 1y 2 2 1 2 Ximing Zhang , Qian-Wen Zhang , Zhao Yan , Ruifang Liu , Yunbo Cao Beijing University of Posts and Telecommunications, Beijing 100876, China Tencent Cloud Xiaowei, Beijing 100080, China ximingzhang@bupt.edu.cn, cowenzhang@tencent.com, zhaoyan@tencent.com, lrf@bupt.edu.cn, yunbocao@tencent.com Abstract set of labels which are often related statistically and semantically. Label correlations should be In multi-label text classification (MLTC), each sufficiently utilized to build multi-label classifi- given document is associated with a set of cation models with strong generalization perfor- correlated labels. To capture label correla- mance (Tsoumakas et al., 2009; Gibaja and Ven- tions, previous classifier-chain and sequence- tura, 2015). In particular, learning the dependen- to-sequence models transform MLTC to a se- cies between labels might be helpful in modeling quence prediction task. However, they tend the low-frequency labels, because real-world clas- to suffer from label order dependency, la- bel combination over-fitting and error prop- sification problems tend to exhibit long-tail label agation problems. To address these prob- distribution, where low-frequency labels are asso- lems, we introduce a novel approach with ciated with only a few instances and are difficult to multi-task learning to enhance label correla- learn (Menon et al., 2020). tion feedback. We first utilize a joint em- Previous sequence-to-sequence (Seq2Seq) based bedding (JE) mechanism to obtain the text methods (Nam et al., 2017; Yang et al., 2018) have and label representation simultaneously. In been shown to have a powerful ability to capture la- MLTC task, a document-label cross atten- tion (CA) mechanism is adopted to gener- bel correlations with using the current hidden state ate a more discriminative document represen- of the model and the prefix label predictions. How- tation. Furthermore, we propose two auxil- ever, exposure bias phenomenon (Bengio et al., iary label co-occurrence prediction tasks to en- 2015) may cause the models overfit to the frequent hance label correlation learning: 1) Pairwise label sequence in training set, thus lead to several Label Co-occurrence Prediction (PLCP), and problems. First, Seq2Seq-based methods heavily 2) Conditional Label Co-occurrence Predic- rely on a predefined ordering of labels and perform tion (CLCP). Experimental results on AAPD and RCV1-V2 datasets show that our method sensitively to the label order (Vinyals et al.; Yang outperforms competitive baselines by a large et al., 2019; Qin et al., 2019). Actually, labels are margin. We analyze low-frequency label per- essentially an order-independent set in the MLTC formance, label dependency, label combina- task. Second, the Seq2Seq-based methods suffer tion diversity and coverage speed to show the from low generalization ability problem since they effectiveness of our proposed method on label tend to overfit the label combinations in the train- correlation learning. Our code is available at ing set and have difficulty to generate the unseen https://github.com/EiraZhang/LACO. label combination. Third, Seq2Seq-based methods rely on the previous potentially incorrect predic- 1 Introduction tion results. The errors may propagate during the Multi-label text classification (MLTC) is an impor- inference stage where true previous target labels tant natural language processing task with applica- are unavailable and are thus replaced by labels gen- tions in text categorization, information retrieval, erated by the model itself. web mining, and many other real-world scenar- To circumvent the potential issues mentioned ios (Zhang and Zhou, 2014; Liu et al., 2020). In above, we introduce a multi-task learning based MLTC, each given document is associated with a approach that does not rely on Seq2Seq architec- ture. The approach contains a shared encoder, a Equal contribution. Work done during an internship at Tencent. MLTC task specific module and a label correla- Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1190–1200 August 1–6, 2021. ©2021 Association for Computational Linguistics tion enhancing module. In the shared parameter the multi-label predictive model with strong layers, we introduce a joint embedding (JE) mecha- generalization performance. nism which takes advantage of a transformer-based 3. We compare our approach with competitive encoder to obtain document and label representa- baseline models on two multi-label classifica- tion jointly. Correlations among labels are learned tion datasets and systematically demonstrate implicitly through the self-attention mechanism, the superiority of the proposed models. which is different from previous label embedding methods (Wang et al., 2018; Xiao et al., 2019) that 2 Related Work treat labels independently. In MLTC task specific module, we generate the label-specific document Our work mainly relates to two fields of MLTC representation by the document-label cross atten- task: label correlation learning and document rep- tion (CA) mechanism, which retains discriminatory resentation learning. information. The shared encoder and the MLTC task specific module form the basic model called 2.1 Label Correlation Learning LACO, i.e. LAbel COrrelation aware multi-label For MLTC task, a simple but widely used method text classification. is binary relevance (BR) (Boutella et al., 2004), The co-occurrence relationship among labels which decomposes the MLC task into multiple in- is one of the important signal that can reflect la- dependent binary classification problem without bel correlations explicitly, which can be obtained considering the correlations between labels. without additional manual annotation. In label To capture label correlations, label powerset (LP) correlation enhancing module, we propose two (Tsoumakas and Katakis, 2007) take MLTC task as label co-occurrence prediction tasks, which are a multi-class classification problem by training a jointly trained with the MLTC task. The one is the classifier on all unique label combinations. Classi- Pairwise Label Co-occurrence Prediction (PLCP) fier Chains (CC) based method (Read et al., 2011) task for capturing second-order label correlations exploits the chain rule and predictions from the through the two-by-two combinations to distin- previous classifiers as input. Seq2Seq architectures guish whether they appear together in the set of are proposed to transform MLTC into a label se- relevant labels. The other one is the Conditional quence generation problem by encoding input text Label Co-occurrence Prediction (CLCP) task for sequences and decoding labels sequentially (Nam capturing high-order label correlations through a et al., 2017). However, both CC and Seq2Seq- given partial relevant label set to predict the rele- based methods heavily rely on a predefined order- vance of other unknown labels. ing of labels and perform sensitively to the label or- We conduct experiments on AAPD and RCV1- der. To tackle the label order dependency problem, V2 datasets, and show that our method outperforms various methods have been explored: by sorting competitive baselines by a large margin. Compre- heuristically (Yang et al., 2018), by dynamic pro- hensive experimental results are provided to analy- gramming (Liu and Tsang, 2015), by reinforcement sis low-frequency label performance, label depen- learning (Yang et al., 2019), by multi-task learning dency, label combination diversity and coverage (Tsai and Lee, 2020; Zhao et al., 2020). Different speed, which are essential to measure the ability of from these works, our method learns the label cor- label correlation learning. We highlight our contri- relations through a non-Seq2Seq-based approach butions as follows: without suffering the above mentioned problems. More recently, researchers have proposed a va- 1. We propose a novel and effective approach for riety of label correlation modeling methods for MLTC, which not only sufficiently learns the MLTC that are not based on Seq2Seq architecture. features of documents and labels through the Wang et al. (2020) propose a multi-label reasoner joint space, but also reinforces correlations mechanism that employs multiple rounds of pre- through multi-task design without depending dictions, and relies on predicting multiple rounds on the label order. of results to ensemble or determine a proper or- 2. We propose two feasible tasks (PLCP and der, which is computationally expensive. CorNet- CLCP) to enhance the feedback of label cor- BertXML (Xun et al., 2020) utilizes BERT (Devlin relations, which is beneficial to help induce et al., 2019) to obtain the joint representation of 1191 ... ... ... MLTC Task PLCP Task CLCP Task text and all candidate labels and extra exponen- " " p(# |%), … , p(# |%) p(# |%, # ) ! " p y #, % , p y #, % , … # $ ! # tial linear units (ELU) at the prediction layer to Task make use of label correlation knowledge. Differ- specific "⃑ +/- h h h h + y y y y 2 3 n layers h h … ℎ !ℎ ! ℎ ! ℎ ! y y i j ! ! ! ! ent from the above works, we exploit extra label co-occurrence prediction tasks to explicitly model the label correlations in a multi-task framework. ... ... h h h h h h h h h h h h [CLS] x x x x4 xm [SEP ] y y y y [SEP ] 1 2 3 1 2 3 n Transformer Layer x K 2.2 Document Representation Learning Shared Doc-Label Label-Label layers Text representation plays a significant role in text ... ... classification tasks. It is crucial to extract essential ... ... [CLS] [SEP] y y y [SEP] x x x x x y 1 2 n 1 2 3 4 m 3 hand-crafted features for early models (Joachims, Inputs Document (D) Labels(Y) 1998). Deep neural network based MLTC models have achieved great success such as CNN (Kurata Figure 1: The framework of our proposed approach. et al., 2016; Liu et al., 2017), RNN (Liu et al., Note that the shaded square in the CLCP task is the 2016), CNN-RNN (Chen et al., 2017; Lai et al., embedding of given labels, and +, represent related 2015), attention mechanism (Yang et al., 2016; You label and unrelated label respectively. et al., 2018; Adhikari et al., 2019) and etc. (De- vlin et al., 2019) is an important turning point in 3.1 Problem Formulation the development of text classification task and it works by generating contextualized word vectors Multi-label task studies the classification problem using Transformer. The reason why deep learning where each single instance is associated with a methods have become so popular is their ability to set of labels simultaneously. Given a training set learn sophisticated semantic representations from S = f(D ; Y )j1  i  Ng of multi-label text text, which are much richer than hand-crafted fea- classification data, D is the text sequence and Y tures(Guo et al., 2020). However, these methods is its corresponding labels. Specifically, a text se- tend to ignore the semantics of labels while focus- quence D of length m is composed of word tokens ing only on the representation of the document. D = fx ; x ; :::; x g, and Y = fy ; y ; :::; y g 1 2 m 1 2 n Recently, label embedding is considered to im- denote the label space consisting of n class labels. prove multi-label text classification tasks. (Liu The aim of MLTC is to learn a predictive function et al., 2017) is the first DNN-based multi-label em- f : D ! 2 to predict the associated label set for the unseen text. For such, the model must optimize bedding method that seeks a deep latent space to jointly embed the instances and labels. LEAM a loss function which ensures that the relevant and irrelevant labels of each training text are predicted (Wang et al., 2018) applies label embedding in text classification, which obtains each label’s embed- with minimal misclassification. ding by its corresponding text descriptions. LSAN 3.2 Document-Label Joint Embedding (JE) (Xiao et al., 2019) makes use of document content and label text to learn the label-specific document Following BERT (Devlin et al., 2019), the first representation with the aid of self-attention and token is always the [CLS] token. The output vec- label-attention mechanisms. Our work differs from tor corresponding to the [CLS] token aggregates these works in that the goal of our work is to con- the features of the whole document and can be sider not only the relevance between the document used for classification. Different from this habitual and labels but also the correlations between labels. operation, we propose a novel input structure to directly use label information in constructing the token-level representations. 3 Methodology As shown in Figure 1, the inputs are packed The framework of L ACO is shown in Figure 1. The by a sequence pair (D; Y ), we separate the lower layers are shared across all tasks, while the text sequence D and the label sequence Y top layers are task-specific. In this section, we with a special token [SEP]. Note that the label first introduce the standard formal definition of sequence is to concatenate all label tokens. The MLTC. After that, we present the detailed technical shared layers map the inputs into a sequence implementation of LACO. of embedding vectors, one for each token, 1192 called token-level representations. Formally, let via a fully connected layer that captures more fine- f[CLS]; x ; :::; x ; [SEP ]; y ; :::; y ; [SEP ]g be grained features from different regions of the docu- 1 m 1 n the input sequence of the encoder, we obtain the ment: output contextualized token-level representations ! ! p = sigmoid(W c + b ) (3) 1 1 fh ; h ; :::; h ; h ; h ; :::; h ; h g. x x y y [CLS] 1 m [SEP] 1 n [SEP] The input structure is designed to guarantee that nk n where W 2 R and b 2 R . We use Binary 1 1 words and labels are embedded together in the Cross Entropy as the loss function for the multi- same space. With the joint embedding mechanism, label classification problem: our model could pay more attention to two facets: 1) The correlations between document and labels. Different document have different influences on a L = [q ln p + (1 q ) ln(1 p )] (4) mlc i i i i specific label, while the same document fragment i=1 may affect multiple labels. 2) The correlations where p = P (y jD) is the probability of y pre- i i i among labels. The semantic information of labels dicted by the model, and q 2 f0; 1g is the cate- is interrelated, and label co-occurrence indicates gorical information of y . We train the model by strong semantic correlations between them. minimizing the cross-entropy error. 3.3 Multi-Label Text Classification 3.4 Multi-Task Learning with Label In this subsection, we introduce the MLTC task Correlations specific module, including Document-Label Cross In this subsection, we introduce two auxiliary tasks, Attention (CA) and Label Predication. Pairwise Label Co-occurrence Prediction (PLCP) and Conditional Label Co-occurrence Prediction 3.3.1 Document-Label Cross Attention (CA) (CLCP), to explore the second-order and high-order To explicitly model the semantic relationship be- label relationships, respectively. tween each word and label token, we measure the compatibility of label-word pairs via dot product: 3.4.1 PLCP Task T Suppose that each document D contains the cor- M = H H (1) responding label set Y and the uncorresponding label set Y . In order to train the model to un- where H = [h ; :::; h ] is the text sequence D x x 1 m derstand second-order label relationships, we pro- embedding, H = [h ; :::; h ] is the label se- Y y y 1 n mn pose a binarized label-pair prediction task named quence embedding and M 2 R . Consid- as PLCP that can be trivially generated from the ering the semantic information among consecu- multi-label classification corpus. The strategy of tive words, we further generalize M through non- selecting label pairs for co-occurrence prediction linearity network. Specifically, for a text fragment is straightforward. One part is sampled only from of length 2r + 1 centered at i, the local matrix Y , which is marked as IsCo-occur, and the other block M in M measures the correlation for ir;i+r part is sampled from Y and Y , respectively, the label-phrase pairs. To improve the effective- which is marked as NotCo-occur. To construct the ness of the sparse regularization, we use CNN with manual training dataset, we empirically set the ratio ReLU activation in the hidden layers, and perform of IsCo-occur and NotCo-occur to . As Figure 1 max-pooling and hyperbolic tangent sequentially shows, we concat the embedding of the two labels in the function [y ; y ] together as the input features. The addi- i j c = (M ) H (2) tional binary classifier is used to predict whether ir;i+r D the state of the two labels is IsCo-occur or NotCo- Note that the final document representation c is occur. The loss function is as followed: generated by aggregation of word representations L = [q ln p + (1 q ) ln(1 p )] (5) H , and weighted by the label-specific attention plcp ij ij ij ij vector (). where p = p(y jD; y ) denotes the output prob- ij j i 3.3.2 Label Predication ability of the the co-occurrance of the label-pair, Once having the discriminative document repre- and q is the ground-truth where q = 1 means ij sentation, we build the multi-label text classifier IsCo-occur and q = 0 means NotCo-occur. ij 1193 + 3.4.2 CLCP Task Dataset jDj jY j jD j jY j AAPD 55,840 54 163.42 2.41 To further learn the high-order label relationships, RCV1-V2 804,414 103 123.94 3.24 we propose the conditional label co-occurrence pre- diction (CLCP) task. We first randomly pick s Table 1: Statistics of datasets. Here,jDj andjYj denote + G G + labels from Y to form Y , i.e. Y  Y , and the total number of documents and labels. jD j is the then predict whether the remaining labels of Y average length of all documents. jY j means the aver- are relevant with them. Specifically, we introduce age number of labels associated with the document. an additional position vector E = [e ; :::; e ], Y y y 1 n where e = 0 indicates that y at that position is y i science, which is organized into 54 related top- the sampled label, i.e. y 2 Y , and e = 1 indi- i y G ics. In AAPD dataset, each paper is assigned mul- cates y 2 Y Y . The average of the embedding tiple topics. Reuters Corpus Volume I (RCV1- of the zero-position labels h G is concatenated to V2) (Lewis et al., 2004) is composed of 804,414 each nonzero-position label embedding as the in- manually categorized newswire stories for research put features to predict whether each of remaining purposes. Each story in the dataset can be assigned labels should be co-occurrence when knowing the G multiple topics, and there are 103 topics in total. sampled labels. In Figure 1, p(y jD; Y ) denotes Tabel 1 shows statistics of datasets. Each dataset the probability of y predicated by the additional is divided into a training set, a validation set, and sigmoid classifier. The loss for the classification a test set. We followed the division of these two is the sum of binary cross-entropy loss of each datasets by Yang et al. (2018). nonzero-position: 4.2 Evaluation Metrics ns L = [q ln p + (1 q ) ln(1 p )] (6) clcp i i i i Multi-label classification can be evaluated with i=1 a group of metrics, which capture different as- pects of the task (Zhang and Zhou, 2014). Fol- where q 2 f0; 1g is the ground-truth to denote lowing the previous works (Yang et al., 2018; whether the label y should be co-occurrence with G G Tsai and Lee, 2020), we adopt hamming loss, Y , and p = p(y jD; Y ) is the output probabil- i i Micro/Macro-F1 scores as our main evaluation met- ity of each masked label y . rics. Micro/Macro-P and Micro/Macro-R are also 3.4.3 Training Objectives reported to assist analysis. A Macro-average will The same inputs are first fed into the shared layers, treat all labels equally, whereas a Micro-average then each sub-task module takes the contextual- will weighted compute each label by its frequency. ized token-level representations generated by joint 4.3 Comparing Algorithms embedding and produces a probability distribution for its own target labels. The overall loss can be We adopt a various of methods as baselines, which calculated by: can be divided into two groups according to whether the label correlations are considered. L = L + L + (1 )L (7) mlc plcp clcp The first group of approaches do not consider la- bel correlations. Binary Relevance (BR) (Boutella where is a hyperparameter in (0, 1), L and plcp et al., 2004) amounts to independently training one L are task-specific Cross-Entropy loss for PLCP clcp binary classifier (linear SVM) for each label. CNN task and CLCP task, respectively. (Kim, 2014) utilizes multiple convolution kernels to extract text features and then output the probabil- 4 Experimental Setup ity distribution over the label space. LEAM (Wang 4.1 Datasets et al., 2018) involves label embedding to obtain a more discriminative text representation in text We validate our proposed model on two multi-label classification. LSAN (Xiao et al., 2019) learns the text classification datasets: Arxiv Academic Pa- label-specific text representation with the help of per Dataset (AAPD) (Yang et al., 2018) collected attention mechanisms. We also implement a BERT 55,840 abstracts of papers in the field of computer (Devlin et al., 2019) classifier which first encodes We also implement it with three tasks together. Since the a document into vector space and then outputs the two auxiliary tasks have the similar goal, there is no perfor- mance gain. probability for each label independently. 1194 AAPD dataset RCV1-V2 dataset Algorithm HL# Mi- P / R / F1" Ma- P / R / F1" HL# Mi- P / R / F1" Ma- P / R / F1" BR (Boutella et al., 2004) 0.0316 64.4 / 64.8 / 64.6 - - - 0.0086 90.4 / 81.6 / 85.8 - - - CNN (Kim, 2014) 0.0256 84.9 / 54.5 / 66.4 - - - 0.0089 92.2 / 79.8 / 85.5 - - - LEAM(Wang et al., 2018) 0.0261 76.5 / 59.6 / 67.0 52.4 / 40.3 / 45.6 0.0090 87.1/ 84.1 / 85.6 69.5 / 65.8 / 67.6 LSAN(Xiao et al., 2019) 0.0242 77.7 / 64.6 / 70.6 67.6 / 47.2 / 53.5 0.0075 91.3 / 84.1 / 87.5 74.9 / 65.0 / 68.4 BERT(Devlin et al., 2019) 0.0224 78.6 / 68.7 / 73.4 68.7 / 52.1 / 57.2 0.0073 92.7 / 83.2 / 87.7 77.3 / 61.9 / 66.7 CC (Read et al., 2011) 0.0306 65.7 / 65.1 / 65.4 - - - 0.0087 88.7 / 82.8 / 85.7 - - - y| SGM (Yang et al., 2018) 0.0251 74.6 / 65.9 / 69.9 - - - 0.0081 88.7 / 85.0 / 86.9 - - - y| Seq2Set (Yang et al., 2019) 0.0247 73.9 / 67.4 / 70.5 - - - 0.0073 90.0 / 85.8 / 87.9 - - - y| OCD (Tsai and Lee, 2020) - - - 72.0 - - 58.5 - - - - - - - ML-R (Wang et al., 2020) 0.0248 72.6 / 71.8 / 72.2 - - - 0.0079 89.0 / 85.2 / 87.1 - - - Seq2Seq (Nam et al., 2017) 0.0275 69.8 / 68.2 / 69.0 56.2 / 53.7 / 54.0 0.0074 88.5 / 87.4 / 87.9 69.8 / 65.5 / 66.1 SeqTag 0.0238 74.3/ 71.5 / 72.9 61.5 / 57.5 / 58.5 0.0073 90.6 / 84.9 / 87.7 73.7 / 66.7 / 68.7 Bert LACO 0.0213 80.2 / 69.6 / 74.5 70.4 / 54.0 / 59.1 0.0072 90.8 / 85.6 / 88.1 75.9 / 66.6 / 69.2 LACO+plcp 0.0212 79.5 / 70.8 / 74.9 68.4 / 55.8 / 59.9 0.0070 90.8 / 86.2 / 88.4 76.1 / 66.5 / 69.2 LACO+clcp 0.0215 78.9 / 70.8 / 74.7 71.9 / 56.6 / 61.2 0.0070 90.6 / 86.4 / 88.5 77.6 / 71.5 / 73.1 Table 2: Predictive performance of each comparing algorithm on two datasets. Hamming Loss (HL), Micro (Mi-) and Marco (Ma-) average Precision (P), Recall (R), F1-Score (F1) are used as evaluation metrics. The# represents the lower score the better performance, and the " is the opposite. Models with y denote for its results are quoted from previous papers. Models with| are the Seq2Seq-based models. The second group of methods consider label cor- glish base-uncased versions of BERT . The batch relations. Classifier Chains (CC) (Read et al., 2011) size is 32, and the maximum total input sequence transforms the MLTC problem into a chain of bi- length is 320. The window size of the additional nary classification problems. SGM (Yang et al., layer is 10, and we set as 0.5. We use Adam 2018) proposes the Seq2Seq model with global em- (Kingma and Ba, 2015) with learning rate of 5e-5, bedding mechanism to capture label correlations. and train the models by monitoring Micro-F1 score Seq2Set (Yang et al., 2019) presents deep reinforce- on the validation set and stopping the training if ment learning to improve the performance of the there is no increase in 50,000 consecutive steps. Seq2Seq model. We also implement a Seq2Seq 5 Results and Analysis baseline with 12-layer transformer, named with Seq2Seq . More recently, OCD (Tsai and Lee, In this section, we report the main experimental 2020) proposes a framework including one encoder results of the baseline models and the proposed and two decoders for MLTC to alleviate exposure method on two text datasets. Besides, we analyze bias. ML-Reasoner (Wang et al., 2020) employs a the performance on different frequency labels, and binary classifier to predict all labels simultaneously further evaluate whether our method effectively and applies a novel iterative reasoning mechanism. learns the label correlations through label-pair con- Besides, we also provide another strong baseline: fidence distribution learning and label combination SeqTag transforms multi-label classification Bert prediction. Finally, we give a detailed analysis task into sequential tagging task, which first obtain of the convergence study which demonstrates the embeddings of each label (H in Sec 3.3) by our generalization ability of our method. shared encoder and then output a probability for each label sequentially by a BiLSTM-CRF model 5.1 Experiment Results (Huang et al., 2015). We report the experimental results of all comparing Results of BR, CNN, CC, SGM, Seq2Set, OCD algorithms on two datasets in Table 2. The first and ML-R are cited in previous papers and results block includes methods without learning label of other baselines are implemented by us. All algo- correlations. The second block is the methods rithms follow the same data division. considering label correlations, and the third block is our proposed LACO methods. As shown in 4.4 Experimental Setting Table 2, the LACO-based models outperform all We implement our model in Tensorflow and run on NVIDIA Tesla P40. We fine-tune models on the En- https://github.com/google-research/bert 1195 AAPD RCV1-V2 AAPD RCV1-V2 Model HL Mi-F1 HL Mi-F1 Model HL Mi-F Ma-F HL Mi-F Ma-F BERT 9.39e-09 3.80e-10 4.95e-04 3.67e-08 LACO 0.0213 74.5 59.1 0.0072 88.1 69.2 SeqTag 7.76e-16 1.86e-07 4.95e-04 3.67e-08 Bert w/o JE 0.0237 72.6 57.7 0.0077 87.5 68.4 w/o CA 0.0220 73.5 58.4 0.0073 87.8 68.5 Table 3: Statistical analysis results. The P-values of w/o JE & CA 0.0224 73.4 57.2 0.0073 87.7 66.7 LACO on significant test comparing with the two strong baselines BERT and SeqTag . Bert Table 4: Ablation over the proposed joint embedding (JE) and cross attention (CA) mechanisms using the LACO model on AAPD and RCV1-V2 datasets. baselines by a large margin in the main evaluation metrics. The following observations can be made As for the results of the multi-task learning according to the results: methods, the two subtasks introduced by our Our basic model LACO training only by the method have a certain degree of improvement on MLTC task significantly improves previous results the main metrics of the two datasets. Specifically, on hamming loss and Micro-F1. Specifically, on we observe that the PLCP task shows better the AAPD dataset, comparing to Seq2Set which performance and presents the best score of 74.9 considers modeling the label correlations, our on Micro-F1 for AAPD dataset, while the CLCP basic model decreases by 13:8% on hamming loss task presents the best performance on Micro-F1 and improves by 5:67% on Micro-F1. Comparing for RCV1-V2 dataset as 88.5. Furthermore, with the label embedding method like LSAN, the proposed multi-task framework shows great LACO achieves a reduction of 4:00% hamming improvements than the basic model LACO on loss score and an improvement of 0:69% Micro-F1 Macro-F1, which demonstrates that the perfor- score on the RCV1-V2 dataset. Also, BERT mance on low-frequency labels can be greatly is still a strong baseline, which shows that improved through our label correlation guided obtaining a high-quality discriminative document subtasks. There are more detailed analysis in representation is important for the MLTC task. Section 5.3 and 5.5. Notably, the CLCP task Here, we train the LACO with 3 random seeds and performs better on Marco-F1 by considering the calculate the mean and the standard deviation. We high-order correlations. We also implement the perform a significant test with LACO and the two experiment using the losses of three tasks together, strong baselines BERT and SeqTag in Table 3. Bert while the combination of the two subtasks can not Comparing with the two strong baseline models, further improve the model performance comparing all of the P-values of L ACO are below the threshold to L ACO or LACO , which we consider is +plcp +clcp (p < 0.05), suggesting that the performance is due to the strong relevance between the two tasks. statistically significant. In addition, we implement Friedman test (Demsar ˇ , 2006) for hamming loss 5.2 Ablation Study and Micro-F1 metrics. The Friedman statistics F for hamming loss is 7.875 and for Micro-F1 In this section, we will demonstrate the effective- is 6.125, when the corresponding critical value ness of two cores of the proposed L ACO model, that is 2.8179 (# comparing algorithms k = 12, # is a document-label joint embedding (JE) mecha- datasets N = 2). As a result, the null hypothesis nism, and a document-label cross attention (CA) of indistinguishable performance among the mechanism. Note that, the setting of w/o JE & compared algorithms is clearly rejected at 0.05 CA is equivalent to the BERT baseline in Tabel 2, significance level. which encode document only and predict the proba- Compared with SGM, Seq2Seq does bility for each label based on [CLS]. In the w/o JE not achieve significantly improvements, but setting, document embedding is encoded by BERT SeqTag shows good performance based on the while each label embedding is a learnable random Bert shared Transformer encoder between document initialized vector. Its label prediction layer is the and labels. Notably, the result of SeqTag on same with L ACO. In the w/o CA setting, document Bert Micro-F1 is comparable to BERT, but the result and label embedding are obtained by BERT jointly, on Macro-F1 is observably higher. The above and probability for each label is predicted based illustrates that label correlation information is on [CLS]. Tabel 4 shows that JE and CA are both more important for learning low frequency labels. important to obtain a more discriminative text rep- AAPD RCV1-V2 Model train test train test Seq2Seq 1.27 1.30 0.08 0.94 SeqTag 1.40 1.28 0.09 0.95 ≤ Bert LACO 1.40 1.27 0.09 0.94 LACO 1.35 1.28 0.08 0.76 +plcp (a) The label distribution of AAPD LACO 1.32 1.10 0.08 0.91 +clcp g p Table 5: KL(P jjP ) for different models on AAPD and RCV1-V2 datasets. Note that P is the ground truth distribution of datasets and P is the model dis- tribution. Smaller scores indicate that two distributions are closer. (b)Macro-F1 for the four groups on AAPD Figure 2: Label classification performance on different tance” between model prediction distribution (P ) frequency distributions. Subfigure(a) shows the label frequency distribution of each label on AAPD training and the ground-truth distribution on training/testing set. Subfigure(b) illustrates the Macro-F1 performance dataset (P ). The score is calculate as: of different methods in the four groups. p (y jy ) b a g p g KL(P jjP ) = (p (y jy )log b a p (y jy ) b a y ;y 2Y resentation. After removing JE and CA mechanism, b the performance drops more in the AAPD dataset p(y jy ) = #(y ; y )=#(y ) b a a b a than RCV1-V2 dataset. We believe that is mainly (8) due to the less of training instance in AAPD, which where # means the number of the single label or is more difficult to learn relevant features especially the label combination in the training/testing dataset. for those low-frequency labels. The KL-distances on the AAPD and RCV1-V2 datasets are shown in Table 5. On the testing set set- 5.3 Low-frequency Label Performance tings, we can find that L ACO has much better fitting ability for the dependency relationship between la- Figure 2(a) illustrates the label frequency distri- bution on AAPD training set, which is a typi- bels, especially after introducing the co-occurrence relationship prediction task. The Seq2Seq model cal big-head-long-tail distribution. We divide all achieves the lowest KL distance with training set the labels into four groups according to the fre- on both AAPD and RCV1-V2 but achieve larger quency, the big-head group (Group1), the high- scores on the test set. This conclusion further frequency group (Group2), the middle-frequency proves that the Seq2Seq-based model is prone to group (Group3), and the low-frequency group over-fitting label pairs during training. It should (Group4). As shown in Figure 2(b), we find the performance of all methods decreases with the la- be emphasized that this KL distance just quantify how much interdependence between label pairs the bel frequency of occurrence. The performance gap between Seq2Seq and LACO based meth- model have learned, but it cannot directly measure the prediction accuracy of the model. ods increases as the frequency decreases, espe- cially in Group 4, L ACO achieves a 74.5% +clcp 5.5 Label Combination Diversity Analysis improvement comparing to the Seq2Seq model, which demonstrates that the performance on low- Table 6 shows the number of different predicted frequency labels can be enhanced by the condi- label combinations (C ) and subset accuracy Test tional label co-occurrence prediction task. (Acc), which is a strict metric that indicates the percentage of samples that have all their labels clas- 5.4 Label Correlation Analysis sified correctly. Seq2Seq produces fewer kinds The co-occurrence relationship between labels is of label combinations on the two datasets. As they one of the important aspects that can reflect label tend to “remember” label combinations, the gen- correlation. In this experiment, we utilize the condi- erated label sets are most alike, indicating a poor tional probability p(y jy ) between label y and y generalization ability to unseen label combinations. b a a b to represent their dependency quantitatively. Fur- Because Seq2Seq is conservative and only gen- thermore, we calculate the Conditional Kullback- erates label combinations it has seen in the train- Leibler Divergence of p(y jy ) to measure the “dis- ing set, it achieves high Acc values, especially on b a AAPD RCV1-V2 results show that our method outperforms competi- Model C Acc C Acc Test Test tive baselines by a large margin. Detailed analyses Ground Truth 392 1.000 278 1.000 show the effectiveness of our proposed architecture Seq2Seq 214 0.392 87 0.669 using semantic connections between document- OCD 302 0.403 - - SeqTag 289 0.410 187 0.637 Bert label and label-label, which helps to obtain a dis- LACO 315 0.425 241 0.642 criminative text representation. Furthermore, the LACO 320 0.439 241 0.644 +plcp LACO 321 0.427 239 0.660 multi-task framework shows strong capability on +clcp low-frequency label predicting and label correla- Table 6: Statistics on the number of label combinations. tion learning. C is the number of different predicted label combi- Test Considering the Extreme Multi-label Text Clas- nations. Acc is the subset accuracy on the testing set. sification that contains an extremely large label set, LACO could be further exploited through sched- uled label sampling, hierarchical label embedding strategy, and so on. We hope that further research could get clues from our work. Acknowledgements We would like to thank the ACL reviewers for their valuable comments and Keqing He, Haoyan Liu, (a) Covergence speed of AAPD (b) Covergence speed of RCV1-V2 Zizhen Wang, Chenyang Liao and Rui Pan for their Figure 3: The convergence speed of five BERT-based generous help and discussion. methods. The x-axis refers to the training steps, and the y-axis refers to the Micro-F1 score performance. References Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and RCV1-V2 dataset. For our models, they produce Jimmy Lin. 2019. Docbert: Bert for document clas- more diverse label combinations while obtaining sification. arXiv:1904.08398. good Acc since we do not regard multi-label clas- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and sification as a sequence generation task that uses a Noam Shazeer. 2015. Scheduled sampling for se- decoder to model the relationship between labels. quence prediction with recurrent neural networks. Instead, we learn the correlations among labels on In Proceedings of the 28th International Conference the encoding side, and the scoring between labels on Neural Information Processing Systems-Volume 1, pages 1171–1179. does not interfere with each other, which leads to a higher probability of generating label combinations Matthew R Boutella, Jiebo Luob, Xipeng Shena, and not seen during training than the Seq2Seq-based Christopher M Browna. 2004. Learning multi-label scene classification. Pattern Recognition, 37:1757– models. 5.6 Coverage Speed Guibin Chen, Deheng Ye, Zhenchang Xing, Jieshan The convergence speed of five BERT-based models Chen, and Erik Cambria. 2017. Ensemble applica- tion of convolutional and recurrent neural networks are shown in Figure 3. Our basic model LACO for multi-label text categorization. In 2017 interna- outperforms other BERT-based models in terms of tional joint conference on neural networks (IJCNN), convergence speed, and the proposed multi-task pages 2377–2383. IEEE. mechanisms are able to enhance L ACO to converge Janez Demsar ˇ . 2006. Statistical comparisons of classi- much faster. The main reason might be that the fiers over multiple data sets. The Journal of Machine feature exchanging through multi-tasks accelerates Learning Research, 7:1–30. the model to learn a more robust and common rep- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and resentation. Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language under- 6 Conclusions and Future Work standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for In this paper, we propose a new method for MLTC Computational Linguistics: Human Language Tech- based on document-label joint embedding and cor- nologies, Volume 1 (Long and Short Papers), pages relation aware multi-task learning. Experimental 4171–4186. Eva Gibaja and Sebastian ´ Ventura. 2015. A tutorial Weiwei Liu and Ivor W Tsang. 2015. On the optimal- on multilabel learning. ACM Computing Surveys ity of classifier chain for multi-label classification. (CSUR), 47(3):1–38. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 712–720. Biyang Guo, Songqiao Han, Xiao Han, Hailiang Huang, and Ting Lu. 2020. Label confusion Aditya Krishna Menon, Sadeep Jayasumana, learning to enhance text classification models. Ankit Singh Rawat, Himanshu Jain, Andreas arXiv:2012.04987. Veit, and Sanjiv Kumar. 2020. Long-tail learning via logit adjustment. arXiv:2007.07314. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional lstm-crf models for sequence tagging. Jinseok Nam, Eneldo Loza Menc´ ıa, Hyunwoo J Kim, arXiv:1508.01991. and Johannes Furnkranz. ¨ 2017. Maximizing sub- set accuracy with recurrent neural networks in multi- Thorsten Joachims. 1998. Text categorization with sup- label classification. In Advances in neural informa- port vector machines: Learning with many relevant tion processing systems, pages 5413–5423. features. In European conference on machine learn- ing, pages 137–142. Springer. Kechen Qin, Cheng Li, Virgil Pavlu, and Javed Aslam. 2019. Adapting rnn sequence prediction model to Yoon Kim. 2014. Convolutional neural networks for multi-label set prediction. In Proceedings of the sentence classification. In Proceedings of the 2014 2019 Conference of the North American Chapter of Conference on Empirical Methods in Natural Lan- the Association for Computational Linguistics: Hu- guage Processing (EMNLP), pages 1746–1751. man Language Technologies, Volume 1 (Long and Short Papers), pages 3181–3190. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Jesse Read, Bernhard Pfahringer, Geoff Holmes, and method for stochastic optimization. In 3rd Inter- Eibe Frank. 2011. Classifier chains for multi-label national Conference on Learning Representations, classification. Machine learning, 85(3):333. ICLR 2015,San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Che-Ping Tsai and Hung-Yi Lee. 2020. Order-free learning alleviating exposure bias in multi-label clas- Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. sification. In Proceedings of the AAAI Conference Improved neural network-based multi-label classifi- on Artificial Intelligence, volume 34, pages 6038– cation with better initialization leveraging label co- occurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association Grigorios Tsoumakas and Ioannis Katakis. 2007. for Computational Linguistics: Human Language Multi-label classification: An overview. Interna- Technologies, pages 521–526. tional Journal of Data Warehousing and Mining (IJDWM), 3(3):1–13. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text Grigorios Tsoumakas, Ioannis Katakis, and Ioannis classification. In Proceedings of the AAAI Confer- Vlahavas. 2009. Mining multi-label data. In Data ence on Artificial Intelligence, volume 29. mining and knowledge discovery handbook, pages 667–685. Springer. David D Lewis, Yiming Yang, Tony G Rose, and Fan Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Li. 2004. Rcv1: A new benchmark collection for Order matters: Sequence to sequence for sets. text categorization research. Journal of machine learning research, 5(Apr):361–397. Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yim- Henao, and Lawrence Carin. 2018. Joint embedding ing Yang. 2017. Deep learning for extreme multi- of words and labels for text classification. In Pro- label text classification. In Proceedings of the 40th ceedings of the 56th Annual Meeting of the Associa- International ACM SIGIR Conference on Research tion for Computational Linguistics (Volume 1: Long and Development in Information Retrieval, pages Papers), pages 2321–2331. 115–124. Ran Wang, Robert Ridley, Weiguang Qu, Xinyu Dai, Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. et al. 2020. A novel reasoning mechanism for multi- Recurrent neural network for text classification with label text classification. Information Processing & multi-task learning. In Proceedings of the Twenty- Management, 58(2):102441. Fifth International Joint Conference on Artificial In- telligence, pages 2873–2879. Lin Xiao, Xin Huang, Boli Chen, and Liping Jing. 2019. Label-specific document representation for Weiwei Liu, Xiaobo Shen, Haobo Wang, and Ivor W multi-label text classification. In Proceedings of Tsang. 2020. The emerging trends of multi-label the 2019 Conference on Empirical Methods in Nat- learning. arXiv:2011.11197. ural Language Processing and the 9th International 1199 Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 466–475. Guangxu Xun, Kishlay Jha, Jianhui Sun, and Aidong Zhang. 2020. Correlation networks for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1074– Pengcheng Yang, Fuli Luo, Shuming Ma, Junyang Lin, and Xu Sun. 2019. A deep reinforced sequence-to- set model for multi-label classification. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5252–5258. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. Sgm: sequence gen- eration model for multi-label classification. In Pro- ceedings of the 27th International Conference on Computational Linguistics, page 3915–3926. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchi- cal attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, pages 1480–1489. Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shanfeng Zhu. 2018. At- tentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. arXiv:1811.01727. Min-Ling Zhang and Zhi-Hua Zhou. 2014. A re- view on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819–1837. Wei Zhao, Hui Gao, Shuhui Chen, and Nan Wang. 2020. Generative multi-task learning for text clas- sification. IEEE Access, 8:86380–86387. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 Unpaywall

Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021Jan 1, 2021

Loading next page...
 
/lp/unpaywall/enhancing-label-correlation-feedback-in-multi-label-text-TgLhmGv1Px

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
DOI
10.18653/v1/2021.findings-acl.101
Publisher site
See Article on Publisher Site

Abstract

Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning 1y 2 2 1 2 Ximing Zhang , Qian-Wen Zhang , Zhao Yan , Ruifang Liu , Yunbo Cao Beijing University of Posts and Telecommunications, Beijing 100876, China Tencent Cloud Xiaowei, Beijing 100080, China ximingzhang@bupt.edu.cn, cowenzhang@tencent.com, zhaoyan@tencent.com, lrf@bupt.edu.cn, yunbocao@tencent.com Abstract set of labels which are often related statistically and semantically. Label correlations should be In multi-label text classification (MLTC), each sufficiently utilized to build multi-label classifi- given document is associated with a set of cation models with strong generalization perfor- correlated labels. To capture label correla- mance (Tsoumakas et al., 2009; Gibaja and Ven- tions, previous classifier-chain and sequence- tura, 2015). In particular, learning the dependen- to-sequence models transform MLTC to a se- cies between labels might be helpful in modeling quence prediction task. However, they tend the low-frequency labels, because real-world clas- to suffer from label order dependency, la- bel combination over-fitting and error prop- sification problems tend to exhibit long-tail label agation problems. To address these prob- distribution, where low-frequency labels are asso- lems, we introduce a novel approach with ciated with only a few instances and are difficult to multi-task learning to enhance label correla- learn (Menon et al., 2020). tion feedback. We first utilize a joint em- Previous sequence-to-sequence (Seq2Seq) based bedding (JE) mechanism to obtain the text methods (Nam et al., 2017; Yang et al., 2018) have and label representation simultaneously. In been shown to have a powerful ability to capture la- MLTC task, a document-label cross atten- tion (CA) mechanism is adopted to gener- bel correlations with using the current hidden state ate a more discriminative document represen- of the model and the prefix label predictions. How- tation. Furthermore, we propose two auxil- ever, exposure bias phenomenon (Bengio et al., iary label co-occurrence prediction tasks to en- 2015) may cause the models overfit to the frequent hance label correlation learning: 1) Pairwise label sequence in training set, thus lead to several Label Co-occurrence Prediction (PLCP), and problems. First, Seq2Seq-based methods heavily 2) Conditional Label Co-occurrence Predic- rely on a predefined ordering of labels and perform tion (CLCP). Experimental results on AAPD and RCV1-V2 datasets show that our method sensitively to the label order (Vinyals et al.; Yang outperforms competitive baselines by a large et al., 2019; Qin et al., 2019). Actually, labels are margin. We analyze low-frequency label per- essentially an order-independent set in the MLTC formance, label dependency, label combina- task. Second, the Seq2Seq-based methods suffer tion diversity and coverage speed to show the from low generalization ability problem since they effectiveness of our proposed method on label tend to overfit the label combinations in the train- correlation learning. Our code is available at ing set and have difficulty to generate the unseen https://github.com/EiraZhang/LACO. label combination. Third, Seq2Seq-based methods rely on the previous potentially incorrect predic- 1 Introduction tion results. The errors may propagate during the Multi-label text classification (MLTC) is an impor- inference stage where true previous target labels tant natural language processing task with applica- are unavailable and are thus replaced by labels gen- tions in text categorization, information retrieval, erated by the model itself. web mining, and many other real-world scenar- To circumvent the potential issues mentioned ios (Zhang and Zhou, 2014; Liu et al., 2020). In above, we introduce a multi-task learning based MLTC, each given document is associated with a approach that does not rely on Seq2Seq architec- ture. The approach contains a shared encoder, a Equal contribution. Work done during an internship at Tencent. MLTC task specific module and a label correla- Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1190–1200 August 1–6, 2021. ©2021 Association for Computational Linguistics tion enhancing module. In the shared parameter the multi-label predictive model with strong layers, we introduce a joint embedding (JE) mecha- generalization performance. nism which takes advantage of a transformer-based 3. We compare our approach with competitive encoder to obtain document and label representa- baseline models on two multi-label classifica- tion jointly. Correlations among labels are learned tion datasets and systematically demonstrate implicitly through the self-attention mechanism, the superiority of the proposed models. which is different from previous label embedding methods (Wang et al., 2018; Xiao et al., 2019) that 2 Related Work treat labels independently. In MLTC task specific module, we generate the label-specific document Our work mainly relates to two fields of MLTC representation by the document-label cross atten- task: label correlation learning and document rep- tion (CA) mechanism, which retains discriminatory resentation learning. information. The shared encoder and the MLTC task specific module form the basic model called 2.1 Label Correlation Learning LACO, i.e. LAbel COrrelation aware multi-label For MLTC task, a simple but widely used method text classification. is binary relevance (BR) (Boutella et al., 2004), The co-occurrence relationship among labels which decomposes the MLC task into multiple in- is one of the important signal that can reflect la- dependent binary classification problem without bel correlations explicitly, which can be obtained considering the correlations between labels. without additional manual annotation. In label To capture label correlations, label powerset (LP) correlation enhancing module, we propose two (Tsoumakas and Katakis, 2007) take MLTC task as label co-occurrence prediction tasks, which are a multi-class classification problem by training a jointly trained with the MLTC task. The one is the classifier on all unique label combinations. Classi- Pairwise Label Co-occurrence Prediction (PLCP) fier Chains (CC) based method (Read et al., 2011) task for capturing second-order label correlations exploits the chain rule and predictions from the through the two-by-two combinations to distin- previous classifiers as input. Seq2Seq architectures guish whether they appear together in the set of are proposed to transform MLTC into a label se- relevant labels. The other one is the Conditional quence generation problem by encoding input text Label Co-occurrence Prediction (CLCP) task for sequences and decoding labels sequentially (Nam capturing high-order label correlations through a et al., 2017). However, both CC and Seq2Seq- given partial relevant label set to predict the rele- based methods heavily rely on a predefined order- vance of other unknown labels. ing of labels and perform sensitively to the label or- We conduct experiments on AAPD and RCV1- der. To tackle the label order dependency problem, V2 datasets, and show that our method outperforms various methods have been explored: by sorting competitive baselines by a large margin. Compre- heuristically (Yang et al., 2018), by dynamic pro- hensive experimental results are provided to analy- gramming (Liu and Tsang, 2015), by reinforcement sis low-frequency label performance, label depen- learning (Yang et al., 2019), by multi-task learning dency, label combination diversity and coverage (Tsai and Lee, 2020; Zhao et al., 2020). Different speed, which are essential to measure the ability of from these works, our method learns the label cor- label correlation learning. We highlight our contri- relations through a non-Seq2Seq-based approach butions as follows: without suffering the above mentioned problems. More recently, researchers have proposed a va- 1. We propose a novel and effective approach for riety of label correlation modeling methods for MLTC, which not only sufficiently learns the MLTC that are not based on Seq2Seq architecture. features of documents and labels through the Wang et al. (2020) propose a multi-label reasoner joint space, but also reinforces correlations mechanism that employs multiple rounds of pre- through multi-task design without depending dictions, and relies on predicting multiple rounds on the label order. of results to ensemble or determine a proper or- 2. We propose two feasible tasks (PLCP and der, which is computationally expensive. CorNet- CLCP) to enhance the feedback of label cor- BertXML (Xun et al., 2020) utilizes BERT (Devlin relations, which is beneficial to help induce et al., 2019) to obtain the joint representation of 1191 ... ... ... MLTC Task PLCP Task CLCP Task text and all candidate labels and extra exponen- " " p(# |%), … , p(# |%) p(# |%, # ) ! " p y #, % , p y #, % , … # $ ! # tial linear units (ELU) at the prediction layer to Task make use of label correlation knowledge. Differ- specific "⃑ +/- h h h h + y y y y 2 3 n layers h h … ℎ !ℎ ! ℎ ! ℎ ! y y i j ! ! ! ! ent from the above works, we exploit extra label co-occurrence prediction tasks to explicitly model the label correlations in a multi-task framework. ... ... h h h h h h h h h h h h [CLS] x x x x4 xm [SEP ] y y y y [SEP ] 1 2 3 1 2 3 n Transformer Layer x K 2.2 Document Representation Learning Shared Doc-Label Label-Label layers Text representation plays a significant role in text ... ... classification tasks. It is crucial to extract essential ... ... [CLS] [SEP] y y y [SEP] x x x x x y 1 2 n 1 2 3 4 m 3 hand-crafted features for early models (Joachims, Inputs Document (D) Labels(Y) 1998). Deep neural network based MLTC models have achieved great success such as CNN (Kurata Figure 1: The framework of our proposed approach. et al., 2016; Liu et al., 2017), RNN (Liu et al., Note that the shaded square in the CLCP task is the 2016), CNN-RNN (Chen et al., 2017; Lai et al., embedding of given labels, and +, represent related 2015), attention mechanism (Yang et al., 2016; You label and unrelated label respectively. et al., 2018; Adhikari et al., 2019) and etc. (De- vlin et al., 2019) is an important turning point in 3.1 Problem Formulation the development of text classification task and it works by generating contextualized word vectors Multi-label task studies the classification problem using Transformer. The reason why deep learning where each single instance is associated with a methods have become so popular is their ability to set of labels simultaneously. Given a training set learn sophisticated semantic representations from S = f(D ; Y )j1  i  Ng of multi-label text text, which are much richer than hand-crafted fea- classification data, D is the text sequence and Y tures(Guo et al., 2020). However, these methods is its corresponding labels. Specifically, a text se- tend to ignore the semantics of labels while focus- quence D of length m is composed of word tokens ing only on the representation of the document. D = fx ; x ; :::; x g, and Y = fy ; y ; :::; y g 1 2 m 1 2 n Recently, label embedding is considered to im- denote the label space consisting of n class labels. prove multi-label text classification tasks. (Liu The aim of MLTC is to learn a predictive function et al., 2017) is the first DNN-based multi-label em- f : D ! 2 to predict the associated label set for the unseen text. For such, the model must optimize bedding method that seeks a deep latent space to jointly embed the instances and labels. LEAM a loss function which ensures that the relevant and irrelevant labels of each training text are predicted (Wang et al., 2018) applies label embedding in text classification, which obtains each label’s embed- with minimal misclassification. ding by its corresponding text descriptions. LSAN 3.2 Document-Label Joint Embedding (JE) (Xiao et al., 2019) makes use of document content and label text to learn the label-specific document Following BERT (Devlin et al., 2019), the first representation with the aid of self-attention and token is always the [CLS] token. The output vec- label-attention mechanisms. Our work differs from tor corresponding to the [CLS] token aggregates these works in that the goal of our work is to con- the features of the whole document and can be sider not only the relevance between the document used for classification. Different from this habitual and labels but also the correlations between labels. operation, we propose a novel input structure to directly use label information in constructing the token-level representations. 3 Methodology As shown in Figure 1, the inputs are packed The framework of L ACO is shown in Figure 1. The by a sequence pair (D; Y ), we separate the lower layers are shared across all tasks, while the text sequence D and the label sequence Y top layers are task-specific. In this section, we with a special token [SEP]. Note that the label first introduce the standard formal definition of sequence is to concatenate all label tokens. The MLTC. After that, we present the detailed technical shared layers map the inputs into a sequence implementation of LACO. of embedding vectors, one for each token, 1192 called token-level representations. Formally, let via a fully connected layer that captures more fine- f[CLS]; x ; :::; x ; [SEP ]; y ; :::; y ; [SEP ]g be grained features from different regions of the docu- 1 m 1 n the input sequence of the encoder, we obtain the ment: output contextualized token-level representations ! ! p = sigmoid(W c + b ) (3) 1 1 fh ; h ; :::; h ; h ; h ; :::; h ; h g. x x y y [CLS] 1 m [SEP] 1 n [SEP] The input structure is designed to guarantee that nk n where W 2 R and b 2 R . We use Binary 1 1 words and labels are embedded together in the Cross Entropy as the loss function for the multi- same space. With the joint embedding mechanism, label classification problem: our model could pay more attention to two facets: 1) The correlations between document and labels. Different document have different influences on a L = [q ln p + (1 q ) ln(1 p )] (4) mlc i i i i specific label, while the same document fragment i=1 may affect multiple labels. 2) The correlations where p = P (y jD) is the probability of y pre- i i i among labels. The semantic information of labels dicted by the model, and q 2 f0; 1g is the cate- is interrelated, and label co-occurrence indicates gorical information of y . We train the model by strong semantic correlations between them. minimizing the cross-entropy error. 3.3 Multi-Label Text Classification 3.4 Multi-Task Learning with Label In this subsection, we introduce the MLTC task Correlations specific module, including Document-Label Cross In this subsection, we introduce two auxiliary tasks, Attention (CA) and Label Predication. Pairwise Label Co-occurrence Prediction (PLCP) and Conditional Label Co-occurrence Prediction 3.3.1 Document-Label Cross Attention (CA) (CLCP), to explore the second-order and high-order To explicitly model the semantic relationship be- label relationships, respectively. tween each word and label token, we measure the compatibility of label-word pairs via dot product: 3.4.1 PLCP Task T Suppose that each document D contains the cor- M = H H (1) responding label set Y and the uncorresponding label set Y . In order to train the model to un- where H = [h ; :::; h ] is the text sequence D x x 1 m derstand second-order label relationships, we pro- embedding, H = [h ; :::; h ] is the label se- Y y y 1 n mn pose a binarized label-pair prediction task named quence embedding and M 2 R . Consid- as PLCP that can be trivially generated from the ering the semantic information among consecu- multi-label classification corpus. The strategy of tive words, we further generalize M through non- selecting label pairs for co-occurrence prediction linearity network. Specifically, for a text fragment is straightforward. One part is sampled only from of length 2r + 1 centered at i, the local matrix Y , which is marked as IsCo-occur, and the other block M in M measures the correlation for ir;i+r part is sampled from Y and Y , respectively, the label-phrase pairs. To improve the effective- which is marked as NotCo-occur. To construct the ness of the sparse regularization, we use CNN with manual training dataset, we empirically set the ratio ReLU activation in the hidden layers, and perform of IsCo-occur and NotCo-occur to . As Figure 1 max-pooling and hyperbolic tangent sequentially shows, we concat the embedding of the two labels in the function [y ; y ] together as the input features. The addi- i j c = (M ) H (2) tional binary classifier is used to predict whether ir;i+r D the state of the two labels is IsCo-occur or NotCo- Note that the final document representation c is occur. The loss function is as followed: generated by aggregation of word representations L = [q ln p + (1 q ) ln(1 p )] (5) H , and weighted by the label-specific attention plcp ij ij ij ij vector (). where p = p(y jD; y ) denotes the output prob- ij j i 3.3.2 Label Predication ability of the the co-occurrance of the label-pair, Once having the discriminative document repre- and q is the ground-truth where q = 1 means ij sentation, we build the multi-label text classifier IsCo-occur and q = 0 means NotCo-occur. ij 1193 + 3.4.2 CLCP Task Dataset jDj jY j jD j jY j AAPD 55,840 54 163.42 2.41 To further learn the high-order label relationships, RCV1-V2 804,414 103 123.94 3.24 we propose the conditional label co-occurrence pre- diction (CLCP) task. We first randomly pick s Table 1: Statistics of datasets. Here,jDj andjYj denote + G G + labels from Y to form Y , i.e. Y  Y , and the total number of documents and labels. jD j is the then predict whether the remaining labels of Y average length of all documents. jY j means the aver- are relevant with them. Specifically, we introduce age number of labels associated with the document. an additional position vector E = [e ; :::; e ], Y y y 1 n where e = 0 indicates that y at that position is y i science, which is organized into 54 related top- the sampled label, i.e. y 2 Y , and e = 1 indi- i y G ics. In AAPD dataset, each paper is assigned mul- cates y 2 Y Y . The average of the embedding tiple topics. Reuters Corpus Volume I (RCV1- of the zero-position labels h G is concatenated to V2) (Lewis et al., 2004) is composed of 804,414 each nonzero-position label embedding as the in- manually categorized newswire stories for research put features to predict whether each of remaining purposes. Each story in the dataset can be assigned labels should be co-occurrence when knowing the G multiple topics, and there are 103 topics in total. sampled labels. In Figure 1, p(y jD; Y ) denotes Tabel 1 shows statistics of datasets. Each dataset the probability of y predicated by the additional is divided into a training set, a validation set, and sigmoid classifier. The loss for the classification a test set. We followed the division of these two is the sum of binary cross-entropy loss of each datasets by Yang et al. (2018). nonzero-position: 4.2 Evaluation Metrics ns L = [q ln p + (1 q ) ln(1 p )] (6) clcp i i i i Multi-label classification can be evaluated with i=1 a group of metrics, which capture different as- pects of the task (Zhang and Zhou, 2014). Fol- where q 2 f0; 1g is the ground-truth to denote lowing the previous works (Yang et al., 2018; whether the label y should be co-occurrence with G G Tsai and Lee, 2020), we adopt hamming loss, Y , and p = p(y jD; Y ) is the output probabil- i i Micro/Macro-F1 scores as our main evaluation met- ity of each masked label y . rics. Micro/Macro-P and Micro/Macro-R are also 3.4.3 Training Objectives reported to assist analysis. A Macro-average will The same inputs are first fed into the shared layers, treat all labels equally, whereas a Micro-average then each sub-task module takes the contextual- will weighted compute each label by its frequency. ized token-level representations generated by joint 4.3 Comparing Algorithms embedding and produces a probability distribution for its own target labels. The overall loss can be We adopt a various of methods as baselines, which calculated by: can be divided into two groups according to whether the label correlations are considered. L = L + L + (1 )L (7) mlc plcp clcp The first group of approaches do not consider la- bel correlations. Binary Relevance (BR) (Boutella where is a hyperparameter in (0, 1), L and plcp et al., 2004) amounts to independently training one L are task-specific Cross-Entropy loss for PLCP clcp binary classifier (linear SVM) for each label. CNN task and CLCP task, respectively. (Kim, 2014) utilizes multiple convolution kernels to extract text features and then output the probabil- 4 Experimental Setup ity distribution over the label space. LEAM (Wang 4.1 Datasets et al., 2018) involves label embedding to obtain a more discriminative text representation in text We validate our proposed model on two multi-label classification. LSAN (Xiao et al., 2019) learns the text classification datasets: Arxiv Academic Pa- label-specific text representation with the help of per Dataset (AAPD) (Yang et al., 2018) collected attention mechanisms. We also implement a BERT 55,840 abstracts of papers in the field of computer (Devlin et al., 2019) classifier which first encodes We also implement it with three tasks together. Since the a document into vector space and then outputs the two auxiliary tasks have the similar goal, there is no perfor- mance gain. probability for each label independently. 1194 AAPD dataset RCV1-V2 dataset Algorithm HL# Mi- P / R / F1" Ma- P / R / F1" HL# Mi- P / R / F1" Ma- P / R / F1" BR (Boutella et al., 2004) 0.0316 64.4 / 64.8 / 64.6 - - - 0.0086 90.4 / 81.6 / 85.8 - - - CNN (Kim, 2014) 0.0256 84.9 / 54.5 / 66.4 - - - 0.0089 92.2 / 79.8 / 85.5 - - - LEAM(Wang et al., 2018) 0.0261 76.5 / 59.6 / 67.0 52.4 / 40.3 / 45.6 0.0090 87.1/ 84.1 / 85.6 69.5 / 65.8 / 67.6 LSAN(Xiao et al., 2019) 0.0242 77.7 / 64.6 / 70.6 67.6 / 47.2 / 53.5 0.0075 91.3 / 84.1 / 87.5 74.9 / 65.0 / 68.4 BERT(Devlin et al., 2019) 0.0224 78.6 / 68.7 / 73.4 68.7 / 52.1 / 57.2 0.0073 92.7 / 83.2 / 87.7 77.3 / 61.9 / 66.7 CC (Read et al., 2011) 0.0306 65.7 / 65.1 / 65.4 - - - 0.0087 88.7 / 82.8 / 85.7 - - - y| SGM (Yang et al., 2018) 0.0251 74.6 / 65.9 / 69.9 - - - 0.0081 88.7 / 85.0 / 86.9 - - - y| Seq2Set (Yang et al., 2019) 0.0247 73.9 / 67.4 / 70.5 - - - 0.0073 90.0 / 85.8 / 87.9 - - - y| OCD (Tsai and Lee, 2020) - - - 72.0 - - 58.5 - - - - - - - ML-R (Wang et al., 2020) 0.0248 72.6 / 71.8 / 72.2 - - - 0.0079 89.0 / 85.2 / 87.1 - - - Seq2Seq (Nam et al., 2017) 0.0275 69.8 / 68.2 / 69.0 56.2 / 53.7 / 54.0 0.0074 88.5 / 87.4 / 87.9 69.8 / 65.5 / 66.1 SeqTag 0.0238 74.3/ 71.5 / 72.9 61.5 / 57.5 / 58.5 0.0073 90.6 / 84.9 / 87.7 73.7 / 66.7 / 68.7 Bert LACO 0.0213 80.2 / 69.6 / 74.5 70.4 / 54.0 / 59.1 0.0072 90.8 / 85.6 / 88.1 75.9 / 66.6 / 69.2 LACO+plcp 0.0212 79.5 / 70.8 / 74.9 68.4 / 55.8 / 59.9 0.0070 90.8 / 86.2 / 88.4 76.1 / 66.5 / 69.2 LACO+clcp 0.0215 78.9 / 70.8 / 74.7 71.9 / 56.6 / 61.2 0.0070 90.6 / 86.4 / 88.5 77.6 / 71.5 / 73.1 Table 2: Predictive performance of each comparing algorithm on two datasets. Hamming Loss (HL), Micro (Mi-) and Marco (Ma-) average Precision (P), Recall (R), F1-Score (F1) are used as evaluation metrics. The# represents the lower score the better performance, and the " is the opposite. Models with y denote for its results are quoted from previous papers. Models with| are the Seq2Seq-based models. The second group of methods consider label cor- glish base-uncased versions of BERT . The batch relations. Classifier Chains (CC) (Read et al., 2011) size is 32, and the maximum total input sequence transforms the MLTC problem into a chain of bi- length is 320. The window size of the additional nary classification problems. SGM (Yang et al., layer is 10, and we set as 0.5. We use Adam 2018) proposes the Seq2Seq model with global em- (Kingma and Ba, 2015) with learning rate of 5e-5, bedding mechanism to capture label correlations. and train the models by monitoring Micro-F1 score Seq2Set (Yang et al., 2019) presents deep reinforce- on the validation set and stopping the training if ment learning to improve the performance of the there is no increase in 50,000 consecutive steps. Seq2Seq model. We also implement a Seq2Seq 5 Results and Analysis baseline with 12-layer transformer, named with Seq2Seq . More recently, OCD (Tsai and Lee, In this section, we report the main experimental 2020) proposes a framework including one encoder results of the baseline models and the proposed and two decoders for MLTC to alleviate exposure method on two text datasets. Besides, we analyze bias. ML-Reasoner (Wang et al., 2020) employs a the performance on different frequency labels, and binary classifier to predict all labels simultaneously further evaluate whether our method effectively and applies a novel iterative reasoning mechanism. learns the label correlations through label-pair con- Besides, we also provide another strong baseline: fidence distribution learning and label combination SeqTag transforms multi-label classification Bert prediction. Finally, we give a detailed analysis task into sequential tagging task, which first obtain of the convergence study which demonstrates the embeddings of each label (H in Sec 3.3) by our generalization ability of our method. shared encoder and then output a probability for each label sequentially by a BiLSTM-CRF model 5.1 Experiment Results (Huang et al., 2015). We report the experimental results of all comparing Results of BR, CNN, CC, SGM, Seq2Set, OCD algorithms on two datasets in Table 2. The first and ML-R are cited in previous papers and results block includes methods without learning label of other baselines are implemented by us. All algo- correlations. The second block is the methods rithms follow the same data division. considering label correlations, and the third block is our proposed LACO methods. As shown in 4.4 Experimental Setting Table 2, the LACO-based models outperform all We implement our model in Tensorflow and run on NVIDIA Tesla P40. We fine-tune models on the En- https://github.com/google-research/bert 1195 AAPD RCV1-V2 AAPD RCV1-V2 Model HL Mi-F1 HL Mi-F1 Model HL Mi-F Ma-F HL Mi-F Ma-F BERT 9.39e-09 3.80e-10 4.95e-04 3.67e-08 LACO 0.0213 74.5 59.1 0.0072 88.1 69.2 SeqTag 7.76e-16 1.86e-07 4.95e-04 3.67e-08 Bert w/o JE 0.0237 72.6 57.7 0.0077 87.5 68.4 w/o CA 0.0220 73.5 58.4 0.0073 87.8 68.5 Table 3: Statistical analysis results. The P-values of w/o JE & CA 0.0224 73.4 57.2 0.0073 87.7 66.7 LACO on significant test comparing with the two strong baselines BERT and SeqTag . Bert Table 4: Ablation over the proposed joint embedding (JE) and cross attention (CA) mechanisms using the LACO model on AAPD and RCV1-V2 datasets. baselines by a large margin in the main evaluation metrics. The following observations can be made As for the results of the multi-task learning according to the results: methods, the two subtasks introduced by our Our basic model LACO training only by the method have a certain degree of improvement on MLTC task significantly improves previous results the main metrics of the two datasets. Specifically, on hamming loss and Micro-F1. Specifically, on we observe that the PLCP task shows better the AAPD dataset, comparing to Seq2Set which performance and presents the best score of 74.9 considers modeling the label correlations, our on Micro-F1 for AAPD dataset, while the CLCP basic model decreases by 13:8% on hamming loss task presents the best performance on Micro-F1 and improves by 5:67% on Micro-F1. Comparing for RCV1-V2 dataset as 88.5. Furthermore, with the label embedding method like LSAN, the proposed multi-task framework shows great LACO achieves a reduction of 4:00% hamming improvements than the basic model LACO on loss score and an improvement of 0:69% Micro-F1 Macro-F1, which demonstrates that the perfor- score on the RCV1-V2 dataset. Also, BERT mance on low-frequency labels can be greatly is still a strong baseline, which shows that improved through our label correlation guided obtaining a high-quality discriminative document subtasks. There are more detailed analysis in representation is important for the MLTC task. Section 5.3 and 5.5. Notably, the CLCP task Here, we train the LACO with 3 random seeds and performs better on Marco-F1 by considering the calculate the mean and the standard deviation. We high-order correlations. We also implement the perform a significant test with LACO and the two experiment using the losses of three tasks together, strong baselines BERT and SeqTag in Table 3. Bert while the combination of the two subtasks can not Comparing with the two strong baseline models, further improve the model performance comparing all of the P-values of L ACO are below the threshold to L ACO or LACO , which we consider is +plcp +clcp (p < 0.05), suggesting that the performance is due to the strong relevance between the two tasks. statistically significant. In addition, we implement Friedman test (Demsar ˇ , 2006) for hamming loss 5.2 Ablation Study and Micro-F1 metrics. The Friedman statistics F for hamming loss is 7.875 and for Micro-F1 In this section, we will demonstrate the effective- is 6.125, when the corresponding critical value ness of two cores of the proposed L ACO model, that is 2.8179 (# comparing algorithms k = 12, # is a document-label joint embedding (JE) mecha- datasets N = 2). As a result, the null hypothesis nism, and a document-label cross attention (CA) of indistinguishable performance among the mechanism. Note that, the setting of w/o JE & compared algorithms is clearly rejected at 0.05 CA is equivalent to the BERT baseline in Tabel 2, significance level. which encode document only and predict the proba- Compared with SGM, Seq2Seq does bility for each label based on [CLS]. In the w/o JE not achieve significantly improvements, but setting, document embedding is encoded by BERT SeqTag shows good performance based on the while each label embedding is a learnable random Bert shared Transformer encoder between document initialized vector. Its label prediction layer is the and labels. Notably, the result of SeqTag on same with L ACO. In the w/o CA setting, document Bert Micro-F1 is comparable to BERT, but the result and label embedding are obtained by BERT jointly, on Macro-F1 is observably higher. The above and probability for each label is predicted based illustrates that label correlation information is on [CLS]. Tabel 4 shows that JE and CA are both more important for learning low frequency labels. important to obtain a more discriminative text rep- AAPD RCV1-V2 Model train test train test Seq2Seq 1.27 1.30 0.08 0.94 SeqTag 1.40 1.28 0.09 0.95 ≤ Bert LACO 1.40 1.27 0.09 0.94 LACO 1.35 1.28 0.08 0.76 +plcp (a) The label distribution of AAPD LACO 1.32 1.10 0.08 0.91 +clcp g p Table 5: KL(P jjP ) for different models on AAPD and RCV1-V2 datasets. Note that P is the ground truth distribution of datasets and P is the model dis- tribution. Smaller scores indicate that two distributions are closer. (b)Macro-F1 for the four groups on AAPD Figure 2: Label classification performance on different tance” between model prediction distribution (P ) frequency distributions. Subfigure(a) shows the label frequency distribution of each label on AAPD training and the ground-truth distribution on training/testing set. Subfigure(b) illustrates the Macro-F1 performance dataset (P ). The score is calculate as: of different methods in the four groups. p (y jy ) b a g p g KL(P jjP ) = (p (y jy )log b a p (y jy ) b a y ;y 2Y resentation. After removing JE and CA mechanism, b the performance drops more in the AAPD dataset p(y jy ) = #(y ; y )=#(y ) b a a b a than RCV1-V2 dataset. We believe that is mainly (8) due to the less of training instance in AAPD, which where # means the number of the single label or is more difficult to learn relevant features especially the label combination in the training/testing dataset. for those low-frequency labels. The KL-distances on the AAPD and RCV1-V2 datasets are shown in Table 5. On the testing set set- 5.3 Low-frequency Label Performance tings, we can find that L ACO has much better fitting ability for the dependency relationship between la- Figure 2(a) illustrates the label frequency distri- bution on AAPD training set, which is a typi- bels, especially after introducing the co-occurrence relationship prediction task. The Seq2Seq model cal big-head-long-tail distribution. We divide all achieves the lowest KL distance with training set the labels into four groups according to the fre- on both AAPD and RCV1-V2 but achieve larger quency, the big-head group (Group1), the high- scores on the test set. This conclusion further frequency group (Group2), the middle-frequency proves that the Seq2Seq-based model is prone to group (Group3), and the low-frequency group over-fitting label pairs during training. It should (Group4). As shown in Figure 2(b), we find the performance of all methods decreases with the la- be emphasized that this KL distance just quantify how much interdependence between label pairs the bel frequency of occurrence. The performance gap between Seq2Seq and LACO based meth- model have learned, but it cannot directly measure the prediction accuracy of the model. ods increases as the frequency decreases, espe- cially in Group 4, L ACO achieves a 74.5% +clcp 5.5 Label Combination Diversity Analysis improvement comparing to the Seq2Seq model, which demonstrates that the performance on low- Table 6 shows the number of different predicted frequency labels can be enhanced by the condi- label combinations (C ) and subset accuracy Test tional label co-occurrence prediction task. (Acc), which is a strict metric that indicates the percentage of samples that have all their labels clas- 5.4 Label Correlation Analysis sified correctly. Seq2Seq produces fewer kinds The co-occurrence relationship between labels is of label combinations on the two datasets. As they one of the important aspects that can reflect label tend to “remember” label combinations, the gen- correlation. In this experiment, we utilize the condi- erated label sets are most alike, indicating a poor tional probability p(y jy ) between label y and y generalization ability to unseen label combinations. b a a b to represent their dependency quantitatively. Fur- Because Seq2Seq is conservative and only gen- thermore, we calculate the Conditional Kullback- erates label combinations it has seen in the train- Leibler Divergence of p(y jy ) to measure the “dis- ing set, it achieves high Acc values, especially on b a AAPD RCV1-V2 results show that our method outperforms competi- Model C Acc C Acc Test Test tive baselines by a large margin. Detailed analyses Ground Truth 392 1.000 278 1.000 show the effectiveness of our proposed architecture Seq2Seq 214 0.392 87 0.669 using semantic connections between document- OCD 302 0.403 - - SeqTag 289 0.410 187 0.637 Bert label and label-label, which helps to obtain a dis- LACO 315 0.425 241 0.642 criminative text representation. Furthermore, the LACO 320 0.439 241 0.644 +plcp LACO 321 0.427 239 0.660 multi-task framework shows strong capability on +clcp low-frequency label predicting and label correla- Table 6: Statistics on the number of label combinations. tion learning. C is the number of different predicted label combi- Test Considering the Extreme Multi-label Text Clas- nations. Acc is the subset accuracy on the testing set. sification that contains an extremely large label set, LACO could be further exploited through sched- uled label sampling, hierarchical label embedding strategy, and so on. We hope that further research could get clues from our work. Acknowledgements We would like to thank the ACL reviewers for their valuable comments and Keqing He, Haoyan Liu, (a) Covergence speed of AAPD (b) Covergence speed of RCV1-V2 Zizhen Wang, Chenyang Liao and Rui Pan for their Figure 3: The convergence speed of five BERT-based generous help and discussion. methods. The x-axis refers to the training steps, and the y-axis refers to the Micro-F1 score performance. References Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and RCV1-V2 dataset. For our models, they produce Jimmy Lin. 2019. Docbert: Bert for document clas- more diverse label combinations while obtaining sification. arXiv:1904.08398. good Acc since we do not regard multi-label clas- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and sification as a sequence generation task that uses a Noam Shazeer. 2015. Scheduled sampling for se- decoder to model the relationship between labels. quence prediction with recurrent neural networks. Instead, we learn the correlations among labels on In Proceedings of the 28th International Conference the encoding side, and the scoring between labels on Neural Information Processing Systems-Volume 1, pages 1171–1179. does not interfere with each other, which leads to a higher probability of generating label combinations Matthew R Boutella, Jiebo Luob, Xipeng Shena, and not seen during training than the Seq2Seq-based Christopher M Browna. 2004. Learning multi-label scene classification. Pattern Recognition, 37:1757– models. 5.6 Coverage Speed Guibin Chen, Deheng Ye, Zhenchang Xing, Jieshan The convergence speed of five BERT-based models Chen, and Erik Cambria. 2017. Ensemble applica- tion of convolutional and recurrent neural networks are shown in Figure 3. Our basic model LACO for multi-label text categorization. In 2017 interna- outperforms other BERT-based models in terms of tional joint conference on neural networks (IJCNN), convergence speed, and the proposed multi-task pages 2377–2383. IEEE. mechanisms are able to enhance L ACO to converge Janez Demsar ˇ . 2006. Statistical comparisons of classi- much faster. The main reason might be that the fiers over multiple data sets. The Journal of Machine feature exchanging through multi-tasks accelerates Learning Research, 7:1–30. the model to learn a more robust and common rep- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and resentation. Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language under- 6 Conclusions and Future Work standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for In this paper, we propose a new method for MLTC Computational Linguistics: Human Language Tech- based on document-label joint embedding and cor- nologies, Volume 1 (Long and Short Papers), pages relation aware multi-task learning. Experimental 4171–4186. Eva Gibaja and Sebastian ´ Ventura. 2015. A tutorial Weiwei Liu and Ivor W Tsang. 2015. On the optimal- on multilabel learning. ACM Computing Surveys ity of classifier chain for multi-label classification. (CSUR), 47(3):1–38. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 712–720. Biyang Guo, Songqiao Han, Xiao Han, Hailiang Huang, and Ting Lu. 2020. Label confusion Aditya Krishna Menon, Sadeep Jayasumana, learning to enhance text classification models. Ankit Singh Rawat, Himanshu Jain, Andreas arXiv:2012.04987. Veit, and Sanjiv Kumar. 2020. Long-tail learning via logit adjustment. arXiv:2007.07314. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional lstm-crf models for sequence tagging. Jinseok Nam, Eneldo Loza Menc´ ıa, Hyunwoo J Kim, arXiv:1508.01991. and Johannes Furnkranz. ¨ 2017. Maximizing sub- set accuracy with recurrent neural networks in multi- Thorsten Joachims. 1998. Text categorization with sup- label classification. In Advances in neural informa- port vector machines: Learning with many relevant tion processing systems, pages 5413–5423. features. In European conference on machine learn- ing, pages 137–142. Springer. Kechen Qin, Cheng Li, Virgil Pavlu, and Javed Aslam. 2019. Adapting rnn sequence prediction model to Yoon Kim. 2014. Convolutional neural networks for multi-label set prediction. In Proceedings of the sentence classification. In Proceedings of the 2014 2019 Conference of the North American Chapter of Conference on Empirical Methods in Natural Lan- the Association for Computational Linguistics: Hu- guage Processing (EMNLP), pages 1746–1751. man Language Technologies, Volume 1 (Long and Short Papers), pages 3181–3190. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Jesse Read, Bernhard Pfahringer, Geoff Holmes, and method for stochastic optimization. In 3rd Inter- Eibe Frank. 2011. Classifier chains for multi-label national Conference on Learning Representations, classification. Machine learning, 85(3):333. ICLR 2015,San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Che-Ping Tsai and Hung-Yi Lee. 2020. Order-free learning alleviating exposure bias in multi-label clas- Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. sification. In Proceedings of the AAAI Conference Improved neural network-based multi-label classifi- on Artificial Intelligence, volume 34, pages 6038– cation with better initialization leveraging label co- occurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association Grigorios Tsoumakas and Ioannis Katakis. 2007. for Computational Linguistics: Human Language Multi-label classification: An overview. Interna- Technologies, pages 521–526. tional Journal of Data Warehousing and Mining (IJDWM), 3(3):1–13. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text Grigorios Tsoumakas, Ioannis Katakis, and Ioannis classification. In Proceedings of the AAAI Confer- Vlahavas. 2009. Mining multi-label data. In Data ence on Artificial Intelligence, volume 29. mining and knowledge discovery handbook, pages 667–685. Springer. David D Lewis, Yiming Yang, Tony G Rose, and Fan Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Li. 2004. Rcv1: A new benchmark collection for Order matters: Sequence to sequence for sets. text categorization research. Journal of machine learning research, 5(Apr):361–397. Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yim- Henao, and Lawrence Carin. 2018. Joint embedding ing Yang. 2017. Deep learning for extreme multi- of words and labels for text classification. In Pro- label text classification. In Proceedings of the 40th ceedings of the 56th Annual Meeting of the Associa- International ACM SIGIR Conference on Research tion for Computational Linguistics (Volume 1: Long and Development in Information Retrieval, pages Papers), pages 2321–2331. 115–124. Ran Wang, Robert Ridley, Weiguang Qu, Xinyu Dai, Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. et al. 2020. A novel reasoning mechanism for multi- Recurrent neural network for text classification with label text classification. Information Processing & multi-task learning. In Proceedings of the Twenty- Management, 58(2):102441. Fifth International Joint Conference on Artificial In- telligence, pages 2873–2879. Lin Xiao, Xin Huang, Boli Chen, and Liping Jing. 2019. Label-specific document representation for Weiwei Liu, Xiaobo Shen, Haobo Wang, and Ivor W multi-label text classification. In Proceedings of Tsang. 2020. The emerging trends of multi-label the 2019 Conference on Empirical Methods in Nat- learning. arXiv:2011.11197. ural Language Processing and the 9th International 1199 Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 466–475. Guangxu Xun, Kishlay Jha, Jianhui Sun, and Aidong Zhang. 2020. Correlation networks for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1074– Pengcheng Yang, Fuli Luo, Shuming Ma, Junyang Lin, and Xu Sun. 2019. A deep reinforced sequence-to- set model for multi-label classification. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5252–5258. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, and Houfeng Wang. 2018. Sgm: sequence gen- eration model for multi-label classification. In Pro- ceedings of the 27th International Conference on Computational Linguistics, page 3915–3926. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchi- cal attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, pages 1480–1489. Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shanfeng Zhu. 2018. At- tentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. arXiv:1811.01727. Min-Ling Zhang and Zhi-Hua Zhou. 2014. A re- view on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819–1837. Wei Zhao, Hui Gao, Shuhui Chen, and Nan Wang. 2020. Generative multi-task learning for text clas- sification. IEEE Access, 8:86380–86387.

Journal

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021Unpaywall

Published: Jan 1, 2021

There are no references for this article.