Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Deconfounded Recommendation for Alleviating Bias Amplification

Deconfounded Recommendation for Alleviating Bias Amplification Deconfounded Recommendation for Alleviating Bias Amplification 1 12∗ 3 12 1 Wenjie Wang , Fuli Feng , Xiangnan He , Xiang Wang , and Tat-Seng Chua 1 2 3 National University of Singapore, Sea-NExT Joint Lab, University of Science and Technology of China {wenjiewang96,fulifeng93,xiangnanhe}@gmail.com,xiangwang@u.nus.edu,dcscts@nus.edu.sg ABSTRACT ACM Reference Format: Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. Recommender systems usually amplify the biases in the data. The 2021. Deconfounded Recommendation for Alleviating Bias Amplification. model learned from historical interactions with imbalanced item In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery distribution will amplify the imbalance by over-recommending and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. items from the major groups. Addressing this issue is essential ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3447548.3467249 for a healthy ecosystem of recommendation in the long run. Existing works apply bias control to the ranking targets (e.g., 1 INTRODUCTION calibration, fairness, and diversity), but ignore the true reason for Recommender System (RS) has been widely used to achieve bias amplification and trade off the recommendation accuracy. personalized recommendation in most online services, such as In this work, we scrutinize the cause-effect factors for bias social networks and advertising [39]. Its default choice is to learn amplification, identifying the main reason lies in the confounder user interest from historical interactions (e.g., clicks and purchases), effect of imbalanced item distribution on user representation and which typically exhibit data bias, i.e., the distribution over item prediction score. The existence of such confounder pushes us to go groups (e.g., the genre of movies) is imbalanced. Consequently, beyond merely modeling the conditional probability and embrace recommender models face the bias amplification issue [32]: over- the causal modeling for recommendation. Towards this end, we recommending the majority group and amplifying the imbalance. propose a Deconfounded Recommender System (DecRS), which Figure 1(a) illustrates this issue with an example in movie models the causal effect of user representation on the prediction recommendation, where 70% of the movies watched by a user are score. The key to eliminating the impact of the confounder lies action movies, but action movies take 90% of the recommendation in backdoor adjustment, which is however difficult to do due to slots. Undoubtedly, over-emphasizing the items from the majority the infinite sample space of the confounder. For this challenge, we groups will limit a user’s view and decrease the effectiveness of contribute an approximation operator for backdoor adjustment recommendations. Worse still, due to feedback loop [7], such bias which can be easily plugged into most recommender models. Lastly, amplification will intensify with time, causing more issues like filter we devise an inference strategy to dynamically regulate backdoor bubbles [22] and echo chambers [14]. adjustment according to user status. We instantiate DecRS on two Existing works alleviate bias amplification by introducing bias representative models FM [29] and NFM [16], and conduct extensive control into the ranking objective of recommender models, which experiments over two benchmarks to validate the superiority of are mainly from three perspectives: 1) fairness [21, 31], which our proposed DecRS. pursues equal exposure opportunities for items of different groups; 2) diversity [6], which intentionally increases the covered groups CCS CONCEPTS in a recommendation list, and 3) calibration [32], which encourages • Information systems→ Recommender systems; Collabora- the distribution of recommended item groups to follow that of tive filtering . interacted items of the user. However, these methods alleviate bias amplification at the cost of sacrificing recommendation KEYWORDS accuracy [31, 32]. More importantly, the fundamental question is not answered: what is the root reason for bias amplification? Deconfounded Recommendation, User Interest Imbalance, Bias After inspecting the cause-effect factors in recommender Amplification modeling, we attribute bias amplification to a confounder [25]. The ∗ Corresponding author: Fuli Feng (fulifeng93@gmail.com). This research is supported historical distribution of a user over item groups (e.g., [0.7, 0.3] in by the Sea-NExT Joint Lab, the National Natural Science Foundation of China Figure 1(a)) is a confounder between the user’s representation and (61972372), and National Key Research and Development Program of China (2020AAA0106000). the prediction score. In the conventional RS, the user/item features (e.g., ID and attributes) are first embedded into the representation Permission to make digital or hard copies of all or part of this work for personal or vectors, which are then fed into an interaction module (e.g., classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation factorization machines (FM) [29]) to calculate the prediction score on the first page. Copyrights for components of this work owned by others than ACM for the user-item pair [17]. In other words, recommender models must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, estimate the conditional probability of clicks given user/item to post on servers or to redistribute to lists, requires prior specific permission and /or a fee. Request permissions from permissions@acm.org. representations. From a causal view, user and item representations KDD ’21, August 14–18, 2021, Virtual Event, Singapore can be regarded as the causes of the prediction score, and the © 2021 Association for Computing Machinery. interaction module should encode the causal relations between ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00 https://doi.org/10.1145/3447548.3467249 them [25]. But inspecting the causal relations, we find that the arXiv:2105.10648v1 [cs.IR] 22 May 2021 Action movie Romance movie Majority group Minority group D I D I U User representation User browsing history I Item representation 70% 30% 0.70 User historical distribution 0.59 Bias amplification Recommender Feedback loop 0.58 X over item groups 0.47 Backdoor Group-level user representation New recommendation list M Adjustment 90% 10% ratings Y Prediction score 4 5 U Y U User feedback (b) Prediction score difference between the items (a) (b) (a) An example of bias amplification. in the majority and minority groups over ML-1M. Figure 2: (a) The causal graph of conventional RS. (b) The Underwater (2020) Historical distribution Marriage Story (2019) 𝑝 (𝑔 ) % & causal graph used in DecRS. Action movie of user u over item groups. Romance movie Rating by user u: 3.0/5.0 Rating by user u: 5.0/5.0 𝑔 𝑔 ' ( item groups • We instantiate DecRS on two representative recommender Item representation User representation Item representation Interaction Interaction models, and conduct extensive experiments on two benchmarks module module Rating by user u: 3.0 < 5.0 which validate the effectiveness of our proposal. 0.6 0.5 Prediction score: 0.6 > 0.5 (c) An example on the cause of bias amplification. 2 METHODOLOGY Figure 1: Illustration of bias amplification. In this section, we first analyze the conventional RS from a causal hidden confounder, i.e., the user historical distribution over item view and explain the reason for bias amplification, which is followed groups, affects both the user representation and the prediction by the introduction of the proposed DecRS. score. Due to the modeling of conditional probability, recommender models are affected by the confounder and thus suffer from a 2.1 A Causal View on Bias Amplification spurious correlation between the user and the prediction score. To study bias amplification, we build up a causal graph to explicitly That is, given two item groups, the one that the user interacted analyze the causal relations in the conventional RS. more in the history will receive higher prediction scores, even though their items have the same matching level. Figure 1(b) shows 2.1.1 Causal Graph. We scrutinize the causal relations in empirical evidence from the FM on ML-1M dataset: among the items recommender models and abstract a causal graph, as shown in with the same ratings (e.g., ratings = 4), the ones in the majority Figure 2(a), which consists of five variables: 𝑈 , 𝐼 , 𝐷 , 𝑀 , and 𝑌 . Note group will receive higher prediction scores. Therefore, the items in that we use the capital letter (e.g., 𝑈 ), lowercase letter (e.g., 𝒖), and the majority group, even including those undesirable or low-quality letter in the calligraphic font (e.g., U) to represent a variable, its ones (see example in Figure 1(c)), could deprive the recommendation particular value, and its sample space, respectively. In particular, opportunities of the items in the minority group. • 𝑈 denotes user representation. For one user, 𝒖 = [𝒖 , ..., 𝒖 ] 1 𝐾 The key to addressing bias amplification lies in eliminating the represents the embeddings of 𝐾 user features (e.g., ID, gender, spurious correlation in the recommender modeling. To achieve this and age) [29], where 𝒖 ∈ R is one feature embedding. goal, we need to push the conventional RS to go beyond modeling • 𝐼 is item representation and each 𝒊 denotes the embeddings of the conditional probability and embrace the causal modeling several item features (e.g., ID and genre) which are similar to 𝒖. of user representation on the prediction score. We propose a • 𝐷 represents the user historical distribution over item groups. novel Deconfounded Recommender System (DecRS), which explicitly Groups can be decided by item attributes or similarity [32]. Given models the causal relations during training, and leverages backdoor 𝑁 item groups {𝑔 , ...,𝑔 }, 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] ∈ R is a 1 𝑁 𝑢 𝑢 1 𝑢 𝑁 adjustment [25] to eliminate the impact of the confounder. However, particular value of 𝐷 when the user is𝑢, where 𝑝 (𝑔 ) is the click 𝑢 𝑛 the sample space of the confounder is huge, making the traditional frequency of user 𝑢 over group 𝑔 in the history . For instance, implementation of backdoor adjustment infeasible. To this end, for the user 𝑢 in Figure 1(a), 𝒅 is [0.7, 0.3] if 𝑁 = 2. we derive an approximation of backdoor adjustment, which is • 𝑀 is the group-level user representation. A particular value 𝒎 ∈ universally applicable to most recommender models. Lastly, we R is a vector which describes how much the user likes different propose a user-specific inference strategy to dynamically regulate item groups. 𝒎 can be obtained from the values of 𝑈 and 𝐷 . the influence of backdoor adjustment based on the user status. We That is, 𝑀 is deterministic if 𝑈 and 𝐷 are given so that we can instantiate DecRS on two representative models FM [29] and neural represent 𝒎 by a function 𝑀(𝒅, 𝒖) with 𝒅 and 𝒖 as inputs. To factorization machines (NFM) [16]. Extensive experiments over two keep generality, we incorporate 𝑀 into the causal graph because benchmarks demonstrate that our DecRS not only alleviates bias many recommender models (e.g., FM) have modeled the user amplification effectively, but also improves the recommendation preference over item groups explicitly or implicitly by using the accuracy over the backbone models. group-related features (e.g., movie genre). Overall, the main contributions of this work are threefold: • 𝑌 with 𝑦 ∈ [0, 1] is the prediction score for the user-item pair. • We construct a causal graph to analyze the causal relations The edges in the graph describe the causal relations between in recommender models, which reveals the cause of bias variables, e.g., 𝑈 → 𝑌 means that 𝑈 has a direct causal effect [25] amplification from a causal view. on 𝑌 , i.e., changes on 𝑈 will affect the value of 𝑌 . In particular, • We propose a novel DecRS with an approximation of backdoor adjustment to eliminate the impact of the confounder, which can In this work, we use click to represent any implicit feedback, such as purchase and be incorporated into existing recommender models to alleviate watch. For brevity, 𝑢 and 𝑖 may be used to denote the user and item, respectively. bias amplification. Besides, 𝑛 is used to represent any value in {1, 2, ..., 𝑁}. avg. prediction score Time t 𝒑 𝒈 , 𝒖 𝟐 • 𝐷 → 𝑈 : the user historical distribution over item groups Romance movie Romance movie 10% 20% affects user representation 𝑈 , making it favor the group with a (Minority group) 30% 50% higher click frequency (i.e., majority group). This is because user 𝟏 0.6 90% 0.4 Action movie 80% 70% representation is optimized to fit the imbalanced historical data. (Majority group) 50% 𝟎. 𝟔 • (𝐷,𝑈) → 𝑀 : 𝐷 and𝑈 decide the group-level user representation. 0.8 0.2 Training data Testing data • (𝑈, 𝑀, 𝐼) → 𝑌 : The edges show that 𝑈 affects 𝑌 by two paths: 1) the direct path 𝑈 → 𝑌 which denotes the user’s pure preference 𝟎. 𝟐 𝒅 𝒅 𝒑 (𝒈 ) 𝒖 𝒑 (𝒈 ) 𝒖 𝒖 𝒏 𝒖 𝒏 over the item, and 2) the indirect path 𝑈 → 𝑀 → 𝑌 , indicating drift 𝟎. 𝟖 𝟏 𝟎 𝟎. 𝟒 𝒑 𝒈 , that the prediction score could be high because the user shows 𝒖 𝟏 𝒈 𝒈 𝟏 𝟐 group 𝒈𝟏𝒈 group Action movie interest in the item group rather than the item. (b) Possible values of D and the probabilities. (a) User interest is changing over time. According to the causal theory [25], since 𝐷 affects both 𝑈 and Figure 3: (a) Illustration of user interest drift. (b) An example 𝑌 , 𝐷 is a confounder between 𝑈 and 𝑌 , resulting in the spurious of the distribution of 𝐷 when the item group number is 2. correlation when estimating the correlation between 𝑈 and 𝑌 . Each node in the line represents a particular value 𝒅, and a 2.1.2 Conventional RS. Due to the confounder, existing recom- darker color denotes a higher probability of 𝒅, i.e., 𝑃(𝒅). mender models that estimate the conditional probability 𝑃(𝑌|𝑈, 𝐼) recommendation list and narrow down the user interest. Besides, face the spurious correlation, which leads to bias amplification. the undesirable and low-quality items in the majority group Formally, given 𝑈 = 𝒖 and 𝐼 = 𝒊, we can derive the conditional will dissatisfy users, leading to poor recommendation accuracy. probability 𝑃(𝑌|𝑈, 𝐼) by: Worse still, by analyzing Eq. 1(d), we have a new observation: the 𝑃(𝑌|𝑈 = 𝒖, 𝐼 = 𝒊) prediction score 𝑌 heavily relies on the user historical distribution Í Í 𝑃(𝒅)𝑃(𝒖|𝒅)𝑃(𝒎|𝒅, 𝒖)𝑃(𝒊)𝑃(𝑌|𝒖, 𝒊, 𝒎) over item groups, i.e., 𝒅 . Once users’ future interest in item groups 𝒅∈D 𝒎∈M 𝑢 = (1a) changes (i.e., user interest drift), the recommendations will be 𝑃(𝒖)𝑃(𝒊) ∑︁ ∑︁ dissatisfying. For instance, as shown in Figure 3(a), the user interest = 𝑃(𝒅|𝒖)𝑃(𝒎|𝒅, 𝒖)𝑃(𝑌|𝒖, 𝒊, 𝒎) (1b) in item groups is not stable, and thus the correlation caused by 𝒅∈D 𝒎∈M ∑︁ the confounder 𝐷 will not be reliable if the distribution 𝑑 is = 𝑃(𝒅|𝒖)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)) (1c) inconsistent between training and testing data. 𝒅∈D = 𝑃(𝒅 |𝒖)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅 , 𝒖)), (1d) 𝑢 𝑢 2.2 Deconfounded Recommender System To resolve the impact of the confounder, DecRS estimates the causal where D andM are the sample spaces of 𝐷 and 𝑀 , respectively . effect of user representation on the prediction score. Experimentally, In particular, Eq. (1a) follows the law of total probability; Eq. (1b) the target can be achieved by collecting intervened data where the is obtained by Bayes rule; since 𝑀 can only take a value 𝑀(𝒅, 𝒖) if user representation is forcibly adjusted to eliminate the impact of 𝑈 = 𝒖 and 𝐷 = 𝒅, i.e., 𝑃(𝑀(𝒅, 𝒖)|𝒅, 𝒖) = 1, the sum overM in Eq. confounder. However, such an experiment is too costly to achieve in (1b) is removed; 𝐷 is known if 𝑈 = 𝒖 is given. Thus the probability large-scale and faces the risk of hurting user experience in practice. of 𝒖 having the distribution 𝒅 (i.e., 𝑃(𝒅|𝒖)) is 1 if and only if 𝒅 is DecRS thus resorts to the causal technique: backdoor adjustment [25, 𝒅 ; otherwise 𝑃(𝒅|𝒖) = 0, where 𝒅 is the historical distribution of 𝑢 𝑢 26, 41], which enables the estimation of causal effect from the user 𝑢 over item groups. observational data. From Eq. (1d), we can find that 𝒅 does not only affect the user representation 𝒖 but also affects 𝑌 via 𝑀(𝒅 , 𝒖), causing the 2.2.1 Backdoor Adjustment. According to the theory of backdoor spurious correlation: given the item 𝑖 in a group 𝑔 , the more items adjustment [25], the target of DecRS is formulated as: 𝑃(𝑌|𝑑𝑜(𝑈 = in group 𝑔 the user 𝑢 has clicked in the history, the higher the 𝒖), 𝐼 = 𝒊) where 𝑑𝑜(𝑈 = 𝒖) can be intuitively seen as cutting off the prediction score 𝑌 becomes. In other words, the high prediction edge 𝐷 → 𝑈 in the causal graph and blocking the effect of 𝐷 on 𝑈 scores are caused by the users’ historical interest in the group (cf. Figure 2(b)). We then derive the specific expression of backdoor instead of the items themselves. From the perspective of model adjustment. Formally, prediction, 𝒅 affects 𝒖, which makes 𝒖 favor the majority group. In 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) Eq. (1d), a higher click frequency 𝑝 (𝑔 ) in 𝒅 will make 𝑀(𝒅 , 𝒖) 𝑢 𝑛 𝑢 𝑢 ∑︁ represent a strong interest in group 𝑔 , increasing the prediction = 𝑃(𝒅|𝑑𝑜(𝑈 = 𝒖))𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝒊, 𝑀(𝒅,𝑑𝑜(𝑈 = 𝒖))) (2a) scores of items in group 𝑔 via 𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅 , 𝒖)). As such, the 𝑛 𝑢 𝒅∈D ∑︁ items in the majority group, even including the low-quality ones, = 𝑃(𝒅)𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝒊, 𝑀(𝒅,𝑑𝑜(𝑈 = 𝒖))) (2b) are easy to have high prediction scores due to the effect of the 𝒅∈D confounder 𝐷 . They occupy the recommendation opportunities of ∑︁ = 𝑃(𝒅)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)), (2c) items in the minority group, and thus bias amplification happens. 𝒅∈D The spurious correlation is harmful for most users because the items in the majority group are likely to dominate the where the derivation of Eq. (2a) is the same as Eq. (1c), which follows the law of total probability and Bayes rule. Besides, Eq. (2b) and Theoretically, 𝐷 has an infinite sample space. But the values are finite in a specific Eq. (2c) are obtained by two do calculus rules: insertion/deletion of dataset. To simplify the notations, we use the discrete set D to represent the sample space of 𝐷 , and so is 𝑀 . actions and action/observation exchange in Theorem 3.4.1 of [25]. Table 1: Key notations and descriptions. As compared to Eq. 1(d), DecRS estimates the prediction score Notation Description with consideration of every possible value of 𝐷 subject to the prior 𝒖 = [𝒖 , ..., 𝒖 ], 𝒖 ∈ R The representation vectors of 𝐾 user features. 1 𝐾 𝑘 𝑃(𝒅), rather than the probability of 𝒅 conditioned on 𝒖. Therefore, The feature values of a user’s 𝐾 features [29], e.g., the items in the majority group will not receive high prediction 𝒙 = [𝑥 , ...,𝑥 ] 𝑢 𝑢,1 𝑢,𝐾 [0.5, 1, ..., 0.2] . scores purely because of a high click probability in 𝒅 . And thus 𝑝 (𝑔 ) denotes the click frequency of user𝑢 over 𝑢 𝑛 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] 𝑢 𝑢 1 𝑢 𝑁 backdoor adjustment alleviates bias amplification by removing the group 𝑔 in the history, e.g., 𝒅 = [0.8, 0.2] . 𝑛 𝑢 The group-level representation of user 𝑢 under a effect of 𝐷 on 𝑈 . 𝐻 𝒎 = 𝑀(𝒅, 𝒖) ∈ R historical distribution 𝒅 . Intuitively, as shown in Figure 3(b), 𝐷 has extensive possible H The set of the items clicked by user 𝑢 . values in a specific dataset, i.e., users have various historical U,I The user and item sets, respectively. distributions over item groups. In DecRS, the prediction score 𝑞 denotes the probability of item 𝑖 belonging to 𝑖 𝑖 𝑖 𝑁 𝑔 𝒒 = [𝑞 , ...,𝑞 ] ∈ R 𝑌 considers various possible values of 𝐷 . As such, 1) inevitably, 𝑔 𝑔 1 𝑁 𝑖 group 𝑔 , e.g., 𝒒 = [1, 0, 0] . DecRS removes the dependency on 𝒅 in Eq. 1(d) and mitigates the 𝐻 𝒗 = [𝒗 , ..., 𝒗 ], 𝒗 ∈ R 𝒗 denotes the representation of group 𝑔 . 1 𝑁 𝑛 𝑛 𝑛 spurious correlation, and 2) theoretically, when user interest drift The symmetric KL divergence value of user 𝑢 and 𝘂 ,𝘂 ˆ 𝑢 𝑢 the normalized one, respectively. happens in the testing data, DecRS can produce a more robust and accurate prediction because the model has “seen” many different values of 𝐷 during training and doesn’t heavily depend on the The error of the approximation 𝜖 is measured by the Jensen gap [1]: unreliable distribution 𝒅 in Eq. 1(d). 𝜖 = |E [𝑓 (·)] −𝑓 (𝒖, 𝒊, 𝑀(E [𝒅], 𝒖))|. 2.2.2 Backdoor Adjustment Approximation. Theoretically, the (6) 𝒅 𝒅 sample space of 𝐷 is infinite, which makes the calculation of Eq. Theorem 2.1. If 𝑓 is a linear function with a random variable 𝑋 (2c) intractable. Therefore, it is essential to derive an efficient as the input, then 𝐸[𝑓 (𝑋)] = 𝑓 (𝐸[𝑋]) holds under any probability approximation of Eq. (2c). distribution 𝑃(𝑋). Refer to [1, 13] for the proof. • Sampling of 𝐷. To estimate the distribution of 𝐷 , we sample users’ historical distributions over item groups in the training data, Theorem 2.2. If a random variable 𝑋 with the probability which comprise a discrete set D. Formally, given a user 𝑢, 𝒅 = distribution 𝑃(𝑋) has the expectation 𝘇, and the non-linear function [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] ∈ D and each click frequency 𝑝 (𝑔 ) over 𝑢 1 𝑢 𝑁 𝑢 𝑛 𝑓 : 𝐺 → R where 𝐺 is a closed subset of R, following: group 𝑔 is calculated by (1) 𝑓 is bounded on any compact subset of 𝐺; ∑︁ (2) |𝑓 (𝑥) − 𝑓 (𝘇)| = 𝑂(|𝑥 − 𝘇| ) at 𝑥 → 𝘇 for 𝛽 > 0; 𝑖∈H 𝑔 𝑢 𝑛 𝑝 (𝑔 ) = 𝑝(𝑔 |𝑖)𝑝(𝑖|𝑢) = , (3) 𝑢 𝑛 𝑛 (3) |𝑓 (𝑥)| = 𝑂(|𝑥| ) as 𝑥 → +∞ for 𝛾 ≥ 𝛽, |H | 𝑖∈I 𝛽 𝛾 then the inequality holds: |E[𝑓 (𝑋)] − 𝑓 (𝘇)| ≤ 𝑇(𝜌 + 𝜌 ), where √︃ where I is the set of all items, H denotes the clicked item set |𝑓 (𝑥)−𝑓 (𝘇)| 𝑖 𝜌 = E[|𝑋 − 𝘇| ], and 𝑇 = sup does not 𝛽 𝛽 𝛾 by user 𝑢, and 𝑞 represents the probability of item 𝑖 belonging 𝑥∈𝐺\{𝘇} |𝑥−𝘇| +|𝑥−𝘇| 𝑖 𝑖 depend on 𝑃(𝑋). The proof can be found in [13]. to group 𝑔 . For instance, 𝒒 = [1, 0, 0] with 𝑞 = 1 denotes that item 𝑖 only belongs to the first group. In this work, we sample 𝐷 From Theorem 2.1, we know that the error𝜖 in Eq. 6 is zero if 𝑓 (·) according to the user-item interactions in the training data, and in Eq. 5 is a linear function. However, most existing recommender |H | thus the probability 𝑃(𝒅 ) of user 𝑢 is obtained by where 𝑣 models use non-linear functions to increase the representation 𝑣∈U U represents the user set. As such, we can estimate Eq. (2c) by capacity. In these cases, there is an upper bound of 𝜖 which can ∑︁ be estimated by Theorem 2.2. It can be proven that the common 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) ≈ 𝑃(𝒅)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)) non-linear functions in recommender models (e.g., sigmoid in [29]) 𝒅∈D satisfy the conditions in Theorem 2.2, and the upper bound is ∑︁ (4) small, especially when the distribution of 𝐷 concentrates around = 𝑃(𝒅)𝑓 (𝒖, 𝒊, 𝑀(𝒅, 𝒖)), its expectation [13]. 𝒅∈D where each 𝒅 is a distribution from one user, and we use a 2.3 Backdoor Adjustment Operator function 𝑓 (·) (e.g., FM [29]) to calculate the conditional probability To facilitate the usage of DecRS, we design the operator to 𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)), similar to conventional recommender models. instantiate backdoor adjustment, which can be easily plugged into • Approximation of E [𝑓 (·)]. The expected value of function recommender models to alleviate bias amplification. From Eq. 5, 𝑓 (·) of 𝒅 in Eq. 4 is hard to compute because we need to calculate the we can find that in addition to 𝒖 and 𝒊, 𝑓 (·) takes 𝑀(𝒅, 𝒖) as the results of 𝑓 (·) for each 𝒅 and the possible values inD are extensive. Í model input where 𝒅 = 𝑃(𝒅)𝒅 . That is, if we can implement 𝒅∈D A popular solution [1, 35] in statistics and machine learning theory 𝑀(𝒅, 𝒖), existing recommender models can take it as one additional is to make the approximation E [𝑓 (·)] ≈ 𝑓 (𝒖, 𝒊, 𝑀(E [𝒅], 𝒖)). 𝒅 𝒅 input to achieve backdoor adjustment. Formally, the approximation takes the outer sum 𝑃(𝒅)𝑓 (·) into Recall that 𝑀 denotes the group-level user representation the calculation within 𝑓 (·): which describes the user preference over item groups. Given ∑︁ 𝒅 = [𝑝(𝑔 ), ...,𝑝(𝑔 )], item group representation 𝒗 = [𝒗 , ..., 𝒗 ], 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) ≈ 𝑓 (𝒖, 𝒊, 𝑀( 𝑃(𝒅)𝒅, 𝒖)). 1 𝑁 1 𝑁 (5) and user representation 𝒖 = [𝒖 , ..., 𝒖 ] with feature values 𝒙 = ˜ 1 𝐾 𝑢 𝒅∈D ¯ [𝑥 , ...,𝑥 ] [16], we calculate 𝑀(𝒅, 𝒖) by adjustment is essential to alleviate bias amplification. Otherwise, 𝑢,1 𝑢,𝐾 the impact of backdoor adjustment should be controlled. 𝑁 𝐾 ∑︁ ∑︁ ¯ • Symmetric KL Divergence. We employ the symmetric 𝑀(𝒅, 𝒖) = 𝑝(𝑔 )𝒗 ⊙ 𝑥 𝒖 , (7) 𝑎 𝑎 𝑢,𝑏 𝑏 Kullback–Leibler (KL) divergence to quantify the user interest drift 𝑎=1 𝑏=1 in the history. In detail, we divide the historical interaction sequence where ⊙ denotes the element-wise product, and 𝒗 ∈ R is the of user 𝑢 into two parts according to the timestamps. For each part, item group representation for group 𝑔 proposed by us, which is we calculate the historical distribution over item groups by Eq. 3, randomly initialized like 𝒖. The feature values in 𝒙 are usually one, 1 1 1 2 2 2 obtaining 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] and 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )]. 1 𝑁 1 𝑁 𝑢 𝑢 𝑢 𝑢 𝑢 𝑢 but in some special cases, it could be a float number. For instance, Then, the distance between these two distributions is measured by a user may have two jobs and the feature value for these two the symmetric KL divergence: features can be set as 0.5 separately. Besides, we can also leverage 1 2 2 1 a FM module [29] or other high-order operators [10]. Formally, 𝘂 = 𝐾𝐿(𝒅 |𝒅 ) + 𝐾𝐿(𝒅 |𝒅 ) 𝑢 𝑢 𝑢 𝑢 we can obtain 𝒘 = [𝒅, 𝒙 ] = [𝑝(𝑔 ), ...,𝑝(𝑔 ),𝑥 , ...,𝑥 ] and 𝑢 1 𝑢,1 𝑁 𝑢,𝐾 𝑁 𝑁 ∑︁ 1 ∑︁ 2 (10) 𝑃 (𝑔 ) 𝑃 (𝑔 ) 𝑛 𝑛 𝑢 𝑢 𝒄 = [𝒗 , 𝒖] = [𝒗 , ..., 𝒗 , 𝒖 , ..., 𝒖 ] via concatenation, and then 1 2 1 1 𝑁 𝐾 = 𝑃 (𝑔 ) log + 𝑃 (𝑔 ) log , 𝑢 𝑛 𝑢 𝑛 2 1 𝑃 (𝑔 ) 𝑃 (𝑔 ) 𝑀(𝒅, 𝒖) can be calculated by a second-order FM module: 𝑢 𝑛 𝑢 𝑛 𝑛=1 𝑛=1 𝑁+𝐾 𝑁+𝐾 ∑︁ ∑︁ where 𝘂 denotes the distribution distance of user 𝑢. A higher 𝘂 𝑢 𝑢 𝑀(𝒅, 𝒖) = 𝑤 𝒄 ⊙ 𝑤 𝒄 , (8) 𝑎 𝑎 𝑏 𝑏 represents that the user is easier to change the interest distribution 𝑎=1 𝑏=1 over item groups. Here, we only divide the historical interaction where 𝑀(𝒅, 𝒖) considers the interactions within 𝒖 and 𝒗 like FM, sequence into two parts to reduce the computation cost. More fine- which is the main difference from Eq. 7. Next, the group-level grained division can be explored in future work if necessary. user representation 𝑀(𝒅, 𝒖) can be incorporated into existing Based on the signal of 𝘂 , we utilize an inference strategy to recommender models as one additional user representation. adaptively fuse the prediction scores from the conventional RS Formally, if the generalized recommender models (e.g., FM) are able and DecRS. Specifically, we first train the recommender model by to incorporate multiple feature representations, 𝑀(𝒅, 𝒖) is directly 𝑃(𝑌|𝑈 = 𝒖, 𝐼 = 𝒊) and 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊), respectively, and fed into the models to calculate 𝑓 (𝒖, 𝒊, 𝑀(𝒅, 𝒖)). Otherwise, 𝑓 (·) can their prediction scores are then automatically fused to regulate the be implemented by a later-fusion manner, i.e., 𝑓 (·) = 𝛿∗𝑓 (𝒖, 𝒊)+(1− impact of backdoor adjustment. Formally, ′ ′ 𝛿) ∗ 𝑓 (𝑀(𝒅, 𝒖), 𝒊) where 𝛿 is a hyperparameter and 𝑓 (·) denotes 𝑅𝑆 𝐷𝐸 𝑌 = (1− 𝘂 ˆ ) ∗ 𝑌 + 𝘂 ˆ ∗ 𝑌 , (11) 𝑢,𝑖 𝑢 𝑢 𝑢,𝑖 𝑢,𝑖 the interaction module (e.g., dot product) in recommender models to calculate the prediction score given user/item representations, 𝑅𝑆 where 𝑌 is the inference score for user 𝑢 and item 𝑖 , 𝑌 and 𝑢,𝑖 𝑢,𝑖 such as neural collaborative filtering [ 17]. Then the parameters 𝘃 𝐷𝐸 𝑌 are the prediction scores from the conventional RS and DecRS, 𝑢,𝑖 in the recommender models are optimized by ∑︁ respectively. In particular, 𝘂 ˆ is calculated by ¯ ¯ 𝘃 = arg min 𝑙(𝑓 (𝒖, 𝒊, 𝑀(𝒅, 𝒖)),𝑦 ¯ ), 𝑢,𝑖 𝘂 − 𝘂 (9) 𝑢 𝑚𝑖𝑛 ¯ 𝘂 ˆ = ( ) 𝑢 (12) (𝑢,𝑖,𝑦 ¯ )∈T 𝑢,𝑖 𝘂 − 𝘂 𝑚𝑎𝑥 𝑚𝑖𝑛 where 𝑦 ¯ ∈ {0, 1} represents whether user 𝑢 has interacted with 𝑢,𝑖 where the normalized 𝘂 ˆ ∈ [0, 1], 𝘂 and 𝘂 are the minimum 𝑢 𝑚𝑖𝑛 𝑚𝑎𝑥 item 𝑖 (i.e., 𝑦 ¯ = 1) or not (i.e., 𝑦 ¯ = 0), T denotes the training 𝑢,𝑖 𝑢,𝑖 and maximum symmetric KL divergence values across all users, data, and 𝑙(·) is the loss function, e.g., log loss [17]. respectively. Besides, 𝛼 ∈ [0,+∞) is a hyper-parameter to further 𝑅𝑆 𝐷𝐸 control the weights of 𝑌 and 𝑌 by human intervention. 𝑢,𝑖 𝑢,𝑖 2.4 Inference Strategy Specifically, 𝘂 ˆ becomes larger if 𝛼 → 0 due to 𝘂 ˆ ∈ [0, 1] which 𝑢 𝑢 As mentioned before, DecRS alleviates bias amplification and 𝐷𝐸 makes 𝑌 favor 𝑌 , and 𝘂 ˆ decreases if 𝛼 → +∞. 𝑢,𝑖 𝑢 𝑢,𝑖 produces more robust predictions when user interest drift happens. From Eq. 11, we can find that the inference for the users with Indeed, for some users, bias amplification might be beneficial to 𝐷𝐸 high 𝘂 ˆ will rely more on 𝑌 . That is, 𝘂 automatically adjusts 𝑢 𝑢 𝑢,𝑖 exclude the item groups they dislike. For example, users might only 𝑅𝑆 𝐷𝐸 the balance between 𝑌 and 𝑌 . Besides, we can regulate the like action movies so that they don’t watch the movies in other 𝑢,𝑖 𝑢,𝑖 impact of backdoor adjustment by tuning the hyper-parameter 𝛼 in groups. In these special cases, it makes sense to purely recommend Eq. 12 for different datasets or recommender models. Theoretically, extensive action movies. Therefore, it is better to develop a user- 𝛼 is usually close to 0 because mitigating the spurious correlation specific inference strategy to regulate the impact of backdoor improves the recommendation accuracy for most users. adjustment dynamically. To summarize, the proposed DecRS has three main differences By analyzing the user behavior, we find that many users have from the conventional RS: diverse interest and are likely to have interest drift while few users have stable interest in item groups over time (e.g., only liking action • DecRS models the causal effect 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) instead of movies). This inspires us to explore the user characteristics: is this the conditional probability 𝑃(𝑌|𝑈 = 𝒖, 𝐼 = 𝒊). user easy to change the interest distribution over item groups? • DecRS equips the recommender models with a backdoor Based on that, we propose a user-specific inference strategy for adjustment operator (e.g., Equation 7). item ranking. If the user is easy to change the interest distribution • DecRS makes recommendations with a user-specific inference over item groups in the history, we assume that he/she has diverse strategy instead of the simple model prediction (e.g., a forward interest and will change it easily in future. And thus backdoor propagation). Table 2: The statistics of the datasets. 3 RELATED WORK Dataset #Users #Items #Interactions # Features #Group In this work, we explore how to alleviate bias amplification of ML-1M 3,883 6,040 575,276 13,408 18 recommender models by causal inference, which is highly related Amazon-Book 29,115 16,845 1,712,409 46,213 253 to fairness, diversity, and causal recommendation. Inverse Propensity Scoring (IPS) [2, 28, 41], which first estimates the Negative Effect of Bias Amplification. Due to the existence propensity score based on some assumptions, and then uses the of feedback loop [7], bias amplification will become increasingly inverse propensity score to re-weight the samples. For instance, serious. Consequently, it will result in many negative issues: 1) Saito et al. estimated the exposure propensity for each user-item narrowing down the user interest gradually, which is similar pair, and re-weighted the samples via IPS to solve the miss-not-at- to the effect of filter bubbles [22]. Worse still, the issue might random problem [30]. However, IPS methods heavily rely on the evolve into echo chambers [14], in which users’ imbalanced interest accurate propensity estimation, and usually suffer from the high is further reinforced by the repeated exposure to similar items; propensity variance. Thus it is often followed by the propensity 2) low-quality items that users dislike might be recommended clipping technique [2, 30]. Another line of causal recommendation purely because they are in the majority group, which deprive the studies the effect of taking recommendations as treatments on recommendation opportunities of other high-quality items, causing user/system behaviors [48], which is totally different from our low recommendation accuracy and unfairness. work because we focus on the causal relations within the models. Fairness in Recommendation. With the increasing attention on the fairness of machine learning algorithms [19], many 4 EXPERIMENTS works explore the definitions of fairness in recommendation and We conduct extensive experiments to demonstrate the effectiveness information retrieval [20, 24, 27]. Generally speaking, they have of our DecRS by investigating the following research questions: two categories: individual fairness and group fairness. Individual • RQ1: How does the proposed DecRS perform across different fairness denotes that similar individuals (e.g., users or items) users in terms of recommendation accuracy? should receive similar treatments (e.g., exposure or clicks), such as • RQ2: How does DecRS perform to alleviate bias amplification, amortized equity of attention [3]. Besides, group fairness indicates compared to the state-of-the-art methods? that all groups are supposed to be treated fairly where individuals • RQ3: How do the different components affect the performance are divided into groups according to the protected attributes (e.g., of DecRS, such as the inference strategy and the implementation item category and user gender) [46]. The particular definitions span of function 𝑀(·)? from discounted cumulative fairness [44], fairness of exposure [31], to multi-sided fairness [5]. 4.1 Experimental Settings Another representative direction in fairness to reduce bias amplification is calibrated recommendation [ 32]. It re-ranks the Datasets. We use two benchmark datasets, ML-1M and Amazon- items to make the distribution of the recommended item groups Book, in different real-world scenarios. 1) ML-1M is a movie follow the proportion in the browsing history. For example, if a recommendation dataset , which involves rich user/item features, user has watched 70% action movies and 30% romance movies, the such as user gender, and movie genre. We partition the items into recommendation list is expected to have the same proportion of groups according to the movie genre. 2) Amazon-Book is one of the movies. Although the fairness-related works, including calibrated Amazon product datasets , where the book items can be divided recommendation, may alleviate bias amplification well, they are into groups based on the book category (e.g., sports). To ensure data making the trade-off between ranking accuracy and fairness [ 21, quality, we adopt the 20-core settings, i.e., discarding the users and 31, 32]. The reason possibly lies in that they neglect the true cause items with less than 20 interactions. We summarize the statistics of of bias amplification. datasets in Table 2. For each dataset, we sort the user-item interactions by the Diversity in Recommendation. Diversity is regarded as one timestamps, and split them into the training, validation, and testing essential direction to get users out of filter bubbles in the subsets with the ratio of 80%, 10%, and 10%. For each interaction with information filtering systems [ 32]. As to recommendation, diversity the rating≥ 4, we treat it as a positive instance. During training, we pursues the dissimilarity of the recommended items [8, 33], where adopt the negative sampling strategy to randomly sample one item similarity can be measured by many factors, such as item category that the user did not interact with before as a negative instance. and embeddings [6]. However, most works might recommend many dissatisfying items when making diverse recommendations. For Baselines. As our proposed DecRS is model-agnostic, we example, the recommender model may trade off the accuracy to instantiate it on two representative recommender models, FM [29] reduce the intra-list similarity by re-ranking [47]. and NFM [16], to alleviate bias amplification and boost the Causal Recommendation. Causal inference has been widely predictive performance. We compare DecRS with the state-of-the- used in many machine learning applications, spanning from art methods that might alleviate bias amplification of FM and NFM computer vision [23, 34], natural language processing [11, 12, 43], to backbone models. In particular, information retrieval [4]. In recommendation, most works on causal • Unawareness [15, 19] removes the features of item groups (e.g., inference [25] focus on debiasing various biases in user feedback, movie genre in ML-1M) from the input of item representation 𝐼 . including position bias [18], clickbait issue [37], and popularity https://grouplens.org/datasets/movielens/1m/. bias [45]. The most representative idea in the existing works is https://jmcauley.ucsd.edu/data/amazon/. Table 3: Overall performance comparison between DecRS and the baselines on ML-1M and Amazon-Book. %improv. denotes the relative performance improvement achieved by DecRS over FM or NFM. The best results are highlighted in bold. FM NFM ML-1M Amazon-Book ML-1M Amazon-Book Method R@10 R@20 N@10 N@20 R@10 R@20 N@10 N@20 R@10 R@20 N@10 N@20 R@10 R@20 N@10 N@20 FM/NFM [16, 29] 0.0676 0.1162 0.0566 0.0715 0.0213 0.0370 0.0134 0.0187 0.0659 0.1135 0.0551 0.0697 0.0222 0.0389 0.0144 0.0199 Unawareness [15] 0.0679 0.1179 0.0575 0.0730 0.0216 0.0377 0.0138 0.0191 0.0648 0.1143 0.0556 0.0708 0.0206 0.0381 0.0133 0.0190 FairCo [21] 0.0676 0.1165 0.0570 0.0720 0.0212 0.0370 0.0135 0.0188 0.0651 0.1152 0.0554 0.0708 0.0219 0.0390 0.0142 0.0199 Calibration [32] 0.0647 0.1149 0.0539 0.0695 0.0202 0.0359 0.0129 0.0181 0.0636 0.1131 0.0526 0.0682 0.0194 0.0335 0.0131 0.0178 Diversity [47] 0.0670 0.1159 0.0555 0.0706 0.0207 0.0369 0.0131 0.0185 0.0641 0.1133 0.0540 0.0693 0.0215 0.0386 0.0140 0.0197 IPS [30] 0.0663 0.1188 0.0556 0.0718 0.0213 0.0369 0.0135 0.0187 0.0648 0.1135 0.0544 0.0692 0.0213 0.0370 0.0137 0.0189 DecRS 0.0704 0.1231 0.0578 0.0737 0.0231 0.0405 0.0148 0.0205 0.0694 0.1218 0.0580 0.0742 0.0236 0.0413 0.0153 0.0211 %improv. 4.14% 5.94% 2.12% 3.08% 8.45% 9.46% 10.45% 9.63% 5.31% 7.31% 5.26% 6.46% 6.31% 6.17% 6.25% 6.03% Table 4: Performance comparison across different user groups on ML-1M and Amazon-Book. Each line denotes the performance over the user group with 𝘂 > the threshold. We omit the results of threshold > 4 due to the similar trend. ML-1M Amazon-Book FM R@20 N@20 R@20 N@20 Threshold FM DecRS %improv. FM DecRS %improv. FM DecRS %improv. FM DecRS %improv. 0 0.1162 0.1231 5.94% 0.0715 0.0737 3.08% 0.0370 0.0405 9.46% 0.0187 0.0205 9.63% 0.5 0.1215 0.1296 6.67% 0.0704 0.0730 3.69% 0.0383 0.0424 10.70% 0.0192 0.0213 10.94% 1 0.1303 0.1412 8.37% 0.0707 0.0741 4.81% 0.0430 0.0479 11.40% 0.0208 0.0232 11.54% 2 0.1432 0.1646 14.94% 0.0706 0.0786 11.33% 0.0518 0.0595 14.86% 0.0231 0.0274 18.61% 3 0.1477 0.1637 10.83% 0.0620 0.0711 14.68% 0.0586 0.0684 16.72% 0.0256 0.0318 24.22% 4 0.1454 0.1768 21.60% 0.0595 0.0737 23.87% 0.0659 0.0793 20.33% 0.0284 0.0362 27.46% NFM R@20 N@20 R@20 N@20 Threshold NFM DecRS %improv. NFM DecRS %improv. NFM DecRS %improv. NFM DecRS %improv. 0 0.1135 0.1218 7.31% 0.0697 0.0742 6.46% 0.0389 0.0413 6.17% 0.0199 0.0211 6.03% 0.5 0.1187 0.1280 7.83% 0.0688 0.0735 6.83% 0.0401 0.0426 6.23% 0.0202 0.0218 7.92% 1 0.1272 0.1391 9.36% 0.0692 0.0747 7.95% 0.0438 0.0473 7.99% 0.0212 0.0234 10.38% 2 0.1452 0.1584 9.09% 0.0701 0.0771 9.99% 0.0530 0.0580 9.43% 0.0234 0.0269 14.96% 3 0.1478 0.1740 17.73% 0.0639 0.0723 13.15% 0.0614 0.0660 7.49% 0.0275 0.0319 16.00% 4 0.1442 0.1775 23.09% 0.0542 0.0699 28.97% 0.0709 0.0795 12.13% 0.0308 0.0371 20.45% • FairCo [21] introduces one error term to control the exposure recommendation list (comprised by the top-20 items). Higher 𝐶 𝐾𝐿 fairness across item groups. In this work, we calculate the error scores suggest a more serious issue of bias amplification. term based on the ranking list sorted by relevance, and its Parameter Settings. We implement our DecRS in the PyTorch coefficient 𝘆 in the ranking target is tuned in {0.01, 0.02, ..., 0.5}. implementation of FM and NFM. Closely following the original • Calibration [32] is one state-of-the-art method to alleviate bias papers [16, 29], we use the following settings: in FM and NFM, amplification. Specifically, it proposes a calibration metric 𝐶 to 𝐾𝐿 the embedding size of user/item features is 64, log loss [17] is measure the imbalance between the history and recommendation applied and the optimizer is set as Adagrad [9]; in NFM, a 64- list, and minimizes𝐶 by re-ranking. Here the hyper-parameter 𝐾𝐿 dimension fully-connected layer is used. We adopt a grid search 𝘆 in the ranking target is searched in {0.01, 0.02, ..., 0.5}. to tune their hyperparameters: the learning rate is searched in • Diversity [47] aims to decrease the intra-list similarity, where {0.005, 0.01, 0.05}; the batch size is tuned in {512, 1024, 2048}; the diversification factor is tuned in {0.01, 0.02, ..., 0.2}. the normalization coefficient is searched in {0, 0.1, 0.2}, and the • IPS [30] is a classical method in causal recommendation. Here we dropout ratio is confirmed in {0.2, 0.3, ..., 0.5}. Besides, 𝛼 in the use 𝑃(𝒅 ) as the propensity of user𝑢 to down-weight the items in proposed inference strategy is tuned in {0.1, 0.2, ..., 10}, and the the majority group during debiasing training, and we employ the model performs the best in {0.2, 0.3, 0.4}, where 𝛼 is close to 0, propensity clipping technique [30] to reduce propensity variance, proving the advantages of our DecRS over the conventional RS as where the clipping threshold is searched in {2, 3, ..., 10}. ¯ discussed in Section 2.4. We use Eq. 8 to implement 𝑀(𝒅, 𝒖) and Evaluation Metrics. We evaluate the performance of all methods the backbone models take 𝑀(𝒅, 𝒖) as one additional feature. The from two perspectives: recommendation accuracy and effectiveness exploration of the late-fusion manner is left to future work because of alleviating bias amplification. In terms of accuracy, two widely- it is not our main contribution. Furthermore, we use the early used metrics [40], Recall@K (R@K) and NDCG@K (N@K), are stopping strategy [38, 42] — stop training if R@10 on the validation adopted under all ranking protocol [36, 39], which test the top-K set does not increase for 10 successive epochs. For all approaches, recommendations over all items that users never interact with in we tune the hyper-parameters to choose the best models w.r.t. R@10 the training data. As to alleviating bias amplification, we use the on the validation set, and report the results on the testing set. We representative calibration metric 𝐶 [32], which quantifies the 𝐾𝐿 released code and data at https://github.com/WenjieWWJ/DecRS. distribution drift over item groups between the history and the new 4.2 Performance Comparison (RQ1 & RQ2) FM NFM 0.48 0.6 FM Calibration DecRS 4.2.1 Overall Performance w.r.t. Accuracy. We present the NFM Calibration DecRS 0.55 0.46 empirical results of all baselines and DecRS in Table 3. Moreover, 0.5 to further analyze the characteristics of DecRS, we split users into 0.44 groups based on the symmetric KL divergence (cf. Eq. 10) and report 0.45 0.42 the performance comparison over the user groups in Table 4. From 0.4 the two tables, we have the following findings: 0.4 0.35 • Unawareness and FairCo only achieve comparable performance 0.38 0.3 or marginal improvements over the vanilla FM and NFM on the 0 0.5 1 2 3 4 0 0.5 1 2 3 4 two datasets. Possible reasons are the trade-offs among different Threshold Threshold user groups. To be more specific, for some users, discarding Figure 4: The performance comparison between the base- group features or preserving group fairness is able to reduce bias lines and DecRS on alleviating bias amplification. amplification and recommend more satisfying items. However, FM for most users with imbalanced interest in item groups, these NFM 0.2 0.2 approaches possibly recommend many disappointing items by FM DecRS (w/o) DecRS NFM DecRS (w/o) DecRS pursuing group fairness. 0.18 0.18 • Calibration and Diversity perform worse than the vanilla 0.16 0.16 backbone models, suggesting that simple re-ranking does hurt the recommendation accuracy. This is consistent with the findings in 0.14 0.14 [32, 47]. Moreover, we ascribe the inferior performance of IPS to the inaccurate estimation and high variance of propensity scores. 0.12 0.12 That is, the propensity cannot precisely estimate the effect of 𝐷 on 𝑈 , even if the propensity clipping technique [30] is applied. 0.1 0.1 0 0.5 1 2 3 4 0 0.5 1 2 3 4 • DecRS effectively improves the recommendation performance of Threshold Threshold FM and NFM on the two datasets. As shown in Table 3, the relative Figure 5: Ablation study of DecRS on ML-1M. improvements of DecRS over FM w.r.t. R@20 are 5.94% and 9.46% on ML-1M and Amazon-Book, respectively. This verifies the Table 5: Effect of the design of 𝑀(·). effectiveness of backdoor adjustment, which enables DecRS to Method R@10 R@20 N@10 N@20 remove the effect of confounder for many users. As a result, many FM 0.0676 0.1162 0.0566 0.0715 DecRS-EP 0.0685 0.1205 0.0573 0.0730 less-interested or low-quality items from the majority group will DecRS-FM 0.0704 0.1231 0.0578 0.0737 not be recommended, thus increasing the accuracy. • As Table 4 shows, with the increase of 𝘂 , the performance 4.3 In-depth Analysis (RQ3) gap between DecRS and the backbone models becomes larger. 4.3.1 Effect of the Inference Strategy . We first answer the For example, in the user group with 𝘂 > 4, the relative question: Is it of importance to conduct the inference strategy for improvements w.r.t. N@20 over FM and NFM are 23.87% and DecRS? Towards this end, one variant “DecRS (w/o)” is constructed 28.97%, respectively. We attribute such improvements to the by disabling the inference strategy and only using the prediction robust recommendation produced by DecRS. Specifically, DecRS 𝐷𝐸 𝑌 in Eq. 11 for inference. We illustrate its results in Figure 5 with equipped with backdoor adjustment is superior in reducing the following key findings. 1) The performance of “DecRS (w/o)” the spurious correlation and predicting users’ diverse interest, drops as compared with that of DecRS, indicating the effectiveness especially for the users with the interest drift (i.e., high 𝘂 ). of the inference strategy. 2) “DecRS (w/o)” still outperforms FM 4.2.2 Performance on Alleviating Bias Amplification. In and NFM consistently, especially over the users with high 𝘂 . This Figure 4, we present the performance comparison w.r.t. 𝐶 𝐾𝐿 suggests the superiority of DecRS over the conventional RS. It between the vanilla FM/NFM, calibrated recommendation, and achieves more accurate predictions of user interest by mitigating the DecRS on ML-1M. Due to space limitation, we omit other baselines effect of the confounder via backdoor adjustment approximation. that perform worse than calibrated recommendation and the results on Amazon-Book which have similar trends. We have the following 4.3.2 Effect of the Implementation of 𝑀(·). As mentioned in observations from Figure 4. 1) As compared to the vanilla models, Section 2.3, we can implement the function 𝑀(·) by either Eq. 7 or calibrated recommendation achieves lower 𝐶 scores, suggesting Eq. 8. We investigate the influence of different implementations and 𝐾𝐿 that the bias amplification is reduced. However, it comes at the construct two variants, DecRS-EP and DecRS-FM, which employ cost of lower recommendation accuracy, as shown in Table 3. 2) the element-wise product in Eq. 7 and the FM module in Eq. 8, Our DecRS consistently achieves lower 𝐶 scores than calibrated respectively. We summarize their performance comparison over 𝐾𝐿 recommendation across all user groups. More importantly, DecRS FM on ML-1M in Table 5. While being inferior to DecRS-FM, DecRS- does not hurt the recommendation accuracy. This evidently shows EP still performs better than FM. This proves the superiority of that DecRS solves the bias amplification problem well by embracing DecRS-FM over DecRS-EP, and also shows that DecRS with different causal modeling for recommendation, and justifies the effectiveness implementations still surpasses the vanilla backbone models, which of backdoor adjustment on reducing spurious correlations. further suggests the stability and effectiveness of DecRS. Recall@20 C_KL C_KL Recall@20 5 CONCLUSION AND FUTURE WORK [18] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM. ACM, 781–789. In this work, we explained that bias amplification in recommender [19] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual Fairness. In NeuIPS. Curran Associates, Inc., 4066–4076. models is caused by the confounder from a causal view. To [20] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and alleviate bias amplification, we proposed a novel DecRS with an Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of approximation operator for backdoor adjustment. DecRS explicitly the trade-off between relevance, fairness and satisfaction in recommendation systems. In CIKM. ACM, 2243–2251. models the causal relations in recommender models, and leverages [21] Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. backdoor adjustment to remove the spurious correlation caused Controlling Fairness and Bias in Dynamic Learning-to-Rank. ACM, 429–438. by the confounder. Besides, we developed an inference strategy to [22] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph A Konstan. 2014. Exploring the filter bubble: the effect of using recommender regulate the impact of backdoor adjustment. Extensive experiments systems on content diversity. In WWW. ACM, 677–686. validate the effectiveness of DecRS on alleviating bias amplification [23] Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji- and improving recommendation accuracy. Rong Wen. 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias. In CVPR. IEEE. This work takes the first step to incorporate backdoor adjustment [24] Gourab K Patro, Arpita Biswas, Niloy Ganguly, Krishna P Gummadi, and Abhijnan into existing recommender models. In future, there are many Chakraborty. 2020. Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms. In WWW. ACM, 1194–1204. research directions that deserve our attention. 1) The discovery [25] Judea Pearl. 2009. Causality. Cambridge university press. of more fine-grained causal relations in recommendation models. [26] Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of This work starts to mitigate the spurious correlations caused by Cause and Effect (1st ed.). Basic Books, Inc. [27] Evaggelia Pitoura, Georgia Koutrika, and Kostas Stefanidis. 2020. Fairness in the confounder while recommendation is an extremely complex Rankings and Recommenders.. In EDBT. ACM, 651–654. scenario, involving many observed/hidden variables that are [28] Zhen Qin, Suming J. Chen, Donald Metzler, Yongwoo Noh, Jingzheng Qin, and Xuanhui Wang. 2020. Attribute-Based Propensity for Unbiased Learning in waiting for causal discovery. 2) The proposed DecRS has the Recommender Systems: Algorithm and Case Studies. In KDD. ACM, 2359–2367. potential to reduce various biases in information retrieval and [29] Steffen Rendle. 2010. Factorization machines. In ICDM. IEEE, 995–1000. recommendation, such as position bias and popularity bias. The [30] Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. 2020. Unbiased Recommender Learning from Missing-Not-At-Random Implicit causes of the biases are also related to the imbalanced training data. Feedback. In WSDM. ACM, 501–509. 3) Bias amplification is one essential cause of the filter bubble [ 22] [31] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. and echo chambers [14]. The effect of DecRS on mitigating these In KDD. ACM, 2219–2228. [32] Harald Steck. 2018. Calibrated recommendations. In RecSys. ACM, 154–162. issues can be studied in future work. [33] Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, and Mark REFERENCES Coates. 2020. A Framework for Recommending Accurate and Diverse Items Using [1] Shoshana Abramovich and Lars-Erik Persson. 2016. Some new estimates of the Bayesian Graph Convolutional Neural Networks. In KDD. ACM, 2030–2039. ‘Jensen gap’. Journal of Inequalities and Applications 2016, 1 (2016), 1–9. [34] Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-Tailed [2] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Classification by Keeping the Good and Removing the Bad Momentum Causal Unbiased Learning to Rank with Unbiased Propensity Estimation. In SIGIR. ACM, Effect. In NeuIPS. 385–394. [35] Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual [3] Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: commonsense r-cnn. In CVPR. IEEE, 10760–10770. Amortizing individual fairness in rankings. In SIGIR. ACM, 405–414. [36] Wenjie Wang, Fuli Feng, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2021. [4] Stephen Bonner and Flavian Vasile. 2018. Causal embeddings for recommendation. Denoising implicit feedback for recommendation. In WSDM. ACM, 373–381. In RecSys. ACM, 104–112. [37] Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. [5] Robin Burke. 2017. Multisided fairness for recommendation. In FAT ML. 2021. Click can be Cheating: Counterfactual Recommendation for Mitigating [6] Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures Clickbait Issue. In SIGIR. ACM. for novelty and diversity. In SIGIR. ACM, 413–422. [38] Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. [7] Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. 2018. How Chat more: Deepening and widening the chatting topic via a deep model. In algorithmic confounding in recommendation systems increases homogeneity SIGIR. ACM, 255–264. and decreases utility. In RecSys. ACM, 224–232. [39] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. [8] Peizhe Cheng, Shuaiqiang Wang, Jun Ma, Jiankai Sun, and Hui Xiong. 2017. Neural Graph Collaborative Filtering. In SIGIR. ACM, 165–174. Learning to Recommend Accurate and Diverse Items. In WWW. IW3C2, 183–192. [40] Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. [9] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods 2020. Disentangled Graph Collaborative Filtering. In SIGIR. ACM, 1001–1010. for online learning and stochastic optimization. JMLR 12, 7 (2011). [41] Yixin Wang, Dawen Liang, Laurent Charlin, and David M Blei. 2018. The [10] Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Cross-GCN: deconfounded recommender: A causal inference approach to recommendation. Enhancing Graph Convolutional Network with k-Order Feature Interactions. In arXiv:1808.06581. TKDE (2021). [42] Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng [11] Fuli Feng, Weiran Huang, Xin Xin, Xiangnan He, and Tat-Seng Chua. 2021. Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Should Graph Convolution Trust Neighbors? A Simple Causal Inference Method. Recommendation of Micro-video. In MM. ACM, 1437–1445. In SIGIR. ACM. [43] Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu, Changlong Sun, Jun Xiao, [12] Fuli Feng, Jizhi Zhang, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Yueting Zhuang, Luo Si, and Fei Wu. 2020. De-Biased Court’s View Generation Empowering Language Understanding with Counterfactual Reasoning. In ACL- with Causality. In EMNLP. ACL, 763–780. IJCNLP Findings. ACL. [44] Ke Yang and Julia Stoyanovich. 2017. Measuring fairness in ranked outputs. In [13] Xiang Gao, Meera Sitharam, and Adrian E. Roitberg. 2019. Bounds on the Jensen SSDBM. ACM, 1–6. Gap, and Implications for Mean-Concentrated Distributions. AJMAA 16, 14 [45] Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui (2019), 1–16. Issue 2. Ling, and Yongdong Zhang. 2021. Causal Intervention for Leveraging Popularity [14] Yingqiang Ge, Shuya Zhao, Honglu Zhou, Changhua Pei, Fei Sun, Wenwu Ou, Bias in Recommendation. In SIGIR. ACM. and Yongfeng Zhang. 2020. Understanding Echo Chambers in E-Commerce [46] Ziwei Zhu, Jianling Wang, and James Caverlee. 2020. Measuring and Mitigating Recommender Systems. In SIGIR. ACM, 2261–2270. Item Under-Recommendation Bias in Personalized Ranking Systems. In SIGIR. [15] Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P Gummadi, and Adrian ACM, 449–458. Weller. 2016. The case for process fairness in learning: Feature selection for fair [47] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005. decision making. In NeuIPS. Improving recommendation lists through topic diversification. In WWW. ACM, [16] Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse 22–32. predictive analytics. In SIGIR. ACM, 355–364. [48] Hao Zou, Peng Cui, Bo Li, Zheyan Shen, Jianxin Ma, Hongxia Yang, and Yue He. [17] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng 2020. Counterfactual Prediction for Bundle Treatment. NeuIPS (2020). Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University)

Deconfounded Recommendation for Alleviating Bias Amplification

Loading next page...
 
/lp/arxiv-cornell-university/deconfounded-recommendation-for-alleviating-bias-amplification-hi2kBcrBTL

References (53)

eISSN
ARCH-3344
DOI
10.1145/3447548.3467249
Publisher site
See Article on Publisher Site

Abstract

Deconfounded Recommendation for Alleviating Bias Amplification 1 12∗ 3 12 1 Wenjie Wang , Fuli Feng , Xiangnan He , Xiang Wang , and Tat-Seng Chua 1 2 3 National University of Singapore, Sea-NExT Joint Lab, University of Science and Technology of China {wenjiewang96,fulifeng93,xiangnanhe}@gmail.com,xiangwang@u.nus.edu,dcscts@nus.edu.sg ABSTRACT ACM Reference Format: Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. Recommender systems usually amplify the biases in the data. The 2021. Deconfounded Recommendation for Alleviating Bias Amplification. model learned from historical interactions with imbalanced item In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery distribution will amplify the imbalance by over-recommending and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. items from the major groups. Addressing this issue is essential ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3447548.3467249 for a healthy ecosystem of recommendation in the long run. Existing works apply bias control to the ranking targets (e.g., 1 INTRODUCTION calibration, fairness, and diversity), but ignore the true reason for Recommender System (RS) has been widely used to achieve bias amplification and trade off the recommendation accuracy. personalized recommendation in most online services, such as In this work, we scrutinize the cause-effect factors for bias social networks and advertising [39]. Its default choice is to learn amplification, identifying the main reason lies in the confounder user interest from historical interactions (e.g., clicks and purchases), effect of imbalanced item distribution on user representation and which typically exhibit data bias, i.e., the distribution over item prediction score. The existence of such confounder pushes us to go groups (e.g., the genre of movies) is imbalanced. Consequently, beyond merely modeling the conditional probability and embrace recommender models face the bias amplification issue [32]: over- the causal modeling for recommendation. Towards this end, we recommending the majority group and amplifying the imbalance. propose a Deconfounded Recommender System (DecRS), which Figure 1(a) illustrates this issue with an example in movie models the causal effect of user representation on the prediction recommendation, where 70% of the movies watched by a user are score. The key to eliminating the impact of the confounder lies action movies, but action movies take 90% of the recommendation in backdoor adjustment, which is however difficult to do due to slots. Undoubtedly, over-emphasizing the items from the majority the infinite sample space of the confounder. For this challenge, we groups will limit a user’s view and decrease the effectiveness of contribute an approximation operator for backdoor adjustment recommendations. Worse still, due to feedback loop [7], such bias which can be easily plugged into most recommender models. Lastly, amplification will intensify with time, causing more issues like filter we devise an inference strategy to dynamically regulate backdoor bubbles [22] and echo chambers [14]. adjustment according to user status. We instantiate DecRS on two Existing works alleviate bias amplification by introducing bias representative models FM [29] and NFM [16], and conduct extensive control into the ranking objective of recommender models, which experiments over two benchmarks to validate the superiority of are mainly from three perspectives: 1) fairness [21, 31], which our proposed DecRS. pursues equal exposure opportunities for items of different groups; 2) diversity [6], which intentionally increases the covered groups CCS CONCEPTS in a recommendation list, and 3) calibration [32], which encourages • Information systems→ Recommender systems; Collabora- the distribution of recommended item groups to follow that of tive filtering . interacted items of the user. However, these methods alleviate bias amplification at the cost of sacrificing recommendation KEYWORDS accuracy [31, 32]. More importantly, the fundamental question is not answered: what is the root reason for bias amplification? Deconfounded Recommendation, User Interest Imbalance, Bias After inspecting the cause-effect factors in recommender Amplification modeling, we attribute bias amplification to a confounder [25]. The ∗ Corresponding author: Fuli Feng (fulifeng93@gmail.com). This research is supported historical distribution of a user over item groups (e.g., [0.7, 0.3] in by the Sea-NExT Joint Lab, the National Natural Science Foundation of China Figure 1(a)) is a confounder between the user’s representation and (61972372), and National Key Research and Development Program of China (2020AAA0106000). the prediction score. In the conventional RS, the user/item features (e.g., ID and attributes) are first embedded into the representation Permission to make digital or hard copies of all or part of this work for personal or vectors, which are then fed into an interaction module (e.g., classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation factorization machines (FM) [29]) to calculate the prediction score on the first page. Copyrights for components of this work owned by others than ACM for the user-item pair [17]. In other words, recommender models must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, estimate the conditional probability of clicks given user/item to post on servers or to redistribute to lists, requires prior specific permission and /or a fee. Request permissions from permissions@acm.org. representations. From a causal view, user and item representations KDD ’21, August 14–18, 2021, Virtual Event, Singapore can be regarded as the causes of the prediction score, and the © 2021 Association for Computing Machinery. interaction module should encode the causal relations between ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00 https://doi.org/10.1145/3447548.3467249 them [25]. But inspecting the causal relations, we find that the arXiv:2105.10648v1 [cs.IR] 22 May 2021 Action movie Romance movie Majority group Minority group D I D I U User representation User browsing history I Item representation 70% 30% 0.70 User historical distribution 0.59 Bias amplification Recommender Feedback loop 0.58 X over item groups 0.47 Backdoor Group-level user representation New recommendation list M Adjustment 90% 10% ratings Y Prediction score 4 5 U Y U User feedback (b) Prediction score difference between the items (a) (b) (a) An example of bias amplification. in the majority and minority groups over ML-1M. Figure 2: (a) The causal graph of conventional RS. (b) The Underwater (2020) Historical distribution Marriage Story (2019) 𝑝 (𝑔 ) % & causal graph used in DecRS. Action movie of user u over item groups. Romance movie Rating by user u: 3.0/5.0 Rating by user u: 5.0/5.0 𝑔 𝑔 ' ( item groups • We instantiate DecRS on two representative recommender Item representation User representation Item representation Interaction Interaction models, and conduct extensive experiments on two benchmarks module module Rating by user u: 3.0 < 5.0 which validate the effectiveness of our proposal. 0.6 0.5 Prediction score: 0.6 > 0.5 (c) An example on the cause of bias amplification. 2 METHODOLOGY Figure 1: Illustration of bias amplification. In this section, we first analyze the conventional RS from a causal hidden confounder, i.e., the user historical distribution over item view and explain the reason for bias amplification, which is followed groups, affects both the user representation and the prediction by the introduction of the proposed DecRS. score. Due to the modeling of conditional probability, recommender models are affected by the confounder and thus suffer from a 2.1 A Causal View on Bias Amplification spurious correlation between the user and the prediction score. To study bias amplification, we build up a causal graph to explicitly That is, given two item groups, the one that the user interacted analyze the causal relations in the conventional RS. more in the history will receive higher prediction scores, even though their items have the same matching level. Figure 1(b) shows 2.1.1 Causal Graph. We scrutinize the causal relations in empirical evidence from the FM on ML-1M dataset: among the items recommender models and abstract a causal graph, as shown in with the same ratings (e.g., ratings = 4), the ones in the majority Figure 2(a), which consists of five variables: 𝑈 , 𝐼 , 𝐷 , 𝑀 , and 𝑌 . Note group will receive higher prediction scores. Therefore, the items in that we use the capital letter (e.g., 𝑈 ), lowercase letter (e.g., 𝒖), and the majority group, even including those undesirable or low-quality letter in the calligraphic font (e.g., U) to represent a variable, its ones (see example in Figure 1(c)), could deprive the recommendation particular value, and its sample space, respectively. In particular, opportunities of the items in the minority group. • 𝑈 denotes user representation. For one user, 𝒖 = [𝒖 , ..., 𝒖 ] 1 𝐾 The key to addressing bias amplification lies in eliminating the represents the embeddings of 𝐾 user features (e.g., ID, gender, spurious correlation in the recommender modeling. To achieve this and age) [29], where 𝒖 ∈ R is one feature embedding. goal, we need to push the conventional RS to go beyond modeling • 𝐼 is item representation and each 𝒊 denotes the embeddings of the conditional probability and embrace the causal modeling several item features (e.g., ID and genre) which are similar to 𝒖. of user representation on the prediction score. We propose a • 𝐷 represents the user historical distribution over item groups. novel Deconfounded Recommender System (DecRS), which explicitly Groups can be decided by item attributes or similarity [32]. Given models the causal relations during training, and leverages backdoor 𝑁 item groups {𝑔 , ...,𝑔 }, 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] ∈ R is a 1 𝑁 𝑢 𝑢 1 𝑢 𝑁 adjustment [25] to eliminate the impact of the confounder. However, particular value of 𝐷 when the user is𝑢, where 𝑝 (𝑔 ) is the click 𝑢 𝑛 the sample space of the confounder is huge, making the traditional frequency of user 𝑢 over group 𝑔 in the history . For instance, implementation of backdoor adjustment infeasible. To this end, for the user 𝑢 in Figure 1(a), 𝒅 is [0.7, 0.3] if 𝑁 = 2. we derive an approximation of backdoor adjustment, which is • 𝑀 is the group-level user representation. A particular value 𝒎 ∈ universally applicable to most recommender models. Lastly, we R is a vector which describes how much the user likes different propose a user-specific inference strategy to dynamically regulate item groups. 𝒎 can be obtained from the values of 𝑈 and 𝐷 . the influence of backdoor adjustment based on the user status. We That is, 𝑀 is deterministic if 𝑈 and 𝐷 are given so that we can instantiate DecRS on two representative models FM [29] and neural represent 𝒎 by a function 𝑀(𝒅, 𝒖) with 𝒅 and 𝒖 as inputs. To factorization machines (NFM) [16]. Extensive experiments over two keep generality, we incorporate 𝑀 into the causal graph because benchmarks demonstrate that our DecRS not only alleviates bias many recommender models (e.g., FM) have modeled the user amplification effectively, but also improves the recommendation preference over item groups explicitly or implicitly by using the accuracy over the backbone models. group-related features (e.g., movie genre). Overall, the main contributions of this work are threefold: • 𝑌 with 𝑦 ∈ [0, 1] is the prediction score for the user-item pair. • We construct a causal graph to analyze the causal relations The edges in the graph describe the causal relations between in recommender models, which reveals the cause of bias variables, e.g., 𝑈 → 𝑌 means that 𝑈 has a direct causal effect [25] amplification from a causal view. on 𝑌 , i.e., changes on 𝑈 will affect the value of 𝑌 . In particular, • We propose a novel DecRS with an approximation of backdoor adjustment to eliminate the impact of the confounder, which can In this work, we use click to represent any implicit feedback, such as purchase and be incorporated into existing recommender models to alleviate watch. For brevity, 𝑢 and 𝑖 may be used to denote the user and item, respectively. bias amplification. Besides, 𝑛 is used to represent any value in {1, 2, ..., 𝑁}. avg. prediction score Time t 𝒑 𝒈 , 𝒖 𝟐 • 𝐷 → 𝑈 : the user historical distribution over item groups Romance movie Romance movie 10% 20% affects user representation 𝑈 , making it favor the group with a (Minority group) 30% 50% higher click frequency (i.e., majority group). This is because user 𝟏 0.6 90% 0.4 Action movie 80% 70% representation is optimized to fit the imbalanced historical data. (Majority group) 50% 𝟎. 𝟔 • (𝐷,𝑈) → 𝑀 : 𝐷 and𝑈 decide the group-level user representation. 0.8 0.2 Training data Testing data • (𝑈, 𝑀, 𝐼) → 𝑌 : The edges show that 𝑈 affects 𝑌 by two paths: 1) the direct path 𝑈 → 𝑌 which denotes the user’s pure preference 𝟎. 𝟐 𝒅 𝒅 𝒑 (𝒈 ) 𝒖 𝒑 (𝒈 ) 𝒖 𝒖 𝒏 𝒖 𝒏 over the item, and 2) the indirect path 𝑈 → 𝑀 → 𝑌 , indicating drift 𝟎. 𝟖 𝟏 𝟎 𝟎. 𝟒 𝒑 𝒈 , that the prediction score could be high because the user shows 𝒖 𝟏 𝒈 𝒈 𝟏 𝟐 group 𝒈𝟏𝒈 group Action movie interest in the item group rather than the item. (b) Possible values of D and the probabilities. (a) User interest is changing over time. According to the causal theory [25], since 𝐷 affects both 𝑈 and Figure 3: (a) Illustration of user interest drift. (b) An example 𝑌 , 𝐷 is a confounder between 𝑈 and 𝑌 , resulting in the spurious of the distribution of 𝐷 when the item group number is 2. correlation when estimating the correlation between 𝑈 and 𝑌 . Each node in the line represents a particular value 𝒅, and a 2.1.2 Conventional RS. Due to the confounder, existing recom- darker color denotes a higher probability of 𝒅, i.e., 𝑃(𝒅). mender models that estimate the conditional probability 𝑃(𝑌|𝑈, 𝐼) recommendation list and narrow down the user interest. Besides, face the spurious correlation, which leads to bias amplification. the undesirable and low-quality items in the majority group Formally, given 𝑈 = 𝒖 and 𝐼 = 𝒊, we can derive the conditional will dissatisfy users, leading to poor recommendation accuracy. probability 𝑃(𝑌|𝑈, 𝐼) by: Worse still, by analyzing Eq. 1(d), we have a new observation: the 𝑃(𝑌|𝑈 = 𝒖, 𝐼 = 𝒊) prediction score 𝑌 heavily relies on the user historical distribution Í Í 𝑃(𝒅)𝑃(𝒖|𝒅)𝑃(𝒎|𝒅, 𝒖)𝑃(𝒊)𝑃(𝑌|𝒖, 𝒊, 𝒎) over item groups, i.e., 𝒅 . Once users’ future interest in item groups 𝒅∈D 𝒎∈M 𝑢 = (1a) changes (i.e., user interest drift), the recommendations will be 𝑃(𝒖)𝑃(𝒊) ∑︁ ∑︁ dissatisfying. For instance, as shown in Figure 3(a), the user interest = 𝑃(𝒅|𝒖)𝑃(𝒎|𝒅, 𝒖)𝑃(𝑌|𝒖, 𝒊, 𝒎) (1b) in item groups is not stable, and thus the correlation caused by 𝒅∈D 𝒎∈M ∑︁ the confounder 𝐷 will not be reliable if the distribution 𝑑 is = 𝑃(𝒅|𝒖)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)) (1c) inconsistent between training and testing data. 𝒅∈D = 𝑃(𝒅 |𝒖)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅 , 𝒖)), (1d) 𝑢 𝑢 2.2 Deconfounded Recommender System To resolve the impact of the confounder, DecRS estimates the causal where D andM are the sample spaces of 𝐷 and 𝑀 , respectively . effect of user representation on the prediction score. Experimentally, In particular, Eq. (1a) follows the law of total probability; Eq. (1b) the target can be achieved by collecting intervened data where the is obtained by Bayes rule; since 𝑀 can only take a value 𝑀(𝒅, 𝒖) if user representation is forcibly adjusted to eliminate the impact of 𝑈 = 𝒖 and 𝐷 = 𝒅, i.e., 𝑃(𝑀(𝒅, 𝒖)|𝒅, 𝒖) = 1, the sum overM in Eq. confounder. However, such an experiment is too costly to achieve in (1b) is removed; 𝐷 is known if 𝑈 = 𝒖 is given. Thus the probability large-scale and faces the risk of hurting user experience in practice. of 𝒖 having the distribution 𝒅 (i.e., 𝑃(𝒅|𝒖)) is 1 if and only if 𝒅 is DecRS thus resorts to the causal technique: backdoor adjustment [25, 𝒅 ; otherwise 𝑃(𝒅|𝒖) = 0, where 𝒅 is the historical distribution of 𝑢 𝑢 26, 41], which enables the estimation of causal effect from the user 𝑢 over item groups. observational data. From Eq. (1d), we can find that 𝒅 does not only affect the user representation 𝒖 but also affects 𝑌 via 𝑀(𝒅 , 𝒖), causing the 2.2.1 Backdoor Adjustment. According to the theory of backdoor spurious correlation: given the item 𝑖 in a group 𝑔 , the more items adjustment [25], the target of DecRS is formulated as: 𝑃(𝑌|𝑑𝑜(𝑈 = in group 𝑔 the user 𝑢 has clicked in the history, the higher the 𝒖), 𝐼 = 𝒊) where 𝑑𝑜(𝑈 = 𝒖) can be intuitively seen as cutting off the prediction score 𝑌 becomes. In other words, the high prediction edge 𝐷 → 𝑈 in the causal graph and blocking the effect of 𝐷 on 𝑈 scores are caused by the users’ historical interest in the group (cf. Figure 2(b)). We then derive the specific expression of backdoor instead of the items themselves. From the perspective of model adjustment. Formally, prediction, 𝒅 affects 𝒖, which makes 𝒖 favor the majority group. In 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) Eq. (1d), a higher click frequency 𝑝 (𝑔 ) in 𝒅 will make 𝑀(𝒅 , 𝒖) 𝑢 𝑛 𝑢 𝑢 ∑︁ represent a strong interest in group 𝑔 , increasing the prediction = 𝑃(𝒅|𝑑𝑜(𝑈 = 𝒖))𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝒊, 𝑀(𝒅,𝑑𝑜(𝑈 = 𝒖))) (2a) scores of items in group 𝑔 via 𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅 , 𝒖)). As such, the 𝑛 𝑢 𝒅∈D ∑︁ items in the majority group, even including the low-quality ones, = 𝑃(𝒅)𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝒊, 𝑀(𝒅,𝑑𝑜(𝑈 = 𝒖))) (2b) are easy to have high prediction scores due to the effect of the 𝒅∈D confounder 𝐷 . They occupy the recommendation opportunities of ∑︁ = 𝑃(𝒅)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)), (2c) items in the minority group, and thus bias amplification happens. 𝒅∈D The spurious correlation is harmful for most users because the items in the majority group are likely to dominate the where the derivation of Eq. (2a) is the same as Eq. (1c), which follows the law of total probability and Bayes rule. Besides, Eq. (2b) and Theoretically, 𝐷 has an infinite sample space. But the values are finite in a specific Eq. (2c) are obtained by two do calculus rules: insertion/deletion of dataset. To simplify the notations, we use the discrete set D to represent the sample space of 𝐷 , and so is 𝑀 . actions and action/observation exchange in Theorem 3.4.1 of [25]. Table 1: Key notations and descriptions. As compared to Eq. 1(d), DecRS estimates the prediction score Notation Description with consideration of every possible value of 𝐷 subject to the prior 𝒖 = [𝒖 , ..., 𝒖 ], 𝒖 ∈ R The representation vectors of 𝐾 user features. 1 𝐾 𝑘 𝑃(𝒅), rather than the probability of 𝒅 conditioned on 𝒖. Therefore, The feature values of a user’s 𝐾 features [29], e.g., the items in the majority group will not receive high prediction 𝒙 = [𝑥 , ...,𝑥 ] 𝑢 𝑢,1 𝑢,𝐾 [0.5, 1, ..., 0.2] . scores purely because of a high click probability in 𝒅 . And thus 𝑝 (𝑔 ) denotes the click frequency of user𝑢 over 𝑢 𝑛 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] 𝑢 𝑢 1 𝑢 𝑁 backdoor adjustment alleviates bias amplification by removing the group 𝑔 in the history, e.g., 𝒅 = [0.8, 0.2] . 𝑛 𝑢 The group-level representation of user 𝑢 under a effect of 𝐷 on 𝑈 . 𝐻 𝒎 = 𝑀(𝒅, 𝒖) ∈ R historical distribution 𝒅 . Intuitively, as shown in Figure 3(b), 𝐷 has extensive possible H The set of the items clicked by user 𝑢 . values in a specific dataset, i.e., users have various historical U,I The user and item sets, respectively. distributions over item groups. In DecRS, the prediction score 𝑞 denotes the probability of item 𝑖 belonging to 𝑖 𝑖 𝑖 𝑁 𝑔 𝒒 = [𝑞 , ...,𝑞 ] ∈ R 𝑌 considers various possible values of 𝐷 . As such, 1) inevitably, 𝑔 𝑔 1 𝑁 𝑖 group 𝑔 , e.g., 𝒒 = [1, 0, 0] . DecRS removes the dependency on 𝒅 in Eq. 1(d) and mitigates the 𝐻 𝒗 = [𝒗 , ..., 𝒗 ], 𝒗 ∈ R 𝒗 denotes the representation of group 𝑔 . 1 𝑁 𝑛 𝑛 𝑛 spurious correlation, and 2) theoretically, when user interest drift The symmetric KL divergence value of user 𝑢 and 𝘂 ,𝘂 ˆ 𝑢 𝑢 the normalized one, respectively. happens in the testing data, DecRS can produce a more robust and accurate prediction because the model has “seen” many different values of 𝐷 during training and doesn’t heavily depend on the The error of the approximation 𝜖 is measured by the Jensen gap [1]: unreliable distribution 𝒅 in Eq. 1(d). 𝜖 = |E [𝑓 (·)] −𝑓 (𝒖, 𝒊, 𝑀(E [𝒅], 𝒖))|. 2.2.2 Backdoor Adjustment Approximation. Theoretically, the (6) 𝒅 𝒅 sample space of 𝐷 is infinite, which makes the calculation of Eq. Theorem 2.1. If 𝑓 is a linear function with a random variable 𝑋 (2c) intractable. Therefore, it is essential to derive an efficient as the input, then 𝐸[𝑓 (𝑋)] = 𝑓 (𝐸[𝑋]) holds under any probability approximation of Eq. (2c). distribution 𝑃(𝑋). Refer to [1, 13] for the proof. • Sampling of 𝐷. To estimate the distribution of 𝐷 , we sample users’ historical distributions over item groups in the training data, Theorem 2.2. If a random variable 𝑋 with the probability which comprise a discrete set D. Formally, given a user 𝑢, 𝒅 = distribution 𝑃(𝑋) has the expectation 𝘇, and the non-linear function [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] ∈ D and each click frequency 𝑝 (𝑔 ) over 𝑢 1 𝑢 𝑁 𝑢 𝑛 𝑓 : 𝐺 → R where 𝐺 is a closed subset of R, following: group 𝑔 is calculated by (1) 𝑓 is bounded on any compact subset of 𝐺; ∑︁ (2) |𝑓 (𝑥) − 𝑓 (𝘇)| = 𝑂(|𝑥 − 𝘇| ) at 𝑥 → 𝘇 for 𝛽 > 0; 𝑖∈H 𝑔 𝑢 𝑛 𝑝 (𝑔 ) = 𝑝(𝑔 |𝑖)𝑝(𝑖|𝑢) = , (3) 𝑢 𝑛 𝑛 (3) |𝑓 (𝑥)| = 𝑂(|𝑥| ) as 𝑥 → +∞ for 𝛾 ≥ 𝛽, |H | 𝑖∈I 𝛽 𝛾 then the inequality holds: |E[𝑓 (𝑋)] − 𝑓 (𝘇)| ≤ 𝑇(𝜌 + 𝜌 ), where √︃ where I is the set of all items, H denotes the clicked item set |𝑓 (𝑥)−𝑓 (𝘇)| 𝑖 𝜌 = E[|𝑋 − 𝘇| ], and 𝑇 = sup does not 𝛽 𝛽 𝛾 by user 𝑢, and 𝑞 represents the probability of item 𝑖 belonging 𝑥∈𝐺\{𝘇} |𝑥−𝘇| +|𝑥−𝘇| 𝑖 𝑖 depend on 𝑃(𝑋). The proof can be found in [13]. to group 𝑔 . For instance, 𝒒 = [1, 0, 0] with 𝑞 = 1 denotes that item 𝑖 only belongs to the first group. In this work, we sample 𝐷 From Theorem 2.1, we know that the error𝜖 in Eq. 6 is zero if 𝑓 (·) according to the user-item interactions in the training data, and in Eq. 5 is a linear function. However, most existing recommender |H | thus the probability 𝑃(𝒅 ) of user 𝑢 is obtained by where 𝑣 models use non-linear functions to increase the representation 𝑣∈U U represents the user set. As such, we can estimate Eq. (2c) by capacity. In these cases, there is an upper bound of 𝜖 which can ∑︁ be estimated by Theorem 2.2. It can be proven that the common 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) ≈ 𝑃(𝒅)𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)) non-linear functions in recommender models (e.g., sigmoid in [29]) 𝒅∈D satisfy the conditions in Theorem 2.2, and the upper bound is ∑︁ (4) small, especially when the distribution of 𝐷 concentrates around = 𝑃(𝒅)𝑓 (𝒖, 𝒊, 𝑀(𝒅, 𝒖)), its expectation [13]. 𝒅∈D where each 𝒅 is a distribution from one user, and we use a 2.3 Backdoor Adjustment Operator function 𝑓 (·) (e.g., FM [29]) to calculate the conditional probability To facilitate the usage of DecRS, we design the operator to 𝑃(𝑌|𝒖, 𝒊, 𝑀(𝒅, 𝒖)), similar to conventional recommender models. instantiate backdoor adjustment, which can be easily plugged into • Approximation of E [𝑓 (·)]. The expected value of function recommender models to alleviate bias amplification. From Eq. 5, 𝑓 (·) of 𝒅 in Eq. 4 is hard to compute because we need to calculate the we can find that in addition to 𝒖 and 𝒊, 𝑓 (·) takes 𝑀(𝒅, 𝒖) as the results of 𝑓 (·) for each 𝒅 and the possible values inD are extensive. Í model input where 𝒅 = 𝑃(𝒅)𝒅 . That is, if we can implement 𝒅∈D A popular solution [1, 35] in statistics and machine learning theory 𝑀(𝒅, 𝒖), existing recommender models can take it as one additional is to make the approximation E [𝑓 (·)] ≈ 𝑓 (𝒖, 𝒊, 𝑀(E [𝒅], 𝒖)). 𝒅 𝒅 input to achieve backdoor adjustment. Formally, the approximation takes the outer sum 𝑃(𝒅)𝑓 (·) into Recall that 𝑀 denotes the group-level user representation the calculation within 𝑓 (·): which describes the user preference over item groups. Given ∑︁ 𝒅 = [𝑝(𝑔 ), ...,𝑝(𝑔 )], item group representation 𝒗 = [𝒗 , ..., 𝒗 ], 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) ≈ 𝑓 (𝒖, 𝒊, 𝑀( 𝑃(𝒅)𝒅, 𝒖)). 1 𝑁 1 𝑁 (5) and user representation 𝒖 = [𝒖 , ..., 𝒖 ] with feature values 𝒙 = ˜ 1 𝐾 𝑢 𝒅∈D ¯ [𝑥 , ...,𝑥 ] [16], we calculate 𝑀(𝒅, 𝒖) by adjustment is essential to alleviate bias amplification. Otherwise, 𝑢,1 𝑢,𝐾 the impact of backdoor adjustment should be controlled. 𝑁 𝐾 ∑︁ ∑︁ ¯ • Symmetric KL Divergence. We employ the symmetric 𝑀(𝒅, 𝒖) = 𝑝(𝑔 )𝒗 ⊙ 𝑥 𝒖 , (7) 𝑎 𝑎 𝑢,𝑏 𝑏 Kullback–Leibler (KL) divergence to quantify the user interest drift 𝑎=1 𝑏=1 in the history. In detail, we divide the historical interaction sequence where ⊙ denotes the element-wise product, and 𝒗 ∈ R is the of user 𝑢 into two parts according to the timestamps. For each part, item group representation for group 𝑔 proposed by us, which is we calculate the historical distribution over item groups by Eq. 3, randomly initialized like 𝒖. The feature values in 𝒙 are usually one, 1 1 1 2 2 2 obtaining 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )] and 𝒅 = [𝑝 (𝑔 ), ...,𝑝 (𝑔 )]. 1 𝑁 1 𝑁 𝑢 𝑢 𝑢 𝑢 𝑢 𝑢 but in some special cases, it could be a float number. For instance, Then, the distance between these two distributions is measured by a user may have two jobs and the feature value for these two the symmetric KL divergence: features can be set as 0.5 separately. Besides, we can also leverage 1 2 2 1 a FM module [29] or other high-order operators [10]. Formally, 𝘂 = 𝐾𝐿(𝒅 |𝒅 ) + 𝐾𝐿(𝒅 |𝒅 ) 𝑢 𝑢 𝑢 𝑢 we can obtain 𝒘 = [𝒅, 𝒙 ] = [𝑝(𝑔 ), ...,𝑝(𝑔 ),𝑥 , ...,𝑥 ] and 𝑢 1 𝑢,1 𝑁 𝑢,𝐾 𝑁 𝑁 ∑︁ 1 ∑︁ 2 (10) 𝑃 (𝑔 ) 𝑃 (𝑔 ) 𝑛 𝑛 𝑢 𝑢 𝒄 = [𝒗 , 𝒖] = [𝒗 , ..., 𝒗 , 𝒖 , ..., 𝒖 ] via concatenation, and then 1 2 1 1 𝑁 𝐾 = 𝑃 (𝑔 ) log + 𝑃 (𝑔 ) log , 𝑢 𝑛 𝑢 𝑛 2 1 𝑃 (𝑔 ) 𝑃 (𝑔 ) 𝑀(𝒅, 𝒖) can be calculated by a second-order FM module: 𝑢 𝑛 𝑢 𝑛 𝑛=1 𝑛=1 𝑁+𝐾 𝑁+𝐾 ∑︁ ∑︁ where 𝘂 denotes the distribution distance of user 𝑢. A higher 𝘂 𝑢 𝑢 𝑀(𝒅, 𝒖) = 𝑤 𝒄 ⊙ 𝑤 𝒄 , (8) 𝑎 𝑎 𝑏 𝑏 represents that the user is easier to change the interest distribution 𝑎=1 𝑏=1 over item groups. Here, we only divide the historical interaction where 𝑀(𝒅, 𝒖) considers the interactions within 𝒖 and 𝒗 like FM, sequence into two parts to reduce the computation cost. More fine- which is the main difference from Eq. 7. Next, the group-level grained division can be explored in future work if necessary. user representation 𝑀(𝒅, 𝒖) can be incorporated into existing Based on the signal of 𝘂 , we utilize an inference strategy to recommender models as one additional user representation. adaptively fuse the prediction scores from the conventional RS Formally, if the generalized recommender models (e.g., FM) are able and DecRS. Specifically, we first train the recommender model by to incorporate multiple feature representations, 𝑀(𝒅, 𝒖) is directly 𝑃(𝑌|𝑈 = 𝒖, 𝐼 = 𝒊) and 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊), respectively, and fed into the models to calculate 𝑓 (𝒖, 𝒊, 𝑀(𝒅, 𝒖)). Otherwise, 𝑓 (·) can their prediction scores are then automatically fused to regulate the be implemented by a later-fusion manner, i.e., 𝑓 (·) = 𝛿∗𝑓 (𝒖, 𝒊)+(1− impact of backdoor adjustment. Formally, ′ ′ 𝛿) ∗ 𝑓 (𝑀(𝒅, 𝒖), 𝒊) where 𝛿 is a hyperparameter and 𝑓 (·) denotes 𝑅𝑆 𝐷𝐸 𝑌 = (1− 𝘂 ˆ ) ∗ 𝑌 + 𝘂 ˆ ∗ 𝑌 , (11) 𝑢,𝑖 𝑢 𝑢 𝑢,𝑖 𝑢,𝑖 the interaction module (e.g., dot product) in recommender models to calculate the prediction score given user/item representations, 𝑅𝑆 where 𝑌 is the inference score for user 𝑢 and item 𝑖 , 𝑌 and 𝑢,𝑖 𝑢,𝑖 such as neural collaborative filtering [ 17]. Then the parameters 𝘃 𝐷𝐸 𝑌 are the prediction scores from the conventional RS and DecRS, 𝑢,𝑖 in the recommender models are optimized by ∑︁ respectively. In particular, 𝘂 ˆ is calculated by ¯ ¯ 𝘃 = arg min 𝑙(𝑓 (𝒖, 𝒊, 𝑀(𝒅, 𝒖)),𝑦 ¯ ), 𝑢,𝑖 𝘂 − 𝘂 (9) 𝑢 𝑚𝑖𝑛 ¯ 𝘂 ˆ = ( ) 𝑢 (12) (𝑢,𝑖,𝑦 ¯ )∈T 𝑢,𝑖 𝘂 − 𝘂 𝑚𝑎𝑥 𝑚𝑖𝑛 where 𝑦 ¯ ∈ {0, 1} represents whether user 𝑢 has interacted with 𝑢,𝑖 where the normalized 𝘂 ˆ ∈ [0, 1], 𝘂 and 𝘂 are the minimum 𝑢 𝑚𝑖𝑛 𝑚𝑎𝑥 item 𝑖 (i.e., 𝑦 ¯ = 1) or not (i.e., 𝑦 ¯ = 0), T denotes the training 𝑢,𝑖 𝑢,𝑖 and maximum symmetric KL divergence values across all users, data, and 𝑙(·) is the loss function, e.g., log loss [17]. respectively. Besides, 𝛼 ∈ [0,+∞) is a hyper-parameter to further 𝑅𝑆 𝐷𝐸 control the weights of 𝑌 and 𝑌 by human intervention. 𝑢,𝑖 𝑢,𝑖 2.4 Inference Strategy Specifically, 𝘂 ˆ becomes larger if 𝛼 → 0 due to 𝘂 ˆ ∈ [0, 1] which 𝑢 𝑢 As mentioned before, DecRS alleviates bias amplification and 𝐷𝐸 makes 𝑌 favor 𝑌 , and 𝘂 ˆ decreases if 𝛼 → +∞. 𝑢,𝑖 𝑢 𝑢,𝑖 produces more robust predictions when user interest drift happens. From Eq. 11, we can find that the inference for the users with Indeed, for some users, bias amplification might be beneficial to 𝐷𝐸 high 𝘂 ˆ will rely more on 𝑌 . That is, 𝘂 automatically adjusts 𝑢 𝑢 𝑢,𝑖 exclude the item groups they dislike. For example, users might only 𝑅𝑆 𝐷𝐸 the balance between 𝑌 and 𝑌 . Besides, we can regulate the like action movies so that they don’t watch the movies in other 𝑢,𝑖 𝑢,𝑖 impact of backdoor adjustment by tuning the hyper-parameter 𝛼 in groups. In these special cases, it makes sense to purely recommend Eq. 12 for different datasets or recommender models. Theoretically, extensive action movies. Therefore, it is better to develop a user- 𝛼 is usually close to 0 because mitigating the spurious correlation specific inference strategy to regulate the impact of backdoor improves the recommendation accuracy for most users. adjustment dynamically. To summarize, the proposed DecRS has three main differences By analyzing the user behavior, we find that many users have from the conventional RS: diverse interest and are likely to have interest drift while few users have stable interest in item groups over time (e.g., only liking action • DecRS models the causal effect 𝑃(𝑌|𝑑𝑜(𝑈 = 𝒖), 𝐼 = 𝒊) instead of movies). This inspires us to explore the user characteristics: is this the conditional probability 𝑃(𝑌|𝑈 = 𝒖, 𝐼 = 𝒊). user easy to change the interest distribution over item groups? • DecRS equips the recommender models with a backdoor Based on that, we propose a user-specific inference strategy for adjustment operator (e.g., Equation 7). item ranking. If the user is easy to change the interest distribution • DecRS makes recommendations with a user-specific inference over item groups in the history, we assume that he/she has diverse strategy instead of the simple model prediction (e.g., a forward interest and will change it easily in future. And thus backdoor propagation). Table 2: The statistics of the datasets. 3 RELATED WORK Dataset #Users #Items #Interactions # Features #Group In this work, we explore how to alleviate bias amplification of ML-1M 3,883 6,040 575,276 13,408 18 recommender models by causal inference, which is highly related Amazon-Book 29,115 16,845 1,712,409 46,213 253 to fairness, diversity, and causal recommendation. Inverse Propensity Scoring (IPS) [2, 28, 41], which first estimates the Negative Effect of Bias Amplification. Due to the existence propensity score based on some assumptions, and then uses the of feedback loop [7], bias amplification will become increasingly inverse propensity score to re-weight the samples. For instance, serious. Consequently, it will result in many negative issues: 1) Saito et al. estimated the exposure propensity for each user-item narrowing down the user interest gradually, which is similar pair, and re-weighted the samples via IPS to solve the miss-not-at- to the effect of filter bubbles [22]. Worse still, the issue might random problem [30]. However, IPS methods heavily rely on the evolve into echo chambers [14], in which users’ imbalanced interest accurate propensity estimation, and usually suffer from the high is further reinforced by the repeated exposure to similar items; propensity variance. Thus it is often followed by the propensity 2) low-quality items that users dislike might be recommended clipping technique [2, 30]. Another line of causal recommendation purely because they are in the majority group, which deprive the studies the effect of taking recommendations as treatments on recommendation opportunities of other high-quality items, causing user/system behaviors [48], which is totally different from our low recommendation accuracy and unfairness. work because we focus on the causal relations within the models. Fairness in Recommendation. With the increasing attention on the fairness of machine learning algorithms [19], many 4 EXPERIMENTS works explore the definitions of fairness in recommendation and We conduct extensive experiments to demonstrate the effectiveness information retrieval [20, 24, 27]. Generally speaking, they have of our DecRS by investigating the following research questions: two categories: individual fairness and group fairness. Individual • RQ1: How does the proposed DecRS perform across different fairness denotes that similar individuals (e.g., users or items) users in terms of recommendation accuracy? should receive similar treatments (e.g., exposure or clicks), such as • RQ2: How does DecRS perform to alleviate bias amplification, amortized equity of attention [3]. Besides, group fairness indicates compared to the state-of-the-art methods? that all groups are supposed to be treated fairly where individuals • RQ3: How do the different components affect the performance are divided into groups according to the protected attributes (e.g., of DecRS, such as the inference strategy and the implementation item category and user gender) [46]. The particular definitions span of function 𝑀(·)? from discounted cumulative fairness [44], fairness of exposure [31], to multi-sided fairness [5]. 4.1 Experimental Settings Another representative direction in fairness to reduce bias amplification is calibrated recommendation [ 32]. It re-ranks the Datasets. We use two benchmark datasets, ML-1M and Amazon- items to make the distribution of the recommended item groups Book, in different real-world scenarios. 1) ML-1M is a movie follow the proportion in the browsing history. For example, if a recommendation dataset , which involves rich user/item features, user has watched 70% action movies and 30% romance movies, the such as user gender, and movie genre. We partition the items into recommendation list is expected to have the same proportion of groups according to the movie genre. 2) Amazon-Book is one of the movies. Although the fairness-related works, including calibrated Amazon product datasets , where the book items can be divided recommendation, may alleviate bias amplification well, they are into groups based on the book category (e.g., sports). To ensure data making the trade-off between ranking accuracy and fairness [ 21, quality, we adopt the 20-core settings, i.e., discarding the users and 31, 32]. The reason possibly lies in that they neglect the true cause items with less than 20 interactions. We summarize the statistics of of bias amplification. datasets in Table 2. For each dataset, we sort the user-item interactions by the Diversity in Recommendation. Diversity is regarded as one timestamps, and split them into the training, validation, and testing essential direction to get users out of filter bubbles in the subsets with the ratio of 80%, 10%, and 10%. For each interaction with information filtering systems [ 32]. As to recommendation, diversity the rating≥ 4, we treat it as a positive instance. During training, we pursues the dissimilarity of the recommended items [8, 33], where adopt the negative sampling strategy to randomly sample one item similarity can be measured by many factors, such as item category that the user did not interact with before as a negative instance. and embeddings [6]. However, most works might recommend many dissatisfying items when making diverse recommendations. For Baselines. As our proposed DecRS is model-agnostic, we example, the recommender model may trade off the accuracy to instantiate it on two representative recommender models, FM [29] reduce the intra-list similarity by re-ranking [47]. and NFM [16], to alleviate bias amplification and boost the Causal Recommendation. Causal inference has been widely predictive performance. We compare DecRS with the state-of-the- used in many machine learning applications, spanning from art methods that might alleviate bias amplification of FM and NFM computer vision [23, 34], natural language processing [11, 12, 43], to backbone models. In particular, information retrieval [4]. In recommendation, most works on causal • Unawareness [15, 19] removes the features of item groups (e.g., inference [25] focus on debiasing various biases in user feedback, movie genre in ML-1M) from the input of item representation 𝐼 . including position bias [18], clickbait issue [37], and popularity https://grouplens.org/datasets/movielens/1m/. bias [45]. The most representative idea in the existing works is https://jmcauley.ucsd.edu/data/amazon/. Table 3: Overall performance comparison between DecRS and the baselines on ML-1M and Amazon-Book. %improv. denotes the relative performance improvement achieved by DecRS over FM or NFM. The best results are highlighted in bold. FM NFM ML-1M Amazon-Book ML-1M Amazon-Book Method R@10 R@20 N@10 N@20 R@10 R@20 N@10 N@20 R@10 R@20 N@10 N@20 R@10 R@20 N@10 N@20 FM/NFM [16, 29] 0.0676 0.1162 0.0566 0.0715 0.0213 0.0370 0.0134 0.0187 0.0659 0.1135 0.0551 0.0697 0.0222 0.0389 0.0144 0.0199 Unawareness [15] 0.0679 0.1179 0.0575 0.0730 0.0216 0.0377 0.0138 0.0191 0.0648 0.1143 0.0556 0.0708 0.0206 0.0381 0.0133 0.0190 FairCo [21] 0.0676 0.1165 0.0570 0.0720 0.0212 0.0370 0.0135 0.0188 0.0651 0.1152 0.0554 0.0708 0.0219 0.0390 0.0142 0.0199 Calibration [32] 0.0647 0.1149 0.0539 0.0695 0.0202 0.0359 0.0129 0.0181 0.0636 0.1131 0.0526 0.0682 0.0194 0.0335 0.0131 0.0178 Diversity [47] 0.0670 0.1159 0.0555 0.0706 0.0207 0.0369 0.0131 0.0185 0.0641 0.1133 0.0540 0.0693 0.0215 0.0386 0.0140 0.0197 IPS [30] 0.0663 0.1188 0.0556 0.0718 0.0213 0.0369 0.0135 0.0187 0.0648 0.1135 0.0544 0.0692 0.0213 0.0370 0.0137 0.0189 DecRS 0.0704 0.1231 0.0578 0.0737 0.0231 0.0405 0.0148 0.0205 0.0694 0.1218 0.0580 0.0742 0.0236 0.0413 0.0153 0.0211 %improv. 4.14% 5.94% 2.12% 3.08% 8.45% 9.46% 10.45% 9.63% 5.31% 7.31% 5.26% 6.46% 6.31% 6.17% 6.25% 6.03% Table 4: Performance comparison across different user groups on ML-1M and Amazon-Book. Each line denotes the performance over the user group with 𝘂 > the threshold. We omit the results of threshold > 4 due to the similar trend. ML-1M Amazon-Book FM R@20 N@20 R@20 N@20 Threshold FM DecRS %improv. FM DecRS %improv. FM DecRS %improv. FM DecRS %improv. 0 0.1162 0.1231 5.94% 0.0715 0.0737 3.08% 0.0370 0.0405 9.46% 0.0187 0.0205 9.63% 0.5 0.1215 0.1296 6.67% 0.0704 0.0730 3.69% 0.0383 0.0424 10.70% 0.0192 0.0213 10.94% 1 0.1303 0.1412 8.37% 0.0707 0.0741 4.81% 0.0430 0.0479 11.40% 0.0208 0.0232 11.54% 2 0.1432 0.1646 14.94% 0.0706 0.0786 11.33% 0.0518 0.0595 14.86% 0.0231 0.0274 18.61% 3 0.1477 0.1637 10.83% 0.0620 0.0711 14.68% 0.0586 0.0684 16.72% 0.0256 0.0318 24.22% 4 0.1454 0.1768 21.60% 0.0595 0.0737 23.87% 0.0659 0.0793 20.33% 0.0284 0.0362 27.46% NFM R@20 N@20 R@20 N@20 Threshold NFM DecRS %improv. NFM DecRS %improv. NFM DecRS %improv. NFM DecRS %improv. 0 0.1135 0.1218 7.31% 0.0697 0.0742 6.46% 0.0389 0.0413 6.17% 0.0199 0.0211 6.03% 0.5 0.1187 0.1280 7.83% 0.0688 0.0735 6.83% 0.0401 0.0426 6.23% 0.0202 0.0218 7.92% 1 0.1272 0.1391 9.36% 0.0692 0.0747 7.95% 0.0438 0.0473 7.99% 0.0212 0.0234 10.38% 2 0.1452 0.1584 9.09% 0.0701 0.0771 9.99% 0.0530 0.0580 9.43% 0.0234 0.0269 14.96% 3 0.1478 0.1740 17.73% 0.0639 0.0723 13.15% 0.0614 0.0660 7.49% 0.0275 0.0319 16.00% 4 0.1442 0.1775 23.09% 0.0542 0.0699 28.97% 0.0709 0.0795 12.13% 0.0308 0.0371 20.45% • FairCo [21] introduces one error term to control the exposure recommendation list (comprised by the top-20 items). Higher 𝐶 𝐾𝐿 fairness across item groups. In this work, we calculate the error scores suggest a more serious issue of bias amplification. term based on the ranking list sorted by relevance, and its Parameter Settings. We implement our DecRS in the PyTorch coefficient 𝘆 in the ranking target is tuned in {0.01, 0.02, ..., 0.5}. implementation of FM and NFM. Closely following the original • Calibration [32] is one state-of-the-art method to alleviate bias papers [16, 29], we use the following settings: in FM and NFM, amplification. Specifically, it proposes a calibration metric 𝐶 to 𝐾𝐿 the embedding size of user/item features is 64, log loss [17] is measure the imbalance between the history and recommendation applied and the optimizer is set as Adagrad [9]; in NFM, a 64- list, and minimizes𝐶 by re-ranking. Here the hyper-parameter 𝐾𝐿 dimension fully-connected layer is used. We adopt a grid search 𝘆 in the ranking target is searched in {0.01, 0.02, ..., 0.5}. to tune their hyperparameters: the learning rate is searched in • Diversity [47] aims to decrease the intra-list similarity, where {0.005, 0.01, 0.05}; the batch size is tuned in {512, 1024, 2048}; the diversification factor is tuned in {0.01, 0.02, ..., 0.2}. the normalization coefficient is searched in {0, 0.1, 0.2}, and the • IPS [30] is a classical method in causal recommendation. Here we dropout ratio is confirmed in {0.2, 0.3, ..., 0.5}. Besides, 𝛼 in the use 𝑃(𝒅 ) as the propensity of user𝑢 to down-weight the items in proposed inference strategy is tuned in {0.1, 0.2, ..., 10}, and the the majority group during debiasing training, and we employ the model performs the best in {0.2, 0.3, 0.4}, where 𝛼 is close to 0, propensity clipping technique [30] to reduce propensity variance, proving the advantages of our DecRS over the conventional RS as where the clipping threshold is searched in {2, 3, ..., 10}. ¯ discussed in Section 2.4. We use Eq. 8 to implement 𝑀(𝒅, 𝒖) and Evaluation Metrics. We evaluate the performance of all methods the backbone models take 𝑀(𝒅, 𝒖) as one additional feature. The from two perspectives: recommendation accuracy and effectiveness exploration of the late-fusion manner is left to future work because of alleviating bias amplification. In terms of accuracy, two widely- it is not our main contribution. Furthermore, we use the early used metrics [40], Recall@K (R@K) and NDCG@K (N@K), are stopping strategy [38, 42] — stop training if R@10 on the validation adopted under all ranking protocol [36, 39], which test the top-K set does not increase for 10 successive epochs. For all approaches, recommendations over all items that users never interact with in we tune the hyper-parameters to choose the best models w.r.t. R@10 the training data. As to alleviating bias amplification, we use the on the validation set, and report the results on the testing set. We representative calibration metric 𝐶 [32], which quantifies the 𝐾𝐿 released code and data at https://github.com/WenjieWWJ/DecRS. distribution drift over item groups between the history and the new 4.2 Performance Comparison (RQ1 & RQ2) FM NFM 0.48 0.6 FM Calibration DecRS 4.2.1 Overall Performance w.r.t. Accuracy. We present the NFM Calibration DecRS 0.55 0.46 empirical results of all baselines and DecRS in Table 3. Moreover, 0.5 to further analyze the characteristics of DecRS, we split users into 0.44 groups based on the symmetric KL divergence (cf. Eq. 10) and report 0.45 0.42 the performance comparison over the user groups in Table 4. From 0.4 the two tables, we have the following findings: 0.4 0.35 • Unawareness and FairCo only achieve comparable performance 0.38 0.3 or marginal improvements over the vanilla FM and NFM on the 0 0.5 1 2 3 4 0 0.5 1 2 3 4 two datasets. Possible reasons are the trade-offs among different Threshold Threshold user groups. To be more specific, for some users, discarding Figure 4: The performance comparison between the base- group features or preserving group fairness is able to reduce bias lines and DecRS on alleviating bias amplification. amplification and recommend more satisfying items. However, FM for most users with imbalanced interest in item groups, these NFM 0.2 0.2 approaches possibly recommend many disappointing items by FM DecRS (w/o) DecRS NFM DecRS (w/o) DecRS pursuing group fairness. 0.18 0.18 • Calibration and Diversity perform worse than the vanilla 0.16 0.16 backbone models, suggesting that simple re-ranking does hurt the recommendation accuracy. This is consistent with the findings in 0.14 0.14 [32, 47]. Moreover, we ascribe the inferior performance of IPS to the inaccurate estimation and high variance of propensity scores. 0.12 0.12 That is, the propensity cannot precisely estimate the effect of 𝐷 on 𝑈 , even if the propensity clipping technique [30] is applied. 0.1 0.1 0 0.5 1 2 3 4 0 0.5 1 2 3 4 • DecRS effectively improves the recommendation performance of Threshold Threshold FM and NFM on the two datasets. As shown in Table 3, the relative Figure 5: Ablation study of DecRS on ML-1M. improvements of DecRS over FM w.r.t. R@20 are 5.94% and 9.46% on ML-1M and Amazon-Book, respectively. This verifies the Table 5: Effect of the design of 𝑀(·). effectiveness of backdoor adjustment, which enables DecRS to Method R@10 R@20 N@10 N@20 remove the effect of confounder for many users. As a result, many FM 0.0676 0.1162 0.0566 0.0715 DecRS-EP 0.0685 0.1205 0.0573 0.0730 less-interested or low-quality items from the majority group will DecRS-FM 0.0704 0.1231 0.0578 0.0737 not be recommended, thus increasing the accuracy. • As Table 4 shows, with the increase of 𝘂 , the performance 4.3 In-depth Analysis (RQ3) gap between DecRS and the backbone models becomes larger. 4.3.1 Effect of the Inference Strategy . We first answer the For example, in the user group with 𝘂 > 4, the relative question: Is it of importance to conduct the inference strategy for improvements w.r.t. N@20 over FM and NFM are 23.87% and DecRS? Towards this end, one variant “DecRS (w/o)” is constructed 28.97%, respectively. We attribute such improvements to the by disabling the inference strategy and only using the prediction robust recommendation produced by DecRS. Specifically, DecRS 𝐷𝐸 𝑌 in Eq. 11 for inference. We illustrate its results in Figure 5 with equipped with backdoor adjustment is superior in reducing the following key findings. 1) The performance of “DecRS (w/o)” the spurious correlation and predicting users’ diverse interest, drops as compared with that of DecRS, indicating the effectiveness especially for the users with the interest drift (i.e., high 𝘂 ). of the inference strategy. 2) “DecRS (w/o)” still outperforms FM 4.2.2 Performance on Alleviating Bias Amplification. In and NFM consistently, especially over the users with high 𝘂 . This Figure 4, we present the performance comparison w.r.t. 𝐶 𝐾𝐿 suggests the superiority of DecRS over the conventional RS. It between the vanilla FM/NFM, calibrated recommendation, and achieves more accurate predictions of user interest by mitigating the DecRS on ML-1M. Due to space limitation, we omit other baselines effect of the confounder via backdoor adjustment approximation. that perform worse than calibrated recommendation and the results on Amazon-Book which have similar trends. We have the following 4.3.2 Effect of the Implementation of 𝑀(·). As mentioned in observations from Figure 4. 1) As compared to the vanilla models, Section 2.3, we can implement the function 𝑀(·) by either Eq. 7 or calibrated recommendation achieves lower 𝐶 scores, suggesting Eq. 8. We investigate the influence of different implementations and 𝐾𝐿 that the bias amplification is reduced. However, it comes at the construct two variants, DecRS-EP and DecRS-FM, which employ cost of lower recommendation accuracy, as shown in Table 3. 2) the element-wise product in Eq. 7 and the FM module in Eq. 8, Our DecRS consistently achieves lower 𝐶 scores than calibrated respectively. We summarize their performance comparison over 𝐾𝐿 recommendation across all user groups. More importantly, DecRS FM on ML-1M in Table 5. While being inferior to DecRS-FM, DecRS- does not hurt the recommendation accuracy. This evidently shows EP still performs better than FM. This proves the superiority of that DecRS solves the bias amplification problem well by embracing DecRS-FM over DecRS-EP, and also shows that DecRS with different causal modeling for recommendation, and justifies the effectiveness implementations still surpasses the vanilla backbone models, which of backdoor adjustment on reducing spurious correlations. further suggests the stability and effectiveness of DecRS. Recall@20 C_KL C_KL Recall@20 5 CONCLUSION AND FUTURE WORK [18] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM. ACM, 781–789. In this work, we explained that bias amplification in recommender [19] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual Fairness. In NeuIPS. Curran Associates, Inc., 4066–4076. models is caused by the confounder from a causal view. To [20] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and alleviate bias amplification, we proposed a novel DecRS with an Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of approximation operator for backdoor adjustment. DecRS explicitly the trade-off between relevance, fairness and satisfaction in recommendation systems. In CIKM. ACM, 2243–2251. models the causal relations in recommender models, and leverages [21] Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. backdoor adjustment to remove the spurious correlation caused Controlling Fairness and Bias in Dynamic Learning-to-Rank. ACM, 429–438. by the confounder. Besides, we developed an inference strategy to [22] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph A Konstan. 2014. Exploring the filter bubble: the effect of using recommender regulate the impact of backdoor adjustment. Extensive experiments systems on content diversity. In WWW. ACM, 677–686. validate the effectiveness of DecRS on alleviating bias amplification [23] Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji- and improving recommendation accuracy. Rong Wen. 2021. Counterfactual VQA: A Cause-Effect Look at Language Bias. In CVPR. IEEE. This work takes the first step to incorporate backdoor adjustment [24] Gourab K Patro, Arpita Biswas, Niloy Ganguly, Krishna P Gummadi, and Abhijnan into existing recommender models. In future, there are many Chakraborty. 2020. Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms. In WWW. ACM, 1194–1204. research directions that deserve our attention. 1) The discovery [25] Judea Pearl. 2009. Causality. Cambridge university press. of more fine-grained causal relations in recommendation models. [26] Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of This work starts to mitigate the spurious correlations caused by Cause and Effect (1st ed.). Basic Books, Inc. [27] Evaggelia Pitoura, Georgia Koutrika, and Kostas Stefanidis. 2020. Fairness in the confounder while recommendation is an extremely complex Rankings and Recommenders.. In EDBT. ACM, 651–654. scenario, involving many observed/hidden variables that are [28] Zhen Qin, Suming J. Chen, Donald Metzler, Yongwoo Noh, Jingzheng Qin, and Xuanhui Wang. 2020. Attribute-Based Propensity for Unbiased Learning in waiting for causal discovery. 2) The proposed DecRS has the Recommender Systems: Algorithm and Case Studies. In KDD. ACM, 2359–2367. potential to reduce various biases in information retrieval and [29] Steffen Rendle. 2010. Factorization machines. In ICDM. IEEE, 995–1000. recommendation, such as position bias and popularity bias. The [30] Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. 2020. Unbiased Recommender Learning from Missing-Not-At-Random Implicit causes of the biases are also related to the imbalanced training data. Feedback. In WSDM. ACM, 501–509. 3) Bias amplification is one essential cause of the filter bubble [ 22] [31] Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. and echo chambers [14]. The effect of DecRS on mitigating these In KDD. ACM, 2219–2228. [32] Harald Steck. 2018. Calibrated recommendations. In RecSys. ACM, 154–162. issues can be studied in future work. [33] Jianing Sun, Wei Guo, Dengcheng Zhang, Yingxue Zhang, Florence Regol, Yaochen Hu, Huifeng Guo, Ruiming Tang, Han Yuan, Xiuqiang He, and Mark REFERENCES Coates. 2020. A Framework for Recommending Accurate and Diverse Items Using [1] Shoshana Abramovich and Lars-Erik Persson. 2016. Some new estimates of the Bayesian Graph Convolutional Neural Networks. In KDD. ACM, 2030–2039. ‘Jensen gap’. Journal of Inequalities and Applications 2016, 1 (2016), 1–9. [34] Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-Tailed [2] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Classification by Keeping the Good and Removing the Bad Momentum Causal Unbiased Learning to Rank with Unbiased Propensity Estimation. In SIGIR. ACM, Effect. In NeuIPS. 385–394. [35] Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual [3] Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: commonsense r-cnn. In CVPR. IEEE, 10760–10770. Amortizing individual fairness in rankings. In SIGIR. ACM, 405–414. [36] Wenjie Wang, Fuli Feng, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2021. [4] Stephen Bonner and Flavian Vasile. 2018. Causal embeddings for recommendation. Denoising implicit feedback for recommendation. In WSDM. ACM, 373–381. In RecSys. ACM, 104–112. [37] Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. [5] Robin Burke. 2017. Multisided fairness for recommendation. In FAT ML. 2021. Click can be Cheating: Counterfactual Recommendation for Mitigating [6] Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures Clickbait Issue. In SIGIR. ACM. for novelty and diversity. In SIGIR. ACM, 413–422. [38] Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. [7] Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. 2018. How Chat more: Deepening and widening the chatting topic via a deep model. In algorithmic confounding in recommendation systems increases homogeneity SIGIR. ACM, 255–264. and decreases utility. In RecSys. ACM, 224–232. [39] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. [8] Peizhe Cheng, Shuaiqiang Wang, Jun Ma, Jiankai Sun, and Hui Xiong. 2017. Neural Graph Collaborative Filtering. In SIGIR. ACM, 165–174. Learning to Recommend Accurate and Diverse Items. In WWW. IW3C2, 183–192. [40] Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. [9] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods 2020. Disentangled Graph Collaborative Filtering. In SIGIR. ACM, 1001–1010. for online learning and stochastic optimization. JMLR 12, 7 (2011). [41] Yixin Wang, Dawen Liang, Laurent Charlin, and David M Blei. 2018. The [10] Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Cross-GCN: deconfounded recommender: A causal inference approach to recommendation. Enhancing Graph Convolutional Network with k-Order Feature Interactions. In arXiv:1808.06581. TKDE (2021). [42] Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng [11] Fuli Feng, Weiran Huang, Xin Xin, Xiangnan He, and Tat-Seng Chua. 2021. Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Should Graph Convolution Trust Neighbors? A Simple Causal Inference Method. Recommendation of Micro-video. In MM. ACM, 1437–1445. In SIGIR. ACM. [43] Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu, Changlong Sun, Jun Xiao, [12] Fuli Feng, Jizhi Zhang, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021. Yueting Zhuang, Luo Si, and Fei Wu. 2020. De-Biased Court’s View Generation Empowering Language Understanding with Counterfactual Reasoning. In ACL- with Causality. In EMNLP. ACL, 763–780. IJCNLP Findings. ACL. [44] Ke Yang and Julia Stoyanovich. 2017. Measuring fairness in ranked outputs. In [13] Xiang Gao, Meera Sitharam, and Adrian E. Roitberg. 2019. Bounds on the Jensen SSDBM. ACM, 1–6. Gap, and Implications for Mean-Concentrated Distributions. AJMAA 16, 14 [45] Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui (2019), 1–16. Issue 2. Ling, and Yongdong Zhang. 2021. Causal Intervention for Leveraging Popularity [14] Yingqiang Ge, Shuya Zhao, Honglu Zhou, Changhua Pei, Fei Sun, Wenwu Ou, Bias in Recommendation. In SIGIR. ACM. and Yongfeng Zhang. 2020. Understanding Echo Chambers in E-Commerce [46] Ziwei Zhu, Jianling Wang, and James Caverlee. 2020. Measuring and Mitigating Recommender Systems. In SIGIR. ACM, 2261–2270. Item Under-Recommendation Bias in Personalized Ranking Systems. In SIGIR. [15] Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P Gummadi, and Adrian ACM, 449–458. Weller. 2016. The case for process fairness in learning: Feature selection for fair [47] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. 2005. decision making. In NeuIPS. Improving recommendation lists through topic diversification. In WWW. ACM, [16] Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse 22–32. predictive analytics. In SIGIR. ACM, 355–364. [48] Hao Zou, Peng Cui, Bo Li, Zheyan Shen, Jianxin Ma, Hongxia Yang, and Yue He. [17] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng 2020. Counterfactual Prediction for Bundle Treatment. NeuIPS (2020). Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.

Journal

Computing Research RepositoryarXiv (Cornell University)

Published: May 22, 2021

There are no references for this article.