Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Rethinking the framework constructed by counterfactual functional model

Rethinking the framework constructed by counterfactual functional model The causal inference represented by counterfactual inference technology breathes new life into the current field of artificial intelligence. Although the fusion of causal inference and artificial intelligence has an excellent performance in many various applications, some theoretical justifications have not been well resolved. In this paper, we focus on two fundamental issues in causal inference: probabilistic evaluation of counterfactual queries and the assumptions used to evaluate causal effects. Both of these issues are closely related to counterfactual inference tasks. Among them, counterfactual queries focus on the outcome of the inference task, and the assumptions provide the preconditions for performing the inference task. Counterfactual queries are to consider the question of what kind of causality would arise if we artificially apply the conditions contrary to the facts. In general, to obtain a unique solution, the evaluation of counterfactual queries requires the assistance of a functional model. We analyze the limitations of the original functional model when evaluating a specific query and find that the model arrives at ambiguous conclusions when the unique probability solution is 0. In the task of estimating causal effects, the experiments are conducted under some strong assumptions, such as treatment-unit additivity. However, such assumptions are often insatiable in real-world tasks, and there is also a lack of scientific representation of the assumptions themselves. We propose a mild version of the treatment-unit additivity assumption coined as M-TUA based on the damped vibration equation in physics to alleviate this problem. M-TUA reduces the strength of the constraints in the original assumptions with reasonable formal expression. Keywords Causal effect · Counterfactual approach · Functional model · Treatment-unit additivity assumption 1 Introduction when we modify a factual prior event and then evaluate the consequences of that change [1]. In the classic Rubin causal model (RCM), counterfactual results usually refer Counterfactual inference, as an indispensable method of causal inference, helps create human self-awareness and to unobserved potential outcomes [2]. A typical application imbue life experiences with meaning, which is embodied representative is counterfactual queries (CQs) [3]. A counterfactual query is a question of what kind of causality would arise if we artificially adopt the conditions contrary to Chao Wang cwang17@fudan.edu.cn the facts. Formally, the evaluation of CQs can be expressed as “If C happened, would B have occurred?”, where C is the Linfang Liu counterfactual antecedent. liulf19@fudan.edu.cn CQs embody our reflections on what already happening Shichao Sun in the real world. For example, in Fig. 1, data released by bruce.sun@connect.polyu.hk Johns Hopkins University (JHU) shows that as of August 19, 2021, EST, the cumulative number of confirmed cases of Wei Wang COVID-19 (coronavirus disease 2019) in the United States wangwei1@fudan.edu.cn amounted to 37,155,209 cases and the cumulative number of deaths amounted to 624,253 cases. The data also shows Shanghai Key Laboratory of Data Science, School of that the current cumulative number of confirmed cases in Computer Science, Fudan University, Shanghai, China Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China https://coronavirus.jhu.edu/map.html 12958 C. Wang et al. Fig. 1 COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at JHU the United States accounts for about 17.75% of the more than Note that, in CQ1, there is a clear causal relationship, 200 million confirmed cases worldwide; the cumulative that is, COVID-19 has caused the unemployment rate in number of deaths in the United States accounts for about the United States to rise. Therefore, in response to CQ1, 14.20% of the more than 4.3 million deaths worldwide. In an essential task is to be able to evaluate the degree of face of the onslaught of the epidemic, one might ponder the belief in the counterfactual consequence (i.e., probability following query: If the U.S. has taken decisive measures, evaluation) after considering the facts that have already would the number of confirmed cases have been effectively happened. In other words, it is equivalent to evaluating controlled instead of spreading as wildly as it is now? the probability of a potential (or counterfactual) outcome Originally, most studies on counterfactual inferences given the antecedent. Moreover, in CQ1, it is a fact that the (such as the query above) focus on the field of philosophy. COVID-19 sweeps the world and causes the unemployment Philosophers establish the form of a logical relationship rate in the United States to rise. Hence, we should focus constituting a logical world, which is consistent with the on analyzing what is the probability that the unemployment counterfactual antecedent and must be the closest to the real rate in the United States will rise if there is no COVID-19? world (for the convenience of description, we call it the This is undoubtedly an influence on the government to make closest world approach) [4]. Further, Ginsberg [5] applies decisions. Therefore, evaluating counterfactual queries like similar counterfactual logic to analyze problems of AI tasks, these has far-reaching significance for practical application. which relies on logic based on the closest world approach. With the widespread application of causal inference in However, the disadvantage of the closest world approach is the field of AI [7, 8], the current popular method is to adopt that it lacks constraints on closeness measures. the functional model (FM) [9] for inference. FM takes a Regarding the above issue, Balke and Pearl [3]are com- CQ as an input and finally outputs the probability evalua- mitted to explaining the closest world approach. Specif- tion of the CQ by combining prior knowledge and internal ically, they suggest that turning a CQ into a probability inference mechanisms. The evaluation of CQs has benefited problem, named, the probabilistic evaluation of counterfac- many research fields and tasks, such as the determina- tual queries (PECQs). In other words, PECQs focus more on tion of person liable [10], marketing and economics [11], the probability of an event occurring in a specific CQ, rather personalized policies [12], medical imaging analysis [13, than just outputting “True” or “False” (or “Yes” or “No”, 14], Bayesian network [7], high dimensional data analy- etc.) for this query. PECQs motivate us to deeply rethink sis [15], abduction reasoning [16], the intervention of tabu- counterfactual problems in many AI applications. For exam- lar data [8], epidemiology [17], natural language processing ple, we know that COVID-19 has caused economic losses (NLP) [18, 19] and graph neural networks (GNN) [20, 21]. and increased unemployment in the United States [6]. An In particular, FM can provide powerful interpretability for important reason is that the government has not dealt with machine learning model decisions [22–25], which is one of the epidemic promptly. Based on the facts that have already the most concerning issues in the Artificial Intelligence (AI) occurred, we may reflect on the following question, CQ1: If community today. the government issued effective policies in time to control the spread of COVID-19, would the unemployment rate in 1.1 Motivation the United States still have raised? Judea Pearl discusses the limitations of the current machine 2 learning theory and points out that current machine learning https://www.nytimes.com/2020/05/28/business/ models are difficult to be used as the basis for strong unemployment-stock-market-coronavirus.html Rethinking the framework constructed by counterfactual functional model 12959 AI [9]. An important reason is that the current machine ambiguity cannot be eliminated, we must consider what learning approach is almost entirely in the form of statistics may cause ambiguity and how to avoid trouble caused or “black box”, which brings serious theoretical limitations by ambiguity. to its performance [26]. For example, it is difficult for 2) The assumptions used to estimate causal effects in current smart devices to make counterfactual inferences. the data are strong, which are often violated in real- A large number of researchers are increasingly interested world applications. Some strong assumptions tend in combining counterfactual inference with AI [27, 28], to constrain on individuals (e.g., individuals u in such as explaining consumer behavior [29], the study U) to obtain the ideal an experimental population of viral pathogenesis [30], and predicting the risk of environment in an experiment. This neglects to obtain flight delays [31]. In addition, counterfactual inference the equivalent form of the assumption directly from the has shown advantages in improving the robustness of the abstract level (e.g., the experimental population U,the model [32, 33] and optimizing text generation tasks [34]and dataset itself). In some practical applications of causal classification tasks [35]. Although counterfactual inference inference, a challenging task requires researchers to has set off a new upsurge in the field of machine learning, a make causal inferences in the absence of data. For deeper understanding of the existing models and methods is example, in RCM, the causal effect is described as notably lacking. O − O ,where O (O ) is the result variable O t,u c,u t,u c,u In our work, we focus on two basic aspects in the task displayed by subject (or individual) u under the control of counterfactual inference. The first aspect focuses on the (c) group or treatment (t) group. Unfortunately, we have no idea how to obtain O ,and O at the same counterfactual framework and this aspect is related to the t,u c,u inference results of the model. The second aspect focuses time no matter how large the dataset is. This situation is on the preconditions for the counterfactual inference also called the fundamental problem of causal inference tasks. Specifically, the first aspect is based on a type (FPCI) [37]. of counterfactual approach (e.g., the functional model) Owing to the existence of FPCI, we can only apply in causal science. We analyze the credibility of some additional assumptions on the data distribution to avoid it. results obtained by using this counterfactual approach to Some typical assumptions are shown below: evaluate CQs. Another aspect we are concerned about is the assumptions used in causal inference to estimate – Stable Unit Treatment Value Assumption (SUTVA) [38], causal effects. Since causal effects depend on the potential where each O of u is treated as an independent event; results, however, we cannot observe all the potential – Assumption of Homogeneity (AOH) [39], which requires that for any individual u and u ,and any outcomes of the experimental individual simultaneously i j,j =i (unobservable outcomes are usually called counterfactual intervention method t , O = O always holds; t ,u t ,u i j,j =i outcomes). Therefore, some assumptions are often needed – Treatment-Unit Additivity (TUA) [36], some studies when estimating the causal effect. We pay attention to a also call it the assumption of constant effect (AOCE). commonly used strong assumption (i.e., the Treatment-Unit The TUA assumption constrains such an equivalence Additivity (TUA) assumption) and weaken it using some relationship that, for all individuals, the causal effect mathematical methods. Next, we specify the above two is the same for each individual under a defined aspects to the following two issues (we use a real inference intervention method. task (i.e., PECQs) as an example to explain the relationship (u ) = (u ) = ··· = (u ), (1) 1 2 |U | between the two issues in Fig. 2. where (u ) denotes the individual causal effect of u ∈ i i 1) In the CQs tasks, although the output result of the FM U,and |U | is the cardinality of set U. Apparently, AOH is unique, this unique solution sometimes is ambiguous. is stronger than TUA. Therefore, in the second aspect, For example, in the task of evaluating the probability we focus on TUA, aiming to obtain the milder TUA solution of CQs by FM, if the model predicts that assumption. the probability of a CQ is 0, the result may be To address the two issues mentioned above, in this paper, ambiguous. In other words, although the probability our contributions are three-fold: value predicted by the model in this situation is 0, it is – We focus on a basic problem in the FM and primarily still possible that the event will happen. Intuitively, the existence of statistical uncertainty may cause ambiguity analyze the evaluation method of [3]. We find that FM of the inference results. Dawid [36] proves that even sometimes produces ambiguous output results for some if the statistical uncertainty can be eliminated, the CQs, even if the final output result is unique. One of inference may also produce ambiguity. Therefore, when the important reasons is that FM needs to calculate the 12960 C. Wang et al. Fig. 2 The framework of the probabilistic evaluation of counterfac- counterfactual inference task, the plausibility of the output affects, the tual queries: these two issues spread over the same inference task, and user’s confidence, and the strong assumptions premise determines the these two issues are independent of each other. However, for the same scope of the task intersection between the two sets to get the final result and analyze the rationality and limitations of M-TUA. The when estimating the output probability. However, the comparison between TUA and M-TUA is in Section 6. intersection may be an empty set ∅, when estimating Section 7 summarizes this paper. some special CQs. – We provide a mild TUA assumption, called M-TUA, which incorporates the idea of the damped vibration 2 Notation equation. – We prove theoretically that M-TUA can be applied to In this section, the key mathematical notations and their large datasets, and give a reasonable and rigorous math- descriptions are listed in Table 1. ematical description of this theory (see Theorem 1). Especially for some complex internal principles, we do not choose to use the “black box” method but hope to 3 Inference mechanism and result credibility use M-TUA to try to reveal the complex internal rela- analysis of FM tionship between certain parameters and assumptions and make some reasonable description and explanation. In this section, we first introduce the definition of PECQs [3], which is a probabilistic description of the counterfactual 1.2 Paper organization query. Second, we review the inference mechanism of FM in Fig. 3. Finally, we exhaustively analyze the inference The rest of this paper is organized as follows: In Section 2, mechanism in FM by some examples and find that when we give the mathematical notation and their descriptions. the probabilistic evaluation of a CQ is 0, the result causes In Section 3, we give a visualization of the FM inference unreliable guidance for decision-making. mechanism and analyze the pitfalls of this inference mechanism based on concrete examples. In Section 4 and 3.1 Definition of PECQs Section 5, we give a mild version of the TUA assumption (i.e., M-TUA), and theoretically prove the equivalent Definition 1 (Probabilistic Evaluation of Counterfactual representation of the TUA assumption in the vector space Queries, PECQs [3]) The core idea of PECQs is to transform Rethinking the framework constructed by counterfactual functional model 12961 Table 1 Key Notations and Notation Description Descriptions ∅ the empty set R ={R , ...,R } the set of variables R 1 n i r /rˆ the value of R in the real/counterfactual world i i i {t, c} t and c represent two different treatments (or intervention variables) U = U ∪ U a population with a huge number of units u t c i t t U ={u , ...,u } the set of some units receiving treatment t 1 k c c U ={u , ...,u } the set of other units receiving treatment c, i.e., U ∩ U =∅ c  t c 1 k R, Z , C the set of real numbers, positive integers, and complex numbers A ∈ C the complex conjugate of A ∈ C |S| the cardinality of finite set S, e.g., |R|= n, |U |= k and |U |= k t c {·} the finite set containing n elements, e,g., R ={R , ...,R }={R } n 1 n i n c all unknown factors that may influence β in the inference mechanism of FM Pr(c ) the probability distribution of c in the inference mechanism of FM n n L the Euclidean distance from point a to point o coordinate system ao p(x)  q(x) means that function p(x) is equivalent to function q(x) a CQ into a probabilistic evaluation problem, which can be Example 1 CQ1 can be translated into (2) for evaluation. formalized as: Specifically, for (α , β ), we observe that there is an 0 0 ineffective policy (i.e., α ) that causes the unemployment Pr(β |ˆ α ) , (2) 1 1 |(α ,β ) 0 0 ˆ rate rise (i.e., β ); Pr(β |ˆ α ) indicates the probability of 0 1 1 where “|(α , β )” represents the evidence (or observed data) unemployment rate falls (i.e., β )ifweimplement effective 0 0 1 we have observed in the real world, and the value of policies (i.e., α ˆ ). evidence be considered as a conditional probability (e.g., (α , β )  Pr(β |α ) = p ). Pr(β |ˆ α ) is the counterfactual 3.2 Inference mechanism of FM 0 0 0 0 0 1 1 outcome that we need to infer based on evidence. The probabilistic evaluation of (2) can be obtained by the The inference mechanism of FM is shown in Fig. 3.More inference mechanism of FM [3] (i.e,. Figure 3). detailed information on the inference mechanism of FM is Fig. 3 The inference mechanism of FM when evaluating the CQ1 12962 C. Wang et al. elaborated upon in [3], and we will not repeat it here in this derived from real predictive inferences or the processing section. of some special counterfactual queries (e.g., V-CQ2)by the inference mechanism. Therefore, when the probabilistic 3.3 Analysis of the inference mechanism of FM evaluation of a CQ is 0, the decision based on this result is not credible, that is, the result is ambiguous. Although FM can output a unique solution for a CQ, 2) In addition, in V-CQ2, α does not constitute a however, we find that the results are not credible when the counterfactual condition, it still belongs to the assumptions in the real world, in this case, the β is also known evidence probability estimate of the FM output is 0. In other words, the output value of Pr(·) = 0 does not mean that the event in the real world, i.e., β = β . Hence, we have 1 1 will not occur. Next, we introduce some simple examples to Pr(β |α ) = Pr(β |α ) 1 0 |(α ,β ) 1 0 |(α ,β ) 0 0 0 0 reveal the untrustworthy guidance that this ambiguity may = 1 − Pr(β |α ) = 1 − p , (4) 0 0 0 bring to the decision-making. which contradicts with the result of (3). This shows that α Example 2 CQ2 [36]: Patient P has a headache. Will it help does not constitute an intervention that affects the outcome if P takes aspirin? The information we observe is that the of the counterfactual world. Therefore, the estimated current patient has a headache (denoted as β )and is not value of (3) obtained by FM violates the counterfactual taking aspirin (denoted as α ). Therefore, Pr(β |ˆ α ) consistency rule [40]. 0 1 1 |(α ,β ) 0 0 is equivalent to the probability evaluation of CQ2 (the The impact of ambiguity in inference results on decision- query of this form like CQ2 can also be called the effects making We discuss the impact of the unique solution on of causes [36]). However, consider a situation (denoted decision-making by two examples as follows: as the variant of CQ2, which is abbreviated as V-CQ2 ) where the patient still does not take aspirin. What is the Example 3 In predicting the probability value of 0.8 or probability of the headache disappearing? It is equivalent 0.9 for an earthquake to occur at a certain location, there to evaluating Pr(β |α ) . If we still choose to use 1 0 |(α ,β ) 0 0 is little difference in decision-making for this probability. FM to estimate this query, we first determine the value of However, when the probability of an earthquake is estimated n ∈{1, 2} (n refers to the value of n,which is (α ,β ) (α ,β ) 0 0 0 0 to be 0 and unique, it is essential for us to verify its determined according to (α , β )), and then we determine 0 0 rationality, because this may directly lead to the need for the the new value of n ∈{3, 4} (n refers to the value ˆ ˆ (α ˆ ,β ) (α ˆ ,β ) 1 1 1 1 corresponding deployment. In other words, how confident of n, which is determined according to (α ˆ , β )). Finally, the 1 1 are we to ensure that there will be no earthquake based on evaluation of Pr(β |α ) is the sum of Pr(c ) 1 0 |(α ,β ) 3,β |(α ,β ) 0 0 0 0 the prediction of FM? Therefore, the fact that there exist and Pr(c ) , i.e., 4,β |(α ,β ) 0 0 queries that cannot be answered using FM does not mean ˆ that the evaluation of these queries is meaningless. Pr(β |α ) = Pr(c ) + Pr(c ) 1 0 |(α ,β ) 3,β |(α ,β ) 4,β |(α ,β ) 0 0 0 0 0 0 = 0 + 0 =0(3) Example 4 CQ3:The murderer assassinated President Kennedy, if the assassination had failed, would Kennedy Why is the evaluation of V-CQ2 equal to 0, and what does still be alive? Formally, if the shot hits the target (α ) with this mean? 1) When using FM to estimate the results of a high probability (p ) that the hit target will die (β ), 0 0 CQ1 and V-CQ2, a key step is to calculate the intersection then we estimate Pr(β |α ) =? We will eventually 1 0 |(α ,β ) 0 0 of N and N ,where N ={n } refers (α ,β ) ˆ (α ,β ) (α ,β ) 0 0 0 0 0 0 (α ˆ ,β ) 1 1 ˆ get Pr(β |α ) = 0 using FM (the prediction 1 0 |(α ,β ) 0 0 to the set of n , which is determined by the observed (α ,β ) 0 0 process is similar to predicting V-CQ2). Obviously, if the evidence in the real world, and N ={n } refers ˆ ˆ (α ˆ ,β ) (α ˆ ,β ) 1 1 1 1 assassination failed (that is, the shot was successfully fired to the set of n , which is updated in the counterfactual (α ˆ ,β ) 1 1 but did not cause the target to die) and Kennedy is still alive, world. For example, N ={1, 2} and N = (α ,β ) 0 0 (α ˆ ,β ) 1 1 this situation may affect the assassin’s further decisions {2, 4} in CQ1 can be derived from Fig. 3. Therefore, the and deployment. For Kennedy’s team, this may affect probabilistic evaluation of CQ1 is uniquely determined by the deployment of security measures for similar activities. N ∩ N = n = 2. Hence, the probabilistic (α ,β ) ˆ 0 0 (α ˆ ,β ) Therefore, when the estimated result of a CQ is 0, the 1 1 evaluation of CQ1 is Pr(c ) . 2,β |(α ,β ) 0 0 result cannot provide credible and sufficient opinions for However, unlike CQ1,the N in V-CQ2 is {3, 4}, decision-making. (α ˆ ,β ) 1 1 which causes the probability evaluation of V-CQ2 to be 0 (i.e., (3), because N ∩ N =∅). This (α ,β ) A straightforward solution Through the above series of 0 0 (α ˆ ,β ) 1 1 probability estimate is not completely credible. The reason analyses, it is not difficult to find that when the probability is that we cannot be sure whether the output results are of a CQ is evaluated as 0, for this situation, further Rethinking the framework constructed by counterfactual functional model 12963 verification and analysis are indispensable. Because the cannot observe the potential outcomes (i.e, counterfactual inference mechanism of FM itself will inevitably introduce outcomes) in the counterfactual world (e.g., O in c,u ambiguity for the evaluation result of Pr(·) = 0. Since the Example 6). This situation where all potential outcomes of evaluation of the FM determines the final output solution units cannot be observed simultaneously is also called FPCI through the intersection between two sets, there is a certain we mentioned earlier. Formally, for binary intervention probability that the intersection is an empty set. variables, let d ∈{t = 1,c = 0}, the observation outcome A straightforward solution is that if an empty set appears O and the potential outcome Y can be expressed by the d,u o in the estimation process, we need to stop using the FM for following formula: estimation because the above analysis shows that we cannot Y = O , if d =1 o 1,u define the empty set as Pr(·) = 0. Therefore, when this Y = d ·O +(1−d)·O = (5) o 1,u 0,u i i Y = O , if d =0. o 0,u happens, we should estimate the output probability in the i real world instead of the counterfactual world to avoid the Where O ∈{O ,O } represents the potential d,u t,u c,u i i i appearance of ambiguous results. In this case, Pr(·) = 0 outcome of treatment d ∈{t, c} on unit u ∈ U. plays a role in prompting a replacement prediction strategy. For a more intuitive description, we focus on the Therefore, to comply with the counterfactual consistency following 2-dimensional Gaussian distribution model rule, we must use the prior probability (4) (i.e.,1 − p )to G(O ,O ) ∼ N (μ ,μ ,σ ,σ ,ρ).(6) replace Pr(·) = 0. t,u c,u t c t c i i Specifically, we introduce the following example [36]and use it as a basic background for subsequent analysis. 4 The mild treatment-unit additivity assumption Example 5 Given the pair (O ,O ), O ,and O t,u c,u t,u c,u i i i i are independent and identically distributed (i.i.d.), each For the second reflection in Fig. 2, in this section, we with the 2-dimensional Gaussian distribution with means analyze the TUA assumption, which is often used as a (μ ,μ ), σ = σ = σ (for simplicity of calculation, we t c c t o strong prerequisite for estimating causal effects in data. We assume that the distribution has a common variance σ ), and first review the potential outcome framework (Section 4.1), the correlation ρ ∈ (0, 1). Furthermore, we use the mixed individual causal effect (Section 4.2), the definition of model to describe the specific structure, i.e., TUA (Section 4.3), and provide an equivalent description O  μ + τ + λ of TUA utilizing vectorization (Section 4.4). Second, d,u d u d,u i i i based on the idea of the Damped Vibration Equation ⎪ μ = μ or μ d t c (7) (DVE) [41], we propose a mild TUA assumption (called M- s.t. , τ ∼ N (0,σ ) = N (0,ρσ ) u τ o TUA) (Section 4.5). M-TUA not only weakens the original λ ∼ N (0,σ ) = N (0,(1 − ρ)σ ) d,u λ o assumption but also has good mathematical properties and interpretability . where μ indicates the treatment effects applicable to all Our main conclusion in this section is presented based units. τ represents the effect on unit u ∈ U, called unit u i on two lemmas, and the specific proof process is mainly effects, and this effect applies to all units, i,e., τ = τ . u u i j,j =i divided into the following two steps. First, we describe λ stands for the effect between treatment and unit, called d,u the relationship between TUA and ICE in the counterfac- unit-treatment interaction. This internal mechanism reveals tual approach, and we explore the equivalence of ICE and the change from one treatment to another for unit u . τ and i u residual causal effect (RCE) in the TUA assumption (i.e., λ are independent random variables. d,u Lemma 1). Second, we innovatively introduce the defini- tions of positive effects and negative effects, and on this 4.2 Individual causal effect basis, we obtain the equivalent form of TUA in vector space by Lemma 2. Dawid [36] adopts the model of (7) to analyze the pros and cons of the counterfactual based on the idea of decision- 4.1 Potential outcome framework making and mentions an assumption that is often used in the counterfactual analysis, which is called TUA (Definition 2). According to the viewpoint of Rubin [42], there is an As the TUA assumption has strong constraints on data, it intervention in the causal inference, which means that will lead to a reduction in the practicability and scope of there is no cause and effect without intervention, and one use of TUA. Hence, in this paper, another goal of a study intervention state corresponds to a potential outcome. When is to design a mild TUA assumption that constrains the the intervention state is realized, we can only observe dataset itself or the experimental population as a whole, the potential outcomes in the realization state, that is, we rather than a strong constraint on each individual, as in 12964 C. Wang et al. the traditional TUA assumption. In the rest of this section, For example, we can use (11) to estimate the ICE of the we try to optimize TUA to make it have a broader scope of new unit u . Because inferring (u ) is equivalent to new new application in the context of large data. inferring  (u ) and 2(1−ρ)σ under (11). Unfortunately, ACE i o Specifically, we first analyze the individual and average we cannot accurately determine the value of 2(1 − ρ)σ . causal effect based on (7). In an experimental study, the individual causal effect (ICE) is the basic object (or Example 6 (Calculation of causal effect parameters (i.e., a basic measure). It describes the differences in various ICE, ACE) in the ideal case). In Table 2, we construct potential outcomes of a given unit u ∈ U under all possible a simple example to demonstrate the calculation of the treatments d ∈{t, c}. Generally, for one unit u ∈ U,the causal effect parameters, such as ICE, ACE. Suppose a ICE can be represented as population contains four subjects, labeled as u , u , u and 1 2 3 u , respectively. For each u , the potential outcomes in both 4 i (u )  O − O.(8) i t,u c,u i i intervention states are known (in reality only one potential For different tasks, the ICE can also have other forms of outcome can be observed). Where individuals 1 and 2 are in description, such as (u ) = log O /O . Therefore, the intervention group (i.e., the set of some units receiving i t,u c,u i i from a broader perspective, the subtraction in the definition treatment t) and individuals 3 and 4 are in the control group of ICE may not necessarily be a subtraction in R.Note (the set of some units receiving treatment c). that no matter which form is used, only one potential outcome can be observed [43]. Researchers usually do not According to Table 2, we can obtain: pay attention to ICE directly, but focus on the average value (u ) = E(O − O ) of the causal effect of all units, that is, ACE, also known ACE i t,u t,u i i as average treatment effect (ATE). ACE can be expressed = (30 + 0 + 10 + 0) = 10. (13) by the following formula, (u )  E((u )) = E O − O .(9) ACE i i t,u c,u Meanwhile, based on the information in Table 2, we can i i further obtain information on two other causal effect param- Apparently, in (7),  (u ) = μ − μ . ACE i t c eters, one is average treatment effect for the treated (ATT) and the other is average treatment effect for the control Limitations of the counterfactual approach focused on (ATC). Where, ICE We utilize the above Example 5 for our analysis. Specifically, according to (7)and (8), we have that, (u ) = E(O −O |d = t) = (30+0) = 15. (14) ATT i t,u t,u i i (u ) = O −O = (μ − μ ) + (λ − λ ) i t,u c,u t c t,u d,u i i i i (10) =  (u ) + (λ ), ACE i u and where (λ )  λ − λ is called residual causal u t,u d,u i i i (u ) = E(O − O |d = c) = (10 + 0) = 5. (15) ) ∼ ATC i t,u t,u effect (RCE) [36]. It is easy to verify that (λ u i i N (0, 2(1 − ρ)σ ). Thus, according to (7)-(9), we could Unfortunately, in the real world, the boldface numbers obtain the distribution of ICE as follows: (e.g., O , O )inTable 2 are not observable to us. The c,u t,u 2 3 (u ) ∼ N ( (u ), 2(1 − ρ)σ ). (11) i ACE i o reason is that the treatment received by subject u is d = t, However, in (11), 2(1 − ρ)σ cannot be inferred from we can not observe the potential outcome of u receiving o 2 observed data and has nothing to do with the size of the treatment d = c at the same time. Therefore, in the real data. Because even if the marginal distributions of O and world, the calculation and estimation of the causal effect t,u O are known, the joint distribution of random variables parameters require additional constraints (e.g., Treatment- c,u G(O ,O ) cannot be determined, and the marginal unit additivity assumption (Definition 2) to be imposed on t,u t,u i i distribution of the Gaussian distribution does not depend on the data. the parameter σ . Moreover, according to (7), we have Table 2 Causal effect parameters 2(1 − ρ)σ = 2σ ∈ (0, 2σ ), if ρ ∈ (0, 1) o λ o Subject O O O dO − O t,u c,u t,u t,u c,u . (12) i i i i i 2(1 − ρ)σ = 2σ ∈ (2σ , 4σ ), if ρ ∈ (−1, 0) o λ o o u 30 0 30 t 30 (12) indicates that different values of ρ determine different u 10 10 10 t 0 variances of the distribution of (u ). We can only get a u 10 00 c 10 range of σ , and a different ρ will lead to a different σ , λ λ u 10 10 10 c 0 which will cause a variety of uncertain results for reasoning. Rethinking the framework constructed by counterfactual functional model 12965 Table 4 Additional information about all u 4.3 Treatment-unit additivity Subject O O O − O t,u c,u t,u c,u i i i i In summary, the POF focuses on the inference of causal effects but does not explain the mechanism of influence u 13 ? ? between variables [44]. A computational bottleneck is the u ? 12.5 ? prediction of parameter ρ through the marginal distribution. u 10 ? ? Therefore, in the task of using the causal model for infer- u ?13 ? ence, additional constraints (e.g., Example 7) are usually u ?12 ? required to ensure that the inference result is obtained under mean 11.5 12.5 −1 this constraint. Example 7 Under the TUA, (u ) =  (u ) implies new ACE i 4.4 Equivalent form of TUA that ρ = 1. TUA assumes that the causal effect (u ) has the same effect Definition 2 (Treatment-Unit Additivity (TUA) [36]). on all units in U, e.g., (u ) =  (u ), i ∈[1, ..., |U |]. i ACE i The TUA assumption is to deal with the non-uniformity of Unfortunately, as a commonly used prerequisite, TUA is a data through a strong processing method. Specifically, TUA strong assumption, which cannot be tested on observable requirements that (u ) in (u )  O − O has the i i t,u c,u i i data and lacks a more transparent explanation in the real same effect on all units in U, e.g., (u ) = (u ) = ··· = 1 2 world [36]. This leads to some interesting questions worth (u ) =  (u ). ACE i |U | exploring, such as: – For applications of TUA, how to obtain a mild version TUA can be equivalently regarded as the assumption of constant effect (AOCE). For example, we can set (u ) = of the TUA assumption to make the TUA more broadly applicable? (u )= a specific constant (e.g.,  (u ). Generally j,j =i ACE i speaking, AOCE uses the average effect in the sample to – For interpretability of TUA, based on the TUA assump- tion (or a mild TUA), how to establish a formal expres- estimate the causal effect. Next, we will give a simple example to demonstrate the relation between TUA and ACE sion to describe the impact of the main factors inside the data on estimating ICE? and the application of TUA. To address these issues, next, we first provide an equivalent Example 8 Considering a fundamental problem of causal form of the TUA assumption under the 2-dimensional inference, let u be a patient. We want to know whether Gaussian distribution (i.e., Lemma 1). certain medication has a therapeutic effect on u . Suppose that the data about patient u isshowninTable 3. Lemma 1 If the data follows a Gaussian distribution as According to Table 3, we only know that O =13. Due t,u Example 5, then the TUA assumption has the following to the existence of FPCI, we cannot simultaneously observe equivalent form, i.e., the effects of u taking the medication and not taking the medication. Therefore, we rely on adding additional (u ) = (u ) = ··· = (u ) 1 2 q constraints (i.e., TUA) to estimate the value of O . c,u TUA Suppose we also have additional data (as shown in =⇒ lim (λ ) − (λ ) = 0. (17) d,u d,u i j,j =i Table 4), we can then use TUA assumption to infer the q→q values of O and O − O (i = 1, 2, 3, 4, 5). For c,u t,u c,u i i i example, according to Table 5 Assignment mechanism based on TUA assumption with (u ) =−1 ACE i (u ) = O − O =  (u ) =−1, (16) i t,u c,u ACE i i i Subject O O O − O t,u c,u t,u c,u i i i i we can obtain the following complete prediction data (see Table 5). u 13 14 −1 u 11.5 12.5 −1 u 10 11 −1 Table 3 The data of u ,where O and O − O are unknown 1 c,u t,u c,u 1 1 1 u 12 13 −1 Subject O O O − O t,u c,u t,u c,u u 11 12 −1 i i i i 5 u 13 ? ? mean 11.5 12.5 −1 1 12966 C. Wang et al. Where u ,u ∈U,i,j ∈[1, ...q],q=|U |, q is a sufficiently which proves the lemma. i j,j =i large positive integer (q q ). (λ ) = λ −λ . d,u t,u c,u i i i 4.5 The properties of (λ )in2-dimensional d,u Proof Given two units u and u , according to (7)and vector space i j,j =i (8), we have that Further, we will analyze the properties of TUA in 2- (u ) − (u ) = λ − λ − λ − λ i j,j =i t,u c,u t,u c,u i i j,j =i j,j =i dimensional vector space. Through the above analysis, it is (λ ) − (λ ). (18) d,u d,u i j,j =i not difficult to find that both the TUA and the equivalent Hence, a reasonable idea based on (18) is that we can form given by Lemma 5 are only numerical constraints (e.g., shift our attention from the constraint on λ to constraint d,u i (u ) = (u ), (λ ) − (λ )). In other words, i j,j =i d,u d,u 1 2 on RCE (λ ). Note that the predicted average value d,u i neither the TUA assumption itself nor Lemma 5 reflects of O − O (denoted as O − O ) will be t,u c,u t,u c,u i i i i their internal influence on the data. To explore the internal closer to E O − O if the size of the data is large t,u c,u i i influence of TUA on the data, our core idea is to transform enough. Therefore,  (u ) can be identified, from a large ACE i the original TUA assumption of constraints on values (i.e., experiment, as O − O . This means that the impact of t,u c,u i i scalars) into constraints on vectors. Specifically, we analyze (λ ) on the data may be related to the size of the data. d,u the TUA assumption by vectorizing λ (i.e., Lemma 2) d,u d d Given a group U ={u ,u , ...,u } containing q units, q and introducing a definition of the positive and negative 1 2 where u means the unit u will receive treatment d. j,j =i effects of λ (i.e., Definition 3) on the data. d,u j,j =i i We can assign “treatment” through Randomized Controlled Trials (RCT) and collect all potential outcomes, i.e., O = Lemma 2 For any λ ,let  (λ ) denote the positive d,u + d,u i i {O } and O ={O } . t,u k c c,u q−k effect of λ on the data, and  (λ ) denote the j,j =i j,j =i d,u − d,u i j,j =i Suppose that q is a large positive integer and naturally let negative effect of λ on the data. Then the TUA d,u j,j =i E O − O = O − O ,wehavethat t,u c,u t,u c,u assumption has the following equivalent form in the vector i i i i ⎛ ⎞ space, i.e., 1 1 ⎝ ⎠ (u ) = O − O =ˆ ACE i t,u c,u j,j =i j,j =i k q −k (u ) = (u ) = ··· = (u ) 1 2 q j =1 j =k+1 TUA s.t. O ∼ N (μ ,σ ), O ∼ N (μ ,σ ). (19) t,u t o c,u c o j,j =i j,j =i ⎛ ⎞ 1 k Where O represents the average of the responses t,u j=1 j,j =i ⎝ ⎠ =⇒ lim  (λ )−  (λ ) =0, (21) q + d,u − d,u i j,j =i of k units receiving treatment t,and O q→q c,u j =k+1 j,j =i q−k + − q q is the average of the responses of q − k units receiving treat- + − ment c. q, k,and q − k are both large numbers. Therefore, where q + q = q. (u ) =ˆ  is estimable and close to the true value. ACE i Next, we employ the TUA constraint on (18), which is Before proving Lemma 2, we need to introduce the equivalent to the setting (u ) − (u ) = 0. According i j,j =i definition of the vectorization of λ , positive effects, and d,u to (18), it is unnecessary for us to constrain every λ d,u negative effects. to a fixed value if q is large enough (e.g., q −→ q ). The alternative solution is that we consider the difference Definition 3 (The vectorization of λ .) Let λ = L d,u d,u ao i i between two (λ ), and formally characterize (λ ) − d,u d,u i represent the distance from a certain point a to the point o in (λ ) so that it gradually approaches 0 when q is a d,u j,j =i the coordinate system (e.g., in Fig. 4a, L represents λ ao d,u large number. Therefore, in the case of the considered RCE, and L represents λ ). The vectorization of λ bo d,u d,u j,j =i i we obtain the equivalent form of the TUA assumption, refers to assigning the characteristics of a vector to λ to d,u describe the possible positive or negative effect of λ on d,u lim (λ ) − (λ ) = 0, (20) d,u d,u i j,j =i q→q the data. As shown in Fig. 4b, for each λ , d,u q i positive effect, if (λ ) is in the first, second quadrants. d,u (λ )  (22) d,u negative effect, if (λ ) is in the third, fourth quadrants. d,u As shown in Fig. 4-(c) and (d),for λ , d,u positive effect, if (λ ) are in the first, second quadrants. d,u (λ )  (23) d,u negative effect, if (λ ) are in the third, fourth quadrants. d,u i Rethinking the framework constructed by counterfactual functional model 12967 There is a one-to-one correspondence between positive TUA, i.e., ⎛ ⎞ effects and negative effects. In other words, if a positive effect “+” exists, there must be a negative effect “-” ⎝ ⎠ lim  (λ ) −  (λ ) = 0, (25) + d,u − d,u i j,j =i q→q corresponding to it. + − q q which proves the lemma. Rationality analysis According to Definition 3, we trans- form the original TUA assumption of constraints on the scalars into constraints on vectors. For example, some indi- Rationality analysis The traditional TUA strongly con- viduals insist on eating nuts in actual life because nuts are strains all λ (or (u )) to be the same for u ∈ U,which d,u i i good for their health (i.e., positive effect), but some people undoubtedly ignores the effect of λ onthedataand the d,u are allergic to nuts, and eating them will bring pains and estimated ICE. However, ignoring this effect by applying even life-threatening effects (i.e., negative effect). There- TUA does not mean that the effect of λ on the data does d,u fore, we argue that it is necessary to consider the positive not exist. Therefore, we did not directly ignore this poten- or negative effects of λ . Definition 3 provides an intu- tial impact but pioneered to represent it by introducing the d,u itive representation of positive/negative effect in the vector vectorization method (i.e., positive and negative effects in space, and according to the definition, next, we give a proof Definition 3). In addition, Lemma 2 relaxes the constraint of Lemma 2 as follows. on the data to the level of the entire dataset U rather than imposing a strong constraint on each unit u . Therefore, Proof For ease of understanding, we will combine Fig. 4 for Lemma 2 can be considered as an equivalent form of TUA the proof. Considering the representation of λ in a 2- d,u at the abstract level. dimensional plane. As shown in Fig. 4a, we first represent λ as the Euclidean distance in the plane, i.e., d,u L = (λ ), and L = (λ ). (24) ao d,u bo d,u i j,j =i 5 The convergence of  (λ )and  (λ ) + d,u − d,u According to Lemma 1, (u ) = (u ) can be regarded i j,j =i Through the above analysis, we provide the equivalent as (λ ) = (λ ). Then, we can use L = L to d,u d,u ao bo i j,j =i equivalently describe (λ ) = (λ ). form of the TUA, which is based on 2-dimensional d,u d,u i j,j =i Gaussian distribution and a large dataset. By performing Second, we consider the representation of the TUA in 2- dimensional vector space. According to Definition 3, we can vectorization operations on λ ,u ∈ U, we introduce d,u i the definition of positive and negative effects, respectively, vectorize λ . The meaning of vectorization is to give each d,u aiming to study the effect of  (λ ) and  (λ ) (λ ) a measure, which aims to describe the positive or + d,u − d,u d,u i j,j =i on the data under the premise (λ ) = (λ ). negative effects of (λ ) on the data. In order to maintain d,u d,u d,u i i j,j =i Although we assume that the effects of  (λ ) and consistency with the original TUA assumption, we assume + d,u (λ ) are equal in a large dataset, we hope that |(λ )|=|(λ )|. For instance, as shown − d,u d,u d,u i j,j =i j,j =i that  (λ ) and  (λ ) will have less and less in Fig. 4b, let | (λ )| (| (λ )|) denote the + d,u − d,u + d,u − d,u i j,j =i i j,j =i positive (negative) effect of (λ ) on the data, although impact on the data as q approaches q . This concern is d,u necessary because if the sample size is not large enough, the |(λ )|=|(λ )|, (λ ) = (λ ). d,u d,u d,u d,u i j,j =i i j,j =i Third, we consider extending (λ ) to the entire positive and negative effects may not cancel each other out. d,u For example, the positive effects may be greater than the dataset. Since the background of our research is in the context of large datasets, we implied a condition here, negative effects or vice versa. Quantifying  (λ ) and + d,u (λ ) requires rigorous and rational mathematical that is, in the entire data, the positive effects  (λ ) − d,u + d,u i j,j =i expressions. Therefore, a natural question is: how to and negative effects  (λ ) on data generation − d,u j,j =i describe the convergence of  (λ ) and  (λ ) are basically the same. Furthermore, since |(λ )|= + d,u − d,u d,u i i j,j =i when q approaches q ? We will give the answers to the |(λ )|, we can visualize the entire data as a circle in a d,u j,j =i above questions in Theorem 1. 2-dimensional plane, where |(λ )|=|(λ )|= r. d,u d,u i j,j =i Intuitively, under the TUA constraint,  (λ ) = + d,u (λ ) always holds. However,  (λ ) = 5.1 The descriptive equation of  (λ ) − d,u + d,u d,u j,j =i i i and  (λ ) (λ ) does not necessarily have to be under a − d,u − d,u j,j =i j,j =i strong constraint of (λ ) = (λ ) to hold. In d,u d,u i j,j =i In classical physics, damping refers to the characteristic other words, in Fig. 4-(b), it is sufficient that the area of red is the same as the area of blue. Therefore, we can relax the that the amplitude of vibration gradually decreases in any oscillating system, which may be caused by external restriction on (λ ) by only assuming  (λ ) = d,u + d,u i i (λ ) without (λ ) = (λ ).In influences or the system itself [45]. We introduce the above − d,u d,u d,u j,j =i i j,j =i ideas into the study of the descriptive equation of  (λ ) summary, we obtain the following conclusion based on + d,u i 12968 C. Wang et al. Fig. 4 Figures (a) − (d) describe the equivalent representation of vectorization of (λ ). It should be noted that the positive and d,u the TUA in the vector space by vectorizing λ . (a) is the geomet- negative effects of λ on the data are almost equal when the num- d,u d,u i i ric description of the traditional TUA assumption in the coordinate ber of samples is large enough. Since | (λ )|=| (λ )|, + d,u − d,u i j,j =i system. According to Lemma 1, (u ) = (u ) can be regarded all after vectorization of (λ ) can form a circle in a 2-dimensional i j,j =i d,u as (λ ) = (λ ). Hence, in the 2-dimensional plane, we plane; (d) reflects the expansion of TUA assumption in the vector d,u d,u i j,j =i can use Euclidean distance L = L to describe (λ ) = space. It can be regarded as a visualization of the TUA assumption ao bo d,u (λ ); (b) describes the vectorization of (λ ). According to at an abstract level (that is, constraints are applied to the dataset U d,u d,u j,j =i i the definitions of positive (red), negative (blue) effects and the TUA rather than to each u ). In other words, it is no longer necessary that assumption, we have | (λ )|=| (λ )|; (c) describes the (λ ) = (λ ) + d,u − d,u d,u d,u i j,j =i i j,j =i monotonically decreasing function for convergence (see and  (λ ). In this section, we provide a description − d,u j,j =i equation about  (λ ) and  (λ ), which satisfies Fig. 5). Therefore, we need to consider the volatility effect + d,u − d,u i j,j =i that when q approaches q ,  (λ ) and  (λ ) of  (λ ) and  (λ ) on the data. + d,u − d,u + d,u − d,u i j,j =i i j,j =i converge strictly to 0 (see Theorem 1). Consider that the influence of  (λ ) and + d,u (λ ) on the data may be volatile. Therefore, we − d,u j,j =i Theorem 1 For λ , if there are positive effect  (λ ) d,u + d,u add the term “cos(n · η · q)”to(27) to describe the volatil- i i + and negative effect  (λ ) of (λ ) on the data, − d,u d,u ity effect of  (λ ) and  (λ ) on the data. We j,j =i i + d,u − d,u i j,j =i (λ ) and  (λ ) satisfy (or approximately + d,u − d,u i j,j =i can rewrite (27) as follows: satisfy) the following equation, −η ·q −η ·q + + S( (λ ), q)  A e · cos(n · η · q) S( (λ ), q)  A e cos(n · η · q) + d,u + + + d,u + + i i (28) , (26) −η ·q −η ·q − − S( (λ ), q)  A e · cos(n · η · q), S( (λ ), q)  A e cos(n · η · q) − d,u − − − d,u − − j,j =i j,j =i where n ∈ Z , and η > 0, η > 0 are adjustment parame- + − where n and η > 0, η > 0are adjustment parameters, −η ·q −η ·q + − + − ters. e and e are attenuation parameters. A and −η ·q −η ·q + − e and e are attenuation parameters. Not only A are the initial values of  (λ ) and  (λ ), − + d,u − d,u i j,j =i −η ·q +/− does the cos(n · η · q) function ensure that A e +/− + respectively. Then  (λ ) and  (λ ) will gradu- + d,u − d,u i j,j =i decays exponentially, but also it ensures that (26) decays. ally converge to 0 as q approaches q . According to Fig. 5, we can intuitively understand the meaning of parameter A and parameter η in Proof Let’s analyze the first term of (26), i.e., + + (27). The parameter A determines the initial maximum −η ·q S ( (λ ), q)  A e 1 + d,u + value of the positive effect. The parameter η determines (27) −η ·q S ( (λ ), q)  A e , 2 − d,u − j,j =i the convergence speed of the function S ( (λ ), q). 1 + d,u where A and A are the initial values of  (λ ) and Although S ( (λ ), q) can describe that the positive + − + d,u i 1 + d,u (λ ), respectively. Because of η > 0, η > 0, − d,u + − effect converges to 0 quickly as the number of samples j,j =i −η ·q −η ·q + − the two terms e and e in the equation decay with increases, it ignores the volatility of positive effects. The thedatasize q. proof for S ( (λ ), q) is similar. 2 − d,u j,j =i Unfortunately, if the equation only uses (27) to describe Similarly, according to Fig. 6, we can intuitively the exponential decay trend of  (λ ) and  (λ ), + d,u − d,u understand the meaning of parameter A and parameter η i j,j =i + + it cannot reflect the potential impact of  (λ ) and + d,u in (26). The parameter A determines the initial maximum i + (λ ) on the data. In other words,  (λ ) − d,u + d,u value of the positive effect, the parameter η determines j,j =i i + and  (λ ) do not necessarily follow a strictly − d,u the convergence speed of the function S( (λ ), q),and + d,u j,j =i i Rethinking the framework constructed by counterfactual functional model 12969 Fig. 5 A visualization of the influence of parameter (A , η )on + + Fig. 6 A visualization of the influence of parameters (A , n, η ) + + equation S ( (λ ), q). The situation of S ( (λ ), q) is 1 + d,u 2 − d,u and cos(n · η · q) on equation S( (λ ), q). The situation of i j,j =i + + d,u similar to the description of S ( (λ ), q) 1 + d,u S( (λ ), q) is similar to the description of S( (λ ), q) − d,u + d,u j,j =i i the cos(n · η · q) reflects the volatility of the positive and presents a trend of exponential decay with volatility. Finally, negative effect. The purpose of introducing cos(n · η · q) is as q increases, S( (λ ), q) will strictly converge to + d,u to reflect the conversion between the positive effect and the zero. negative effect as much as possible. Regarding the form of conversion, it can either be a positive effect that becomes a (−η )·q +/− Attenuation parameters e The purpose of intro- negative effect or vice versa. However, no matter how it is −η ·q +/− ducing the attenuation parameter e is to ensure converted, it will eventually converge to 0 strictly under the that the positive effect and the negative effect can exhibit −η ·q A e . The proof for S( (λ ), q) is similar. + − d,u j,j =i exponential decay characteristics as q increases. Although we improve TUA by vectorization, we hope that S( 5.2 The rationality analysis of equations (λ ), q) and S( (λ ), q) will have minimal d,u − d,u i j,j =i S( (λ ), q)and S( (λ ), q) + d,u − d,u i j,j =i impact on the overall data. Therefore, even while acknowl- edging the existence of positive and negative effects, we The rationality analysis of equations S( (λ ), q) and + d,u hope that  (λ ) and  (λ ) can decay as quickly + d,u − d,u i j,j =i S( (λ ), q) mainly includes two aspects: − d,u j,j =i as possible in an exponential decay manner. – One is about the analysis of the visualization results of In fact, according to Lemma 1, Lemma 2, and Theorem 1, S( (λ ), q) and S( (λ ), q). + d,u − d,u we provide a milder TUA assumption (referred to as M- i j,j =i – The other is the interpretability of S( (λ ), q) and + d,u TUA for short) through vectorization operations. In partic- S( (λ ), q). − d,u ular, (26) provides a formal description of positive effects j,j =i and negative effects, which makes M-TUA interpretable. The function of cos(n · η · q) To simplify the presenta- In summary, the above conclusion provides a mild form of +/− tion, we only analyze positive effects in this subsection. The TUA at the abstract level and an explicit (but not unique) analysis of negative effects is similar. As shown in Fig. 5, mathematical description. S ( (λ ), q) only reflects the nature of exponential 1 + d,u decay as q increases. Although S ( (λ ), q) also can 1 + d,u eventually converge to 0, S ( (λ ), q) does not reflect 6 Comparison of TUA and M-TUA 1 + d,u its potential impact on data, because S ( (λ ), q) 1 + d,u directly describes the positive effect as a strict monotonic In this section, we compare the traditional TUA and M-TUA decreasing function. However, a representation based on to illustrate the similarities and differences between each strict monotonic decrement ignores the description of its other. internal complexities. The effect of  (λ ) on data may + d,u be volatile (the situation may also be more complex). There- – (u ) and (λ ). TUA assumes that the value of ICE i d,u fore, in order to describe the volatility of  (λ ),we is the same for all u ∈ U (|U|= q), e.g., (u ) = + d,u i i i introduce the cos(·) function. Apparently, S( (λ ), q) + d,u  (u ),where i ∈[1, ...,q]. M-TUA transfers ACE i i 12970 C. Wang et al. Table 6 Observation data with  (u ) = 1 the above problem to the constraint of (λ ) by ACE i d,u vectorization operation, that is, Subject O O O − O t,u c,u t,u c,u i i i i lim ((λ ) − (λ )) = 0, d,u d,u i j,j =i u 13 ? ? ⎪ 1 q→q u ?9.5 ? TUA =⇒ ⎪ u ?8 ? lim  (λ ) −  (λ ) = 0, 3 ⎩ + d,u − d,u i j,j =i q→q + − q q u ?10 ? (29) u 11 ? ? u 15 ? ? + − 6 where q + q = q. u ?9.5 ? – Vector  (λ ) and Scalar (λ ).M-TUA +/− d,u d,u i/j i u 9? ? provides a vector description of positive and negative u ?10 ? effects for (λ ) (i.e.,  (λ )), aiming d,u +/− d,u i i/j u ?9 ? to distinguish M-TUA from traditional TUA. The vectorization operation allows for differences between mean ?? 1 individuals to exist, that is, (λ ) = (λ ) d,u d,u i j,j =i is allowed under the premise of  (λ ) = + d,u As shown in Tables 8 and 9, it is not difficult to see that (λ ). Therefore, M-TUA achieves the − d,u j,j =i based on the assumption of M-TUA (i.e., (λ ) = weakening of TUA. d,u i=1 i (0.2+0.2+0.2+0.2+0.1+0.1)−(0.2+0.2+0.3+0.3) = 0), – Variance. For a randomized experiment, the assumption the data can be more in line with the assignment mechanism of TUA implies that the variance is constant for on the condition that the ACE value remains unchanged, all treatments. Constant variance is not a necessary thereby avoiding either O <O (i.e.,  (u )> condition for MTUA, MTUA should be used in data c,u t,u ACE i i i 0), or O >O (i.e.,  (u )< 0). For example, with small variance to constrain the dispersion of the c,u t,u ACE i i i according to (10), we have that population. (u ) =  (u ) + (λ ), i ∈[1, 2, ..., 10]. (31) For the intuitiveness of description, we use a simple i ACE i d,u example to further illustrate how M-TUA weakens the TUA Further, we obtain that, assumption. 10 10 1 1 (u ) = · (u ) = ·  (u ) + (λ ) . Example 9 (Difference between data generated by TUA ACE i i ACE i d,u 10 10 i=1 i=1 and M-TUA) TUA is different from M-TUA in a number (32) of respects. A simple goal in this example is to compare the differences in the data under different assumptions via 10 Since  (u ) =  (u ),in(32), only (λ ) = ACE i ACE i d,u i=1 i estimating the unobserved potential outcomes from Table 6. 0 needs to be satisfied. There are countless equations that satisfy (λ ) = 0. – Similar to Example 8, in Table 7, we construct a set of d,u i=1 i data (including 10 subjects u ,i ∈[1, 2, ..., 10])that meets the TUA assumption, where Table 7 Assignment mechanism based on TUA assumption with (u ) = 1 ACE i (u ) = E(O − O ) ACE i t,u c,u i i Subject O O O − O 10 t,u c,u t,u c,u i i i i = (O − O ) = 1. (30) t,u c,u i i u 13 12 1 10 1 i=1 u 11.5 10.5 1 –Tables 8 and 9 are constructed based on M-TUA u 10 9 1 assumption. u 12 11 1 u 11 10 1 As can be seen from Table 7, we know that the data only u 15 14 1 follows two situations, i.e., O <O (i.e.,  (u )> c,u t,u ACE i i i u 13 12 1 0), or O >O (i.e.,  (u )< 0). However, this 7 c,u t,u ACE i i i u 98 1 strong assumption is often violated in the real world, which u 8.5 7.5 1 forces all subjects to have the same (u ). M-TUA alleviates 9 u 12 11 1 this scenario and makes it more in line with the complex 10 situations in real data (note that the values of (λ ) in d,u mean 11.5 10.5 1 Tables 8 and 9 are not unique). Rethinking the framework constructed by counterfactual functional model 12971 Table 8 Assignment Subject O O (u) = O − O (λ ) = λ − λ mechanism based on M-TUA t,u c,u t,u c,u d,u t,u c,u i i i i i i i assumption with  (u ) = 1 ACE i u 13 11.8 1.2 0.2 u 10.7 9.5 1.2 0.2 u 9.2 8 1.2 0.2 u 11.2 10 1.2 0.2 u 11 9.9 1.1 0.1 u 15 13.9 1.1 0.1 u 10.3 9.5 0.8 −0.2 u 98.2 0.8 −0.2 u 10.7 10 0.7 −0.3 u 9.7 9 0.7 −0.3 mean 10.98 9.98 1 0 Example 9 shows that the data constructed based on the only requires a small value of variance (e.g., the variance M-TUA assumption allows for differences between various of (λ ) in Table 8 is less than 0.5, and the variance of d,u u ’s (e.g., (u ,u ,u ,u )< 0, (u )> 0and (λ ) in Table 9 is close to 1). i 7 8 9 10 1,2,3,4,5 d,u (u ) = 0), while ensuring that  (u ) is constant (e.g., 6 ACE i (u ) = 1 ), which is more in line with the diversity of Limitations Although M-TUA has realized the weakening ACE i experimental samples in real tasks. of TUA to a certain extent and expanded the use scope of the However, note that it is not sufficient to simply original TUA, M-TUA itself is based on some assumptions require that (λ ) = 0 holds, which does and finally achieves the equivalence with TUA in the case d,u i=1 i not guarantee that the data keeps good dispersion with of large samples, i.e., q → q . Therefore, M-TUA still has this constraint. Therefore, an indispensable measure is to the following limitations. introduce variance as a metric to constrain the data so that the data constructed based on M-TUA maintains good – Dimensionality limitation of vector space.Wetakethe dispersion. The reason is that the population is larger and 2-dimensional Gaussian distribution as an example. the variance is less, the ACE would be closer to the true Based on Example 5, we analyze the equivalent form of ACE regardless of the specific units randomly assigned TUA in 2-dimensional vector space. The vectorization to treatments. As mentioned above, for a randomized operation in 2-dimensional space can easily be extended experiment, the TUA implies that the variance is constant to 3-dimensional space. However, the equivalent form for all treatments, which means that a necessary condition of the TUA for data in high-dimensional space has not for TUA is that the variance is constant, while M-TUA been rigorously established. Table 9 Assignment Subject O O (u) = O − O (λ ) = λ − λ t,u c,u t,u c,u d,u t,u c,u mechanism based on M-TUA i i i i i i i assumption with  (u ) = 1 ACE i u 13 10.98 2.02 1.02 u 11.53 9.5 2.03 1.03 u 10.04 8 2.04 1.04 u 12.04 10 2.04 1.04 u 11 9 2 1.00 u 15 15 0 −1.00 u 9.48 9.5 −0.02 −1.02 u 99.03 −0.03 −1.03 u 9.96 10 −0.04 −1.04 u 8.96 9 −0.04 −1.04 mean 11.001 10.001 1 0 12972 C. Wang et al. –  (λ ). As shown in Fig. 4d, M-TUA TUA assumption at an abstract level through a set of +/− d,u i/j =i implies a premise that interpretable parameters. ⎛ ⎞ ⎝ ⎠ lim  (λ ) −  (λ ) = 0, (33) + d,u − d,u i j 7 Conclusion and future work q→q + − q q + − In this paper, we first use an example to illustrate the where q + q = q. It requires a large enough sample underlying problems of using the functional model to size to ensure that the equation estimate the probability solution of counterfactual queries. (λ ) =  (λ ) (34) We analyze the inference mechanism of the functional + d,u − d,u i j + − q q model and point out that there are ambiguous conclusions when the unique output probability solution is 0 under holds with a high probability. Because the effects of any the functional model. In other words, when the probability (λ ) may be positive or negative (this is similar to d,u solution obtained by the functional model is 0, it does not the classical coin toss experiment, when the number of mean that the estimated event will not occur. Secondly, experiments is sufficient, the numbers of positive and for the TUA assumption commonly used in counterfactual negative coin occurrences are basically equal). models, we provide an equivalent description form of the −η ·q +/− – Decay rate.The e in Theorem 1 ensures that (26) TUA in the low-dimensional space. We weaken the TUA will eventually converge to 0 with exponential decay. Of assumption by vectorizing the original TUA and finally course, the purpose of choosing exponential decay is to obtain a milder TUA assumption, i.e., M-TUA. In addition, make  (λ ) or  (λ ) converge quickly so that + d,u − d,u i − we also give theoretical proof and exhaustive analysis of the as the amount of sample data increases, the impact of rationality and limitations of M-TUA. (λ ) or  (λ ) on the data will be minimal (or + d,u − d,u i j As pointed out earlier, in M-TUA, the constraints on the as small as possible) and eventually reach a negligible unit are related to the dataset and RCE, instead of mandatory level. constraints for each unit. We argue this is very necessary, – Ignorability. Since M-TUA is a constraint imposes especially in the case of big data. Mild version assumption on the task of making causal inferences in the POF, (not just M-TUA) can be viewed as an abstraction from the ignorability(i.e., (O ,O ) ⊥ d)) still needs to hold. t,u c,u i i micro world to the macro world [46]. An intuitive example In addition, we argue that estimating the variance of is that if we want to measure the water temperature of a the data is still necessary (e.g, Example 9). Because swimming pool, it is impossible for us to measure every if the population is larger and the variance is less, the drop of water in the swimming pool. However, we do not ACE would be closer to the true ACE regardless of the think that the conclusion of this paper is the final form of the specific units randomly assigned to treatment. M-TUA. Therefore, we will focus on the following points in our future work. Interpretability Since the TUA cannot be tested and veri- fied on the observed data, this will lead to limitations in the Practicality Causal science has shown vigorous vitality in use of many models (e.g., the model of (7)) [36]. Therefore, the field of AI and public health [47]. However, a large it is necessary to obtain a milder and interpretable assump- number of tasks can only be carried out under the premise of tion. In general, M-TUA offers several advantages in terms satisfying strong assumptions. The use of some assumptions of interpretability as follows: is also not differentiated according to the different tasks. – Based on the idea of DVE, we establish the relationship Therefore, including M-TUA, whether the version for between TUA and RCE and try to provide some different AI task scenarios can be further developed is a reasonable explanations for λ . topic worthy of our further consideration. d,u – Through vectorization operations, we endow λ with d,u the ability to describe positive and negative effects on Challenges posed by high-dimensional data .Asatheo- data, and theoretically prove the rationality of M-TUA retical exploration of weakening TUA, M-TUA presents under the large dataset. the equivalent form of TUA in vector space through vec- – M-TUA not only weakens the strength of the torization and gives it a certain degree of interpretability. original TUA assumption but also provides a geometric However, with the explosion of data, AI practitioners are description of the TUA. confronted with data that are very large in both volume and – In particular, the M-TUA has an explicit mathematical dimensionality. Although our theorem shows that M-TUA expression that represents the meaning of the original is applicable in the case of big data, high-dimensional data Rethinking the framework constructed by counterfactual functional model 12973 brings new challenges. Therefore, how to develop assump- 19. Abbasnejad E, Teney D, Parvaneh A, Shi J, Hengel A (2020) Counterfactual vision and language learning. In: Proceedings tions based on M-TUA with theoretical guarantees and of the IEEE/CVF Conference on Computer Vision and Pattern applicable to high-dimensional data is also the focus of our Recognition, pp 10044–10054 future work. 20. Bajaj M, Chu L, Xue ZY, Pei J, Wang L, Lam PC-H, Zhang Y (2021) Robust counterfactual explanations on graph neural Acknowledgements This work was supported by the National Key networks. Adv Neural Inf Process Syst 34 R&D Program of China under Grant 2018YFB1403200. 21. Holzinger A, Malle B, Saranti A, Pfeifer B (2021) Towards multi-modal causability with graph neural networks enabling information fusion for explainable ai. Inf Fusion 71:28–37 22. Wachter S, Mittelstadt B, Russell C (2017) Counterfactual References explanations without opening the black box: Automated decisions and the gdpr. Harv JL Tech 31:841 1. Heintzelman SJ, Christopher J, Trent J, King LA (2013) 23. Hendricks LA, Hu R, Darrell T, Akata Z (2018) Generating coun- Counterfactual thinking about one’s birth enhances well-being terfactual explanations with natural language. arXiv:1806.09809 judgments. J Posit Psychol 8(1):44–49 24. Ustun B, Spangher A, Liu Y (2019) Actionable recourse in linear 2. Morgan SL, Winship C (2015) Counterfactuals and causal classification. In: Proceedings of the Conference on Fairness, inference. Cambridge University Press Accountability, and Transparency, pp 10–19 3. Balke A, Pearl J (1994) Probabilistic evaluation of counterfactual 25. Barocas S, Selbst AD, Raghavan M (2020) The hidden queries. In: Proceedings of the Twelfth AAAI National Confer- assumptions behind counterfactual explanations and principal ence on Artificial Intelligence, pp 230–237 reasons. In: Proceedings of the 2020 Conference on Fairness, 4. Lewis D (1976) Probabilities of conditionals and conditional Accountability, and Transparency, pp 80–89 probabilities. In: Ifs. Springer, pp 129–147 26. Pearl J (2018) Theoretical impediments to machine learning with 5. Ginsberg ML (1986) Counterfactuals. Artif Intell 30(1):35–79 seven sparks from the causal revolution. In: Proceedings of the 6. Kong E, Prinz D (2020) Disentangling policy effects using proxy Eleventh ACM International Conference on Web Search and Data data: Which shutdown policies affected unemployment during the Mining, pp 3–3 covid-19 pandemic? J Public Econ 189:104257 27. Marx A, Vreeken J (2019) Telling cause from effect by local and 7. Luo G, Zhao B, Du S (2019) Causal inference and bayesian global regression. Knowl Inf Syst 60(3):1277–1305 network structure learning from nominal data. Appl Intell 28. Bertossi L (2021) Specifying and computing causes for query 49(1):253–264 answers in databases via database repairs and repair-programs. 8. Liu Y, Yu J, Xu L, Wang L, Yang J (2021) Sissos: intervention of Knowl Inf Syst 63(1):199–231 tabular data and its applications. Appl Intell:1–15 29. Hair Jr J F, Sarstedt M (2021) Data, measurement, and causal 9. Pearl J, Mackenzie D (2018) The book of why: the new science of inferences in machine learning: opportunities and challenges for cause and effect. Basic Books marketing. J Mark Theory Pract:1–13 10. Venzke I (2018) What if? counterfactual (hi) stories of interna- 30. Zucker J, Paneri K, Mohammad-Taheri S, Bhargava S, Kolambkar tional law. Asian J Int Law 8(2):403–431 P, Bakker C, Teuton J, Hoyt CT, Oxford K, Ness R et al (2021) 11. Pesaran MH, Smith RP (2016) Counterfactual analysis in Leveraging structured biological knowledge for counterfactual macroeconometrics: An empirical investigation into the effects of inference: A case study of viral pathogenesis. IEEE Trans Big quantitative easing. Res Econ 70(2):262–280 Data 7(1):25–37 12. Atan O, Zame WR, Feng Q, van der Schaar M (2019) 31. Truong D (2021) Using causal machine learning for predicting the Constructing effective personalized policies using counterfactual risk of flight delays in air transportation. J Air Transport Manag inference from biased data sets with many features. Mach Learn 91:101993 108(6):945–970 32. Kumar V, Choudhary A, Cho E (2020) Data augmentation using 13. Major D, Lenis D, Wimmer M, Sluiter G, Berg A, Buhler ¨ K (2020) pre-trained transformer models. arXiv:2003.02245 Interpreting medical image classifiers by optimization based 33. Wu X, Lv S, Zang L, Han J, Hu S (2019) Conditional counterfactual impact analysis. In: 2020 IEEE 17th International bert contextual augmentation. In: International Conference on Symposium on Biomedical Imaging (ISBI). IEEE, pp 1096–1100 Computational Science. Springer, pp 84–95 14. Castro DC, Walker I, Glocker B (2020) Causality matters in 34. Qin L, Bosselut A, Holtzman A, Bhagavatula C, Clark E, Choi medical imaging. Nat Commun 11(1):1–10 Y (2019) Counterfactual story reasoning and generation. In: 15. Hao Z, Zhang H, Cai R, Wen W, Li Z (2015) Causal discovery on Proceedings of the 2019 Conference on Empirical Methods in high dimensional data. Appl Intell 42(3):594–607 Natural Language Processing and the 9th International Joint 16. Qin L, Shwartz V, West P, Bhagavatula C, Hwang JD, Le Bras Conference on Natural Language Processing (EMNLP-IJCNLP), R, Bosselut A, Choi Y (2020) Backpropagation-based decoding pp 5043–5053 for unsupervised counterfactual and abductive reasoning. In: 35. Qian C, Feng F, Wen L, Ma C, Xie P (2021) Counterfactual Proceedings of the 2020 Conference on Empirical Methods in inference for text classification debiasing. ACL-IJCNLP Natural Language Processing (EMNLP), pp 794–805 36. Dawid AP (2000) Causal inference without counterfactuals. J 17. Nguyen T-L, Collins GS, Landais P, Le Manach Y (2020) Amer Stat Assoc 95(450):407–424 Counterfactual clinical prediction models could help to infer 37. Holland PW (1986) Statistics and causal inference. J Amer Stat individualised treatment effects in randomised controlled trials–an Assoc 81(396):945–960 illustration with the international stroke trial. J Clin Epidemiol 38. Rubin DB (1980) Randomization analysis of experimental data: 18. Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R (2021) The fisher randomization test comment. J Amer Stat Assoc Counterfactual vqa: A cause-effect look at language bias. In: 75(371):591–593 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12700–12710 12974 C. Wang et al. 39. Pearl J (2009) Causality. Cambridge university press 45. Geradin ´ M, Rixen DJ (2014) Mechanical vibrations: theory and 40. Pearl J, Glymour M, Jewell NP (2016) Causal inference in application to structural dynamics. Wiley statistics: A primer. Wiley 46. Beckers S, Eberhardt F, Halpern JY (2020) Approximate causal 41. Humar J (2012) Dynamics of structures. CRC press abstractions. In: Uncertainty in Artificial Intelligence. PMLR, 42. Rubin DB (1974) Estimating causal effects of treatments in ran- pp 606–615 domized and nonrandomized studies. J Educ Psychol 66(5):688 47. Mohimont L, Chemchem A, Alin F, Krajecki M, Steffenel LA 43. Imbens GW, Rubin DB (1997) Bayesian inference for causal (2021) Convolutional neural networks and temporal cnns for effects in randomized experiments with noncompliance. Ann covid-19 forecasting in france. Appl Intell:1–26 Stat:305–327 44. Heckman JJ (2010) Building bridges between structural and program evaluation approaches to evaluating policy. J Econ Publisher’s note Springer Nature remains neutral with regard to Literature 48(2):356–98 jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied Intelligence (Dordrecht, Netherlands) Pubmed Central

Rethinking the framework constructed by counterfactual functional model

Applied Intelligence (Dordrecht, Netherlands) , Volume 52 (11) – Feb 17, 2022

Loading next page...
 
/lp/pubmed-central/rethinking-the-framework-constructed-by-counterfactual-functional-sdFLtVw68i

References (79)

Publisher
Pubmed Central
Copyright
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
ISSN
0924-669X
eISSN
1573-7497
DOI
10.1007/s10489-022-03161-8
Publisher site
See Article on Publisher Site

Abstract

The causal inference represented by counterfactual inference technology breathes new life into the current field of artificial intelligence. Although the fusion of causal inference and artificial intelligence has an excellent performance in many various applications, some theoretical justifications have not been well resolved. In this paper, we focus on two fundamental issues in causal inference: probabilistic evaluation of counterfactual queries and the assumptions used to evaluate causal effects. Both of these issues are closely related to counterfactual inference tasks. Among them, counterfactual queries focus on the outcome of the inference task, and the assumptions provide the preconditions for performing the inference task. Counterfactual queries are to consider the question of what kind of causality would arise if we artificially apply the conditions contrary to the facts. In general, to obtain a unique solution, the evaluation of counterfactual queries requires the assistance of a functional model. We analyze the limitations of the original functional model when evaluating a specific query and find that the model arrives at ambiguous conclusions when the unique probability solution is 0. In the task of estimating causal effects, the experiments are conducted under some strong assumptions, such as treatment-unit additivity. However, such assumptions are often insatiable in real-world tasks, and there is also a lack of scientific representation of the assumptions themselves. We propose a mild version of the treatment-unit additivity assumption coined as M-TUA based on the damped vibration equation in physics to alleviate this problem. M-TUA reduces the strength of the constraints in the original assumptions with reasonable formal expression. Keywords Causal effect · Counterfactual approach · Functional model · Treatment-unit additivity assumption 1 Introduction when we modify a factual prior event and then evaluate the consequences of that change [1]. In the classic Rubin causal model (RCM), counterfactual results usually refer Counterfactual inference, as an indispensable method of causal inference, helps create human self-awareness and to unobserved potential outcomes [2]. A typical application imbue life experiences with meaning, which is embodied representative is counterfactual queries (CQs) [3]. A counterfactual query is a question of what kind of causality would arise if we artificially adopt the conditions contrary to Chao Wang cwang17@fudan.edu.cn the facts. Formally, the evaluation of CQs can be expressed as “If C happened, would B have occurred?”, where C is the Linfang Liu counterfactual antecedent. liulf19@fudan.edu.cn CQs embody our reflections on what already happening Shichao Sun in the real world. For example, in Fig. 1, data released by bruce.sun@connect.polyu.hk Johns Hopkins University (JHU) shows that as of August 19, 2021, EST, the cumulative number of confirmed cases of Wei Wang COVID-19 (coronavirus disease 2019) in the United States wangwei1@fudan.edu.cn amounted to 37,155,209 cases and the cumulative number of deaths amounted to 624,253 cases. The data also shows Shanghai Key Laboratory of Data Science, School of that the current cumulative number of confirmed cases in Computer Science, Fudan University, Shanghai, China Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China https://coronavirus.jhu.edu/map.html 12958 C. Wang et al. Fig. 1 COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at JHU the United States accounts for about 17.75% of the more than Note that, in CQ1, there is a clear causal relationship, 200 million confirmed cases worldwide; the cumulative that is, COVID-19 has caused the unemployment rate in number of deaths in the United States accounts for about the United States to rise. Therefore, in response to CQ1, 14.20% of the more than 4.3 million deaths worldwide. In an essential task is to be able to evaluate the degree of face of the onslaught of the epidemic, one might ponder the belief in the counterfactual consequence (i.e., probability following query: If the U.S. has taken decisive measures, evaluation) after considering the facts that have already would the number of confirmed cases have been effectively happened. In other words, it is equivalent to evaluating controlled instead of spreading as wildly as it is now? the probability of a potential (or counterfactual) outcome Originally, most studies on counterfactual inferences given the antecedent. Moreover, in CQ1, it is a fact that the (such as the query above) focus on the field of philosophy. COVID-19 sweeps the world and causes the unemployment Philosophers establish the form of a logical relationship rate in the United States to rise. Hence, we should focus constituting a logical world, which is consistent with the on analyzing what is the probability that the unemployment counterfactual antecedent and must be the closest to the real rate in the United States will rise if there is no COVID-19? world (for the convenience of description, we call it the This is undoubtedly an influence on the government to make closest world approach) [4]. Further, Ginsberg [5] applies decisions. Therefore, evaluating counterfactual queries like similar counterfactual logic to analyze problems of AI tasks, these has far-reaching significance for practical application. which relies on logic based on the closest world approach. With the widespread application of causal inference in However, the disadvantage of the closest world approach is the field of AI [7, 8], the current popular method is to adopt that it lacks constraints on closeness measures. the functional model (FM) [9] for inference. FM takes a Regarding the above issue, Balke and Pearl [3]are com- CQ as an input and finally outputs the probability evalua- mitted to explaining the closest world approach. Specif- tion of the CQ by combining prior knowledge and internal ically, they suggest that turning a CQ into a probability inference mechanisms. The evaluation of CQs has benefited problem, named, the probabilistic evaluation of counterfac- many research fields and tasks, such as the determina- tual queries (PECQs). In other words, PECQs focus more on tion of person liable [10], marketing and economics [11], the probability of an event occurring in a specific CQ, rather personalized policies [12], medical imaging analysis [13, than just outputting “True” or “False” (or “Yes” or “No”, 14], Bayesian network [7], high dimensional data analy- etc.) for this query. PECQs motivate us to deeply rethink sis [15], abduction reasoning [16], the intervention of tabu- counterfactual problems in many AI applications. For exam- lar data [8], epidemiology [17], natural language processing ple, we know that COVID-19 has caused economic losses (NLP) [18, 19] and graph neural networks (GNN) [20, 21]. and increased unemployment in the United States [6]. An In particular, FM can provide powerful interpretability for important reason is that the government has not dealt with machine learning model decisions [22–25], which is one of the epidemic promptly. Based on the facts that have already the most concerning issues in the Artificial Intelligence (AI) occurred, we may reflect on the following question, CQ1: If community today. the government issued effective policies in time to control the spread of COVID-19, would the unemployment rate in 1.1 Motivation the United States still have raised? Judea Pearl discusses the limitations of the current machine 2 learning theory and points out that current machine learning https://www.nytimes.com/2020/05/28/business/ models are difficult to be used as the basis for strong unemployment-stock-market-coronavirus.html Rethinking the framework constructed by counterfactual functional model 12959 AI [9]. An important reason is that the current machine ambiguity cannot be eliminated, we must consider what learning approach is almost entirely in the form of statistics may cause ambiguity and how to avoid trouble caused or “black box”, which brings serious theoretical limitations by ambiguity. to its performance [26]. For example, it is difficult for 2) The assumptions used to estimate causal effects in current smart devices to make counterfactual inferences. the data are strong, which are often violated in real- A large number of researchers are increasingly interested world applications. Some strong assumptions tend in combining counterfactual inference with AI [27, 28], to constrain on individuals (e.g., individuals u in such as explaining consumer behavior [29], the study U) to obtain the ideal an experimental population of viral pathogenesis [30], and predicting the risk of environment in an experiment. This neglects to obtain flight delays [31]. In addition, counterfactual inference the equivalent form of the assumption directly from the has shown advantages in improving the robustness of the abstract level (e.g., the experimental population U,the model [32, 33] and optimizing text generation tasks [34]and dataset itself). In some practical applications of causal classification tasks [35]. Although counterfactual inference inference, a challenging task requires researchers to has set off a new upsurge in the field of machine learning, a make causal inferences in the absence of data. For deeper understanding of the existing models and methods is example, in RCM, the causal effect is described as notably lacking. O − O ,where O (O ) is the result variable O t,u c,u t,u c,u In our work, we focus on two basic aspects in the task displayed by subject (or individual) u under the control of counterfactual inference. The first aspect focuses on the (c) group or treatment (t) group. Unfortunately, we have no idea how to obtain O ,and O at the same counterfactual framework and this aspect is related to the t,u c,u inference results of the model. The second aspect focuses time no matter how large the dataset is. This situation is on the preconditions for the counterfactual inference also called the fundamental problem of causal inference tasks. Specifically, the first aspect is based on a type (FPCI) [37]. of counterfactual approach (e.g., the functional model) Owing to the existence of FPCI, we can only apply in causal science. We analyze the credibility of some additional assumptions on the data distribution to avoid it. results obtained by using this counterfactual approach to Some typical assumptions are shown below: evaluate CQs. Another aspect we are concerned about is the assumptions used in causal inference to estimate – Stable Unit Treatment Value Assumption (SUTVA) [38], causal effects. Since causal effects depend on the potential where each O of u is treated as an independent event; results, however, we cannot observe all the potential – Assumption of Homogeneity (AOH) [39], which requires that for any individual u and u ,and any outcomes of the experimental individual simultaneously i j,j =i (unobservable outcomes are usually called counterfactual intervention method t , O = O always holds; t ,u t ,u i j,j =i outcomes). Therefore, some assumptions are often needed – Treatment-Unit Additivity (TUA) [36], some studies when estimating the causal effect. We pay attention to a also call it the assumption of constant effect (AOCE). commonly used strong assumption (i.e., the Treatment-Unit The TUA assumption constrains such an equivalence Additivity (TUA) assumption) and weaken it using some relationship that, for all individuals, the causal effect mathematical methods. Next, we specify the above two is the same for each individual under a defined aspects to the following two issues (we use a real inference intervention method. task (i.e., PECQs) as an example to explain the relationship (u ) = (u ) = ··· = (u ), (1) 1 2 |U | between the two issues in Fig. 2. where (u ) denotes the individual causal effect of u ∈ i i 1) In the CQs tasks, although the output result of the FM U,and |U | is the cardinality of set U. Apparently, AOH is unique, this unique solution sometimes is ambiguous. is stronger than TUA. Therefore, in the second aspect, For example, in the task of evaluating the probability we focus on TUA, aiming to obtain the milder TUA solution of CQs by FM, if the model predicts that assumption. the probability of a CQ is 0, the result may be To address the two issues mentioned above, in this paper, ambiguous. In other words, although the probability our contributions are three-fold: value predicted by the model in this situation is 0, it is – We focus on a basic problem in the FM and primarily still possible that the event will happen. Intuitively, the existence of statistical uncertainty may cause ambiguity analyze the evaluation method of [3]. We find that FM of the inference results. Dawid [36] proves that even sometimes produces ambiguous output results for some if the statistical uncertainty can be eliminated, the CQs, even if the final output result is unique. One of inference may also produce ambiguity. Therefore, when the important reasons is that FM needs to calculate the 12960 C. Wang et al. Fig. 2 The framework of the probabilistic evaluation of counterfac- counterfactual inference task, the plausibility of the output affects, the tual queries: these two issues spread over the same inference task, and user’s confidence, and the strong assumptions premise determines the these two issues are independent of each other. However, for the same scope of the task intersection between the two sets to get the final result and analyze the rationality and limitations of M-TUA. The when estimating the output probability. However, the comparison between TUA and M-TUA is in Section 6. intersection may be an empty set ∅, when estimating Section 7 summarizes this paper. some special CQs. – We provide a mild TUA assumption, called M-TUA, which incorporates the idea of the damped vibration 2 Notation equation. – We prove theoretically that M-TUA can be applied to In this section, the key mathematical notations and their large datasets, and give a reasonable and rigorous math- descriptions are listed in Table 1. ematical description of this theory (see Theorem 1). Especially for some complex internal principles, we do not choose to use the “black box” method but hope to 3 Inference mechanism and result credibility use M-TUA to try to reveal the complex internal rela- analysis of FM tionship between certain parameters and assumptions and make some reasonable description and explanation. In this section, we first introduce the definition of PECQs [3], which is a probabilistic description of the counterfactual 1.2 Paper organization query. Second, we review the inference mechanism of FM in Fig. 3. Finally, we exhaustively analyze the inference The rest of this paper is organized as follows: In Section 2, mechanism in FM by some examples and find that when we give the mathematical notation and their descriptions. the probabilistic evaluation of a CQ is 0, the result causes In Section 3, we give a visualization of the FM inference unreliable guidance for decision-making. mechanism and analyze the pitfalls of this inference mechanism based on concrete examples. In Section 4 and 3.1 Definition of PECQs Section 5, we give a mild version of the TUA assumption (i.e., M-TUA), and theoretically prove the equivalent Definition 1 (Probabilistic Evaluation of Counterfactual representation of the TUA assumption in the vector space Queries, PECQs [3]) The core idea of PECQs is to transform Rethinking the framework constructed by counterfactual functional model 12961 Table 1 Key Notations and Notation Description Descriptions ∅ the empty set R ={R , ...,R } the set of variables R 1 n i r /rˆ the value of R in the real/counterfactual world i i i {t, c} t and c represent two different treatments (or intervention variables) U = U ∪ U a population with a huge number of units u t c i t t U ={u , ...,u } the set of some units receiving treatment t 1 k c c U ={u , ...,u } the set of other units receiving treatment c, i.e., U ∩ U =∅ c  t c 1 k R, Z , C the set of real numbers, positive integers, and complex numbers A ∈ C the complex conjugate of A ∈ C |S| the cardinality of finite set S, e.g., |R|= n, |U |= k and |U |= k t c {·} the finite set containing n elements, e,g., R ={R , ...,R }={R } n 1 n i n c all unknown factors that may influence β in the inference mechanism of FM Pr(c ) the probability distribution of c in the inference mechanism of FM n n L the Euclidean distance from point a to point o coordinate system ao p(x)  q(x) means that function p(x) is equivalent to function q(x) a CQ into a probabilistic evaluation problem, which can be Example 1 CQ1 can be translated into (2) for evaluation. formalized as: Specifically, for (α , β ), we observe that there is an 0 0 ineffective policy (i.e., α ) that causes the unemployment Pr(β |ˆ α ) , (2) 1 1 |(α ,β ) 0 0 ˆ rate rise (i.e., β ); Pr(β |ˆ α ) indicates the probability of 0 1 1 where “|(α , β )” represents the evidence (or observed data) unemployment rate falls (i.e., β )ifweimplement effective 0 0 1 we have observed in the real world, and the value of policies (i.e., α ˆ ). evidence be considered as a conditional probability (e.g., (α , β )  Pr(β |α ) = p ). Pr(β |ˆ α ) is the counterfactual 3.2 Inference mechanism of FM 0 0 0 0 0 1 1 outcome that we need to infer based on evidence. The probabilistic evaluation of (2) can be obtained by the The inference mechanism of FM is shown in Fig. 3.More inference mechanism of FM [3] (i.e,. Figure 3). detailed information on the inference mechanism of FM is Fig. 3 The inference mechanism of FM when evaluating the CQ1 12962 C. Wang et al. elaborated upon in [3], and we will not repeat it here in this derived from real predictive inferences or the processing section. of some special counterfactual queries (e.g., V-CQ2)by the inference mechanism. Therefore, when the probabilistic 3.3 Analysis of the inference mechanism of FM evaluation of a CQ is 0, the decision based on this result is not credible, that is, the result is ambiguous. Although FM can output a unique solution for a CQ, 2) In addition, in V-CQ2, α does not constitute a however, we find that the results are not credible when the counterfactual condition, it still belongs to the assumptions in the real world, in this case, the β is also known evidence probability estimate of the FM output is 0. In other words, the output value of Pr(·) = 0 does not mean that the event in the real world, i.e., β = β . Hence, we have 1 1 will not occur. Next, we introduce some simple examples to Pr(β |α ) = Pr(β |α ) 1 0 |(α ,β ) 1 0 |(α ,β ) 0 0 0 0 reveal the untrustworthy guidance that this ambiguity may = 1 − Pr(β |α ) = 1 − p , (4) 0 0 0 bring to the decision-making. which contradicts with the result of (3). This shows that α Example 2 CQ2 [36]: Patient P has a headache. Will it help does not constitute an intervention that affects the outcome if P takes aspirin? The information we observe is that the of the counterfactual world. Therefore, the estimated current patient has a headache (denoted as β )and is not value of (3) obtained by FM violates the counterfactual taking aspirin (denoted as α ). Therefore, Pr(β |ˆ α ) consistency rule [40]. 0 1 1 |(α ,β ) 0 0 is equivalent to the probability evaluation of CQ2 (the The impact of ambiguity in inference results on decision- query of this form like CQ2 can also be called the effects making We discuss the impact of the unique solution on of causes [36]). However, consider a situation (denoted decision-making by two examples as follows: as the variant of CQ2, which is abbreviated as V-CQ2 ) where the patient still does not take aspirin. What is the Example 3 In predicting the probability value of 0.8 or probability of the headache disappearing? It is equivalent 0.9 for an earthquake to occur at a certain location, there to evaluating Pr(β |α ) . If we still choose to use 1 0 |(α ,β ) 0 0 is little difference in decision-making for this probability. FM to estimate this query, we first determine the value of However, when the probability of an earthquake is estimated n ∈{1, 2} (n refers to the value of n,which is (α ,β ) (α ,β ) 0 0 0 0 to be 0 and unique, it is essential for us to verify its determined according to (α , β )), and then we determine 0 0 rationality, because this may directly lead to the need for the the new value of n ∈{3, 4} (n refers to the value ˆ ˆ (α ˆ ,β ) (α ˆ ,β ) 1 1 1 1 corresponding deployment. In other words, how confident of n, which is determined according to (α ˆ , β )). Finally, the 1 1 are we to ensure that there will be no earthquake based on evaluation of Pr(β |α ) is the sum of Pr(c ) 1 0 |(α ,β ) 3,β |(α ,β ) 0 0 0 0 the prediction of FM? Therefore, the fact that there exist and Pr(c ) , i.e., 4,β |(α ,β ) 0 0 queries that cannot be answered using FM does not mean ˆ that the evaluation of these queries is meaningless. Pr(β |α ) = Pr(c ) + Pr(c ) 1 0 |(α ,β ) 3,β |(α ,β ) 4,β |(α ,β ) 0 0 0 0 0 0 = 0 + 0 =0(3) Example 4 CQ3:The murderer assassinated President Kennedy, if the assassination had failed, would Kennedy Why is the evaluation of V-CQ2 equal to 0, and what does still be alive? Formally, if the shot hits the target (α ) with this mean? 1) When using FM to estimate the results of a high probability (p ) that the hit target will die (β ), 0 0 CQ1 and V-CQ2, a key step is to calculate the intersection then we estimate Pr(β |α ) =? We will eventually 1 0 |(α ,β ) 0 0 of N and N ,where N ={n } refers (α ,β ) ˆ (α ,β ) (α ,β ) 0 0 0 0 0 0 (α ˆ ,β ) 1 1 ˆ get Pr(β |α ) = 0 using FM (the prediction 1 0 |(α ,β ) 0 0 to the set of n , which is determined by the observed (α ,β ) 0 0 process is similar to predicting V-CQ2). Obviously, if the evidence in the real world, and N ={n } refers ˆ ˆ (α ˆ ,β ) (α ˆ ,β ) 1 1 1 1 assassination failed (that is, the shot was successfully fired to the set of n , which is updated in the counterfactual (α ˆ ,β ) 1 1 but did not cause the target to die) and Kennedy is still alive, world. For example, N ={1, 2} and N = (α ,β ) 0 0 (α ˆ ,β ) 1 1 this situation may affect the assassin’s further decisions {2, 4} in CQ1 can be derived from Fig. 3. Therefore, the and deployment. For Kennedy’s team, this may affect probabilistic evaluation of CQ1 is uniquely determined by the deployment of security measures for similar activities. N ∩ N = n = 2. Hence, the probabilistic (α ,β ) ˆ 0 0 (α ˆ ,β ) Therefore, when the estimated result of a CQ is 0, the 1 1 evaluation of CQ1 is Pr(c ) . 2,β |(α ,β ) 0 0 result cannot provide credible and sufficient opinions for However, unlike CQ1,the N in V-CQ2 is {3, 4}, decision-making. (α ˆ ,β ) 1 1 which causes the probability evaluation of V-CQ2 to be 0 (i.e., (3), because N ∩ N =∅). This (α ,β ) A straightforward solution Through the above series of 0 0 (α ˆ ,β ) 1 1 probability estimate is not completely credible. The reason analyses, it is not difficult to find that when the probability is that we cannot be sure whether the output results are of a CQ is evaluated as 0, for this situation, further Rethinking the framework constructed by counterfactual functional model 12963 verification and analysis are indispensable. Because the cannot observe the potential outcomes (i.e, counterfactual inference mechanism of FM itself will inevitably introduce outcomes) in the counterfactual world (e.g., O in c,u ambiguity for the evaluation result of Pr(·) = 0. Since the Example 6). This situation where all potential outcomes of evaluation of the FM determines the final output solution units cannot be observed simultaneously is also called FPCI through the intersection between two sets, there is a certain we mentioned earlier. Formally, for binary intervention probability that the intersection is an empty set. variables, let d ∈{t = 1,c = 0}, the observation outcome A straightforward solution is that if an empty set appears O and the potential outcome Y can be expressed by the d,u o in the estimation process, we need to stop using the FM for following formula: estimation because the above analysis shows that we cannot Y = O , if d =1 o 1,u define the empty set as Pr(·) = 0. Therefore, when this Y = d ·O +(1−d)·O = (5) o 1,u 0,u i i Y = O , if d =0. o 0,u happens, we should estimate the output probability in the i real world instead of the counterfactual world to avoid the Where O ∈{O ,O } represents the potential d,u t,u c,u i i i appearance of ambiguous results. In this case, Pr(·) = 0 outcome of treatment d ∈{t, c} on unit u ∈ U. plays a role in prompting a replacement prediction strategy. For a more intuitive description, we focus on the Therefore, to comply with the counterfactual consistency following 2-dimensional Gaussian distribution model rule, we must use the prior probability (4) (i.e.,1 − p )to G(O ,O ) ∼ N (μ ,μ ,σ ,σ ,ρ).(6) replace Pr(·) = 0. t,u c,u t c t c i i Specifically, we introduce the following example [36]and use it as a basic background for subsequent analysis. 4 The mild treatment-unit additivity assumption Example 5 Given the pair (O ,O ), O ,and O t,u c,u t,u c,u i i i i are independent and identically distributed (i.i.d.), each For the second reflection in Fig. 2, in this section, we with the 2-dimensional Gaussian distribution with means analyze the TUA assumption, which is often used as a (μ ,μ ), σ = σ = σ (for simplicity of calculation, we t c c t o strong prerequisite for estimating causal effects in data. We assume that the distribution has a common variance σ ), and first review the potential outcome framework (Section 4.1), the correlation ρ ∈ (0, 1). Furthermore, we use the mixed individual causal effect (Section 4.2), the definition of model to describe the specific structure, i.e., TUA (Section 4.3), and provide an equivalent description O  μ + τ + λ of TUA utilizing vectorization (Section 4.4). Second, d,u d u d,u i i i based on the idea of the Damped Vibration Equation ⎪ μ = μ or μ d t c (7) (DVE) [41], we propose a mild TUA assumption (called M- s.t. , τ ∼ N (0,σ ) = N (0,ρσ ) u τ o TUA) (Section 4.5). M-TUA not only weakens the original λ ∼ N (0,σ ) = N (0,(1 − ρ)σ ) d,u λ o assumption but also has good mathematical properties and interpretability . where μ indicates the treatment effects applicable to all Our main conclusion in this section is presented based units. τ represents the effect on unit u ∈ U, called unit u i on two lemmas, and the specific proof process is mainly effects, and this effect applies to all units, i,e., τ = τ . u u i j,j =i divided into the following two steps. First, we describe λ stands for the effect between treatment and unit, called d,u the relationship between TUA and ICE in the counterfac- unit-treatment interaction. This internal mechanism reveals tual approach, and we explore the equivalence of ICE and the change from one treatment to another for unit u . τ and i u residual causal effect (RCE) in the TUA assumption (i.e., λ are independent random variables. d,u Lemma 1). Second, we innovatively introduce the defini- tions of positive effects and negative effects, and on this 4.2 Individual causal effect basis, we obtain the equivalent form of TUA in vector space by Lemma 2. Dawid [36] adopts the model of (7) to analyze the pros and cons of the counterfactual based on the idea of decision- 4.1 Potential outcome framework making and mentions an assumption that is often used in the counterfactual analysis, which is called TUA (Definition 2). According to the viewpoint of Rubin [42], there is an As the TUA assumption has strong constraints on data, it intervention in the causal inference, which means that will lead to a reduction in the practicability and scope of there is no cause and effect without intervention, and one use of TUA. Hence, in this paper, another goal of a study intervention state corresponds to a potential outcome. When is to design a mild TUA assumption that constrains the the intervention state is realized, we can only observe dataset itself or the experimental population as a whole, the potential outcomes in the realization state, that is, we rather than a strong constraint on each individual, as in 12964 C. Wang et al. the traditional TUA assumption. In the rest of this section, For example, we can use (11) to estimate the ICE of the we try to optimize TUA to make it have a broader scope of new unit u . Because inferring (u ) is equivalent to new new application in the context of large data. inferring  (u ) and 2(1−ρ)σ under (11). Unfortunately, ACE i o Specifically, we first analyze the individual and average we cannot accurately determine the value of 2(1 − ρ)σ . causal effect based on (7). In an experimental study, the individual causal effect (ICE) is the basic object (or Example 6 (Calculation of causal effect parameters (i.e., a basic measure). It describes the differences in various ICE, ACE) in the ideal case). In Table 2, we construct potential outcomes of a given unit u ∈ U under all possible a simple example to demonstrate the calculation of the treatments d ∈{t, c}. Generally, for one unit u ∈ U,the causal effect parameters, such as ICE, ACE. Suppose a ICE can be represented as population contains four subjects, labeled as u , u , u and 1 2 3 u , respectively. For each u , the potential outcomes in both 4 i (u )  O − O.(8) i t,u c,u i i intervention states are known (in reality only one potential For different tasks, the ICE can also have other forms of outcome can be observed). Where individuals 1 and 2 are in description, such as (u ) = log O /O . Therefore, the intervention group (i.e., the set of some units receiving i t,u c,u i i from a broader perspective, the subtraction in the definition treatment t) and individuals 3 and 4 are in the control group of ICE may not necessarily be a subtraction in R.Note (the set of some units receiving treatment c). that no matter which form is used, only one potential outcome can be observed [43]. Researchers usually do not According to Table 2, we can obtain: pay attention to ICE directly, but focus on the average value (u ) = E(O − O ) of the causal effect of all units, that is, ACE, also known ACE i t,u t,u i i as average treatment effect (ATE). ACE can be expressed = (30 + 0 + 10 + 0) = 10. (13) by the following formula, (u )  E((u )) = E O − O .(9) ACE i i t,u c,u Meanwhile, based on the information in Table 2, we can i i further obtain information on two other causal effect param- Apparently, in (7),  (u ) = μ − μ . ACE i t c eters, one is average treatment effect for the treated (ATT) and the other is average treatment effect for the control Limitations of the counterfactual approach focused on (ATC). Where, ICE We utilize the above Example 5 for our analysis. Specifically, according to (7)and (8), we have that, (u ) = E(O −O |d = t) = (30+0) = 15. (14) ATT i t,u t,u i i (u ) = O −O = (μ − μ ) + (λ − λ ) i t,u c,u t c t,u d,u i i i i (10) =  (u ) + (λ ), ACE i u and where (λ )  λ − λ is called residual causal u t,u d,u i i i (u ) = E(O − O |d = c) = (10 + 0) = 5. (15) ) ∼ ATC i t,u t,u effect (RCE) [36]. It is easy to verify that (λ u i i N (0, 2(1 − ρ)σ ). Thus, according to (7)-(9), we could Unfortunately, in the real world, the boldface numbers obtain the distribution of ICE as follows: (e.g., O , O )inTable 2 are not observable to us. The c,u t,u 2 3 (u ) ∼ N ( (u ), 2(1 − ρ)σ ). (11) i ACE i o reason is that the treatment received by subject u is d = t, However, in (11), 2(1 − ρ)σ cannot be inferred from we can not observe the potential outcome of u receiving o 2 observed data and has nothing to do with the size of the treatment d = c at the same time. Therefore, in the real data. Because even if the marginal distributions of O and world, the calculation and estimation of the causal effect t,u O are known, the joint distribution of random variables parameters require additional constraints (e.g., Treatment- c,u G(O ,O ) cannot be determined, and the marginal unit additivity assumption (Definition 2) to be imposed on t,u t,u i i distribution of the Gaussian distribution does not depend on the data. the parameter σ . Moreover, according to (7), we have Table 2 Causal effect parameters 2(1 − ρ)σ = 2σ ∈ (0, 2σ ), if ρ ∈ (0, 1) o λ o Subject O O O dO − O t,u c,u t,u t,u c,u . (12) i i i i i 2(1 − ρ)σ = 2σ ∈ (2σ , 4σ ), if ρ ∈ (−1, 0) o λ o o u 30 0 30 t 30 (12) indicates that different values of ρ determine different u 10 10 10 t 0 variances of the distribution of (u ). We can only get a u 10 00 c 10 range of σ , and a different ρ will lead to a different σ , λ λ u 10 10 10 c 0 which will cause a variety of uncertain results for reasoning. Rethinking the framework constructed by counterfactual functional model 12965 Table 4 Additional information about all u 4.3 Treatment-unit additivity Subject O O O − O t,u c,u t,u c,u i i i i In summary, the POF focuses on the inference of causal effects but does not explain the mechanism of influence u 13 ? ? between variables [44]. A computational bottleneck is the u ? 12.5 ? prediction of parameter ρ through the marginal distribution. u 10 ? ? Therefore, in the task of using the causal model for infer- u ?13 ? ence, additional constraints (e.g., Example 7) are usually u ?12 ? required to ensure that the inference result is obtained under mean 11.5 12.5 −1 this constraint. Example 7 Under the TUA, (u ) =  (u ) implies new ACE i 4.4 Equivalent form of TUA that ρ = 1. TUA assumes that the causal effect (u ) has the same effect Definition 2 (Treatment-Unit Additivity (TUA) [36]). on all units in U, e.g., (u ) =  (u ), i ∈[1, ..., |U |]. i ACE i The TUA assumption is to deal with the non-uniformity of Unfortunately, as a commonly used prerequisite, TUA is a data through a strong processing method. Specifically, TUA strong assumption, which cannot be tested on observable requirements that (u ) in (u )  O − O has the i i t,u c,u i i data and lacks a more transparent explanation in the real same effect on all units in U, e.g., (u ) = (u ) = ··· = 1 2 world [36]. This leads to some interesting questions worth (u ) =  (u ). ACE i |U | exploring, such as: – For applications of TUA, how to obtain a mild version TUA can be equivalently regarded as the assumption of constant effect (AOCE). For example, we can set (u ) = of the TUA assumption to make the TUA more broadly applicable? (u )= a specific constant (e.g.,  (u ). Generally j,j =i ACE i speaking, AOCE uses the average effect in the sample to – For interpretability of TUA, based on the TUA assump- tion (or a mild TUA), how to establish a formal expres- estimate the causal effect. Next, we will give a simple example to demonstrate the relation between TUA and ACE sion to describe the impact of the main factors inside the data on estimating ICE? and the application of TUA. To address these issues, next, we first provide an equivalent Example 8 Considering a fundamental problem of causal form of the TUA assumption under the 2-dimensional inference, let u be a patient. We want to know whether Gaussian distribution (i.e., Lemma 1). certain medication has a therapeutic effect on u . Suppose that the data about patient u isshowninTable 3. Lemma 1 If the data follows a Gaussian distribution as According to Table 3, we only know that O =13. Due t,u Example 5, then the TUA assumption has the following to the existence of FPCI, we cannot simultaneously observe equivalent form, i.e., the effects of u taking the medication and not taking the medication. Therefore, we rely on adding additional (u ) = (u ) = ··· = (u ) 1 2 q constraints (i.e., TUA) to estimate the value of O . c,u TUA Suppose we also have additional data (as shown in =⇒ lim (λ ) − (λ ) = 0. (17) d,u d,u i j,j =i Table 4), we can then use TUA assumption to infer the q→q values of O and O − O (i = 1, 2, 3, 4, 5). For c,u t,u c,u i i i example, according to Table 5 Assignment mechanism based on TUA assumption with (u ) =−1 ACE i (u ) = O − O =  (u ) =−1, (16) i t,u c,u ACE i i i Subject O O O − O t,u c,u t,u c,u i i i i we can obtain the following complete prediction data (see Table 5). u 13 14 −1 u 11.5 12.5 −1 u 10 11 −1 Table 3 The data of u ,where O and O − O are unknown 1 c,u t,u c,u 1 1 1 u 12 13 −1 Subject O O O − O t,u c,u t,u c,u u 11 12 −1 i i i i 5 u 13 ? ? mean 11.5 12.5 −1 1 12966 C. Wang et al. Where u ,u ∈U,i,j ∈[1, ...q],q=|U |, q is a sufficiently which proves the lemma. i j,j =i large positive integer (q q ). (λ ) = λ −λ . d,u t,u c,u i i i 4.5 The properties of (λ )in2-dimensional d,u Proof Given two units u and u , according to (7)and vector space i j,j =i (8), we have that Further, we will analyze the properties of TUA in 2- (u ) − (u ) = λ − λ − λ − λ i j,j =i t,u c,u t,u c,u i i j,j =i j,j =i dimensional vector space. Through the above analysis, it is (λ ) − (λ ). (18) d,u d,u i j,j =i not difficult to find that both the TUA and the equivalent Hence, a reasonable idea based on (18) is that we can form given by Lemma 5 are only numerical constraints (e.g., shift our attention from the constraint on λ to constraint d,u i (u ) = (u ), (λ ) − (λ )). In other words, i j,j =i d,u d,u 1 2 on RCE (λ ). Note that the predicted average value d,u i neither the TUA assumption itself nor Lemma 5 reflects of O − O (denoted as O − O ) will be t,u c,u t,u c,u i i i i their internal influence on the data. To explore the internal closer to E O − O if the size of the data is large t,u c,u i i influence of TUA on the data, our core idea is to transform enough. Therefore,  (u ) can be identified, from a large ACE i the original TUA assumption of constraints on values (i.e., experiment, as O − O . This means that the impact of t,u c,u i i scalars) into constraints on vectors. Specifically, we analyze (λ ) on the data may be related to the size of the data. d,u the TUA assumption by vectorizing λ (i.e., Lemma 2) d,u d d Given a group U ={u ,u , ...,u } containing q units, q and introducing a definition of the positive and negative 1 2 where u means the unit u will receive treatment d. j,j =i effects of λ (i.e., Definition 3) on the data. d,u j,j =i i We can assign “treatment” through Randomized Controlled Trials (RCT) and collect all potential outcomes, i.e., O = Lemma 2 For any λ ,let  (λ ) denote the positive d,u + d,u i i {O } and O ={O } . t,u k c c,u q−k effect of λ on the data, and  (λ ) denote the j,j =i j,j =i d,u − d,u i j,j =i Suppose that q is a large positive integer and naturally let negative effect of λ on the data. Then the TUA d,u j,j =i E O − O = O − O ,wehavethat t,u c,u t,u c,u assumption has the following equivalent form in the vector i i i i ⎛ ⎞ space, i.e., 1 1 ⎝ ⎠ (u ) = O − O =ˆ ACE i t,u c,u j,j =i j,j =i k q −k (u ) = (u ) = ··· = (u ) 1 2 q j =1 j =k+1 TUA s.t. O ∼ N (μ ,σ ), O ∼ N (μ ,σ ). (19) t,u t o c,u c o j,j =i j,j =i ⎛ ⎞ 1 k Where O represents the average of the responses t,u j=1 j,j =i ⎝ ⎠ =⇒ lim  (λ )−  (λ ) =0, (21) q + d,u − d,u i j,j =i of k units receiving treatment t,and O q→q c,u j =k+1 j,j =i q−k + − q q is the average of the responses of q − k units receiving treat- + − ment c. q, k,and q − k are both large numbers. Therefore, where q + q = q. (u ) =ˆ  is estimable and close to the true value. ACE i Next, we employ the TUA constraint on (18), which is Before proving Lemma 2, we need to introduce the equivalent to the setting (u ) − (u ) = 0. According i j,j =i definition of the vectorization of λ , positive effects, and d,u to (18), it is unnecessary for us to constrain every λ d,u negative effects. to a fixed value if q is large enough (e.g., q −→ q ). The alternative solution is that we consider the difference Definition 3 (The vectorization of λ .) Let λ = L d,u d,u ao i i between two (λ ), and formally characterize (λ ) − d,u d,u i represent the distance from a certain point a to the point o in (λ ) so that it gradually approaches 0 when q is a d,u j,j =i the coordinate system (e.g., in Fig. 4a, L represents λ ao d,u large number. Therefore, in the case of the considered RCE, and L represents λ ). The vectorization of λ bo d,u d,u j,j =i i we obtain the equivalent form of the TUA assumption, refers to assigning the characteristics of a vector to λ to d,u describe the possible positive or negative effect of λ on d,u lim (λ ) − (λ ) = 0, (20) d,u d,u i j,j =i q→q the data. As shown in Fig. 4b, for each λ , d,u q i positive effect, if (λ ) is in the first, second quadrants. d,u (λ )  (22) d,u negative effect, if (λ ) is in the third, fourth quadrants. d,u As shown in Fig. 4-(c) and (d),for λ , d,u positive effect, if (λ ) are in the first, second quadrants. d,u (λ )  (23) d,u negative effect, if (λ ) are in the third, fourth quadrants. d,u i Rethinking the framework constructed by counterfactual functional model 12967 There is a one-to-one correspondence between positive TUA, i.e., ⎛ ⎞ effects and negative effects. In other words, if a positive effect “+” exists, there must be a negative effect “-” ⎝ ⎠ lim  (λ ) −  (λ ) = 0, (25) + d,u − d,u i j,j =i q→q corresponding to it. + − q q which proves the lemma. Rationality analysis According to Definition 3, we trans- form the original TUA assumption of constraints on the scalars into constraints on vectors. For example, some indi- Rationality analysis The traditional TUA strongly con- viduals insist on eating nuts in actual life because nuts are strains all λ (or (u )) to be the same for u ∈ U,which d,u i i good for their health (i.e., positive effect), but some people undoubtedly ignores the effect of λ onthedataand the d,u are allergic to nuts, and eating them will bring pains and estimated ICE. However, ignoring this effect by applying even life-threatening effects (i.e., negative effect). There- TUA does not mean that the effect of λ on the data does d,u fore, we argue that it is necessary to consider the positive not exist. Therefore, we did not directly ignore this poten- or negative effects of λ . Definition 3 provides an intu- tial impact but pioneered to represent it by introducing the d,u itive representation of positive/negative effect in the vector vectorization method (i.e., positive and negative effects in space, and according to the definition, next, we give a proof Definition 3). In addition, Lemma 2 relaxes the constraint of Lemma 2 as follows. on the data to the level of the entire dataset U rather than imposing a strong constraint on each unit u . Therefore, Proof For ease of understanding, we will combine Fig. 4 for Lemma 2 can be considered as an equivalent form of TUA the proof. Considering the representation of λ in a 2- d,u at the abstract level. dimensional plane. As shown in Fig. 4a, we first represent λ as the Euclidean distance in the plane, i.e., d,u L = (λ ), and L = (λ ). (24) ao d,u bo d,u i j,j =i 5 The convergence of  (λ )and  (λ ) + d,u − d,u According to Lemma 1, (u ) = (u ) can be regarded i j,j =i Through the above analysis, we provide the equivalent as (λ ) = (λ ). Then, we can use L = L to d,u d,u ao bo i j,j =i equivalently describe (λ ) = (λ ). form of the TUA, which is based on 2-dimensional d,u d,u i j,j =i Gaussian distribution and a large dataset. By performing Second, we consider the representation of the TUA in 2- dimensional vector space. According to Definition 3, we can vectorization operations on λ ,u ∈ U, we introduce d,u i the definition of positive and negative effects, respectively, vectorize λ . The meaning of vectorization is to give each d,u aiming to study the effect of  (λ ) and  (λ ) (λ ) a measure, which aims to describe the positive or + d,u − d,u d,u i j,j =i on the data under the premise (λ ) = (λ ). negative effects of (λ ) on the data. In order to maintain d,u d,u d,u i i j,j =i Although we assume that the effects of  (λ ) and consistency with the original TUA assumption, we assume + d,u (λ ) are equal in a large dataset, we hope that |(λ )|=|(λ )|. For instance, as shown − d,u d,u d,u i j,j =i j,j =i that  (λ ) and  (λ ) will have less and less in Fig. 4b, let | (λ )| (| (λ )|) denote the + d,u − d,u + d,u − d,u i j,j =i i j,j =i positive (negative) effect of (λ ) on the data, although impact on the data as q approaches q . This concern is d,u necessary because if the sample size is not large enough, the |(λ )|=|(λ )|, (λ ) = (λ ). d,u d,u d,u d,u i j,j =i i j,j =i Third, we consider extending (λ ) to the entire positive and negative effects may not cancel each other out. d,u For example, the positive effects may be greater than the dataset. Since the background of our research is in the context of large datasets, we implied a condition here, negative effects or vice versa. Quantifying  (λ ) and + d,u (λ ) requires rigorous and rational mathematical that is, in the entire data, the positive effects  (λ ) − d,u + d,u i j,j =i expressions. Therefore, a natural question is: how to and negative effects  (λ ) on data generation − d,u j,j =i describe the convergence of  (λ ) and  (λ ) are basically the same. Furthermore, since |(λ )|= + d,u − d,u d,u i i j,j =i when q approaches q ? We will give the answers to the |(λ )|, we can visualize the entire data as a circle in a d,u j,j =i above questions in Theorem 1. 2-dimensional plane, where |(λ )|=|(λ )|= r. d,u d,u i j,j =i Intuitively, under the TUA constraint,  (λ ) = + d,u (λ ) always holds. However,  (λ ) = 5.1 The descriptive equation of  (λ ) − d,u + d,u d,u j,j =i i i and  (λ ) (λ ) does not necessarily have to be under a − d,u − d,u j,j =i j,j =i strong constraint of (λ ) = (λ ) to hold. In d,u d,u i j,j =i In classical physics, damping refers to the characteristic other words, in Fig. 4-(b), it is sufficient that the area of red is the same as the area of blue. Therefore, we can relax the that the amplitude of vibration gradually decreases in any oscillating system, which may be caused by external restriction on (λ ) by only assuming  (λ ) = d,u + d,u i i (λ ) without (λ ) = (λ ).In influences or the system itself [45]. We introduce the above − d,u d,u d,u j,j =i i j,j =i ideas into the study of the descriptive equation of  (λ ) summary, we obtain the following conclusion based on + d,u i 12968 C. Wang et al. Fig. 4 Figures (a) − (d) describe the equivalent representation of vectorization of (λ ). It should be noted that the positive and d,u the TUA in the vector space by vectorizing λ . (a) is the geomet- negative effects of λ on the data are almost equal when the num- d,u d,u i i ric description of the traditional TUA assumption in the coordinate ber of samples is large enough. Since | (λ )|=| (λ )|, + d,u − d,u i j,j =i system. According to Lemma 1, (u ) = (u ) can be regarded all after vectorization of (λ ) can form a circle in a 2-dimensional i j,j =i d,u as (λ ) = (λ ). Hence, in the 2-dimensional plane, we plane; (d) reflects the expansion of TUA assumption in the vector d,u d,u i j,j =i can use Euclidean distance L = L to describe (λ ) = space. It can be regarded as a visualization of the TUA assumption ao bo d,u (λ ); (b) describes the vectorization of (λ ). According to at an abstract level (that is, constraints are applied to the dataset U d,u d,u j,j =i i the definitions of positive (red), negative (blue) effects and the TUA rather than to each u ). In other words, it is no longer necessary that assumption, we have | (λ )|=| (λ )|; (c) describes the (λ ) = (λ ) + d,u − d,u d,u d,u i j,j =i i j,j =i monotonically decreasing function for convergence (see and  (λ ). In this section, we provide a description − d,u j,j =i equation about  (λ ) and  (λ ), which satisfies Fig. 5). Therefore, we need to consider the volatility effect + d,u − d,u i j,j =i that when q approaches q ,  (λ ) and  (λ ) of  (λ ) and  (λ ) on the data. + d,u − d,u + d,u − d,u i j,j =i i j,j =i converge strictly to 0 (see Theorem 1). Consider that the influence of  (λ ) and + d,u (λ ) on the data may be volatile. Therefore, we − d,u j,j =i Theorem 1 For λ , if there are positive effect  (λ ) d,u + d,u add the term “cos(n · η · q)”to(27) to describe the volatil- i i + and negative effect  (λ ) of (λ ) on the data, − d,u d,u ity effect of  (λ ) and  (λ ) on the data. We j,j =i i + d,u − d,u i j,j =i (λ ) and  (λ ) satisfy (or approximately + d,u − d,u i j,j =i can rewrite (27) as follows: satisfy) the following equation, −η ·q −η ·q + + S( (λ ), q)  A e · cos(n · η · q) S( (λ ), q)  A e cos(n · η · q) + d,u + + + d,u + + i i (28) , (26) −η ·q −η ·q − − S( (λ ), q)  A e · cos(n · η · q), S( (λ ), q)  A e cos(n · η · q) − d,u − − − d,u − − j,j =i j,j =i where n ∈ Z , and η > 0, η > 0 are adjustment parame- + − where n and η > 0, η > 0are adjustment parameters, −η ·q −η ·q + − + − ters. e and e are attenuation parameters. A and −η ·q −η ·q + − e and e are attenuation parameters. Not only A are the initial values of  (λ ) and  (λ ), − + d,u − d,u i j,j =i −η ·q +/− does the cos(n · η · q) function ensure that A e +/− + respectively. Then  (λ ) and  (λ ) will gradu- + d,u − d,u i j,j =i decays exponentially, but also it ensures that (26) decays. ally converge to 0 as q approaches q . According to Fig. 5, we can intuitively understand the meaning of parameter A and parameter η in Proof Let’s analyze the first term of (26), i.e., + + (27). The parameter A determines the initial maximum −η ·q S ( (λ ), q)  A e 1 + d,u + value of the positive effect. The parameter η determines (27) −η ·q S ( (λ ), q)  A e , 2 − d,u − j,j =i the convergence speed of the function S ( (λ ), q). 1 + d,u where A and A are the initial values of  (λ ) and Although S ( (λ ), q) can describe that the positive + − + d,u i 1 + d,u (λ ), respectively. Because of η > 0, η > 0, − d,u + − effect converges to 0 quickly as the number of samples j,j =i −η ·q −η ·q + − the two terms e and e in the equation decay with increases, it ignores the volatility of positive effects. The thedatasize q. proof for S ( (λ ), q) is similar. 2 − d,u j,j =i Unfortunately, if the equation only uses (27) to describe Similarly, according to Fig. 6, we can intuitively the exponential decay trend of  (λ ) and  (λ ), + d,u − d,u understand the meaning of parameter A and parameter η i j,j =i + + it cannot reflect the potential impact of  (λ ) and + d,u in (26). The parameter A determines the initial maximum i + (λ ) on the data. In other words,  (λ ) − d,u + d,u value of the positive effect, the parameter η determines j,j =i i + and  (λ ) do not necessarily follow a strictly − d,u the convergence speed of the function S( (λ ), q),and + d,u j,j =i i Rethinking the framework constructed by counterfactual functional model 12969 Fig. 5 A visualization of the influence of parameter (A , η )on + + Fig. 6 A visualization of the influence of parameters (A , n, η ) + + equation S ( (λ ), q). The situation of S ( (λ ), q) is 1 + d,u 2 − d,u and cos(n · η · q) on equation S( (λ ), q). The situation of i j,j =i + + d,u similar to the description of S ( (λ ), q) 1 + d,u S( (λ ), q) is similar to the description of S( (λ ), q) − d,u + d,u j,j =i i the cos(n · η · q) reflects the volatility of the positive and presents a trend of exponential decay with volatility. Finally, negative effect. The purpose of introducing cos(n · η · q) is as q increases, S( (λ ), q) will strictly converge to + d,u to reflect the conversion between the positive effect and the zero. negative effect as much as possible. Regarding the form of conversion, it can either be a positive effect that becomes a (−η )·q +/− Attenuation parameters e The purpose of intro- negative effect or vice versa. However, no matter how it is −η ·q +/− ducing the attenuation parameter e is to ensure converted, it will eventually converge to 0 strictly under the that the positive effect and the negative effect can exhibit −η ·q A e . The proof for S( (λ ), q) is similar. + − d,u j,j =i exponential decay characteristics as q increases. Although we improve TUA by vectorization, we hope that S( 5.2 The rationality analysis of equations (λ ), q) and S( (λ ), q) will have minimal d,u − d,u i j,j =i S( (λ ), q)and S( (λ ), q) + d,u − d,u i j,j =i impact on the overall data. Therefore, even while acknowl- edging the existence of positive and negative effects, we The rationality analysis of equations S( (λ ), q) and + d,u hope that  (λ ) and  (λ ) can decay as quickly + d,u − d,u i j,j =i S( (λ ), q) mainly includes two aspects: − d,u j,j =i as possible in an exponential decay manner. – One is about the analysis of the visualization results of In fact, according to Lemma 1, Lemma 2, and Theorem 1, S( (λ ), q) and S( (λ ), q). + d,u − d,u we provide a milder TUA assumption (referred to as M- i j,j =i – The other is the interpretability of S( (λ ), q) and + d,u TUA for short) through vectorization operations. In partic- S( (λ ), q). − d,u ular, (26) provides a formal description of positive effects j,j =i and negative effects, which makes M-TUA interpretable. The function of cos(n · η · q) To simplify the presenta- In summary, the above conclusion provides a mild form of +/− tion, we only analyze positive effects in this subsection. The TUA at the abstract level and an explicit (but not unique) analysis of negative effects is similar. As shown in Fig. 5, mathematical description. S ( (λ ), q) only reflects the nature of exponential 1 + d,u decay as q increases. Although S ( (λ ), q) also can 1 + d,u eventually converge to 0, S ( (λ ), q) does not reflect 6 Comparison of TUA and M-TUA 1 + d,u its potential impact on data, because S ( (λ ), q) 1 + d,u directly describes the positive effect as a strict monotonic In this section, we compare the traditional TUA and M-TUA decreasing function. However, a representation based on to illustrate the similarities and differences between each strict monotonic decrement ignores the description of its other. internal complexities. The effect of  (λ ) on data may + d,u be volatile (the situation may also be more complex). There- – (u ) and (λ ). TUA assumes that the value of ICE i d,u fore, in order to describe the volatility of  (λ ),we is the same for all u ∈ U (|U|= q), e.g., (u ) = + d,u i i i introduce the cos(·) function. Apparently, S( (λ ), q) + d,u  (u ),where i ∈[1, ...,q]. M-TUA transfers ACE i i 12970 C. Wang et al. Table 6 Observation data with  (u ) = 1 the above problem to the constraint of (λ ) by ACE i d,u vectorization operation, that is, Subject O O O − O t,u c,u t,u c,u i i i i lim ((λ ) − (λ )) = 0, d,u d,u i j,j =i u 13 ? ? ⎪ 1 q→q u ?9.5 ? TUA =⇒ ⎪ u ?8 ? lim  (λ ) −  (λ ) = 0, 3 ⎩ + d,u − d,u i j,j =i q→q + − q q u ?10 ? (29) u 11 ? ? u 15 ? ? + − 6 where q + q = q. u ?9.5 ? – Vector  (λ ) and Scalar (λ ).M-TUA +/− d,u d,u i/j i u 9? ? provides a vector description of positive and negative u ?10 ? effects for (λ ) (i.e.,  (λ )), aiming d,u +/− d,u i i/j u ?9 ? to distinguish M-TUA from traditional TUA. The vectorization operation allows for differences between mean ?? 1 individuals to exist, that is, (λ ) = (λ ) d,u d,u i j,j =i is allowed under the premise of  (λ ) = + d,u As shown in Tables 8 and 9, it is not difficult to see that (λ ). Therefore, M-TUA achieves the − d,u j,j =i based on the assumption of M-TUA (i.e., (λ ) = weakening of TUA. d,u i=1 i (0.2+0.2+0.2+0.2+0.1+0.1)−(0.2+0.2+0.3+0.3) = 0), – Variance. For a randomized experiment, the assumption the data can be more in line with the assignment mechanism of TUA implies that the variance is constant for on the condition that the ACE value remains unchanged, all treatments. Constant variance is not a necessary thereby avoiding either O <O (i.e.,  (u )> condition for MTUA, MTUA should be used in data c,u t,u ACE i i i 0), or O >O (i.e.,  (u )< 0). For example, with small variance to constrain the dispersion of the c,u t,u ACE i i i according to (10), we have that population. (u ) =  (u ) + (λ ), i ∈[1, 2, ..., 10]. (31) For the intuitiveness of description, we use a simple i ACE i d,u example to further illustrate how M-TUA weakens the TUA Further, we obtain that, assumption. 10 10 1 1 (u ) = · (u ) = ·  (u ) + (λ ) . Example 9 (Difference between data generated by TUA ACE i i ACE i d,u 10 10 i=1 i=1 and M-TUA) TUA is different from M-TUA in a number (32) of respects. A simple goal in this example is to compare the differences in the data under different assumptions via 10 Since  (u ) =  (u ),in(32), only (λ ) = ACE i ACE i d,u i=1 i estimating the unobserved potential outcomes from Table 6. 0 needs to be satisfied. There are countless equations that satisfy (λ ) = 0. – Similar to Example 8, in Table 7, we construct a set of d,u i=1 i data (including 10 subjects u ,i ∈[1, 2, ..., 10])that meets the TUA assumption, where Table 7 Assignment mechanism based on TUA assumption with (u ) = 1 ACE i (u ) = E(O − O ) ACE i t,u c,u i i Subject O O O − O 10 t,u c,u t,u c,u i i i i = (O − O ) = 1. (30) t,u c,u i i u 13 12 1 10 1 i=1 u 11.5 10.5 1 –Tables 8 and 9 are constructed based on M-TUA u 10 9 1 assumption. u 12 11 1 u 11 10 1 As can be seen from Table 7, we know that the data only u 15 14 1 follows two situations, i.e., O <O (i.e.,  (u )> c,u t,u ACE i i i u 13 12 1 0), or O >O (i.e.,  (u )< 0). However, this 7 c,u t,u ACE i i i u 98 1 strong assumption is often violated in the real world, which u 8.5 7.5 1 forces all subjects to have the same (u ). M-TUA alleviates 9 u 12 11 1 this scenario and makes it more in line with the complex 10 situations in real data (note that the values of (λ ) in d,u mean 11.5 10.5 1 Tables 8 and 9 are not unique). Rethinking the framework constructed by counterfactual functional model 12971 Table 8 Assignment Subject O O (u) = O − O (λ ) = λ − λ mechanism based on M-TUA t,u c,u t,u c,u d,u t,u c,u i i i i i i i assumption with  (u ) = 1 ACE i u 13 11.8 1.2 0.2 u 10.7 9.5 1.2 0.2 u 9.2 8 1.2 0.2 u 11.2 10 1.2 0.2 u 11 9.9 1.1 0.1 u 15 13.9 1.1 0.1 u 10.3 9.5 0.8 −0.2 u 98.2 0.8 −0.2 u 10.7 10 0.7 −0.3 u 9.7 9 0.7 −0.3 mean 10.98 9.98 1 0 Example 9 shows that the data constructed based on the only requires a small value of variance (e.g., the variance M-TUA assumption allows for differences between various of (λ ) in Table 8 is less than 0.5, and the variance of d,u u ’s (e.g., (u ,u ,u ,u )< 0, (u )> 0and (λ ) in Table 9 is close to 1). i 7 8 9 10 1,2,3,4,5 d,u (u ) = 0), while ensuring that  (u ) is constant (e.g., 6 ACE i (u ) = 1 ), which is more in line with the diversity of Limitations Although M-TUA has realized the weakening ACE i experimental samples in real tasks. of TUA to a certain extent and expanded the use scope of the However, note that it is not sufficient to simply original TUA, M-TUA itself is based on some assumptions require that (λ ) = 0 holds, which does and finally achieves the equivalence with TUA in the case d,u i=1 i not guarantee that the data keeps good dispersion with of large samples, i.e., q → q . Therefore, M-TUA still has this constraint. Therefore, an indispensable measure is to the following limitations. introduce variance as a metric to constrain the data so that the data constructed based on M-TUA maintains good – Dimensionality limitation of vector space.Wetakethe dispersion. The reason is that the population is larger and 2-dimensional Gaussian distribution as an example. the variance is less, the ACE would be closer to the true Based on Example 5, we analyze the equivalent form of ACE regardless of the specific units randomly assigned TUA in 2-dimensional vector space. The vectorization to treatments. As mentioned above, for a randomized operation in 2-dimensional space can easily be extended experiment, the TUA implies that the variance is constant to 3-dimensional space. However, the equivalent form for all treatments, which means that a necessary condition of the TUA for data in high-dimensional space has not for TUA is that the variance is constant, while M-TUA been rigorously established. Table 9 Assignment Subject O O (u) = O − O (λ ) = λ − λ t,u c,u t,u c,u d,u t,u c,u mechanism based on M-TUA i i i i i i i assumption with  (u ) = 1 ACE i u 13 10.98 2.02 1.02 u 11.53 9.5 2.03 1.03 u 10.04 8 2.04 1.04 u 12.04 10 2.04 1.04 u 11 9 2 1.00 u 15 15 0 −1.00 u 9.48 9.5 −0.02 −1.02 u 99.03 −0.03 −1.03 u 9.96 10 −0.04 −1.04 u 8.96 9 −0.04 −1.04 mean 11.001 10.001 1 0 12972 C. Wang et al. –  (λ ). As shown in Fig. 4d, M-TUA TUA assumption at an abstract level through a set of +/− d,u i/j =i implies a premise that interpretable parameters. ⎛ ⎞ ⎝ ⎠ lim  (λ ) −  (λ ) = 0, (33) + d,u − d,u i j 7 Conclusion and future work q→q + − q q + − In this paper, we first use an example to illustrate the where q + q = q. It requires a large enough sample underlying problems of using the functional model to size to ensure that the equation estimate the probability solution of counterfactual queries. (λ ) =  (λ ) (34) We analyze the inference mechanism of the functional + d,u − d,u i j + − q q model and point out that there are ambiguous conclusions when the unique output probability solution is 0 under holds with a high probability. Because the effects of any the functional model. In other words, when the probability (λ ) may be positive or negative (this is similar to d,u solution obtained by the functional model is 0, it does not the classical coin toss experiment, when the number of mean that the estimated event will not occur. Secondly, experiments is sufficient, the numbers of positive and for the TUA assumption commonly used in counterfactual negative coin occurrences are basically equal). models, we provide an equivalent description form of the −η ·q +/− – Decay rate.The e in Theorem 1 ensures that (26) TUA in the low-dimensional space. We weaken the TUA will eventually converge to 0 with exponential decay. Of assumption by vectorizing the original TUA and finally course, the purpose of choosing exponential decay is to obtain a milder TUA assumption, i.e., M-TUA. In addition, make  (λ ) or  (λ ) converge quickly so that + d,u − d,u i − we also give theoretical proof and exhaustive analysis of the as the amount of sample data increases, the impact of rationality and limitations of M-TUA. (λ ) or  (λ ) on the data will be minimal (or + d,u − d,u i j As pointed out earlier, in M-TUA, the constraints on the as small as possible) and eventually reach a negligible unit are related to the dataset and RCE, instead of mandatory level. constraints for each unit. We argue this is very necessary, – Ignorability. Since M-TUA is a constraint imposes especially in the case of big data. Mild version assumption on the task of making causal inferences in the POF, (not just M-TUA) can be viewed as an abstraction from the ignorability(i.e., (O ,O ) ⊥ d)) still needs to hold. t,u c,u i i micro world to the macro world [46]. An intuitive example In addition, we argue that estimating the variance of is that if we want to measure the water temperature of a the data is still necessary (e.g, Example 9). Because swimming pool, it is impossible for us to measure every if the population is larger and the variance is less, the drop of water in the swimming pool. However, we do not ACE would be closer to the true ACE regardless of the think that the conclusion of this paper is the final form of the specific units randomly assigned to treatment. M-TUA. Therefore, we will focus on the following points in our future work. Interpretability Since the TUA cannot be tested and veri- fied on the observed data, this will lead to limitations in the Practicality Causal science has shown vigorous vitality in use of many models (e.g., the model of (7)) [36]. Therefore, the field of AI and public health [47]. However, a large it is necessary to obtain a milder and interpretable assump- number of tasks can only be carried out under the premise of tion. In general, M-TUA offers several advantages in terms satisfying strong assumptions. The use of some assumptions of interpretability as follows: is also not differentiated according to the different tasks. – Based on the idea of DVE, we establish the relationship Therefore, including M-TUA, whether the version for between TUA and RCE and try to provide some different AI task scenarios can be further developed is a reasonable explanations for λ . topic worthy of our further consideration. d,u – Through vectorization operations, we endow λ with d,u the ability to describe positive and negative effects on Challenges posed by high-dimensional data .Asatheo- data, and theoretically prove the rationality of M-TUA retical exploration of weakening TUA, M-TUA presents under the large dataset. the equivalent form of TUA in vector space through vec- – M-TUA not only weakens the strength of the torization and gives it a certain degree of interpretability. original TUA assumption but also provides a geometric However, with the explosion of data, AI practitioners are description of the TUA. confronted with data that are very large in both volume and – In particular, the M-TUA has an explicit mathematical dimensionality. Although our theorem shows that M-TUA expression that represents the meaning of the original is applicable in the case of big data, high-dimensional data Rethinking the framework constructed by counterfactual functional model 12973 brings new challenges. Therefore, how to develop assump- 19. Abbasnejad E, Teney D, Parvaneh A, Shi J, Hengel A (2020) Counterfactual vision and language learning. In: Proceedings tions based on M-TUA with theoretical guarantees and of the IEEE/CVF Conference on Computer Vision and Pattern applicable to high-dimensional data is also the focus of our Recognition, pp 10044–10054 future work. 20. Bajaj M, Chu L, Xue ZY, Pei J, Wang L, Lam PC-H, Zhang Y (2021) Robust counterfactual explanations on graph neural Acknowledgements This work was supported by the National Key networks. Adv Neural Inf Process Syst 34 R&D Program of China under Grant 2018YFB1403200. 21. Holzinger A, Malle B, Saranti A, Pfeifer B (2021) Towards multi-modal causability with graph neural networks enabling information fusion for explainable ai. Inf Fusion 71:28–37 22. Wachter S, Mittelstadt B, Russell C (2017) Counterfactual References explanations without opening the black box: Automated decisions and the gdpr. Harv JL Tech 31:841 1. Heintzelman SJ, Christopher J, Trent J, King LA (2013) 23. Hendricks LA, Hu R, Darrell T, Akata Z (2018) Generating coun- Counterfactual thinking about one’s birth enhances well-being terfactual explanations with natural language. arXiv:1806.09809 judgments. J Posit Psychol 8(1):44–49 24. Ustun B, Spangher A, Liu Y (2019) Actionable recourse in linear 2. Morgan SL, Winship C (2015) Counterfactuals and causal classification. In: Proceedings of the Conference on Fairness, inference. Cambridge University Press Accountability, and Transparency, pp 10–19 3. Balke A, Pearl J (1994) Probabilistic evaluation of counterfactual 25. Barocas S, Selbst AD, Raghavan M (2020) The hidden queries. In: Proceedings of the Twelfth AAAI National Confer- assumptions behind counterfactual explanations and principal ence on Artificial Intelligence, pp 230–237 reasons. In: Proceedings of the 2020 Conference on Fairness, 4. Lewis D (1976) Probabilities of conditionals and conditional Accountability, and Transparency, pp 80–89 probabilities. In: Ifs. Springer, pp 129–147 26. Pearl J (2018) Theoretical impediments to machine learning with 5. Ginsberg ML (1986) Counterfactuals. Artif Intell 30(1):35–79 seven sparks from the causal revolution. In: Proceedings of the 6. Kong E, Prinz D (2020) Disentangling policy effects using proxy Eleventh ACM International Conference on Web Search and Data data: Which shutdown policies affected unemployment during the Mining, pp 3–3 covid-19 pandemic? J Public Econ 189:104257 27. Marx A, Vreeken J (2019) Telling cause from effect by local and 7. Luo G, Zhao B, Du S (2019) Causal inference and bayesian global regression. Knowl Inf Syst 60(3):1277–1305 network structure learning from nominal data. Appl Intell 28. Bertossi L (2021) Specifying and computing causes for query 49(1):253–264 answers in databases via database repairs and repair-programs. 8. Liu Y, Yu J, Xu L, Wang L, Yang J (2021) Sissos: intervention of Knowl Inf Syst 63(1):199–231 tabular data and its applications. Appl Intell:1–15 29. Hair Jr J F, Sarstedt M (2021) Data, measurement, and causal 9. Pearl J, Mackenzie D (2018) The book of why: the new science of inferences in machine learning: opportunities and challenges for cause and effect. Basic Books marketing. J Mark Theory Pract:1–13 10. Venzke I (2018) What if? counterfactual (hi) stories of interna- 30. Zucker J, Paneri K, Mohammad-Taheri S, Bhargava S, Kolambkar tional law. Asian J Int Law 8(2):403–431 P, Bakker C, Teuton J, Hoyt CT, Oxford K, Ness R et al (2021) 11. Pesaran MH, Smith RP (2016) Counterfactual analysis in Leveraging structured biological knowledge for counterfactual macroeconometrics: An empirical investigation into the effects of inference: A case study of viral pathogenesis. IEEE Trans Big quantitative easing. Res Econ 70(2):262–280 Data 7(1):25–37 12. Atan O, Zame WR, Feng Q, van der Schaar M (2019) 31. Truong D (2021) Using causal machine learning for predicting the Constructing effective personalized policies using counterfactual risk of flight delays in air transportation. J Air Transport Manag inference from biased data sets with many features. Mach Learn 91:101993 108(6):945–970 32. Kumar V, Choudhary A, Cho E (2020) Data augmentation using 13. Major D, Lenis D, Wimmer M, Sluiter G, Berg A, Buhler ¨ K (2020) pre-trained transformer models. arXiv:2003.02245 Interpreting medical image classifiers by optimization based 33. Wu X, Lv S, Zang L, Han J, Hu S (2019) Conditional counterfactual impact analysis. In: 2020 IEEE 17th International bert contextual augmentation. In: International Conference on Symposium on Biomedical Imaging (ISBI). IEEE, pp 1096–1100 Computational Science. Springer, pp 84–95 14. Castro DC, Walker I, Glocker B (2020) Causality matters in 34. Qin L, Bosselut A, Holtzman A, Bhagavatula C, Clark E, Choi medical imaging. Nat Commun 11(1):1–10 Y (2019) Counterfactual story reasoning and generation. In: 15. Hao Z, Zhang H, Cai R, Wen W, Li Z (2015) Causal discovery on Proceedings of the 2019 Conference on Empirical Methods in high dimensional data. Appl Intell 42(3):594–607 Natural Language Processing and the 9th International Joint 16. Qin L, Shwartz V, West P, Bhagavatula C, Hwang JD, Le Bras Conference on Natural Language Processing (EMNLP-IJCNLP), R, Bosselut A, Choi Y (2020) Backpropagation-based decoding pp 5043–5053 for unsupervised counterfactual and abductive reasoning. In: 35. Qian C, Feng F, Wen L, Ma C, Xie P (2021) Counterfactual Proceedings of the 2020 Conference on Empirical Methods in inference for text classification debiasing. ACL-IJCNLP Natural Language Processing (EMNLP), pp 794–805 36. Dawid AP (2000) Causal inference without counterfactuals. J 17. Nguyen T-L, Collins GS, Landais P, Le Manach Y (2020) Amer Stat Assoc 95(450):407–424 Counterfactual clinical prediction models could help to infer 37. Holland PW (1986) Statistics and causal inference. J Amer Stat individualised treatment effects in randomised controlled trials–an Assoc 81(396):945–960 illustration with the international stroke trial. J Clin Epidemiol 38. Rubin DB (1980) Randomization analysis of experimental data: 18. Niu Y, Tang K, Zhang H, Lu Z, Hua X-S, Wen J-R (2021) The fisher randomization test comment. J Amer Stat Assoc Counterfactual vqa: A cause-effect look at language bias. In: 75(371):591–593 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12700–12710 12974 C. Wang et al. 39. Pearl J (2009) Causality. Cambridge university press 45. Geradin ´ M, Rixen DJ (2014) Mechanical vibrations: theory and 40. Pearl J, Glymour M, Jewell NP (2016) Causal inference in application to structural dynamics. Wiley statistics: A primer. Wiley 46. Beckers S, Eberhardt F, Halpern JY (2020) Approximate causal 41. Humar J (2012) Dynamics of structures. CRC press abstractions. In: Uncertainty in Artificial Intelligence. PMLR, 42. Rubin DB (1974) Estimating causal effects of treatments in ran- pp 606–615 domized and nonrandomized studies. J Educ Psychol 66(5):688 47. Mohimont L, Chemchem A, Alin F, Krajecki M, Steffenel LA 43. Imbens GW, Rubin DB (1997) Bayesian inference for causal (2021) Convolutional neural networks and temporal cnns for effects in randomized experiments with noncompliance. Ann covid-19 forecasting in france. Appl Intell:1–26 Stat:305–327 44. Heckman JJ (2010) Building bridges between structural and program evaluation approaches to evaluating policy. J Econ Publisher’s note Springer Nature remains neutral with regard to Literature 48(2):356–98 jurisdictional claims in published maps and institutional affiliations.

Journal

Applied Intelligence (Dordrecht, Netherlands)Pubmed Central

Published: Feb 17, 2022

There are no references for this article.