Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear... In recent years, the research into cluster-weighted models has been intense. However, esti- mating the covariance matrix of the maximum likelihood estimator under a cluster-weighted model is still an open issue. Here, an approach is developed in which information-based esti- mators of such a covariance matrix are obtained from the incomplete data log-likelihood of the multivariate Gaussian linear cluster-weighted model. To this end, analytical expressions for the score vector and Hessian matrix are provided. Three estimators of the asymptotic covariance matrix of the maximum likelihood estimator, based on the score vector and Hes- sian matrix, are introduced. The performances of these estimators are numerically evaluated using simulated datasets in comparison with a bootstrap-based estimator; their usefulness is illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions. Keywords Gaussian mixture model · Hessian matrix · Linear regression · Model-based cluster analysis · Sandwich estimator · Score vector 1 Introduction Cluster-weighted models constitute an approach to regression analysis with random covari- ates in which supervised (regression) and unsupervised (model-based cluster analysis) learning methods are jointly exploited (Hastie et al., 2009). In this approach, a given ran- dom vector is assumed to be composed of an outcome Y (response, dependent variable) and its explanatory variables X (covariates, predictors). Furthermore, sample observations are allowed to come from a population composed of several unknown sub-populations. Finally, the joint distribution of the outcome and the covariates is modelled through a finite mixture model specified so as to account for a different effect of the covariates on the response within each sub-population. Thus, cluster-weighted models are useful to perform Gabriele Soffritti gabriele.soffritti@unibo.it Department of Statistical Sciences, Alma Mater Studiorum - University of Bologna, via delle Belle Arti, 41 - 40126 Bologna, Italy Journal of Classification (2021) 38:594–625 595 model-based cluster analysis in a multivariate regression setting with random covariates and unobserved heterogeneity. Since the introduction of this approach (Gershenfeld, 1997), the research into cluster- weighted models has been intense, especially in the last 8 years. Models for continuous variables under normal mixture models have been proposed by Ingrassia et al. (2012)and Dang et al. (2017). Robustified solutions have been developed by Ingrassia et al. (2014) and Punzo and McNicholas (2017), based on the use of Student t and contaminated normal mixture distributions, respectively. Punzo and Ingrassia (2013), Punzo and Ingrassia (2016), Ingrassia et al. (2015) and Di Mari et al. (2020) have introduced models for various types of responses. Models able to deal with non-linear relationships or many covariates have been proposed by Punzo (2014), Subedi et al. (2013) and Subedi et al. (2015). By focusing the attention on Gaussian cluster-weighted models in which the effects of the covariates on the response within each sub-population are linear, model parame- ters are generally estimated through the maximum likelihood (ML) method by resorting to the expectation-maximisation (EM) algorithm (Dempster et al., 1977), which is widely employed in incomplete-data problems. In these models, the observed data S ={(x , y ), i i i = 1,...,I } are incomplete because the specific component density that generates the I sample observations is missing. This missing information is modelled through an unob- served variable coming from a pre-specified multinomial distribution and is added to the observed data so as to form the so-called complete data. Then, the ML estimate is computed from the maximisation of the complete data log-likelihood. A description of the EM algo- rithm for the linear Gaussian cluster-weighted model can be found in Dang et al. (2017). Specific functions implementing such algorithm for models with a univariate response are available in the package flexCWM (Mazza et al., 2018)for the R software environment (R Core Team, 2020). A by-product of this estimating approach is a set of K estimated posterior probabili- ties that each sample observation comes from the K Gaussian distributions of the mixture. Thus, a by-product of fitting a linear Gaussian cluster-weighted model is a clustering of the I sample observations, based on a rule that assigns an observation to the distribution of the mixture to which it has the highest posterior probability of belonging. However, an estimat- ing approach based on the use of an EM algorithm has the drawback of not automatically producing any estimate of the covariance matrix of the ML estimator. The assessment of the sample variability of the parameter estimates in a linear Gaussian cluster-weighted model is a necessary step in the subsequent development of inference methods for the model param- eters, such as asymptotic confidence intervals, tests for the significance of the effect of any covariate on a given response within any sub-population and tests for the significance of the difference between the effects of the same covariate on a given response in two different sub-populations. Thus, additional computations are necessary to obtain an assessment of the sample variability of model parameter estimates. To the best of the author’s knowledge, the only solution currently available for the linear Gaussian cluster-weighted models is imple- mented in the flexCWM package, where approximated standard errors are only provided for the intercepts and regression coefficients according to an approach in which a number of separate linear regression analyses with random covariates are carried out (one for each component of the mixture), and the sample observations are weighted with their estimated posterior probabilities of coming from the different components of the mixture. However, this approach only partially exploits the sample information about the parameters under a linear normal cluster-weighted model. Thus, other approaches could be investigated and detected. A solution can be obtained by resorting to bootstrap methods (see, e.g., Newton 596 Journal of Classification (2021) 38:594–625 &Raftery 1994; Basford et al. 1997; McLachlan & Peel 2000). However, the overall com- putational process associated with the use of bootstrap techniques can become particularly time-consuming and complex because of difficulties typically associated with the fitting of finite mixture models (e.g. label-switching problems, possible convergence failures of the EM algorithm on the bootstrap samples). These inconveniences could be avoided through an approach in which the observed information matrix is obtained from the incomplete data log-likelihood and employed to compute information-based estimators of the covariance matrix of the ML estimator (see e.g. McLachlan & Peel 2000). This task has been success- fully carried out under normal mixture models (Boldea & Magnus, 2009) and clusterwise linear regression models (Galimberti et al., 2021). In order to make it possible to properly assess both the variability of and the covariance between ML estimates of all the parameters under multivariate linear normal cluster- weighted models with a multivariate response, the gradient vector and second-order derivative matrix of the incomplete data log-likelihood for these models are explicitly derived here. Then, these results are used to obtain three estimators of the observed infor- mation matrix and the covariance matrix of the ML estimator. Properties of such estimators are numerically investigated using simulated datasets in comparison with the paramet- ric bootstrap and the approach implemented in flexCWM. A numerical evaluation of the relationships between the estimators introduced here and those described by Boldea and Magnus (2009) is also provided. The practical usefulness of the proposed estimators is illus- trated in a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions. The remainder of the paper is organised as follows. Section 2 provides the defini- tion of multivariate Gaussian linear cluster-weighted model together with some quantities employed in the derivation of the score vector and the Hessian matrix. Section 3 describes the estimators of the information matrix and the covariance matrix of the ML estimator. Sections 4 and 5 summarize the main results obtained from the analysis of simulated and real datasets, respectively. The analytical expressions of the score vector and the Hessian matrix are reported in Appendix. Technical details and additional results from the analysis of simulated datasets can be found in a separate document as supplementary materials. 2 Score Vector and Hessian Matrix of Gaussian Linear Cluster-Weighted Models Let X = (X , ...,X ) and Y = (Y , ...,Y ) be two continuous random vectors with joint 1 p 1 q probability density function (p.d.f.) f(x, y). The response vector Y and the covariate vector q p X take values in R and R , respectively. Following Dang et al. (2017), (X , Y ) follows a cluster-weighted model of order G if f(x, y) = π f(x| )f (y|x, ), (1) g g g g=1 where  ,..., denote the G unknown sub-populations ( ∩  =∅ ∀g = j), 1 G g j π = P( ), π > 0 ∀g, π = 1, f(x| ) is the conditional p.d.f. of X given g g g g g g=1 , f(y|x, ) is the conditional p.d.f. of Y given X and  . A Gaussian linear cluster- g g g weighted model is obtained from Eq. 1 by additionally assuming that the distributions of Journal of Classification (2021) 38:594–625 597 X| and Y|(X = x, ) are Gaussian for g = 1,...,G and the effect of X on Y for any g g is linear. By embedding all these assumptions into model (1), f(x, y) becomes f(x, y; ϑ ) = π φ (x; μ ,  )φ (y|x; B x ,  ), (2) g p X q Y X g g g g g=1 where φ (·; μ, ) represents the p.d.f. of a normal d-dimensional random vector with expected value μ and positive definite covariance matrix , B x = E(Y|X = x, ), g = 1,...,G, (3) (1+p)×q ∗ with B ∈ R , x = (1, x ) ,and ϑ is the vector of the unknown parameters. It has been proved that linear Gaussian cluster-weighted models of order G define the same family of probability distributions generated by mixtures of G Gaussian models for the random vector Z = (X , Y ) (Ingrassia et al., 2012). However, it is important to stress that this latter type of mixtures cannot be employed to account for local linear dependencies between X and Y. The score vector and Hessian matrix of model (2) are derived by taking account of the fact that the weights π ,...,π sum to one and the covariance matrices are symmetric. 1 G The first constraint is introduced in the maximization of the likelihood function by differ- entiating with respect to π = (π ,...,π ) and by setting π = 1 − π −· · ·− π . 1 G−1 G 1 G−1 The constraints on the covariance matrices are dealt with by using the operators vec(·),v(·) and the duplication matrix. Namely, vec(B) is the column vector obtained by stacking the columns of matrix B one underneath the other. v() denotes the column vector obtained from vec() by eliminating all supradiagonal elements of a symmetric matrix  (thus, v() contains only the lower triangular part of ). The duplication matrix G is the unique matrix which transforms v() into vec() (Gv() = vec()) (see e.g. Magnus & Neudecker 1988). Using this notation, the vector of the unknown parameters in model (2) can be denoted as ϑ = (π , θ ,..., θ ) ,where θ = μ , v  , vec B , v  . g X g Y 1 G X g g Suppose that the observed data S ={(x , y ), i = 1,...,I } is composed of I indepen- i i dent and identically distributed observations. Then, the incomplete log-likelihood function under the model (2)is I G l(ϑ ) = ln π φ x ; μ ,  φ y |x ; B x ,  .(4) g p i X q i Y X g i g i g i=1 g=1 Each sample observation provides its own contribution to the gth term of the sum that defines the mixture (2). As far as the contribution of the ith observation is concerned, it is given by: 1 1 −1 − − 2 2 p = π (2π) det( ) exp − (x − μ )  (x − μ ) gi g X i i g X X g X g 1 1 − −  ∗  −1  ∗ 2 2 (2π) det( ) exp − (y − B x )  (y − B x ).(5) g i g i Y i g i By exploiting properties of the logarithm and trace, the following equality holds true: p 1 1 −1 ln p = ln π − ln(2π) − ln det( ) − tr  (x − μ )(x − μ ) + gi g X i i X X g X g g 2 2 2 q 1 1 −1  ∗  ∗ − ln(2π) − ln det( ) − tr  (y − B x )(y − B x ).(6) g i g i g Y i i 2 2 2 598 Journal of Classification (2021) 38:594–625 The explicit forms of the score vector and Hessian matrix, as developed here, require the introduction of some additional notation. Namely, let gi α = , (7) gi ji j =1 a = e,g = 1,...,G − 1, (8) g g a =− 1 , G G−1 where e denotes the gth column of the identity matrix of order G − 1and 1 is the g G−1 (G−1)×1 vector having each element equal to 1. The following quantities are also required: −1  ∗ o =  (y − B x ), (9) gi Y i g i −1 O =  − o o , (10) gi gi gi −1 f =  (x − μ ), (11) gi i X g −1 F =  − f f . (12) gi gi X gi The explicit forms of the score vector S(ϑ ) and Hessian matrix H(ϑ ) for a Gaussian linear cluster-weighted model are provided in Theorems 1 and 2 (see Appendix). Proofs can be found in the document with the supplementary materials. 3 Covariance Matrix Estimation of the ML Estimator In the ML approach, the information matrix I(ϑ ) plays a crucial role, as it is used to asymptotically estimate the covariance of the ML estimator of the model parameters. Under regularity conditions and if the model is correctly specified, I(ϑ ) is given either by the covariance of the score function E S(ϑ )S(ϑ ) or the negative of the expected value of the Hessian matrix −E (H(ϑ )). Clearly, an analytical evaluation of the expectations required to obtain I(ϑ ) under model (2) is quite complex. By exploiting some asymptotic results concerning ML estimation (White, 1982), it is possible to obtain the following asymptotic estimators of I(ϑ ): I I ˆ ˆ ˆ I = S (ϑ )S (ϑ ) , I =− H (ϑ ), 1 i i 2 i i=1 i=1 ˆ ˆ where S (ϑ ) and H (ϑ ) denote the contribution of the ith sample observation to the score i i function and Hessian matrix evaluated at the ML estimate ϑ, respectively. They can be used ˆ ˆ to obtain the following asymptotic estimators of Cov(ϑ ), the covariance matrix of ϑ: −1 Cov (ϑ ) = I , (13) −1 Cov (ϑ ) = I . (14) Under suitable conditions (see e.g. Newey & McFadden 1994; Ritter 2015, for a general discussion and some results specifically dealing with finite mixture models, respectively), ˆ ˆ ˆ Cov (ϑ ) and Cov (ϑ ) can be considered consistent estimators of Cov(ϑ ) when the model 1 2 Journal of Classification (2021) 38:594–625 599 is correctly specified. By exploiting the so-called sandwich approach (see e.g. White 1980), the following robust estimator can also be employed: −1 −1 Cov (ϑ ) = I I I . (15) 3 1 2 2 ˆ ˆ It is worth noting that for the existence of the estimators Cov (ϑ ) and Cov (ϑ ) it is required 2 3 that matrix I is positive definite and well conditioned. The same requirements have to be fulfilled by matrix I in order to guarantee that Cov (ϑ ) exists. With large-scale covariance 1 1 matrices and small sample sizes, I and/or I could be ill-conditioned. These situations can 1 2 be managed by resorting to procedures able to produce improved estimators of I(ϑ ) from either I or I . For example, the algorithm by Higham (1988) computes the nearest positive 1 2 definite matrix of a given symmetric matrix by adjusting its eigenvalues. Other approaches which exploit techniques of variance reduction could also be adopted (see e.g. Schafer ¨ and Strimmer 2005). 4 Experimental Results from Simulated Datasets 4.1 Numerical Study of the Properties of the Proposed Estimators ˆ ˆ ˆ In order to evaluate the properties of Cov (ϑ ), Cov (ϑ ) and Cov (ϑ ) in comparison with the 1 2 3 estimators based on the parametric bootstrap and the approach implemented in flexCWM, five Monte Carlo studies have been performed. In the first study, the artificial datasets have been generated under a model defined by Eqs. 2–3 where G = 2, q = 1and p = 2. As far as the model parameters are concerned, the following values have been employed: π = 0.7, π = 0.3,  = 1.5,  = 1, B = (5, 2, 2), B = (1, −2, −2), μ = (−2, −2), 2 Y Y 1 2 1 2 X 1.0 0.2 1.0 0.4 μ = (2, 2),  = ,  = . Such values have been chosen so as to X X X 1 0.2 1.0 2 0.4 1.0 produce two well-separated groups of observations (see the upper panel of Fig. 1, with the pairwise scatterplots of the variables X , X and Y for a sample of size I = 500 generated 1 2 1 as just described). In this way, problems of label switching across simulations are less likely to occur. Furthermore, the ML estimates of ϑ are expected to be accurate enough to let the analysis be focused on the different ways of estimating the standard error of ϑ.Usingthese parameter values, R = 10, 000 datasets (of size I) have been generated as follows: 1. For the rth dataset (r = 1,...,R), a sample of size I is drawn from the standard p-dimensional normal distribution; this gives the vectors  ,...,  ; 1r Ir 2. For the rth dataset (r = 1,...,R), a sample of size I is drawn from the standard q-dimensional normal distribution; this gives the vectors η ,..., η ; 1r Ir 3. For the rth dataset (r = 1,...,R), asampleofsize I is drawn from the Bernoulli distribution with parameter π ; this produces the 0-1 vector z = (z ,...,z ) ; 1 r 1r Ir 4. For the ith observation (i = 1,...,I)ofthe rth dataset, vectors x and y are obtained ir ir as follows: x = μ + A  , y = B x + A η if z = 1, ir X ir ir ir Y ir X ir 1 1 1 1 = μ + A  , y = B x + A η if z = 0, ir X ir ir ir Y ir X 2 2 2 ir where A and A are matrices obtained from the spectral decompositions of  and X Y X g g g , respectively. Such matrices are constructed such that A A =  , A A = Y X X Y g g g g X Y g g ,for g = 1, 2. g 600 Journal of Classification (2021) 38:594–625 Fig. 1 Pairwise scatterplots of X , X and Y forfour samplesofsize I = 500 generated in the first four 1 2 1 studies. Upper and lower panels refer to the first and fourth studies, respectively; intermediate panels refer to the second and third ones. Black circles and red triangles correspond to g = 1and g = 2, respectively In the second study, the datasets have been obtained through the same procedure used in the first one except from the computation of vectors  and η . Namely, a sample of size I ·p is ir ir drawn from the uniform distribution in the interval (0,1) for the rth dataset (r = 1,...,R); ∗ ∗ this produces a vector  , whose elements are transformed as follows:  = 12( − jr r jr 0.5), j = 1,...,I · p; the vector  resulting from this transformation has zero mean and unit variance; partitioning  into Ip−dimensional vectors leads to  ,...,  .Thesame r 1r Ir process has been applied to obtain vectors η ,..., η . The second panel of Fig. 1 provides 1r Ir Journal of Classification (2021) 38:594–625 601 the pairwise scatterplots of X , X and Y for a sample of size I = 500 from this second 1 2 1 study. In the third and fourth studies, the datasets have been generated as in the first and sec- ond studies, respectively, but using the following model parameters: μ = (−1, −1) , μ = (1, 1) . This change in the values of μ and μ leads to overlapping groups of X X X 2 1 2 observations (see the pairwise scatterplots of X , X and Y for samples of size I = 500 1 2 1 in third and fourth panels of Fig. 1). The total number of model parameters in the first four studies is 19. The fifth study has been carried out with the same settings of the first study but with p = 8 explanatory variables. The model parameters pertaining to X which have been employed to generate the datasets are as follows: μ =−2 · 1 , μ = 2 · 1 , V(X | ) = 1 ∀j for X 8 X 8 j g 1 2 |j −h| |j −h| g = 1, 2, Cov(X ,X | ) = 1 − ∀j = h,Cov(X ,X | ) = 1 − ∀j = h. j h 1 j h 2 8 4 As far as the effects of the regressors on Y are concerned, they have been set as follows: B = (5, μ ) , B = (1, μ ) . In this latter study, the total number of model parameters 1 2 X X 2 1 is 109. In all studies, Monte Carlo experiments have been performed with two different sample sizes: I = 250, 500 in the first four studies, I = 300, 500 in the last study. In all experi- ments, it has been assumed that the rth dataset {(x , y ),...,(x , y )} is generated from 1r 1r Ir Ir a model defined by Eqs. 2–3 with G = 2. Thus, the maximum likelihood estimate ϑ of ϑ has been computed for r = 1,...,R under such an assumption. Parameter estimation has been carried out through the EM algorithm as implemented in the function cwm of the package flexCWM. As far as the initialisation of the parameters is concerned, an option has been employed, which executes 5 independent runs of the k-means algorithm and picks the solution maximising the observed-data log-likelihood among these runs. The maximum number of iterations of the EM algorithm has been set equal to 400. A convergence crite- −6 rion based on the Aitken acceleration has been used, with a threshold  = 10 (for further details, see Mazza et al. 2018). The R independent estimates of ϑ are used to approximate the true distribution of ϑ and, in particular, the true standard errors of all the elements of ϑ. Estimates of these standard errors have been computed using the three information-based estimators and the parametric bootstrap for R = 2000 datasets obtained as described above. For each bootstrap estimate, 100 bootstrap samples have been employed. For the ML estimates of the model intercepts and regression coefficients, the standard errors estimated by the function cwm of the pack- age flexCWM using the approach illustrated in the introduction have been included in the comparison. The performances of these strategies have been evaluated on the basis of an estimate of their biases and root mean squared errors (RMSE). A comparative evaluation of such approaches has been carried out also through the coverage probabilities (CP) of 90% and 95% confidence intervals based on the examined standard errors’ estimates and the standard normal quantiles. In this latter comparison, the attention is focused on the expected mean values of the regressors (i.e. μ and μ ) and the regression coefficients (all the X X 1 2 entries in the first column of B and B except the first one). 1 2 Tables 1, 2, 3 and 4 contain the biases and RMSEs for the first four Monte Carlo studies with samples of size 250. The same information for the last study and the sample size I = 300 can be found in Table 5. The corresponding values for the CPs are summarised in Tables 6, 7, 8, 9 and 10. In all the tables, the results obtained using the function cwm of the package flexCWM, the bootstrap and the estimators defined in Eqs. 13–15 are denoted as cwm, Boot, C , C and C , respectively. From now on, B [j, k] is used to denote the 1 2 3 g element on the j-th row and k-th column of matrix B ; μ [j ] represents the j-th element g g 602 Journal of Classification (2021) 38:594–625 Table 1 Biases and root mean squared errors of the examined standard error estimators of ϑ in the first study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π 0.031 0.050 0.038 0.040 0.217 0.096 0.090 0.092 B [1, 1]−5.048 −2.829 −4.068 −2.958 −0.602 5.301 3.651 4.596 3.380 3.156 B [2, 1]−1.766 −1.013 −1.426 −1.076 −0.256 1.874 1.335 1.641 1.232 1.088 B [3, 1]−1.724 −0.958 −1.373 −1.032 −0.215 1.821 1.274 1.577 1.175 1.078 −0.693 4.722 4.331 4.473 0.833 4.823 4.340 4.601 μ [1]−0.048 0.168 −0.047 −0.039 0.687 0.511 0.448 0.448 μ [2] 0.010 0.240 0.026 0.033 0.697 0.521 0.432 0.434 [1, 1]−0.146 0.527 −0.078 −0.170 1.424 1.684 1.199 1.533 [1, 2]−0.085 0.408 −0.023 −0.070 1.419 1.650 1.197 1.525 [2, 2]−0.001 0.688 0.046 −0.080 1.417 1.741 1.198 1.525 B [1, 1]−15.263 0.223 3.462 −0.630 −1.199 15.395 4.171 6.374 3.415 5.514 B [2, 1]−6.046 0.301 1.871 −0.103 −0.445 6.105 1.793 2.934 1.416 2.255 B [3, 1]−6.226 0.085 1.620 −0.300 −0.617 6.280 1.742 2.760 1.383 2.208 0.040 10.541 8.444 8.227 0.781 10.828 8.486 8.545 μ [1]−0.080 0.632 −0.139 −0.076 1.480 1.567 1.244 1.323 μ [2]−0.104 0.624 −0.145 −0.087 1.496 1.574 1.256 1.327 [1, 1]−0.442 1.759 −0.419 −0.702 3.516 4.710 3.360 4.257 [1, 2]−0.468 1.125 −0.455 −0.548 3.519 4.512 3.364 4.234 [2, 2]−0.341 1.853 −0.352 −0.676 3.504 4.746 3.352 4.253 All entries have been multiplied by 100 Journal of Classification (2021) 38:594–625 603 Table 2 Biases and root mean squared errors of the examined standard error estimators of ϑ in the second study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π 0.006 0.024 0.001 0.001 0.221 0.087 0.082 0.082 B [1, 1]−4.220 −1.959 −4.021 −2.477 −0.184 4.367 2.827 4.231 2.721 1.737 B [2, 1]−1.446 −0.683 −1.397 −0.813 0.050 1.514 1.034 1.499 0.935 0.701 B [3, 1]−1.639 −0.863 −1.562 −1.008 −0.174 1.700 1.160 1.652 1.103 0.703 1.722 12.779 6.679 3.074 1.780 12.815 6.683 3.094 μ [1]−0.130 0.019 −0.142 −0.141 0.647 0.385 0.392 0.392 μ [2] 0.051 0.214 0.051 0.052 0.664 0.458 0.384 0.384 [1, 1] 1.730 5.682 1.712 −0.128 2.126 6.030 1.967 0.678 [1, 2] 2.793 7.498 2.761 −0.020 3.054 7.764 2.926 0.666 [2, 2] 1.709 5.689 1.719 −0.123 2.109 6.036 1.973 0.677 B [1, 1]−13.882 1.521 2.659 −0.015 −0.617 13.951 4.027 4.590 2.618 3.525 B [2, 1]−6.010 0.290 0.950 −0.184 −0.413 6.042 1.595 1.918 1.187 1.622 B [3, 1]−6.090 0.247 0.904 −0.255 −0.508 6.120 1.570 1.909 1.185 1.645 2.767 21.632 11.065 5.659 2.868 21.835 11.095 5.791 μ [1] 0.148 0.589 −0.041 −0.043 1.343 1.299 0.986 0.984 μ [2] 0.164 0.585 −0.046 −0.048 1.320 1.270 0.964 0.964 [1, 1] 3.549 10.810 3.160 −0.135 4.532 12.011 4.027 1.765 [1, 2] 4.838 12.754 4.430 −0.002 5.599 13.788 5.084 1.760 [2, 2] 3.518 10.708 3.151 −0.094 4.508 11.920 4.020 1.763 All entries have been multiplied by 100 604 Journal of Classification (2021) 38:594–625 Table 3 Biases and root mean squared errors of the examined standard error estimators of ϑ in the third study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π −0.000 0.044 0.029 0.054 0.243 0.131 0.128 0.156 B [1, 1]−3.552 −1.794 −2.556 −1.668 0.061 3.674 2.347 2.873 1.982 2.091 B [2, 1]−1.977 −1.054 −1.468 −1.061 −0.118 2.079 1.551 1.697 1.256 1.214 B [3, 1]−1.902 −0.950 −1.362 −0.985 −0.062 1.998 1.482 1.589 1.174 1.197 −0.759 4.911 4.647 5.113 0.906 5.024 4.681 5.343 μ [1]−0.042 0.196 −0.031 −0.015 0.749 0.531 0.456 0.462 μ [2] 0.005 0.258 0.030 0.043 0.753 0.544 0.446 0.453 [1, 1]−0.141 0.551 −0.064 −0.150 1.438 1.712 1.215 1.539 [1, 2]−0.061 0.448 0.005 −0.041 1.432 1.681 1.213 1.533 [2, 2] 0.027 0.736 0.081 −0.040 1.431 1.780 1.216 1.533 B [1, 1]−12.209 −0.143 1.969 −0.536 −0.266 12.282 3.176 4.327 3.009 5.175 B [2, 1]−7.068 0.144 1.863 −0.261 −0.442 7.126 2.091 3.175 1.749 2.918 B [3, 1]−7.042 0.159 1.828 −0.268 −0.449 7.096 2.064 3.101 1.675 2.758 −0.070 11.005 8.879 9.018 0.844 11.323 8.947 9.464 μ [1]−0.344 0.749 −0.195 −0.018 1.689 1.894 1.463 1.814 μ [2]−0.425 0.701 −0.271 −0.123 1.697 1.879 1.455 1.720 [1, 1]−0.779 2.310 −0.486 −0.818 3.659 5.263 3.470 4.814 [1, 2]−0.629 1.846 −0.366 −0.593 3.630 5.077 3.455 4.781 [2, 2]−0.539 2.619 −0.255 −0.664 3.616 5.406 3.445 4.791 All entries have been multiplied by 100 Journal of Classification (2021) 38:594–625 605 Table 4 Biases and root mean squared errors of the examined standard error estimators of ϑ in the fourth study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π −0.080 0.054 −0.014 −0.012 0.258 0.164 0.135 0.141 B [1, 1]−9.960 −8.445 −9.492 −7.788 −5.135 9.984 8.543 9.538 7.861 5.506 B [2, 1]−2.944 −2.158 −2.820 −2.069 −0.925 2.981 2.303 2.878 2.143 1.316 B [3, 1]−2.854 −2.042 −2.706 −1.984 −0.880 2.892 2.198 2.767 2.058 1.259 1.270 12.324 6.503 3.441 1.361 12.365 6.517 3.528 μ [1]−0.288 −0.040 −0.221 −0.211 0.707 0.401 0.436 0.435 μ [2]−0.195 0.064 −0.119 −0.110 0.694 0.428 0.411 0.415 [1, 1] 1.098 5.222 1.147 −0.695 1.649 5.596 1.491 0.958 [1, 2] 1.696 6.562 1.705 −1.108 2.096 6.864 1.953 1.290 [2, 2] 1.045 5.187 1.115 −0.725 1.615 5.563 1.466 0.981 B [1, 1]−13.889 −0.999 0.779 −1.230 −1.327 13.932 3.227 3.449 3.060 4.382 B [2, 1]−9.282 −1.819 −0.835 −2.155 −2.318 9.309 2.621 2.264 2.656 3.180 B [3, 1]−9.201 −1.693 −0.689 −2.053 −2.255 9.227 2.505 2.210 2.560 3.130 2.759 22.773 11.819 6.563 2.882 23.022 11.865 6.753 μ [1]−0.600 0.441 −0.375 −0.206 1.666 1.754 1.437 1.641 μ [2]−0.654 0.391 −0.425 −0.255 1.668 1.682 1.415 1.541 [1, 1] 3.047 12.617 3.316 0.036 4.361 14.157 4.448 2.594 [1, 2] 5.083 15.362 5.038 0.118 5.964 16.650 5.845 2.596 [2, 2] 2.905 12.415 3.206 −0.034 4.262 13.978 4.366 2.594 All entries have been multiplied by 100 606 Journal of Classification (2021) 38:594–625 Table 5 Biases and root mean squared errors of the examined standard error estimators of ϑ in the fifth study, I = 300 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π 0.003 0.037 0.011 0.011 0.196 0.079 0.068 0.068 B [1, 1]−4.292 −0.898 −0.286 −2.723 −0.828 4.518 19.941 2.232 3.005 2.445 B [−1, 1]−4.002 −2.017 −0.147 −2.543 −0.767 4.227 3.121 2.114 2.823 2.368 −0.403 5.809 3.905 4.025 4.250 5.891 3.910 4.115 μ 0.008 1.117 −0.014 −0.014 0.851 1.228 0.359 0.359 v( ) 0.098 1.922 0.032 −0.050 2.505 2.727 1.587 1.791 B [1, 1]−30.889 2.842 54.228 −2.217 −4.911 31.112 20.522 61.001 6.438 10.367 B [−1, 1]−10.633 0.797 18.699 −0.645 −1.557 10.714 3.544 20.958 2.297 3.645 0.056 19.834 7.571 7.339 3.695 20.658 7.603 7.592 μ −0.037 6.633 −0.141 −0.141 1.848 7.429 0.942 0.942 v( ) 0.010 10.143 −0.093 −0.344 3.884 12.418 3.120 3.485 All entries have been multiplied by 100 Journal of Classification (2021) 38:594–625 607 Table 6 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the first study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.896 0.909 0.899 0.900 0.948 0.955 0.949 0.949 μ [2] 0.901 0.913 0.905 0.905 0.946 0.956 0.951 0.951 μ [1] 0.886 0.909 0.886 0.889 0.940 0.957 0.941 0.941 μ [2] 0.885 0.903 0.885 0.883 0.941 0.959 0.941 0.941 B [2, 1] 0.818 0.856 0.838 0.859 0.889 0.890 0.918 0.908 0.923 0.941 B [3, 1] 0.822 0.858 0.843 0.861 0.893 0.897 0.920 0.907 0.921 0.942 B [2, 1] 0.616 0.914 0.938 0.900 0.888 0.705 0.961 0.972 0.954 0.941 B [3, 1] 0.627 0.906 0.934 0.895 0.875 0.705 0.953 0.971 0.949 0.929 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) 608 Journal of Classification (2021) 38:594–625 Table 7 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the second study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.888 0.895 0.890 0.890 0.944 0.949 0.943 0.943 μ [2] 0.913 0.920 0.910 0.910 0.956 0.964 0.958 0.958 μ [1] 0.900 0.919 0.900 0.898 0.942 0.958 0.943 0.942 μ [2] 0.893 0.909 0.893 0.893 0.939 0.950 0.938 0.938 B [2, 1] 0.839 0.872 0.846 0.872 0.903 0.904 0.931 0.908 0.935 0.951 B [3, 1] 0.835 0.863 0.834 0.866 0.895 0.899 0.920 0.903 0.919 0.941 B [2, 1] 0.630 0.905 0.930 0.896 0.885 0.709 0.941 0.970 0.937 0.930 B [3, 1] 0.618 0.905 0.932 0.891 0.876 0.696 0.954 0.977 0.946 0.934 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) Journal of Classification (2021) 38:594–625 609 Table 8 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the third study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.898 0.911 0.899 0.899 0.944 0.954 0.947 0.946 μ [2] 0.897 0.912 0.902 0.903 0.947 0.954 0.948 0.948 μ [1] 0.881 0.911 0.887 0.885 0.932 0.951 0.935 0.938 μ [2] 0.878 0.910 0.885 0.887 0.939 0.954 0.936 0.941 B [2, 1] 0.805 0.853 0.837 0.857 0.892 0.882 0.917 0.905 0.921 0.949 B [3, 1] 0.821 0.871 0.846 0.870 0.894 0.892 0.920 0.914 0.924 0.943 B [2, 1] 0.574 0.906 0.939 0.898 0.882 0.665 0.952 0.972 0.947 0.935 B [3, 1] 0.590 0.904 0.938 0.898 0.880 0.685 0.953 0.972 0.950 0.942 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) 610 Journal of Classification (2021) 38:594–625 Table 9 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the fourth study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.884 0.901 0.892 0.892 0.938 0.948 0.940 0.940 μ [2] 0.894 0.910 0.903 0.904 0.947 0.959 0.949 0.949 μ [1] 0.866 0.892 0.876 0.885 0.922 0.940 0.928 0.941 μ [2] 0.853 0.886 0.866 0.871 0.920 0.937 0.926 0.932 B [2, 1] 0.820 0.857 0.828 0.866 0.903 0.895 0.920 0.903 0.923 0.948 B [3, 1] 0.810 0.848 0.815 0.854 0.893 0.882 0.904 0.890 0.913 0.940 B [2, 1] 0.580 0.894 0.925 0.887 0.881 0.660 0.944 0.973 0.938 0.922 B [3, 1] 0.566 0.898 0.930 0.891 0.874 0.648 0.947 0.976 0.945 0.926 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) Journal of Classification (2021) 38:594–625 611 Table 10 Coverage probability of the 90% and 95% confidence intervals for μ and B [−1, 1] based on the standard error estimators of ϑ in the fifth study, I = 300 CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ 0.891 0.939 0.893 0.893 0.944 0.975 0.946 0.946 μ 0.897 0.988 0.898 0.898 0.947 0.997 0.947 0.947 B [−1, 1] 0.831 0.869 0.898 0.862 0.888 0.897 0.924 0.947 0.920 0.940 B [−1, 1] 0.616 0.906 0.995 0.889 0.869 0.699 0.955 0.999 0.945 0.926 2 612 Journal of Classification (2021) 38:594–625 of μ . Due to the large number of parameters pertaining to the regressors in the fifth study, Table 5 provides the mean values of the biases and RMSEs of the ML estimates over the elements of each of the following vectors of model parameters: μ ,v( ), B [−1, 1] for X g X g g = 1, 2, where B [−1, 1] is the vector obtained by dropping the first element from the first column of B . Thus, B [−1, 1] comprises the regression coefficients of the p covariates on g g Y given  . In a similar way, Table 10 contains the mean values of the CPs of 90% and 1 g 95% confidence intervals over μ and B [−1, 1] for g = 1, 2. X g Under the experimental conditions considered in the first Monte Carlo study, biases are generally small for all the estimated standard errors (see Table 1). The overall best perfor- mance in terms of accuracy seems to be achieved by means of the estimator Cov (ϑ ).The bootstrap approach appears to provide the most precise estimates of the standard errors of . The sandwich method is slightly more accurate than the bootstrap approach in esti- mating the standard errors of the ML estimates of the expected values of the regressors; the opposite result holds true when dealing with the estimation of the standard errors of  . The highest root mean square errors are mostly obtained using either the function cwm of the package flexCWM or the estimator Cov (ϑ ), which are therefore not recommended. These results confirm both the best performance of an estimator based on the Hessian and the poor performance of an estimator based on the gradient vector under correctly speci- fied models registered in a study dealing with multivariate normal mixture models (Boldea & Magnus, 2009). It is also worth noting that the accuracy of the approach implemented in flexCWM sharply deteriorates when the ML estimates of the intercept and regression coef- ficients of the second group (B ) are considered. As far as the effective confidence levels for the parameters μ , B [2, 1] and B [3, 1] are concerned (Table 6), the obtained results are g g generally similar to one another and quite close to the nominal confidence levels for all the examined methods except the one implemented in flexCWM. With this latter method, the effective confidence levels for the regression coefficients clearly deviate from the nominal ones, especially in the second group of observations. All these results have been employed to run tests for the hypotheses of equality between effective and nominal confidence levels. This task has been carried out through asymptotic two-tailed normal tests for a propor- tion at a 0.00125 significance level (the Bonferroni correction 0.01/8 has been adopted to account for multiple tests performed for each estimation method and each nominal con- fidence level). All the effective CPs of the confidence intervals for the model regression coefficients obtained using both the estimator based on the gradient and the approach imple- mented in flexCWM appear to be significantly different from the corresponding nominal ones (see the entries in italics in Table 6). As far as the results from the bootstrap-based and Hessian-based estimators are concerned, the null hypothesis of equality between effec- tive and nominal confidence levels should be rejected for two regression coefficients at both examined confidence levels; the same null hypothesis has to be rejected for only one regression coefficient when using the the sandwich approach. In the second Monte Carlo study, a substantial increase in the biases of the estimated ˆ ˆ standard errors of  and  has been registered with all the examined estimators except X Y g g for the sandwich method (Table 2); this latter method is also the most accurate. Using a Gaussian cluster-weighted model for the analysis of datasets generated under a uniform cluster-weighted model seems to have a little impact on the confidence intervals for both the expected values of the regressors and the regression coefficients (Table 7). When the data are obtained from two overlapping groups of observations drawn from Gaussian distributions (third study), the resulting biases and RMSEs (see Table 3) are quite similar to the ones from the first study; the main effect of the reduction in the separation Journal of Classification (2021) 38:594–625 613 of the two groups is a general slight increase in the RMSEs with all the examined meth- ods. The best accuracy is still achieved by the estimator Cov (ϑ ) for all model parameters, with the exception of  , whose standard errors are more accurately estimated by the bootstrap approach. As far as the effect of this reduction on the effective CPs of the confi- dence intervals is concerned (Table 8), the most remarkable result is an increase in the gap between nominal and effective CPs associated with the use of the approach implemented in flexCWM for the regression coefficients in the second group of observations. When the two overlapping groups of observations are generated from the uniform dis- tribution (fourth study), both biases and RMSEs of B results remarkably increased with all the examined methods (see Table 4). However, it is worth noting that the lowest of such increases has always been associated with the use of Cov (ϑ ), which is also the estimator with the best accuracy for the ML estimate of variances and covariances of the regressors in both groups and the majority of the model parameter estimates. Furthermore, the sandwich estimator shows the best performance in terms of effective CPs that are not significantly different from the nominal ones (Table 9). In the presence of datasets generated under a Gaussian cluster-weighted model with p = 8 regressors (fifth study), the most remarkable effects of a larger number of covariates on the performance of the examined estimators with samples of size I = 300 appear to be (see Table 5 in comparison with Table 1) a sharp decrease in the accuracy of the estimates of the standard errors produced by the method based on the gradient for all the parameters of the second group of observations and a deterioration in the performances of the bootstrap approach in reference to all the parameters of the conditional distribution of Y given X and ˆ ˆ , especially B [1, 1],for g = 1, 2. As far as the methods Cov (ϑ ) and Cov (ϑ ) are g g 2 3 concerned, biases and RMSEs result to be quite similar to the ones from the first study for all parameter estimates except the intercepts for both groups. Thus, the best overall accuracy is still achieved by the estimator Cov (ϑ ). The results from the five studies with samples of size 500 (see Tables A–J in the separate document with the supplementary materials) are generally in line with those just described. It is worth noting that using a larger sample size leads to a reduction in the RMSEs for all the examined estimators. With datasets containing two separated groups of observa- tions (first and second studies), all the effective CPs of the confidence intervals obtained using the sandwich approach appear to be not significantly different from the correspond- ing nominal ones. When overlapping groups are considered (third and fourth studies), the estimator Cov (ϑ ) has produced confidence intervals whose effective levels are the closest to the nominal ones. Thus, overall, the obtained results show the robustness of the sandwich method. 4.2 A Comparison with Some Estimators Under Normal Mixtures As already mentioned in the “Introduction”, three information-based estimators of the covariance matrix of the ML estimator for finite normal mixture models were developed by Boldea and Magnus (2009): two of them are based on the gradient vector and the Hessian matrix of the incomplete log-likelihood under a normal mixture model; the third estimator exploits the sandwich approach. From now on, these three estimators will be denoted as BM , BM and BM , respectively. Furthermore, it has been already highlighted that finite 1 2 3 mixtures of Gaussian distributions and linear Gaussian cluster-weighted models define the same family of probability distributions (Ingrassia et al., 2012). Thus, it could be interesting to obtain an evaluation of the relationships between the estimators described in Section 3 and the estimators developed by Boldea and Magnus (2009). This task has been numerically 614 Journal of Classification (2021) 38:594–625 performed by means of two additional simulation studies. The model employed to generate the simulated datasets is: f(x, y; ψ ) = π φ (x, y; μ ,  ), (16) g 2 g g g=1 where p = q = 1, G = 2, ψ = (π , ψ , ψ ) , ψ = (μ , v( ) ) , ψ = (μ , v( ) ) , 1 1 2 1 2 1 1 1 2 1.0 0.0 2.0 1.0 π = 0.7, π = 0.3, μ = (0, 0), μ = (, ),  = ,  = .The two 1 2 1 2 1 2 0.0 1.0 1.0 2.0 studies have been carried out using 5 and 10 as values of  so as to obtain two different levels of separation between μ and μ . In each study, 100 datasets have been generated for 1 2 each of two sample sizes: I = 100, 250. The R packages mclust (Scrucca et al., 2016)and flexCWM have been employed to compute the ML estimates of ψ in model (16) with G = 2 and the ML estimates of ϑ in model (2) with G = 2, respectively. Furthermore, the standard ˆ ˆ errors of ϑ have been estimated using Eqs. 13–15. As far as ψ is concerned, estimated standard errors have been computed according to the solutions described in Boldea and Magnus (2009). Models (16)and (2) are characterised by the same value of π . Furthermore, as illus- trated by Ingrassia et al. (2012), some elements in ϑ coincide with some elements in ψ; namely, μ [1]= μ [1],  [1, 1]=  [1, 1], μ [1]= μ [1],  [1, 1]= X 1 X X 1 1 X 2 2 1 2 [1, 1]. Thus, the comparison between the estimators described in Section 3 and the estimators developed by Boldea and Magnus (2009) has been focused on the follow- ing two subvectors of model parameters: ϑ = (π , μ [1],  [1, 1], μ [1],  [1, 1]), 1 X X X X 1 2 1 2 ψ = (π , μ [1],  [1, 1], μ [1],  [1, 1]).Let se (ϑ [j ]) be the standard error of the 1 1 2 m r 1 2 j-th element of ϑ computed using the estimator C and the r-th dataset. Furthermore, ˆ ˆ ¯ ¯ let se (ψ [j ]) be the standard error of the j-th element of ψ obtained from the estima- tor BM . In order to compare such estimated standard errors, the following differences ˆ ˆ ¯ ¯ have been computed: d (j ) = se (ψ [j ]) − se (ϑ [j ]), j = 1,..., 5, m = 1, 2, 3, rm m r m r r = 1,..., 100. The results obtained for  = 5and  = 10 with samples of size I = 100 are graphically represented in Figs. 2 and 3, respectively. The distributions of the differ- ences d (j ) for almost all the examined parameters appear to be centred around 0 and rm highly homogeneous, thus highlighting a general equivalence between the standard errors resulting from the two models. This result holds true especially for the estimators based on the Hessian matrix and the sandwich approach (m = 2, 3) when the separation between the two groups is larger ( = 10), and for the estimator based on the gradient vector (m = 1) when the separation is low ( = 5). However, it is also worth noting that with both levels of separation the differences in the standard errors of π ˆ computed using the two estimators based on the gradient vector show a median value slightly below 0 and a distribution with negative skewness. Similar results have been obtained with samples of size I = 250 (see Figures A and B in the supplementary material). 5 Analysing Regional Tourism Data in Italy Similar to other studies (see e.g. Cellini & Cuccia 2013), the analysis summarised here aims at evaluating the link between tourism flows and attendance at museums and mon- uments, with a focus on two Italian regions: Emilia Romagna (ER) and Veneto (Ve). For both regions, three variables have been examined: tourist arrivals (denoted Arriv), Journal of Classification (2021) 38:594–625 615 Fig. 2 Boxplots of the differences d (j ) with samples of size I = 100 and  = 5 rm tourist overnights (Overn) and visits to State museums, monuments and museum networks (Visit). Measurements for such variables are available with a monthly frequency over the period January 1999 to December 2017. The source for Visit is the website of the Italian Ministry of Cultural Heritage (http://www.statistica.beniculturali.it). Data on Arriv and Fig. 3 Boxplots of the differences d (j ) with samples of size I = 100 and  = 10 rm 616 Journal of Classification (2021) 38:594–625 Table 11 Maximised Gl(ϑ)BIC log-likelihood and Bayesian information criterion of eight 1 −8596.886 17,340.36 cluster-weighted models fitted to the tourism dataset 2 −8056.517 16,411.65 3 −7803.084 16,056.80 4 −7692.981 15,988.62 5 −7593.141 15,940.96 6 −7539.561 15,985.82 7 −7456.020 15,970.76 8 −7401.414 16,013.57 1 2 Overn have been obtained from the websites of Emilia-Romagna and Veneto regional governments. Overall, the analysed dataset is composed of I = 228 monthly observations for six variables. Because of the goal of the analysis, these variables have been partitioned as follows: Y = (Visit ER, Visit Ve) , X = (Arriv ER, Overn ER, Arriv Ve, Overn Ve) . The analysed data are expressed in thousands. Models obtained from Eqs. 2–3 have been fitted to the dataset for G from 1 to 8. To this end, a function written for the R software environment which implements an EM algo- rithm for the ML estimation of multivariate Gaussian cluster-weighted linear models has been employed. In order to prevent problems due to the presence of singular or nearly singular matrices during the iterations, all covariance matrices have been required to have −20 eigenvalues greater than 10 ; furthermore, the ratio between the smallest and the largest −10 eigenvalues of such matrices is required to be not lower than 10 . Model parameters are initialised according to a two-step strategy. In the first step, the joint distribution of the covariates and responses is estimated using a mixture of G Gaussian models through the mclust package. This produces the required starting values for the G weights, mean vec- tors and covariance matrices for the predictors. In the second step, the initialisation of B and  is obtained from the fitting of a Gaussian linear regression model to the sample of observations that have been assigned to the gth component of the mixture model esti- mated in the first step. The R package systemfit (Henningsen & Hamann, 2007)has been exploited to perform this task. The maximum number of iterations of the EM algo- rithm has been set equal to 500. A convergence criterion based on the Aitken acceleration −8 has been used, with a threshold  = 10 . Table 11 shows the values of the maximised log-likelihood (l(ϑ )) and the Bayesian information criterion (BIC) (Schwarz, 1978) for the eight fitted models, where BI C = 2l(ϑ ) − npar ln(I ),and npar denotes the number of model parameters. According to this criterion, the model with the best trade-off between the fit and complexity seems to be the one with G = 5 clusters of months. With this model, there is a perfect correspondence between some clusters and some months (see Table 12): cluster 2 only contains observa- tions in April and May; cluster 4 only contains observations in June, July and August; cluster 5 only contains observations in January, February, November and December. As far as the remaining months are concerned, observations in September for the years 2005–2017 have been assigned to cluster 1; cluster 3 comprises the remaining observations in Septem- ber together with all the observations in March and October. The obtained cluster structure https://statistica.regione.emilia-romagna.it/turismo https://www.veneto.eu/web/area-operatori/statistiche Journal of Classification (2021) 38:594–625 617 Table 12 Cross-classification of the observations from the tourism dataset, based on their variable time identified by month and maximum posterior probability estimated from the cluster-weighted model with G = 5 g Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 10 0 0 0 0 0 0 0 13 0 0 0 2 0 0 0 19 19 0 0 0 0 0 0 0 30 0 19 0 0 0 0 0 6 19 0 0 4 0 0 0 0 0 19 19 19 0 0 0 0 519 19 0 0 0 0 0 0 0 0 19 19 clearly reflects seasonal patterns characterising tourism flows. Observations in cluster 4 (June, July and August) are characterized by the highest mean values of tourist arrivals and overnights in both regions, followed by those in cluster 1 (September 2005–2017), cluster 2 (April and May), cluster 3 (March, October and September 1999–2004) and cluster 4 (from November to February) (see the μ ˆ [j ] in Table 13). In all clusters, Veneto is characterised by mean values of both regressors which are always higher than those of Emilia-Romagna. During the examined period of time, there seems to be heterogeneity also on the effects of the tourist arrivals and overnights on the number of visits in both regions (see the B [j, k] in Table 13). Such effects do not result to be always positive. Furthermore, an increase in the tourist arrivals and overnights in one region does not necessarily have a positive impact on the number of visits to State museums, monuments and museum networks of the other region. Estimates of the standard errors for the parameter estimates of the selected cluster- weighted model have been computed by the boostrap approach, using 100 bootstrap samples generated from the selected model. Furthermore, Cov(ϑ ) has been estimated by resorting to Eqs. 13–15. The algorithm by Higham (1988), as implemented in the R package corpcor, Table 13 Estimated π , μ and B of the cluster-weighted model with G = 5 fitted to the tourism dataset g g g 12 34 5 π ˆ 0.057 0.168 0.192 0.250 0.333 μ [1] 848.1 766.9 515.6 1327.0 338.5 μ [2] 3654.3 2158.4 1573.2 7765.1 864.7 μ [3] 1634.4 1250.6 908.2 2095.6 561.6 μ [4] 6867.7 3906.2 2852.5 11420.1 1587.0 B [1, 1] 47.771 197.894 5.204 −0.408 −2.183 B [2, 1]−0.173 0.229 0.447 0.138 0.171 B [3, 1] 0.021 −0.037 −0.089 0.008 −0.020 B [4, 1] 0.141 −0.124 −0.215 −0.061 0.013 B [5, 1]−0.018 0.007 0.063 −0.005 −0.008 B [1, 2] 95.694 85.555 23.956 36.059 −19.050 B [2, 2]−0.121 0.002 0.181 0.132 −0.425 B [3, 2] 0.022 0.025 −0.007 −0.010 0.141 B [4, 2] 0.114 0.071 0.020 −0.053 0.172 B [5, 2]−0.024 −0.029 −0.013 0.005 −0.004 g 618 Journal of Classification (2021) 38:594–625 Table 14 B [j, k], estimated standard errors, z values and p-values obtained using the bootstrap (columns g gj k 5–7) and Eq. 13 (columns 8–10) ˆ ˆ ˆ gj k B [j, k] se(B [j, k]) z p-value se(B [j, k]) z p-value g g gj k g gj k 12 1 −0.173 0.318 −0.545 0.586 2.090 −0.083 0.934 1 3 1 0.021 0.078 0.262 0.793 0.305 0.067 0.946 1 4 1 0.141 0.172 0.820 0.412 0.648 0.217 0.828 15 1 −0.018 0.055 −0.328 0.743 0.133 −0.134 0.893 12 2 −0.121 0.330 −0.366 0.715 1.472 −0.082 0.935 1 3 2 0.022 0.080 0.278 0.781 0.196 0.113 0.910 1 4 2 0.114 0.175 0.652 0.515 0.474 0.240 0.810 15 2 −0.024 0.045 −0.538 0.590 0.053 −0.457 0.648 2 2 1 0.229 0.167 1.373 0.170 0.328 0.698 0.485 23 1 −0.037 0.044 −0.827 0.408 0.101 −0.364 0.716 24 1 −0.124 0.111 −1.121 0.262 0.196 −0.633 0.527 2 5 1 0.007 0.030 0.216 0.829 0.052 0.125 0.901 2 2 2 0.002 0.178 0.009 0.993 0.136 0.012 0.991 2 3 2 0.025 0.047 0.532 0.595 0.029 0.848 0.396 2 4 2 0.071 0.071 1.002 0.316 0.072 0.993 0.321 25 2 −0.029 0.021 −1.395 0.163 0.016 −1.868 0.062 3 2 1 0.447 0.148 3.023 0.003 0.168 2.663 0.008 33 1 −0.089 0.040 −2.224 0.026 0.037 −2.437 0.015 34 1 −0.215 0.094 −2.282 0.022 0.083 −2.570 0.010 3 5 1 0.063 0.028 2.299 0.022 0.028 2.270 0.023 3 2 2 0.181 0.195 0.927 0.354 0.145 1.251 0.211 33 2 −0.007 0.047 −0.154 0.878 0.036 −0.204 0.838 3 4 2 0.020 0.074 0.265 0.791 0.076 0.260 0.795 35 2 −0.013 0.018 −0.710 0.478 0.028 −0.456 0.648 4 2 1 0.138 0.088 1.565 0.118 0.058 2.384 0.017 4 3 1 0.008 0.024 0.343 0.732 0.011 0.778 0.436 44 1 −0.061 0.055 −1.100 0.271 0.035 −1.748 0.080 45 1 −0.005 0.013 −0.363 0.717 0.006 −0.737 0.461 4 2 2 0.132 0.200 0.657 0.511 0.032 4.130 0.000 43 2 −0.010 0.054 −0.195 0.846 0.008 −1.350 0.177 44 2 −0.053 0.084 −0.635 0.526 0.018 −2.941 0.003 4 5 2 0.005 0.015 0.337 0.736 0.004 1.155 0.248 5 2 1 0.171 0.082 2.078 0.038 0.069 2.489 0.013 53 1 −0.019 0.019 −1.017 0.309 0.018 −1.056 0.291 5 4 1 0.013 0.049 0.265 0.791 0.033 0.385 0.701 55 1 −0.008 0.008 −0.972 0.331 0.012 −0.647 0.517 52 2 −0.425 0.260 −1.635 0.102 0.108 −3.925 0.000 5 3 2 0.141 0.070 2.026 0.043 0.029 4.919 0.000 5 4 2 0.171 0.094 1.824 0.068 0.049 3.472 0.001 55 2 −0.004 0.013 −0.340 0.734 0.016 −0.280 0.779 Journal of Classification (2021) 38:594–625 619 Table 15 B [j, k], estimated standard errors, z values and p-values obtained using Eqs. 14 (columns 5–7) g gj k and 15 (columns 8–10) ˆ ˆ ˆ gj k B [j, k] se(B [j, k]) z p-value se(B [j, k]) z p-value g g gj k g gj k 12 1 −0.173 0.141 −1.234 0.217 0.157 −1.106 0.269 1 3 1 0.021 0.029 0.712 0.477 0.026 0.794 0.427 1 4 1 0.141 0.061 2.319 0.020 0.070 2.012 0.044 15 1 −0.018 0.015 −1.191 0.234 0.012 −1.478 0.139 12 2 −0.121 0.077 −1.570 0.116 0.070 −1.719 0.086 1 3 2 0.022 0.016 1.404 0.160 0.013 1.761 0.078 1 4 2 0.114 0.033 3.442 0.001 0.029 3.909 0.000 15 2 −0.024 0.008 −2.975 0.003 0.006 −4.254 0.000 2 2 1 0.229 0.162 1.411 0.158 0.148 1.545 0.122 23 1 −0.037 0.045 −0.817 0.414 0.037 −0.986 0.324 24 1 −0.124 0.098 −1.261 0.207 0.091 −1.366 0.172 2 5 1 0.007 0.024 0.268 0.789 0.022 0.302 0.763 2 2 2 0.002 0.090 0.017 0.986 0.091 0.017 0.986 2 3 2 0.025 0.025 1.004 0.315 0.027 0.916 0.360 2 4 2 0.071 0.055 1.302 0.193 0.057 1.254 0.210 25 2 −0.029 0.014 −2.146 0.032 0.015 −1.944 0.052 3 2 1 0.447 0.107 4.164 0.000 0.125 3.570 0.000 33 1 −0.089 0.027 −3.245 0.001 0.029 −3.110 0.002 34 1 −0.215 0.058 −3.691 0.000 0.060 −3.568 0.000 3 5 1 0.063 0.022 2.889 0.004 0.022 2.865 0.004 3 2 2 0.181 0.094 1.927 0.054 0.108 1.683 0.092 33 2 −0.007 0.025 −0.296 0.767 0.024 −0.306 0.759 3 4 2 0.020 0.052 0.381 0.704 0.052 0.379 0.705 35 2 −0.013 0.020 −0.660 0.509 0.019 −0.679 0.497 4 2 1 0.138 0.034 3.994 0.000 0.025 5.423 0.000 4 3 1 0.008 0.007 1.135 0.256 0.007 1.104 0.270 44 1 −0.061 0.021 −2.858 0.004 0.016 −3.751 0.000 45 1 −0.005 0.004 −1.051 0.293 0.004 −1.056 0.291 4 2 2 0.132 0.026 5.083 0.000 0.027 4.954 0.000 43 2 −0.010 0.005 −1.914 0.056 0.005 −2.015 0.044 44 2 −0.053 0.016 −3.346 0.001 0.017 −3.052 0.002 4 5 2 0.005 0.003 1.527 0.127 0.003 1.511 0.131 5 2 1 0.171 0.055 3.132 0.002 0.056 3.069 0.002 53 1 −0.019 0.016 −1.240 0.215 0.017 −1.156 0.247 5 4 1 0.013 0.024 0.540 0.589 0.022 0.576 0.564 55 1 −0.008 0.008 −0.920 0.357 0.008 −1.016 0.310 52 2 −0.425 0.075 −5.695 0.000 0.075 −5.684 0.000 5 3 2 0.141 0.022 6.561 0.000 0.022 6.271 0.000 5 4 2 0.171 0.033 5.261 0.000 0.028 6.032 0.000 55 2 −0.004 0.012 −0.379 0.704 0.011 −0.391 0.696 620 Journal of Classification (2021) 38:594–625 has been employed to adjust the eigenvalues of I and I so as to obtain their nearest pos- 1 2 itive definite matrices. Then, the estimated standard errors have been employed to run tests for the hypotheses H : B [j, k]= 0for j = 2,...,p, k = 1,...,q, g = 1,...,G.Such 0 g tests have been run under an asymptotic normal distribution for the z statistics, where gj k B [j,k] ˆ ˆ z = , with se(B [j, k]) denoting the estimated standard error of B [j, k]. gj k g g se(B [j,k]) Table 14 summarises the results obtained using the parametric bootstrap-based estimates and the score-based estimates; the results derived from the use of the other two estimators are reported in Table 15. According to all the examined methods and using α = 0.05, the four examined regressors show a significant linear effect on the visits to State museums, monuments and museum networks in Emilia Romagna over the period 1999 to 2017 in March, October and over the period 1999 to 2004 in September (cluster 3); furthermore, in the central months of the winters from 1999 to 2017 (cluster 5), there are positive signifi- cant effects of Arriv on Visit in Emilia Romagna and of Overn on Visit in Veneto. According to the estimator based on I , five additional regression coefficients (three within cluster 4, two within cluster 5) result to be significantly different from zero; the same con- ˆ ˆ clusion is obtained from the use of Cov (ϑ ) and Cov (ϑ ). The results obtained using the 2 3 three methods illustrated in Section 3 suggest that tourist arrivals in Emilia Romagna have a positive significant effect on the visits to State museums, monuments and museum networks both in Emilia Romagna and in Veneto in June, July and August. Arriv ER has a sig- nificant and negative effect on Visit Ve in January, February, November and December. Furthermore, Arriv Ve seems to significantly affect Visit Ve, with a positive effect in January, February, November and December and a negative effect in June, July and August. ˆ ˆ The tests based on the estimators Cov (ϑ ) and Cov (ϑ ) lead to a rejection of H for fur- 2 3 0 ther five regression coefficients, three of which concern the effect of tourist arrivals and overnights in Veneto in September for the years 2005–2017 (cluster 1). As far as cluster 2 is concerned, all regression coefficients seem to be not significantly different from 0 accord- ing to all approaches except for the effect of Overn Ve on Visit Ve, which results to be significant only when standard errors are estimated from the Hessian matrix. By exploiting the results contained in the non-diagonal elements of matrices Cov (ϑ ), ˆ ˆ Cov (ϑ ) and Cov (ϑ ), it is also possible to run tests for the significance of the difference 2 3 between the effects of two different predictors on the same response in a given cluster or tests for the significance of the difference between the effects of the same predictor on the same response in two different clusters. For any given pair of regression coefficients in the model, the null hypothesis can be expressed as follows: H : B [j ,k ]= B [j ,k ]. 0 g 1 1 g 2 2 1 2 Three illustrative examples of hypotheses for the analysis of the tourism dataset are sum- ˆ ˆ ˆ marisedinTables 16 and 17,where δ = B [j ,k ]− B [j ,k ].These (g ,j ,k )(g ,j ,k ) g 1 1 g 2 2 1 1 1 2 2 2 1 2 examples have been obtained from the following questions: Table 16 Values of δ for testing three examples of H : B [j ,k ]= B [j ,k ] in the (g ,j ,k )(g ,j ,k ) 0 g 1 1 g 2 2 1 1 1 2 2 2 1 2 tourism data Example (g ,j ,k)(g ,j ,k ) δ 1 1 1 2 2 2 (g ,j ,k )(g ,j ,k ) 1 1 1 2 2 2 1 (4, 2, 1)(4, 2, 2) 0.0061 2 (5, 2, 1)(5, 4, 2) −0.0007 3 (3, 2, 1)(5, 2, 1) 0.2757 Journal of Classification (2021) 38:594–625 621 Table 17 Estimated standard errors (se)of δ , z values and p-values obtained using Eqs. 13 (g ,j ,k )(g ,j ,k ) 1 1 1 2 2 2 (columns 2–4), 14 (columns 5–7) and 15 (columns 8–10) for testing three examples of H : B [j ,k ]= 0 g 1 1 B [j ,k ] in the tourism data g 2 2 Example se z p-value se z p-value se z p-value 1 0.057 0.107 0.915 0.039 0.155 0.877 0.034 0.177 0.859 2 0.101 −0.007 0.995 0.079 −0.008 0.993 0.079 −0.008 0.993 3 0.181 1.522 0.128 0.120 2.292 0.022 0.137 2.015 0.044 (a) In June, July and August (cluster 4), do tourist arrivals in Emilia Romagna have a dif- ferent effect on Visit ER and Visit Ve (first example)? (b) In January, February, November and December (cluster 5), is the effect of tourist arrivals in Emilia Romagna on Visit ER different from the effect of tourist arrivals in Veneto on Visit Ve (second example)? (c) Is the effect of tourist arrivals in Emilia Romagna on Visit ER in January, February, November and December (cluster 5) different from the one in March, September 1999– 2004 and October (cluster 3) (third example)? For each of these illustrations, Table 17 summarises the results of the z tests run using the three estimated covariance matrices (the non-diagonal elements are not reported). For the first two examples, the compared methods lead to results which are all in favour of the null hypothesis. In the third illustration, the null hypothesis should be rejected (α = 0.05) when variances and covariances are estimated using both the Hessian and sandwich approaches. 6 Conclusions Three information-based estimators of the asymptotic covariance matrix of the ML esti- mator under multivariate Gaussian linear cluster-weighted models have been illustrated. For their computation, formulae for the score vector and Hessian matrix of the incomplete log-likelihood have been derived. Properties of these estimators have been numerically eval- uated using simulated samples in comparison with the parametric bootstrap-based estimator. For the ML estimates of the model intercepts and regression coefficients, the comparison has included an approach implemented in the package flexCWM in which estimated stan- dard errors are computed by fitting G separate linear weighted regression models using the estimated posterior probabilities as weights. With correctly specified models, the most accurate estimator of the standard error of the ML estimator is the one based on the Hes- sian matrix. When Gaussian cluster-weighted models are fitted to datasets generated from a uniform distribution, the best accuracy is achieved with the sandwich estimator. Overall, the obtained results show the robustness of this latter method. Through these information- based estimators, the tasks of computing approximated confidence intervals and running tests concerning pairs of parameters can be easily carried out, as illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monu- ments in two Italian regions. Asymptotic properties of the estimators introduced here could also be studied from a theoretical point of view. For example, suitable regularity conditions can be defined so as to provide a general assessment of their consistency (see for example Galimberti et al. (2020) for a similar study in the context of clusterwise regression models). 622 Journal of Classification (2021) 38:594–625 Appendix Theorem 1 The score vector S(ϑ ) for a cluster-weighted model of order G contains the following subvectors: I I I ∂l(ϑ ) ∂l(ϑ ) ∂l(ϑ ) 1 = a ¯ , = α f ∀g, = α J vec(F ) ∀g, i gi gi gi gi ∂π ∂μ ∂v( ) 2 X g i=1 i=1 i=1 I I ∂l(ϑ ) ∂l(ϑ ) 1 = α vec(x o ) ∀g, =− α L vec(O ) ∀g, gi gi gi i gi ∂vec(B ) ∂v( ) 2 g Y i=1 i=1 G q(q+1) with a ¯ = α a , L and J denoting duplication matrices with dimensions q × i gi g g=1 2 p(p+1) and p × , respectively. Theorem 2 The Hessian matrix H(ϑ ) for a cluster-weighted model of order G contains the following submatrices: I I 2 2 ∂ l(ϑ ) ∂ l(ϑ ) =− a ¯ a ¯ , = α (a −¯ a )vec(x o ) ∀g, i gi g i i i gi ∂π ∂π ∂π∂(vec(B )) i=1 i=1 2 2 ∂ l(ϑ ) 1 ∂ l(ϑ ) =− α (a −¯ a )vec(O ) L ∀g, gi g i gi ∂π∂(v( )) 2 ∂π ∂μ g X i=1 = α (a −¯ a )f ∀g, gi g i gi i=1 ∂ l(ϑ ) 1 =− α (a −¯ a )vec(F ) J ∀g, gi g i gi ∂π∂(v( )) 2 i=1 ∂ l(ϑ ) −1 ∗ ∗ =− α  ⊗ (x x ) gi Y i i ∂vec(B )∂(vec(B )) g g i=1 ∗  ∗ −(1 − α )vec(x o )vec(x o ) ∀g, gi i gi i gi ∂ l(ϑ ) ∗  ∗ =− α α vec(x o )vec(x o ) ∀g = j, gi ji i gi i ji ∂vec(B )∂(vec(B )) g j i=1 ∂ l(ϑ ) −1 =− α  ⊗ (x o ) gi i gi ∂vec(B )∂(v( )) g Y i=1 + (1 − α )vec(x o )vec(O ) L ∀g, gi gi i gi ∂ l(ϑ ) 1 = α α vec(x o )vec(O ) L ∀g = j, gi ji ji i gi ∂vec(B )∂(v( )) 2 g Y i=1 ∂ l(ϑ ) = α (1 − α )vec(x o )f ∀g, gi gi i gi gi ∂vec(B )∂μ i=1 Journal of Classification (2021) 38:594–625 623 ∂ l(ϑ ) =− α α vec(x o )f ∀g = j, gi ji i gi ji ∂vec(B )∂μ i=1 ∂ l(ϑ ) 1 =− α (1 − α )vec(x o )vec(F ) J ∀g, gi gi gi i gi ∂vec(B )∂(v( )) 2 g X i=1 ∂ l(ϑ ) 1 = α α vec(x o )vec(F ) J ∀g = j, gi ji ji i gi ∂vec(B )∂(v( )) 2 g X i=1 ∂ l(ϑ ) 1 −1  −1 =− α L ( − 2O ) ⊗ gi gi Y Y g g ∂v( )∂(v( )) 2 Y Y g g i=1 − (1 − α )vec(O )vec(O ) L ∀g, gi gi gi ∂ l(ϑ ) 1 =− α α L vec(O )vec(O ) L ∀g = j, gi ji gi ji ∂v( )∂(v( )) 4 Y Y g j i=1 ∂ l(ϑ ) 1 =− α (1 − α )L vec(O )f ∀g, gi gi gi gi ∂v( )∂μ 2 g X i=1 ∂ l(ϑ ) 1 = α α L vec(O )f ∀g = j, gi ji gi ji ∂v( )∂μ 2 i=1 ∂ l(ϑ ) 1 = α (1 − α )L vec(O )vec(F ) J ∀g, gi gi gi gi ∂v( )∂(v( )) 4 Y X g g i=1 ∂ l(ϑ ) 1 =− α α L vec(O )vec(F ) J ∀g = j, gi ji gi ji ∂v( )∂(v( )) 4 Y X g j i=1 ∂ l(ϑ ) −1 =− α  − (1 − α )f f ∀g, gi gi gi X gi ∂μ ∂μ X X g i=1 ∂ l(ϑ ) =− α α f f ∀g = j, gi ji gi ji ∂μ ∂μ g j i=1 ∂ l(ϑ ) 1 −1 =− α (f ⊗  ) + (1 − α )f vec(F ) J ∀g, gi gi gi gi gi ∂μ ∂(v( )) 2 X X g g i=1 ∂ l(ϑ ) 1 = α α f vec(F ) J ∀g = j, gi ji ji gi ∂μ ∂(v( )) 2 g j i=1 ∂ l(ϑ ) 1 −1 −1 =− α J ( − 2F ) ⊗ gi gi X X g g ∂v( )∂(v( )) 2 X X g g i=1 − (1 − α )vec(F )vec(F ) J ∀g, gi gi gi ∂ l(ϑ ) 1 =− α α J vec(F )vec(F ) J ∀g = j. gi ji gi ji ∂v( )∂(v( )) 4 X X g j i=1 624 Journal of Classification (2021) 38:594–625 Funding Open access funding provided by Alma Mater Studiorum - Universita di Bologna within the CRUI- CARE Agreement. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References Basford, K. E., Greenway, D. R., McLachlan, G. J., & Peel, D (1997). Standard errors of fitted component means of normal mixtures. Computational Statistics, 12, 1–17. Boldea, O., & Magnus, J. R. (2009). Maximum likelihood estimation of the multivariate normal mixture model. Journal of the American Statistical Association, 104, 1539–1549. Cellini, R., & Cuccia, T. (2013). Museum and monument attendance and tourism flow: A time series analysis approach. Applied Economics, 45, 3473–3482. Dang, U., Punzo, A., McNicholas, P., Ingrassia, S., & Browne, R. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34, 4–34. Dempster, A. P., Laird, N. M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1–38. Di Mari, R., Bakk, Z., & Punzo, A. (2020). A random-covariate approach for distal outcome prediction with latent class analysis. Structural Equation Modeling: A Multidisciplinary Journal, 27(3), 351–368. Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808, 18–24. Galimberti, G., Nuzzi, L., & Soffritti, G. (2021). Covariance matrix estimation of the maximum likelihood estimation in multivariate clusterwise linear regression. Statistical Methods & Applications, 30, 235– Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction, 2nd edn. New York: Springer. Henningsen, A., & Hamann, J. D. (2007). systemfit: A package for estimating systems of simultaneous equations in R. Journal of Statistical Software, 23(4), 1–40. Higham, N. J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications, 103, 103–118. Ingrassia, S., Minotti, S. C., & Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approch with elliptical distributions. Journal of Classification, 29, 363–401. Ingrassia, S., Minotti, S. C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics & Data Analysis, 71, 159–182. Ingrassia, S., Punzo, A., Vittadini, G., & Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32, 85–113. Magnus, J. R., & Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley. Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: A flexible framework for cluster-weighted models. Journal of Statistical Software, 86(2), 1–30. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. In Handbook of econometrics (Volume 4, Chapter 36, pp. 2111–2245). Newton, M. A., & Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap (with discussion). Journal of the Royal Statistical Society: Series B, 56, 3–48. Punzo, A., & Ingrassia, S. (2013). On the use of the generalized linear exponential cluster-weighted model to assess local linear independence in bivariate data. QdS - Journal of Methodological and Applied Statistics, 15, 131–144. Punzo, A. (2014). Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14, 257–291. Journal of Classification (2021) 38:594–625 625 Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31, 989–1013. Punzo, A., & McNicholas, P. D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34, 249–293. R Core Team. (2020). R: a language and environment for statistical computing Vienna. Austria: R Foundation for Statistical Computing. Ritter, G. (2015). Robust cluster analysis and variable selection. Boca Raton: CRC Press. Schafer, ¨ J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), Article, 32. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A.E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8/1, 205–223. Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2013). Clustering and classification via cluster- weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40. Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2015). Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24, 623– White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for het- eroskedasticity. Econometrica, 48, 817–838. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Classification Springer Journals

Estimating the Covariance Matrix of the Maximum Likelihood Estimator Under Linear Cluster-Weighted Models

Journal of Classification , Volume 38 (3) – Oct 1, 2021

Loading next page...
 
/lp/springer-journals/estimating-the-covariance-matrix-of-the-maximum-likelihood-estimator-es6EQpJmKw

References (44)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
ISSN
0176-4268
eISSN
1432-1343
DOI
10.1007/s00357-021-09390-9
Publisher site
See Article on Publisher Site

Abstract

In recent years, the research into cluster-weighted models has been intense. However, esti- mating the covariance matrix of the maximum likelihood estimator under a cluster-weighted model is still an open issue. Here, an approach is developed in which information-based esti- mators of such a covariance matrix are obtained from the incomplete data log-likelihood of the multivariate Gaussian linear cluster-weighted model. To this end, analytical expressions for the score vector and Hessian matrix are provided. Three estimators of the asymptotic covariance matrix of the maximum likelihood estimator, based on the score vector and Hes- sian matrix, are introduced. The performances of these estimators are numerically evaluated using simulated datasets in comparison with a bootstrap-based estimator; their usefulness is illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions. Keywords Gaussian mixture model · Hessian matrix · Linear regression · Model-based cluster analysis · Sandwich estimator · Score vector 1 Introduction Cluster-weighted models constitute an approach to regression analysis with random covari- ates in which supervised (regression) and unsupervised (model-based cluster analysis) learning methods are jointly exploited (Hastie et al., 2009). In this approach, a given ran- dom vector is assumed to be composed of an outcome Y (response, dependent variable) and its explanatory variables X (covariates, predictors). Furthermore, sample observations are allowed to come from a population composed of several unknown sub-populations. Finally, the joint distribution of the outcome and the covariates is modelled through a finite mixture model specified so as to account for a different effect of the covariates on the response within each sub-population. Thus, cluster-weighted models are useful to perform Gabriele Soffritti gabriele.soffritti@unibo.it Department of Statistical Sciences, Alma Mater Studiorum - University of Bologna, via delle Belle Arti, 41 - 40126 Bologna, Italy Journal of Classification (2021) 38:594–625 595 model-based cluster analysis in a multivariate regression setting with random covariates and unobserved heterogeneity. Since the introduction of this approach (Gershenfeld, 1997), the research into cluster- weighted models has been intense, especially in the last 8 years. Models for continuous variables under normal mixture models have been proposed by Ingrassia et al. (2012)and Dang et al. (2017). Robustified solutions have been developed by Ingrassia et al. (2014) and Punzo and McNicholas (2017), based on the use of Student t and contaminated normal mixture distributions, respectively. Punzo and Ingrassia (2013), Punzo and Ingrassia (2016), Ingrassia et al. (2015) and Di Mari et al. (2020) have introduced models for various types of responses. Models able to deal with non-linear relationships or many covariates have been proposed by Punzo (2014), Subedi et al. (2013) and Subedi et al. (2015). By focusing the attention on Gaussian cluster-weighted models in which the effects of the covariates on the response within each sub-population are linear, model parame- ters are generally estimated through the maximum likelihood (ML) method by resorting to the expectation-maximisation (EM) algorithm (Dempster et al., 1977), which is widely employed in incomplete-data problems. In these models, the observed data S ={(x , y ), i i i = 1,...,I } are incomplete because the specific component density that generates the I sample observations is missing. This missing information is modelled through an unob- served variable coming from a pre-specified multinomial distribution and is added to the observed data so as to form the so-called complete data. Then, the ML estimate is computed from the maximisation of the complete data log-likelihood. A description of the EM algo- rithm for the linear Gaussian cluster-weighted model can be found in Dang et al. (2017). Specific functions implementing such algorithm for models with a univariate response are available in the package flexCWM (Mazza et al., 2018)for the R software environment (R Core Team, 2020). A by-product of this estimating approach is a set of K estimated posterior probabili- ties that each sample observation comes from the K Gaussian distributions of the mixture. Thus, a by-product of fitting a linear Gaussian cluster-weighted model is a clustering of the I sample observations, based on a rule that assigns an observation to the distribution of the mixture to which it has the highest posterior probability of belonging. However, an estimat- ing approach based on the use of an EM algorithm has the drawback of not automatically producing any estimate of the covariance matrix of the ML estimator. The assessment of the sample variability of the parameter estimates in a linear Gaussian cluster-weighted model is a necessary step in the subsequent development of inference methods for the model param- eters, such as asymptotic confidence intervals, tests for the significance of the effect of any covariate on a given response within any sub-population and tests for the significance of the difference between the effects of the same covariate on a given response in two different sub-populations. Thus, additional computations are necessary to obtain an assessment of the sample variability of model parameter estimates. To the best of the author’s knowledge, the only solution currently available for the linear Gaussian cluster-weighted models is imple- mented in the flexCWM package, where approximated standard errors are only provided for the intercepts and regression coefficients according to an approach in which a number of separate linear regression analyses with random covariates are carried out (one for each component of the mixture), and the sample observations are weighted with their estimated posterior probabilities of coming from the different components of the mixture. However, this approach only partially exploits the sample information about the parameters under a linear normal cluster-weighted model. Thus, other approaches could be investigated and detected. A solution can be obtained by resorting to bootstrap methods (see, e.g., Newton 596 Journal of Classification (2021) 38:594–625 &Raftery 1994; Basford et al. 1997; McLachlan & Peel 2000). However, the overall com- putational process associated with the use of bootstrap techniques can become particularly time-consuming and complex because of difficulties typically associated with the fitting of finite mixture models (e.g. label-switching problems, possible convergence failures of the EM algorithm on the bootstrap samples). These inconveniences could be avoided through an approach in which the observed information matrix is obtained from the incomplete data log-likelihood and employed to compute information-based estimators of the covariance matrix of the ML estimator (see e.g. McLachlan & Peel 2000). This task has been success- fully carried out under normal mixture models (Boldea & Magnus, 2009) and clusterwise linear regression models (Galimberti et al., 2021). In order to make it possible to properly assess both the variability of and the covariance between ML estimates of all the parameters under multivariate linear normal cluster- weighted models with a multivariate response, the gradient vector and second-order derivative matrix of the incomplete data log-likelihood for these models are explicitly derived here. Then, these results are used to obtain three estimators of the observed infor- mation matrix and the covariance matrix of the ML estimator. Properties of such estimators are numerically investigated using simulated datasets in comparison with the paramet- ric bootstrap and the approach implemented in flexCWM. A numerical evaluation of the relationships between the estimators introduced here and those described by Boldea and Magnus (2009) is also provided. The practical usefulness of the proposed estimators is illus- trated in a study aiming at evaluating the link between tourism flows and attendance at museums and monuments in two Italian regions. The remainder of the paper is organised as follows. Section 2 provides the defini- tion of multivariate Gaussian linear cluster-weighted model together with some quantities employed in the derivation of the score vector and the Hessian matrix. Section 3 describes the estimators of the information matrix and the covariance matrix of the ML estimator. Sections 4 and 5 summarize the main results obtained from the analysis of simulated and real datasets, respectively. The analytical expressions of the score vector and the Hessian matrix are reported in Appendix. Technical details and additional results from the analysis of simulated datasets can be found in a separate document as supplementary materials. 2 Score Vector and Hessian Matrix of Gaussian Linear Cluster-Weighted Models Let X = (X , ...,X ) and Y = (Y , ...,Y ) be two continuous random vectors with joint 1 p 1 q probability density function (p.d.f.) f(x, y). The response vector Y and the covariate vector q p X take values in R and R , respectively. Following Dang et al. (2017), (X , Y ) follows a cluster-weighted model of order G if f(x, y) = π f(x| )f (y|x, ), (1) g g g g=1 where  ,..., denote the G unknown sub-populations ( ∩  =∅ ∀g = j), 1 G g j π = P( ), π > 0 ∀g, π = 1, f(x| ) is the conditional p.d.f. of X given g g g g g g=1 , f(y|x, ) is the conditional p.d.f. of Y given X and  . A Gaussian linear cluster- g g g weighted model is obtained from Eq. 1 by additionally assuming that the distributions of Journal of Classification (2021) 38:594–625 597 X| and Y|(X = x, ) are Gaussian for g = 1,...,G and the effect of X on Y for any g g is linear. By embedding all these assumptions into model (1), f(x, y) becomes f(x, y; ϑ ) = π φ (x; μ ,  )φ (y|x; B x ,  ), (2) g p X q Y X g g g g g=1 where φ (·; μ, ) represents the p.d.f. of a normal d-dimensional random vector with expected value μ and positive definite covariance matrix , B x = E(Y|X = x, ), g = 1,...,G, (3) (1+p)×q ∗ with B ∈ R , x = (1, x ) ,and ϑ is the vector of the unknown parameters. It has been proved that linear Gaussian cluster-weighted models of order G define the same family of probability distributions generated by mixtures of G Gaussian models for the random vector Z = (X , Y ) (Ingrassia et al., 2012). However, it is important to stress that this latter type of mixtures cannot be employed to account for local linear dependencies between X and Y. The score vector and Hessian matrix of model (2) are derived by taking account of the fact that the weights π ,...,π sum to one and the covariance matrices are symmetric. 1 G The first constraint is introduced in the maximization of the likelihood function by differ- entiating with respect to π = (π ,...,π ) and by setting π = 1 − π −· · ·− π . 1 G−1 G 1 G−1 The constraints on the covariance matrices are dealt with by using the operators vec(·),v(·) and the duplication matrix. Namely, vec(B) is the column vector obtained by stacking the columns of matrix B one underneath the other. v() denotes the column vector obtained from vec() by eliminating all supradiagonal elements of a symmetric matrix  (thus, v() contains only the lower triangular part of ). The duplication matrix G is the unique matrix which transforms v() into vec() (Gv() = vec()) (see e.g. Magnus & Neudecker 1988). Using this notation, the vector of the unknown parameters in model (2) can be denoted as ϑ = (π , θ ,..., θ ) ,where θ = μ , v  , vec B , v  . g X g Y 1 G X g g Suppose that the observed data S ={(x , y ), i = 1,...,I } is composed of I indepen- i i dent and identically distributed observations. Then, the incomplete log-likelihood function under the model (2)is I G l(ϑ ) = ln π φ x ; μ ,  φ y |x ; B x ,  .(4) g p i X q i Y X g i g i g i=1 g=1 Each sample observation provides its own contribution to the gth term of the sum that defines the mixture (2). As far as the contribution of the ith observation is concerned, it is given by: 1 1 −1 − − 2 2 p = π (2π) det( ) exp − (x − μ )  (x − μ ) gi g X i i g X X g X g 1 1 − −  ∗  −1  ∗ 2 2 (2π) det( ) exp − (y − B x )  (y − B x ).(5) g i g i Y i g i By exploiting properties of the logarithm and trace, the following equality holds true: p 1 1 −1 ln p = ln π − ln(2π) − ln det( ) − tr  (x − μ )(x − μ ) + gi g X i i X X g X g g 2 2 2 q 1 1 −1  ∗  ∗ − ln(2π) − ln det( ) − tr  (y − B x )(y − B x ).(6) g i g i g Y i i 2 2 2 598 Journal of Classification (2021) 38:594–625 The explicit forms of the score vector and Hessian matrix, as developed here, require the introduction of some additional notation. Namely, let gi α = , (7) gi ji j =1 a = e,g = 1,...,G − 1, (8) g g a =− 1 , G G−1 where e denotes the gth column of the identity matrix of order G − 1and 1 is the g G−1 (G−1)×1 vector having each element equal to 1. The following quantities are also required: −1  ∗ o =  (y − B x ), (9) gi Y i g i −1 O =  − o o , (10) gi gi gi −1 f =  (x − μ ), (11) gi i X g −1 F =  − f f . (12) gi gi X gi The explicit forms of the score vector S(ϑ ) and Hessian matrix H(ϑ ) for a Gaussian linear cluster-weighted model are provided in Theorems 1 and 2 (see Appendix). Proofs can be found in the document with the supplementary materials. 3 Covariance Matrix Estimation of the ML Estimator In the ML approach, the information matrix I(ϑ ) plays a crucial role, as it is used to asymptotically estimate the covariance of the ML estimator of the model parameters. Under regularity conditions and if the model is correctly specified, I(ϑ ) is given either by the covariance of the score function E S(ϑ )S(ϑ ) or the negative of the expected value of the Hessian matrix −E (H(ϑ )). Clearly, an analytical evaluation of the expectations required to obtain I(ϑ ) under model (2) is quite complex. By exploiting some asymptotic results concerning ML estimation (White, 1982), it is possible to obtain the following asymptotic estimators of I(ϑ ): I I ˆ ˆ ˆ I = S (ϑ )S (ϑ ) , I =− H (ϑ ), 1 i i 2 i i=1 i=1 ˆ ˆ where S (ϑ ) and H (ϑ ) denote the contribution of the ith sample observation to the score i i function and Hessian matrix evaluated at the ML estimate ϑ, respectively. They can be used ˆ ˆ to obtain the following asymptotic estimators of Cov(ϑ ), the covariance matrix of ϑ: −1 Cov (ϑ ) = I , (13) −1 Cov (ϑ ) = I . (14) Under suitable conditions (see e.g. Newey & McFadden 1994; Ritter 2015, for a general discussion and some results specifically dealing with finite mixture models, respectively), ˆ ˆ ˆ Cov (ϑ ) and Cov (ϑ ) can be considered consistent estimators of Cov(ϑ ) when the model 1 2 Journal of Classification (2021) 38:594–625 599 is correctly specified. By exploiting the so-called sandwich approach (see e.g. White 1980), the following robust estimator can also be employed: −1 −1 Cov (ϑ ) = I I I . (15) 3 1 2 2 ˆ ˆ It is worth noting that for the existence of the estimators Cov (ϑ ) and Cov (ϑ ) it is required 2 3 that matrix I is positive definite and well conditioned. The same requirements have to be fulfilled by matrix I in order to guarantee that Cov (ϑ ) exists. With large-scale covariance 1 1 matrices and small sample sizes, I and/or I could be ill-conditioned. These situations can 1 2 be managed by resorting to procedures able to produce improved estimators of I(ϑ ) from either I or I . For example, the algorithm by Higham (1988) computes the nearest positive 1 2 definite matrix of a given symmetric matrix by adjusting its eigenvalues. Other approaches which exploit techniques of variance reduction could also be adopted (see e.g. Schafer ¨ and Strimmer 2005). 4 Experimental Results from Simulated Datasets 4.1 Numerical Study of the Properties of the Proposed Estimators ˆ ˆ ˆ In order to evaluate the properties of Cov (ϑ ), Cov (ϑ ) and Cov (ϑ ) in comparison with the 1 2 3 estimators based on the parametric bootstrap and the approach implemented in flexCWM, five Monte Carlo studies have been performed. In the first study, the artificial datasets have been generated under a model defined by Eqs. 2–3 where G = 2, q = 1and p = 2. As far as the model parameters are concerned, the following values have been employed: π = 0.7, π = 0.3,  = 1.5,  = 1, B = (5, 2, 2), B = (1, −2, −2), μ = (−2, −2), 2 Y Y 1 2 1 2 X 1.0 0.2 1.0 0.4 μ = (2, 2),  = ,  = . Such values have been chosen so as to X X X 1 0.2 1.0 2 0.4 1.0 produce two well-separated groups of observations (see the upper panel of Fig. 1, with the pairwise scatterplots of the variables X , X and Y for a sample of size I = 500 generated 1 2 1 as just described). In this way, problems of label switching across simulations are less likely to occur. Furthermore, the ML estimates of ϑ are expected to be accurate enough to let the analysis be focused on the different ways of estimating the standard error of ϑ.Usingthese parameter values, R = 10, 000 datasets (of size I) have been generated as follows: 1. For the rth dataset (r = 1,...,R), a sample of size I is drawn from the standard p-dimensional normal distribution; this gives the vectors  ,...,  ; 1r Ir 2. For the rth dataset (r = 1,...,R), a sample of size I is drawn from the standard q-dimensional normal distribution; this gives the vectors η ,..., η ; 1r Ir 3. For the rth dataset (r = 1,...,R), asampleofsize I is drawn from the Bernoulli distribution with parameter π ; this produces the 0-1 vector z = (z ,...,z ) ; 1 r 1r Ir 4. For the ith observation (i = 1,...,I)ofthe rth dataset, vectors x and y are obtained ir ir as follows: x = μ + A  , y = B x + A η if z = 1, ir X ir ir ir Y ir X ir 1 1 1 1 = μ + A  , y = B x + A η if z = 0, ir X ir ir ir Y ir X 2 2 2 ir where A and A are matrices obtained from the spectral decompositions of  and X Y X g g g , respectively. Such matrices are constructed such that A A =  , A A = Y X X Y g g g g X Y g g ,for g = 1, 2. g 600 Journal of Classification (2021) 38:594–625 Fig. 1 Pairwise scatterplots of X , X and Y forfour samplesofsize I = 500 generated in the first four 1 2 1 studies. Upper and lower panels refer to the first and fourth studies, respectively; intermediate panels refer to the second and third ones. Black circles and red triangles correspond to g = 1and g = 2, respectively In the second study, the datasets have been obtained through the same procedure used in the first one except from the computation of vectors  and η . Namely, a sample of size I ·p is ir ir drawn from the uniform distribution in the interval (0,1) for the rth dataset (r = 1,...,R); ∗ ∗ this produces a vector  , whose elements are transformed as follows:  = 12( − jr r jr 0.5), j = 1,...,I · p; the vector  resulting from this transformation has zero mean and unit variance; partitioning  into Ip−dimensional vectors leads to  ,...,  .Thesame r 1r Ir process has been applied to obtain vectors η ,..., η . The second panel of Fig. 1 provides 1r Ir Journal of Classification (2021) 38:594–625 601 the pairwise scatterplots of X , X and Y for a sample of size I = 500 from this second 1 2 1 study. In the third and fourth studies, the datasets have been generated as in the first and sec- ond studies, respectively, but using the following model parameters: μ = (−1, −1) , μ = (1, 1) . This change in the values of μ and μ leads to overlapping groups of X X X 2 1 2 observations (see the pairwise scatterplots of X , X and Y for samples of size I = 500 1 2 1 in third and fourth panels of Fig. 1). The total number of model parameters in the first four studies is 19. The fifth study has been carried out with the same settings of the first study but with p = 8 explanatory variables. The model parameters pertaining to X which have been employed to generate the datasets are as follows: μ =−2 · 1 , μ = 2 · 1 , V(X | ) = 1 ∀j for X 8 X 8 j g 1 2 |j −h| |j −h| g = 1, 2, Cov(X ,X | ) = 1 − ∀j = h,Cov(X ,X | ) = 1 − ∀j = h. j h 1 j h 2 8 4 As far as the effects of the regressors on Y are concerned, they have been set as follows: B = (5, μ ) , B = (1, μ ) . In this latter study, the total number of model parameters 1 2 X X 2 1 is 109. In all studies, Monte Carlo experiments have been performed with two different sample sizes: I = 250, 500 in the first four studies, I = 300, 500 in the last study. In all experi- ments, it has been assumed that the rth dataset {(x , y ),...,(x , y )} is generated from 1r 1r Ir Ir a model defined by Eqs. 2–3 with G = 2. Thus, the maximum likelihood estimate ϑ of ϑ has been computed for r = 1,...,R under such an assumption. Parameter estimation has been carried out through the EM algorithm as implemented in the function cwm of the package flexCWM. As far as the initialisation of the parameters is concerned, an option has been employed, which executes 5 independent runs of the k-means algorithm and picks the solution maximising the observed-data log-likelihood among these runs. The maximum number of iterations of the EM algorithm has been set equal to 400. A convergence crite- −6 rion based on the Aitken acceleration has been used, with a threshold  = 10 (for further details, see Mazza et al. 2018). The R independent estimates of ϑ are used to approximate the true distribution of ϑ and, in particular, the true standard errors of all the elements of ϑ. Estimates of these standard errors have been computed using the three information-based estimators and the parametric bootstrap for R = 2000 datasets obtained as described above. For each bootstrap estimate, 100 bootstrap samples have been employed. For the ML estimates of the model intercepts and regression coefficients, the standard errors estimated by the function cwm of the pack- age flexCWM using the approach illustrated in the introduction have been included in the comparison. The performances of these strategies have been evaluated on the basis of an estimate of their biases and root mean squared errors (RMSE). A comparative evaluation of such approaches has been carried out also through the coverage probabilities (CP) of 90% and 95% confidence intervals based on the examined standard errors’ estimates and the standard normal quantiles. In this latter comparison, the attention is focused on the expected mean values of the regressors (i.e. μ and μ ) and the regression coefficients (all the X X 1 2 entries in the first column of B and B except the first one). 1 2 Tables 1, 2, 3 and 4 contain the biases and RMSEs for the first four Monte Carlo studies with samples of size 250. The same information for the last study and the sample size I = 300 can be found in Table 5. The corresponding values for the CPs are summarised in Tables 6, 7, 8, 9 and 10. In all the tables, the results obtained using the function cwm of the package flexCWM, the bootstrap and the estimators defined in Eqs. 13–15 are denoted as cwm, Boot, C , C and C , respectively. From now on, B [j, k] is used to denote the 1 2 3 g element on the j-th row and k-th column of matrix B ; μ [j ] represents the j-th element g g 602 Journal of Classification (2021) 38:594–625 Table 1 Biases and root mean squared errors of the examined standard error estimators of ϑ in the first study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π 0.031 0.050 0.038 0.040 0.217 0.096 0.090 0.092 B [1, 1]−5.048 −2.829 −4.068 −2.958 −0.602 5.301 3.651 4.596 3.380 3.156 B [2, 1]−1.766 −1.013 −1.426 −1.076 −0.256 1.874 1.335 1.641 1.232 1.088 B [3, 1]−1.724 −0.958 −1.373 −1.032 −0.215 1.821 1.274 1.577 1.175 1.078 −0.693 4.722 4.331 4.473 0.833 4.823 4.340 4.601 μ [1]−0.048 0.168 −0.047 −0.039 0.687 0.511 0.448 0.448 μ [2] 0.010 0.240 0.026 0.033 0.697 0.521 0.432 0.434 [1, 1]−0.146 0.527 −0.078 −0.170 1.424 1.684 1.199 1.533 [1, 2]−0.085 0.408 −0.023 −0.070 1.419 1.650 1.197 1.525 [2, 2]−0.001 0.688 0.046 −0.080 1.417 1.741 1.198 1.525 B [1, 1]−15.263 0.223 3.462 −0.630 −1.199 15.395 4.171 6.374 3.415 5.514 B [2, 1]−6.046 0.301 1.871 −0.103 −0.445 6.105 1.793 2.934 1.416 2.255 B [3, 1]−6.226 0.085 1.620 −0.300 −0.617 6.280 1.742 2.760 1.383 2.208 0.040 10.541 8.444 8.227 0.781 10.828 8.486 8.545 μ [1]−0.080 0.632 −0.139 −0.076 1.480 1.567 1.244 1.323 μ [2]−0.104 0.624 −0.145 −0.087 1.496 1.574 1.256 1.327 [1, 1]−0.442 1.759 −0.419 −0.702 3.516 4.710 3.360 4.257 [1, 2]−0.468 1.125 −0.455 −0.548 3.519 4.512 3.364 4.234 [2, 2]−0.341 1.853 −0.352 −0.676 3.504 4.746 3.352 4.253 All entries have been multiplied by 100 Journal of Classification (2021) 38:594–625 603 Table 2 Biases and root mean squared errors of the examined standard error estimators of ϑ in the second study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π 0.006 0.024 0.001 0.001 0.221 0.087 0.082 0.082 B [1, 1]−4.220 −1.959 −4.021 −2.477 −0.184 4.367 2.827 4.231 2.721 1.737 B [2, 1]−1.446 −0.683 −1.397 −0.813 0.050 1.514 1.034 1.499 0.935 0.701 B [3, 1]−1.639 −0.863 −1.562 −1.008 −0.174 1.700 1.160 1.652 1.103 0.703 1.722 12.779 6.679 3.074 1.780 12.815 6.683 3.094 μ [1]−0.130 0.019 −0.142 −0.141 0.647 0.385 0.392 0.392 μ [2] 0.051 0.214 0.051 0.052 0.664 0.458 0.384 0.384 [1, 1] 1.730 5.682 1.712 −0.128 2.126 6.030 1.967 0.678 [1, 2] 2.793 7.498 2.761 −0.020 3.054 7.764 2.926 0.666 [2, 2] 1.709 5.689 1.719 −0.123 2.109 6.036 1.973 0.677 B [1, 1]−13.882 1.521 2.659 −0.015 −0.617 13.951 4.027 4.590 2.618 3.525 B [2, 1]−6.010 0.290 0.950 −0.184 −0.413 6.042 1.595 1.918 1.187 1.622 B [3, 1]−6.090 0.247 0.904 −0.255 −0.508 6.120 1.570 1.909 1.185 1.645 2.767 21.632 11.065 5.659 2.868 21.835 11.095 5.791 μ [1] 0.148 0.589 −0.041 −0.043 1.343 1.299 0.986 0.984 μ [2] 0.164 0.585 −0.046 −0.048 1.320 1.270 0.964 0.964 [1, 1] 3.549 10.810 3.160 −0.135 4.532 12.011 4.027 1.765 [1, 2] 4.838 12.754 4.430 −0.002 5.599 13.788 5.084 1.760 [2, 2] 3.518 10.708 3.151 −0.094 4.508 11.920 4.020 1.763 All entries have been multiplied by 100 604 Journal of Classification (2021) 38:594–625 Table 3 Biases and root mean squared errors of the examined standard error estimators of ϑ in the third study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π −0.000 0.044 0.029 0.054 0.243 0.131 0.128 0.156 B [1, 1]−3.552 −1.794 −2.556 −1.668 0.061 3.674 2.347 2.873 1.982 2.091 B [2, 1]−1.977 −1.054 −1.468 −1.061 −0.118 2.079 1.551 1.697 1.256 1.214 B [3, 1]−1.902 −0.950 −1.362 −0.985 −0.062 1.998 1.482 1.589 1.174 1.197 −0.759 4.911 4.647 5.113 0.906 5.024 4.681 5.343 μ [1]−0.042 0.196 −0.031 −0.015 0.749 0.531 0.456 0.462 μ [2] 0.005 0.258 0.030 0.043 0.753 0.544 0.446 0.453 [1, 1]−0.141 0.551 −0.064 −0.150 1.438 1.712 1.215 1.539 [1, 2]−0.061 0.448 0.005 −0.041 1.432 1.681 1.213 1.533 [2, 2] 0.027 0.736 0.081 −0.040 1.431 1.780 1.216 1.533 B [1, 1]−12.209 −0.143 1.969 −0.536 −0.266 12.282 3.176 4.327 3.009 5.175 B [2, 1]−7.068 0.144 1.863 −0.261 −0.442 7.126 2.091 3.175 1.749 2.918 B [3, 1]−7.042 0.159 1.828 −0.268 −0.449 7.096 2.064 3.101 1.675 2.758 −0.070 11.005 8.879 9.018 0.844 11.323 8.947 9.464 μ [1]−0.344 0.749 −0.195 −0.018 1.689 1.894 1.463 1.814 μ [2]−0.425 0.701 −0.271 −0.123 1.697 1.879 1.455 1.720 [1, 1]−0.779 2.310 −0.486 −0.818 3.659 5.263 3.470 4.814 [1, 2]−0.629 1.846 −0.366 −0.593 3.630 5.077 3.455 4.781 [2, 2]−0.539 2.619 −0.255 −0.664 3.616 5.406 3.445 4.791 All entries have been multiplied by 100 Journal of Classification (2021) 38:594–625 605 Table 4 Biases and root mean squared errors of the examined standard error estimators of ϑ in the fourth study, I = 250 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π −0.080 0.054 −0.014 −0.012 0.258 0.164 0.135 0.141 B [1, 1]−9.960 −8.445 −9.492 −7.788 −5.135 9.984 8.543 9.538 7.861 5.506 B [2, 1]−2.944 −2.158 −2.820 −2.069 −0.925 2.981 2.303 2.878 2.143 1.316 B [3, 1]−2.854 −2.042 −2.706 −1.984 −0.880 2.892 2.198 2.767 2.058 1.259 1.270 12.324 6.503 3.441 1.361 12.365 6.517 3.528 μ [1]−0.288 −0.040 −0.221 −0.211 0.707 0.401 0.436 0.435 μ [2]−0.195 0.064 −0.119 −0.110 0.694 0.428 0.411 0.415 [1, 1] 1.098 5.222 1.147 −0.695 1.649 5.596 1.491 0.958 [1, 2] 1.696 6.562 1.705 −1.108 2.096 6.864 1.953 1.290 [2, 2] 1.045 5.187 1.115 −0.725 1.615 5.563 1.466 0.981 B [1, 1]−13.889 −0.999 0.779 −1.230 −1.327 13.932 3.227 3.449 3.060 4.382 B [2, 1]−9.282 −1.819 −0.835 −2.155 −2.318 9.309 2.621 2.264 2.656 3.180 B [3, 1]−9.201 −1.693 −0.689 −2.053 −2.255 9.227 2.505 2.210 2.560 3.130 2.759 22.773 11.819 6.563 2.882 23.022 11.865 6.753 μ [1]−0.600 0.441 −0.375 −0.206 1.666 1.754 1.437 1.641 μ [2]−0.654 0.391 −0.425 −0.255 1.668 1.682 1.415 1.541 [1, 1] 3.047 12.617 3.316 0.036 4.361 14.157 4.448 2.594 [1, 2] 5.083 15.362 5.038 0.118 5.964 16.650 5.845 2.596 [2, 2] 2.905 12.415 3.206 −0.034 4.262 13.978 4.366 2.594 All entries have been multiplied by 100 606 Journal of Classification (2021) 38:594–625 Table 5 Biases and root mean squared errors of the examined standard error estimators of ϑ in the fifth study, I = 300 BIAS RMSE ϑ cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 π 0.003 0.037 0.011 0.011 0.196 0.079 0.068 0.068 B [1, 1]−4.292 −0.898 −0.286 −2.723 −0.828 4.518 19.941 2.232 3.005 2.445 B [−1, 1]−4.002 −2.017 −0.147 −2.543 −0.767 4.227 3.121 2.114 2.823 2.368 −0.403 5.809 3.905 4.025 4.250 5.891 3.910 4.115 μ 0.008 1.117 −0.014 −0.014 0.851 1.228 0.359 0.359 v( ) 0.098 1.922 0.032 −0.050 2.505 2.727 1.587 1.791 B [1, 1]−30.889 2.842 54.228 −2.217 −4.911 31.112 20.522 61.001 6.438 10.367 B [−1, 1]−10.633 0.797 18.699 −0.645 −1.557 10.714 3.544 20.958 2.297 3.645 0.056 19.834 7.571 7.339 3.695 20.658 7.603 7.592 μ −0.037 6.633 −0.141 −0.141 1.848 7.429 0.942 0.942 v( ) 0.010 10.143 −0.093 −0.344 3.884 12.418 3.120 3.485 All entries have been multiplied by 100 Journal of Classification (2021) 38:594–625 607 Table 6 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the first study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.896 0.909 0.899 0.900 0.948 0.955 0.949 0.949 μ [2] 0.901 0.913 0.905 0.905 0.946 0.956 0.951 0.951 μ [1] 0.886 0.909 0.886 0.889 0.940 0.957 0.941 0.941 μ [2] 0.885 0.903 0.885 0.883 0.941 0.959 0.941 0.941 B [2, 1] 0.818 0.856 0.838 0.859 0.889 0.890 0.918 0.908 0.923 0.941 B [3, 1] 0.822 0.858 0.843 0.861 0.893 0.897 0.920 0.907 0.921 0.942 B [2, 1] 0.616 0.914 0.938 0.900 0.888 0.705 0.961 0.972 0.954 0.941 B [3, 1] 0.627 0.906 0.934 0.895 0.875 0.705 0.953 0.971 0.949 0.929 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) 608 Journal of Classification (2021) 38:594–625 Table 7 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the second study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.888 0.895 0.890 0.890 0.944 0.949 0.943 0.943 μ [2] 0.913 0.920 0.910 0.910 0.956 0.964 0.958 0.958 μ [1] 0.900 0.919 0.900 0.898 0.942 0.958 0.943 0.942 μ [2] 0.893 0.909 0.893 0.893 0.939 0.950 0.938 0.938 B [2, 1] 0.839 0.872 0.846 0.872 0.903 0.904 0.931 0.908 0.935 0.951 B [3, 1] 0.835 0.863 0.834 0.866 0.895 0.899 0.920 0.903 0.919 0.941 B [2, 1] 0.630 0.905 0.930 0.896 0.885 0.709 0.941 0.970 0.937 0.930 B [3, 1] 0.618 0.905 0.932 0.891 0.876 0.696 0.954 0.977 0.946 0.934 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) Journal of Classification (2021) 38:594–625 609 Table 8 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the third study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.898 0.911 0.899 0.899 0.944 0.954 0.947 0.946 μ [2] 0.897 0.912 0.902 0.903 0.947 0.954 0.948 0.948 μ [1] 0.881 0.911 0.887 0.885 0.932 0.951 0.935 0.938 μ [2] 0.878 0.910 0.885 0.887 0.939 0.954 0.936 0.941 B [2, 1] 0.805 0.853 0.837 0.857 0.892 0.882 0.917 0.905 0.921 0.949 B [3, 1] 0.821 0.871 0.846 0.870 0.894 0.892 0.920 0.914 0.924 0.943 B [2, 1] 0.574 0.906 0.939 0.898 0.882 0.665 0.952 0.972 0.947 0.935 B [3, 1] 0.590 0.904 0.938 0.898 0.880 0.685 0.953 0.972 0.950 0.942 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) 610 Journal of Classification (2021) 38:594–625 Table 9 Coverage probability of the 90% and 95% confidence intervals for μ , B [2, 1] and B [3, 1] based on the standard error estimators of ϑ in the fourth study, I = 250 g g CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ [1] 0.884 0.901 0.892 0.892 0.938 0.948 0.940 0.940 μ [2] 0.894 0.910 0.903 0.904 0.947 0.959 0.949 0.949 μ [1] 0.866 0.892 0.876 0.885 0.922 0.940 0.928 0.941 μ [2] 0.853 0.886 0.866 0.871 0.920 0.937 0.926 0.932 B [2, 1] 0.820 0.857 0.828 0.866 0.903 0.895 0.920 0.903 0.923 0.948 B [3, 1] 0.810 0.848 0.815 0.854 0.893 0.882 0.904 0.890 0.913 0.940 B [2, 1] 0.580 0.894 0.925 0.887 0.881 0.660 0.944 0.973 0.938 0.922 B [3, 1] 0.566 0.898 0.930 0.891 0.874 0.648 0.947 0.976 0.945 0.926 Entries in italics denote effective coverage probabilities significantly different from the corresponding nominal ones (two-tailed normal tests, α = 0.00125) Journal of Classification (2021) 38:594–625 611 Table 10 Coverage probability of the 90% and 95% confidence intervals for μ and B [−1, 1] based on the standard error estimators of ϑ in the fifth study, I = 300 CP 90% CP 95% cwm Boot C C C cwm Boot C C C 1 2 3 1 2 3 μ 0.891 0.939 0.893 0.893 0.944 0.975 0.946 0.946 μ 0.897 0.988 0.898 0.898 0.947 0.997 0.947 0.947 B [−1, 1] 0.831 0.869 0.898 0.862 0.888 0.897 0.924 0.947 0.920 0.940 B [−1, 1] 0.616 0.906 0.995 0.889 0.869 0.699 0.955 0.999 0.945 0.926 2 612 Journal of Classification (2021) 38:594–625 of μ . Due to the large number of parameters pertaining to the regressors in the fifth study, Table 5 provides the mean values of the biases and RMSEs of the ML estimates over the elements of each of the following vectors of model parameters: μ ,v( ), B [−1, 1] for X g X g g = 1, 2, where B [−1, 1] is the vector obtained by dropping the first element from the first column of B . Thus, B [−1, 1] comprises the regression coefficients of the p covariates on g g Y given  . In a similar way, Table 10 contains the mean values of the CPs of 90% and 1 g 95% confidence intervals over μ and B [−1, 1] for g = 1, 2. X g Under the experimental conditions considered in the first Monte Carlo study, biases are generally small for all the estimated standard errors (see Table 1). The overall best perfor- mance in terms of accuracy seems to be achieved by means of the estimator Cov (ϑ ).The bootstrap approach appears to provide the most precise estimates of the standard errors of . The sandwich method is slightly more accurate than the bootstrap approach in esti- mating the standard errors of the ML estimates of the expected values of the regressors; the opposite result holds true when dealing with the estimation of the standard errors of  . The highest root mean square errors are mostly obtained using either the function cwm of the package flexCWM or the estimator Cov (ϑ ), which are therefore not recommended. These results confirm both the best performance of an estimator based on the Hessian and the poor performance of an estimator based on the gradient vector under correctly speci- fied models registered in a study dealing with multivariate normal mixture models (Boldea & Magnus, 2009). It is also worth noting that the accuracy of the approach implemented in flexCWM sharply deteriorates when the ML estimates of the intercept and regression coef- ficients of the second group (B ) are considered. As far as the effective confidence levels for the parameters μ , B [2, 1] and B [3, 1] are concerned (Table 6), the obtained results are g g generally similar to one another and quite close to the nominal confidence levels for all the examined methods except the one implemented in flexCWM. With this latter method, the effective confidence levels for the regression coefficients clearly deviate from the nominal ones, especially in the second group of observations. All these results have been employed to run tests for the hypotheses of equality between effective and nominal confidence levels. This task has been carried out through asymptotic two-tailed normal tests for a propor- tion at a 0.00125 significance level (the Bonferroni correction 0.01/8 has been adopted to account for multiple tests performed for each estimation method and each nominal con- fidence level). All the effective CPs of the confidence intervals for the model regression coefficients obtained using both the estimator based on the gradient and the approach imple- mented in flexCWM appear to be significantly different from the corresponding nominal ones (see the entries in italics in Table 6). As far as the results from the bootstrap-based and Hessian-based estimators are concerned, the null hypothesis of equality between effec- tive and nominal confidence levels should be rejected for two regression coefficients at both examined confidence levels; the same null hypothesis has to be rejected for only one regression coefficient when using the the sandwich approach. In the second Monte Carlo study, a substantial increase in the biases of the estimated ˆ ˆ standard errors of  and  has been registered with all the examined estimators except X Y g g for the sandwich method (Table 2); this latter method is also the most accurate. Using a Gaussian cluster-weighted model for the analysis of datasets generated under a uniform cluster-weighted model seems to have a little impact on the confidence intervals for both the expected values of the regressors and the regression coefficients (Table 7). When the data are obtained from two overlapping groups of observations drawn from Gaussian distributions (third study), the resulting biases and RMSEs (see Table 3) are quite similar to the ones from the first study; the main effect of the reduction in the separation Journal of Classification (2021) 38:594–625 613 of the two groups is a general slight increase in the RMSEs with all the examined meth- ods. The best accuracy is still achieved by the estimator Cov (ϑ ) for all model parameters, with the exception of  , whose standard errors are more accurately estimated by the bootstrap approach. As far as the effect of this reduction on the effective CPs of the confi- dence intervals is concerned (Table 8), the most remarkable result is an increase in the gap between nominal and effective CPs associated with the use of the approach implemented in flexCWM for the regression coefficients in the second group of observations. When the two overlapping groups of observations are generated from the uniform dis- tribution (fourth study), both biases and RMSEs of B results remarkably increased with all the examined methods (see Table 4). However, it is worth noting that the lowest of such increases has always been associated with the use of Cov (ϑ ), which is also the estimator with the best accuracy for the ML estimate of variances and covariances of the regressors in both groups and the majority of the model parameter estimates. Furthermore, the sandwich estimator shows the best performance in terms of effective CPs that are not significantly different from the nominal ones (Table 9). In the presence of datasets generated under a Gaussian cluster-weighted model with p = 8 regressors (fifth study), the most remarkable effects of a larger number of covariates on the performance of the examined estimators with samples of size I = 300 appear to be (see Table 5 in comparison with Table 1) a sharp decrease in the accuracy of the estimates of the standard errors produced by the method based on the gradient for all the parameters of the second group of observations and a deterioration in the performances of the bootstrap approach in reference to all the parameters of the conditional distribution of Y given X and ˆ ˆ , especially B [1, 1],for g = 1, 2. As far as the methods Cov (ϑ ) and Cov (ϑ ) are g g 2 3 concerned, biases and RMSEs result to be quite similar to the ones from the first study for all parameter estimates except the intercepts for both groups. Thus, the best overall accuracy is still achieved by the estimator Cov (ϑ ). The results from the five studies with samples of size 500 (see Tables A–J in the separate document with the supplementary materials) are generally in line with those just described. It is worth noting that using a larger sample size leads to a reduction in the RMSEs for all the examined estimators. With datasets containing two separated groups of observa- tions (first and second studies), all the effective CPs of the confidence intervals obtained using the sandwich approach appear to be not significantly different from the correspond- ing nominal ones. When overlapping groups are considered (third and fourth studies), the estimator Cov (ϑ ) has produced confidence intervals whose effective levels are the closest to the nominal ones. Thus, overall, the obtained results show the robustness of the sandwich method. 4.2 A Comparison with Some Estimators Under Normal Mixtures As already mentioned in the “Introduction”, three information-based estimators of the covariance matrix of the ML estimator for finite normal mixture models were developed by Boldea and Magnus (2009): two of them are based on the gradient vector and the Hessian matrix of the incomplete log-likelihood under a normal mixture model; the third estimator exploits the sandwich approach. From now on, these three estimators will be denoted as BM , BM and BM , respectively. Furthermore, it has been already highlighted that finite 1 2 3 mixtures of Gaussian distributions and linear Gaussian cluster-weighted models define the same family of probability distributions (Ingrassia et al., 2012). Thus, it could be interesting to obtain an evaluation of the relationships between the estimators described in Section 3 and the estimators developed by Boldea and Magnus (2009). This task has been numerically 614 Journal of Classification (2021) 38:594–625 performed by means of two additional simulation studies. The model employed to generate the simulated datasets is: f(x, y; ψ ) = π φ (x, y; μ ,  ), (16) g 2 g g g=1 where p = q = 1, G = 2, ψ = (π , ψ , ψ ) , ψ = (μ , v( ) ) , ψ = (μ , v( ) ) , 1 1 2 1 2 1 1 1 2 1.0 0.0 2.0 1.0 π = 0.7, π = 0.3, μ = (0, 0), μ = (, ),  = ,  = .The two 1 2 1 2 1 2 0.0 1.0 1.0 2.0 studies have been carried out using 5 and 10 as values of  so as to obtain two different levels of separation between μ and μ . In each study, 100 datasets have been generated for 1 2 each of two sample sizes: I = 100, 250. The R packages mclust (Scrucca et al., 2016)and flexCWM have been employed to compute the ML estimates of ψ in model (16) with G = 2 and the ML estimates of ϑ in model (2) with G = 2, respectively. Furthermore, the standard ˆ ˆ errors of ϑ have been estimated using Eqs. 13–15. As far as ψ is concerned, estimated standard errors have been computed according to the solutions described in Boldea and Magnus (2009). Models (16)and (2) are characterised by the same value of π . Furthermore, as illus- trated by Ingrassia et al. (2012), some elements in ϑ coincide with some elements in ψ; namely, μ [1]= μ [1],  [1, 1]=  [1, 1], μ [1]= μ [1],  [1, 1]= X 1 X X 1 1 X 2 2 1 2 [1, 1]. Thus, the comparison between the estimators described in Section 3 and the estimators developed by Boldea and Magnus (2009) has been focused on the follow- ing two subvectors of model parameters: ϑ = (π , μ [1],  [1, 1], μ [1],  [1, 1]), 1 X X X X 1 2 1 2 ψ = (π , μ [1],  [1, 1], μ [1],  [1, 1]).Let se (ϑ [j ]) be the standard error of the 1 1 2 m r 1 2 j-th element of ϑ computed using the estimator C and the r-th dataset. Furthermore, ˆ ˆ ¯ ¯ let se (ψ [j ]) be the standard error of the j-th element of ψ obtained from the estima- tor BM . In order to compare such estimated standard errors, the following differences ˆ ˆ ¯ ¯ have been computed: d (j ) = se (ψ [j ]) − se (ϑ [j ]), j = 1,..., 5, m = 1, 2, 3, rm m r m r r = 1,..., 100. The results obtained for  = 5and  = 10 with samples of size I = 100 are graphically represented in Figs. 2 and 3, respectively. The distributions of the differ- ences d (j ) for almost all the examined parameters appear to be centred around 0 and rm highly homogeneous, thus highlighting a general equivalence between the standard errors resulting from the two models. This result holds true especially for the estimators based on the Hessian matrix and the sandwich approach (m = 2, 3) when the separation between the two groups is larger ( = 10), and for the estimator based on the gradient vector (m = 1) when the separation is low ( = 5). However, it is also worth noting that with both levels of separation the differences in the standard errors of π ˆ computed using the two estimators based on the gradient vector show a median value slightly below 0 and a distribution with negative skewness. Similar results have been obtained with samples of size I = 250 (see Figures A and B in the supplementary material). 5 Analysing Regional Tourism Data in Italy Similar to other studies (see e.g. Cellini & Cuccia 2013), the analysis summarised here aims at evaluating the link between tourism flows and attendance at museums and mon- uments, with a focus on two Italian regions: Emilia Romagna (ER) and Veneto (Ve). For both regions, three variables have been examined: tourist arrivals (denoted Arriv), Journal of Classification (2021) 38:594–625 615 Fig. 2 Boxplots of the differences d (j ) with samples of size I = 100 and  = 5 rm tourist overnights (Overn) and visits to State museums, monuments and museum networks (Visit). Measurements for such variables are available with a monthly frequency over the period January 1999 to December 2017. The source for Visit is the website of the Italian Ministry of Cultural Heritage (http://www.statistica.beniculturali.it). Data on Arriv and Fig. 3 Boxplots of the differences d (j ) with samples of size I = 100 and  = 10 rm 616 Journal of Classification (2021) 38:594–625 Table 11 Maximised Gl(ϑ)BIC log-likelihood and Bayesian information criterion of eight 1 −8596.886 17,340.36 cluster-weighted models fitted to the tourism dataset 2 −8056.517 16,411.65 3 −7803.084 16,056.80 4 −7692.981 15,988.62 5 −7593.141 15,940.96 6 −7539.561 15,985.82 7 −7456.020 15,970.76 8 −7401.414 16,013.57 1 2 Overn have been obtained from the websites of Emilia-Romagna and Veneto regional governments. Overall, the analysed dataset is composed of I = 228 monthly observations for six variables. Because of the goal of the analysis, these variables have been partitioned as follows: Y = (Visit ER, Visit Ve) , X = (Arriv ER, Overn ER, Arriv Ve, Overn Ve) . The analysed data are expressed in thousands. Models obtained from Eqs. 2–3 have been fitted to the dataset for G from 1 to 8. To this end, a function written for the R software environment which implements an EM algo- rithm for the ML estimation of multivariate Gaussian cluster-weighted linear models has been employed. In order to prevent problems due to the presence of singular or nearly singular matrices during the iterations, all covariance matrices have been required to have −20 eigenvalues greater than 10 ; furthermore, the ratio between the smallest and the largest −10 eigenvalues of such matrices is required to be not lower than 10 . Model parameters are initialised according to a two-step strategy. In the first step, the joint distribution of the covariates and responses is estimated using a mixture of G Gaussian models through the mclust package. This produces the required starting values for the G weights, mean vec- tors and covariance matrices for the predictors. In the second step, the initialisation of B and  is obtained from the fitting of a Gaussian linear regression model to the sample of observations that have been assigned to the gth component of the mixture model esti- mated in the first step. The R package systemfit (Henningsen & Hamann, 2007)has been exploited to perform this task. The maximum number of iterations of the EM algo- rithm has been set equal to 500. A convergence criterion based on the Aitken acceleration −8 has been used, with a threshold  = 10 . Table 11 shows the values of the maximised log-likelihood (l(ϑ )) and the Bayesian information criterion (BIC) (Schwarz, 1978) for the eight fitted models, where BI C = 2l(ϑ ) − npar ln(I ),and npar denotes the number of model parameters. According to this criterion, the model with the best trade-off between the fit and complexity seems to be the one with G = 5 clusters of months. With this model, there is a perfect correspondence between some clusters and some months (see Table 12): cluster 2 only contains observa- tions in April and May; cluster 4 only contains observations in June, July and August; cluster 5 only contains observations in January, February, November and December. As far as the remaining months are concerned, observations in September for the years 2005–2017 have been assigned to cluster 1; cluster 3 comprises the remaining observations in Septem- ber together with all the observations in March and October. The obtained cluster structure https://statistica.regione.emilia-romagna.it/turismo https://www.veneto.eu/web/area-operatori/statistiche Journal of Classification (2021) 38:594–625 617 Table 12 Cross-classification of the observations from the tourism dataset, based on their variable time identified by month and maximum posterior probability estimated from the cluster-weighted model with G = 5 g Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 10 0 0 0 0 0 0 0 13 0 0 0 2 0 0 0 19 19 0 0 0 0 0 0 0 30 0 19 0 0 0 0 0 6 19 0 0 4 0 0 0 0 0 19 19 19 0 0 0 0 519 19 0 0 0 0 0 0 0 0 19 19 clearly reflects seasonal patterns characterising tourism flows. Observations in cluster 4 (June, July and August) are characterized by the highest mean values of tourist arrivals and overnights in both regions, followed by those in cluster 1 (September 2005–2017), cluster 2 (April and May), cluster 3 (March, October and September 1999–2004) and cluster 4 (from November to February) (see the μ ˆ [j ] in Table 13). In all clusters, Veneto is characterised by mean values of both regressors which are always higher than those of Emilia-Romagna. During the examined period of time, there seems to be heterogeneity also on the effects of the tourist arrivals and overnights on the number of visits in both regions (see the B [j, k] in Table 13). Such effects do not result to be always positive. Furthermore, an increase in the tourist arrivals and overnights in one region does not necessarily have a positive impact on the number of visits to State museums, monuments and museum networks of the other region. Estimates of the standard errors for the parameter estimates of the selected cluster- weighted model have been computed by the boostrap approach, using 100 bootstrap samples generated from the selected model. Furthermore, Cov(ϑ ) has been estimated by resorting to Eqs. 13–15. The algorithm by Higham (1988), as implemented in the R package corpcor, Table 13 Estimated π , μ and B of the cluster-weighted model with G = 5 fitted to the tourism dataset g g g 12 34 5 π ˆ 0.057 0.168 0.192 0.250 0.333 μ [1] 848.1 766.9 515.6 1327.0 338.5 μ [2] 3654.3 2158.4 1573.2 7765.1 864.7 μ [3] 1634.4 1250.6 908.2 2095.6 561.6 μ [4] 6867.7 3906.2 2852.5 11420.1 1587.0 B [1, 1] 47.771 197.894 5.204 −0.408 −2.183 B [2, 1]−0.173 0.229 0.447 0.138 0.171 B [3, 1] 0.021 −0.037 −0.089 0.008 −0.020 B [4, 1] 0.141 −0.124 −0.215 −0.061 0.013 B [5, 1]−0.018 0.007 0.063 −0.005 −0.008 B [1, 2] 95.694 85.555 23.956 36.059 −19.050 B [2, 2]−0.121 0.002 0.181 0.132 −0.425 B [3, 2] 0.022 0.025 −0.007 −0.010 0.141 B [4, 2] 0.114 0.071 0.020 −0.053 0.172 B [5, 2]−0.024 −0.029 −0.013 0.005 −0.004 g 618 Journal of Classification (2021) 38:594–625 Table 14 B [j, k], estimated standard errors, z values and p-values obtained using the bootstrap (columns g gj k 5–7) and Eq. 13 (columns 8–10) ˆ ˆ ˆ gj k B [j, k] se(B [j, k]) z p-value se(B [j, k]) z p-value g g gj k g gj k 12 1 −0.173 0.318 −0.545 0.586 2.090 −0.083 0.934 1 3 1 0.021 0.078 0.262 0.793 0.305 0.067 0.946 1 4 1 0.141 0.172 0.820 0.412 0.648 0.217 0.828 15 1 −0.018 0.055 −0.328 0.743 0.133 −0.134 0.893 12 2 −0.121 0.330 −0.366 0.715 1.472 −0.082 0.935 1 3 2 0.022 0.080 0.278 0.781 0.196 0.113 0.910 1 4 2 0.114 0.175 0.652 0.515 0.474 0.240 0.810 15 2 −0.024 0.045 −0.538 0.590 0.053 −0.457 0.648 2 2 1 0.229 0.167 1.373 0.170 0.328 0.698 0.485 23 1 −0.037 0.044 −0.827 0.408 0.101 −0.364 0.716 24 1 −0.124 0.111 −1.121 0.262 0.196 −0.633 0.527 2 5 1 0.007 0.030 0.216 0.829 0.052 0.125 0.901 2 2 2 0.002 0.178 0.009 0.993 0.136 0.012 0.991 2 3 2 0.025 0.047 0.532 0.595 0.029 0.848 0.396 2 4 2 0.071 0.071 1.002 0.316 0.072 0.993 0.321 25 2 −0.029 0.021 −1.395 0.163 0.016 −1.868 0.062 3 2 1 0.447 0.148 3.023 0.003 0.168 2.663 0.008 33 1 −0.089 0.040 −2.224 0.026 0.037 −2.437 0.015 34 1 −0.215 0.094 −2.282 0.022 0.083 −2.570 0.010 3 5 1 0.063 0.028 2.299 0.022 0.028 2.270 0.023 3 2 2 0.181 0.195 0.927 0.354 0.145 1.251 0.211 33 2 −0.007 0.047 −0.154 0.878 0.036 −0.204 0.838 3 4 2 0.020 0.074 0.265 0.791 0.076 0.260 0.795 35 2 −0.013 0.018 −0.710 0.478 0.028 −0.456 0.648 4 2 1 0.138 0.088 1.565 0.118 0.058 2.384 0.017 4 3 1 0.008 0.024 0.343 0.732 0.011 0.778 0.436 44 1 −0.061 0.055 −1.100 0.271 0.035 −1.748 0.080 45 1 −0.005 0.013 −0.363 0.717 0.006 −0.737 0.461 4 2 2 0.132 0.200 0.657 0.511 0.032 4.130 0.000 43 2 −0.010 0.054 −0.195 0.846 0.008 −1.350 0.177 44 2 −0.053 0.084 −0.635 0.526 0.018 −2.941 0.003 4 5 2 0.005 0.015 0.337 0.736 0.004 1.155 0.248 5 2 1 0.171 0.082 2.078 0.038 0.069 2.489 0.013 53 1 −0.019 0.019 −1.017 0.309 0.018 −1.056 0.291 5 4 1 0.013 0.049 0.265 0.791 0.033 0.385 0.701 55 1 −0.008 0.008 −0.972 0.331 0.012 −0.647 0.517 52 2 −0.425 0.260 −1.635 0.102 0.108 −3.925 0.000 5 3 2 0.141 0.070 2.026 0.043 0.029 4.919 0.000 5 4 2 0.171 0.094 1.824 0.068 0.049 3.472 0.001 55 2 −0.004 0.013 −0.340 0.734 0.016 −0.280 0.779 Journal of Classification (2021) 38:594–625 619 Table 15 B [j, k], estimated standard errors, z values and p-values obtained using Eqs. 14 (columns 5–7) g gj k and 15 (columns 8–10) ˆ ˆ ˆ gj k B [j, k] se(B [j, k]) z p-value se(B [j, k]) z p-value g g gj k g gj k 12 1 −0.173 0.141 −1.234 0.217 0.157 −1.106 0.269 1 3 1 0.021 0.029 0.712 0.477 0.026 0.794 0.427 1 4 1 0.141 0.061 2.319 0.020 0.070 2.012 0.044 15 1 −0.018 0.015 −1.191 0.234 0.012 −1.478 0.139 12 2 −0.121 0.077 −1.570 0.116 0.070 −1.719 0.086 1 3 2 0.022 0.016 1.404 0.160 0.013 1.761 0.078 1 4 2 0.114 0.033 3.442 0.001 0.029 3.909 0.000 15 2 −0.024 0.008 −2.975 0.003 0.006 −4.254 0.000 2 2 1 0.229 0.162 1.411 0.158 0.148 1.545 0.122 23 1 −0.037 0.045 −0.817 0.414 0.037 −0.986 0.324 24 1 −0.124 0.098 −1.261 0.207 0.091 −1.366 0.172 2 5 1 0.007 0.024 0.268 0.789 0.022 0.302 0.763 2 2 2 0.002 0.090 0.017 0.986 0.091 0.017 0.986 2 3 2 0.025 0.025 1.004 0.315 0.027 0.916 0.360 2 4 2 0.071 0.055 1.302 0.193 0.057 1.254 0.210 25 2 −0.029 0.014 −2.146 0.032 0.015 −1.944 0.052 3 2 1 0.447 0.107 4.164 0.000 0.125 3.570 0.000 33 1 −0.089 0.027 −3.245 0.001 0.029 −3.110 0.002 34 1 −0.215 0.058 −3.691 0.000 0.060 −3.568 0.000 3 5 1 0.063 0.022 2.889 0.004 0.022 2.865 0.004 3 2 2 0.181 0.094 1.927 0.054 0.108 1.683 0.092 33 2 −0.007 0.025 −0.296 0.767 0.024 −0.306 0.759 3 4 2 0.020 0.052 0.381 0.704 0.052 0.379 0.705 35 2 −0.013 0.020 −0.660 0.509 0.019 −0.679 0.497 4 2 1 0.138 0.034 3.994 0.000 0.025 5.423 0.000 4 3 1 0.008 0.007 1.135 0.256 0.007 1.104 0.270 44 1 −0.061 0.021 −2.858 0.004 0.016 −3.751 0.000 45 1 −0.005 0.004 −1.051 0.293 0.004 −1.056 0.291 4 2 2 0.132 0.026 5.083 0.000 0.027 4.954 0.000 43 2 −0.010 0.005 −1.914 0.056 0.005 −2.015 0.044 44 2 −0.053 0.016 −3.346 0.001 0.017 −3.052 0.002 4 5 2 0.005 0.003 1.527 0.127 0.003 1.511 0.131 5 2 1 0.171 0.055 3.132 0.002 0.056 3.069 0.002 53 1 −0.019 0.016 −1.240 0.215 0.017 −1.156 0.247 5 4 1 0.013 0.024 0.540 0.589 0.022 0.576 0.564 55 1 −0.008 0.008 −0.920 0.357 0.008 −1.016 0.310 52 2 −0.425 0.075 −5.695 0.000 0.075 −5.684 0.000 5 3 2 0.141 0.022 6.561 0.000 0.022 6.271 0.000 5 4 2 0.171 0.033 5.261 0.000 0.028 6.032 0.000 55 2 −0.004 0.012 −0.379 0.704 0.011 −0.391 0.696 620 Journal of Classification (2021) 38:594–625 has been employed to adjust the eigenvalues of I and I so as to obtain their nearest pos- 1 2 itive definite matrices. Then, the estimated standard errors have been employed to run tests for the hypotheses H : B [j, k]= 0for j = 2,...,p, k = 1,...,q, g = 1,...,G.Such 0 g tests have been run under an asymptotic normal distribution for the z statistics, where gj k B [j,k] ˆ ˆ z = , with se(B [j, k]) denoting the estimated standard error of B [j, k]. gj k g g se(B [j,k]) Table 14 summarises the results obtained using the parametric bootstrap-based estimates and the score-based estimates; the results derived from the use of the other two estimators are reported in Table 15. According to all the examined methods and using α = 0.05, the four examined regressors show a significant linear effect on the visits to State museums, monuments and museum networks in Emilia Romagna over the period 1999 to 2017 in March, October and over the period 1999 to 2004 in September (cluster 3); furthermore, in the central months of the winters from 1999 to 2017 (cluster 5), there are positive signifi- cant effects of Arriv on Visit in Emilia Romagna and of Overn on Visit in Veneto. According to the estimator based on I , five additional regression coefficients (three within cluster 4, two within cluster 5) result to be significantly different from zero; the same con- ˆ ˆ clusion is obtained from the use of Cov (ϑ ) and Cov (ϑ ). The results obtained using the 2 3 three methods illustrated in Section 3 suggest that tourist arrivals in Emilia Romagna have a positive significant effect on the visits to State museums, monuments and museum networks both in Emilia Romagna and in Veneto in June, July and August. Arriv ER has a sig- nificant and negative effect on Visit Ve in January, February, November and December. Furthermore, Arriv Ve seems to significantly affect Visit Ve, with a positive effect in January, February, November and December and a negative effect in June, July and August. ˆ ˆ The tests based on the estimators Cov (ϑ ) and Cov (ϑ ) lead to a rejection of H for fur- 2 3 0 ther five regression coefficients, three of which concern the effect of tourist arrivals and overnights in Veneto in September for the years 2005–2017 (cluster 1). As far as cluster 2 is concerned, all regression coefficients seem to be not significantly different from 0 accord- ing to all approaches except for the effect of Overn Ve on Visit Ve, which results to be significant only when standard errors are estimated from the Hessian matrix. By exploiting the results contained in the non-diagonal elements of matrices Cov (ϑ ), ˆ ˆ Cov (ϑ ) and Cov (ϑ ), it is also possible to run tests for the significance of the difference 2 3 between the effects of two different predictors on the same response in a given cluster or tests for the significance of the difference between the effects of the same predictor on the same response in two different clusters. For any given pair of regression coefficients in the model, the null hypothesis can be expressed as follows: H : B [j ,k ]= B [j ,k ]. 0 g 1 1 g 2 2 1 2 Three illustrative examples of hypotheses for the analysis of the tourism dataset are sum- ˆ ˆ ˆ marisedinTables 16 and 17,where δ = B [j ,k ]− B [j ,k ].These (g ,j ,k )(g ,j ,k ) g 1 1 g 2 2 1 1 1 2 2 2 1 2 examples have been obtained from the following questions: Table 16 Values of δ for testing three examples of H : B [j ,k ]= B [j ,k ] in the (g ,j ,k )(g ,j ,k ) 0 g 1 1 g 2 2 1 1 1 2 2 2 1 2 tourism data Example (g ,j ,k)(g ,j ,k ) δ 1 1 1 2 2 2 (g ,j ,k )(g ,j ,k ) 1 1 1 2 2 2 1 (4, 2, 1)(4, 2, 2) 0.0061 2 (5, 2, 1)(5, 4, 2) −0.0007 3 (3, 2, 1)(5, 2, 1) 0.2757 Journal of Classification (2021) 38:594–625 621 Table 17 Estimated standard errors (se)of δ , z values and p-values obtained using Eqs. 13 (g ,j ,k )(g ,j ,k ) 1 1 1 2 2 2 (columns 2–4), 14 (columns 5–7) and 15 (columns 8–10) for testing three examples of H : B [j ,k ]= 0 g 1 1 B [j ,k ] in the tourism data g 2 2 Example se z p-value se z p-value se z p-value 1 0.057 0.107 0.915 0.039 0.155 0.877 0.034 0.177 0.859 2 0.101 −0.007 0.995 0.079 −0.008 0.993 0.079 −0.008 0.993 3 0.181 1.522 0.128 0.120 2.292 0.022 0.137 2.015 0.044 (a) In June, July and August (cluster 4), do tourist arrivals in Emilia Romagna have a dif- ferent effect on Visit ER and Visit Ve (first example)? (b) In January, February, November and December (cluster 5), is the effect of tourist arrivals in Emilia Romagna on Visit ER different from the effect of tourist arrivals in Veneto on Visit Ve (second example)? (c) Is the effect of tourist arrivals in Emilia Romagna on Visit ER in January, February, November and December (cluster 5) different from the one in March, September 1999– 2004 and October (cluster 3) (third example)? For each of these illustrations, Table 17 summarises the results of the z tests run using the three estimated covariance matrices (the non-diagonal elements are not reported). For the first two examples, the compared methods lead to results which are all in favour of the null hypothesis. In the third illustration, the null hypothesis should be rejected (α = 0.05) when variances and covariances are estimated using both the Hessian and sandwich approaches. 6 Conclusions Three information-based estimators of the asymptotic covariance matrix of the ML esti- mator under multivariate Gaussian linear cluster-weighted models have been illustrated. For their computation, formulae for the score vector and Hessian matrix of the incomplete log-likelihood have been derived. Properties of these estimators have been numerically eval- uated using simulated samples in comparison with the parametric bootstrap-based estimator. For the ML estimates of the model intercepts and regression coefficients, the comparison has included an approach implemented in the package flexCWM in which estimated stan- dard errors are computed by fitting G separate linear weighted regression models using the estimated posterior probabilities as weights. With correctly specified models, the most accurate estimator of the standard error of the ML estimator is the one based on the Hes- sian matrix. When Gaussian cluster-weighted models are fitted to datasets generated from a uniform distribution, the best accuracy is achieved with the sandwich estimator. Overall, the obtained results show the robustness of this latter method. Through these information- based estimators, the tasks of computing approximated confidence intervals and running tests concerning pairs of parameters can be easily carried out, as illustrated through a study aiming at evaluating the link between tourism flows and attendance at museums and monu- ments in two Italian regions. Asymptotic properties of the estimators introduced here could also be studied from a theoretical point of view. For example, suitable regularity conditions can be defined so as to provide a general assessment of their consistency (see for example Galimberti et al. (2020) for a similar study in the context of clusterwise regression models). 622 Journal of Classification (2021) 38:594–625 Appendix Theorem 1 The score vector S(ϑ ) for a cluster-weighted model of order G contains the following subvectors: I I I ∂l(ϑ ) ∂l(ϑ ) ∂l(ϑ ) 1 = a ¯ , = α f ∀g, = α J vec(F ) ∀g, i gi gi gi gi ∂π ∂μ ∂v( ) 2 X g i=1 i=1 i=1 I I ∂l(ϑ ) ∂l(ϑ ) 1 = α vec(x o ) ∀g, =− α L vec(O ) ∀g, gi gi gi i gi ∂vec(B ) ∂v( ) 2 g Y i=1 i=1 G q(q+1) with a ¯ = α a , L and J denoting duplication matrices with dimensions q × i gi g g=1 2 p(p+1) and p × , respectively. Theorem 2 The Hessian matrix H(ϑ ) for a cluster-weighted model of order G contains the following submatrices: I I 2 2 ∂ l(ϑ ) ∂ l(ϑ ) =− a ¯ a ¯ , = α (a −¯ a )vec(x o ) ∀g, i gi g i i i gi ∂π ∂π ∂π∂(vec(B )) i=1 i=1 2 2 ∂ l(ϑ ) 1 ∂ l(ϑ ) =− α (a −¯ a )vec(O ) L ∀g, gi g i gi ∂π∂(v( )) 2 ∂π ∂μ g X i=1 = α (a −¯ a )f ∀g, gi g i gi i=1 ∂ l(ϑ ) 1 =− α (a −¯ a )vec(F ) J ∀g, gi g i gi ∂π∂(v( )) 2 i=1 ∂ l(ϑ ) −1 ∗ ∗ =− α  ⊗ (x x ) gi Y i i ∂vec(B )∂(vec(B )) g g i=1 ∗  ∗ −(1 − α )vec(x o )vec(x o ) ∀g, gi i gi i gi ∂ l(ϑ ) ∗  ∗ =− α α vec(x o )vec(x o ) ∀g = j, gi ji i gi i ji ∂vec(B )∂(vec(B )) g j i=1 ∂ l(ϑ ) −1 =− α  ⊗ (x o ) gi i gi ∂vec(B )∂(v( )) g Y i=1 + (1 − α )vec(x o )vec(O ) L ∀g, gi gi i gi ∂ l(ϑ ) 1 = α α vec(x o )vec(O ) L ∀g = j, gi ji ji i gi ∂vec(B )∂(v( )) 2 g Y i=1 ∂ l(ϑ ) = α (1 − α )vec(x o )f ∀g, gi gi i gi gi ∂vec(B )∂μ i=1 Journal of Classification (2021) 38:594–625 623 ∂ l(ϑ ) =− α α vec(x o )f ∀g = j, gi ji i gi ji ∂vec(B )∂μ i=1 ∂ l(ϑ ) 1 =− α (1 − α )vec(x o )vec(F ) J ∀g, gi gi gi i gi ∂vec(B )∂(v( )) 2 g X i=1 ∂ l(ϑ ) 1 = α α vec(x o )vec(F ) J ∀g = j, gi ji ji i gi ∂vec(B )∂(v( )) 2 g X i=1 ∂ l(ϑ ) 1 −1  −1 =− α L ( − 2O ) ⊗ gi gi Y Y g g ∂v( )∂(v( )) 2 Y Y g g i=1 − (1 − α )vec(O )vec(O ) L ∀g, gi gi gi ∂ l(ϑ ) 1 =− α α L vec(O )vec(O ) L ∀g = j, gi ji gi ji ∂v( )∂(v( )) 4 Y Y g j i=1 ∂ l(ϑ ) 1 =− α (1 − α )L vec(O )f ∀g, gi gi gi gi ∂v( )∂μ 2 g X i=1 ∂ l(ϑ ) 1 = α α L vec(O )f ∀g = j, gi ji gi ji ∂v( )∂μ 2 i=1 ∂ l(ϑ ) 1 = α (1 − α )L vec(O )vec(F ) J ∀g, gi gi gi gi ∂v( )∂(v( )) 4 Y X g g i=1 ∂ l(ϑ ) 1 =− α α L vec(O )vec(F ) J ∀g = j, gi ji gi ji ∂v( )∂(v( )) 4 Y X g j i=1 ∂ l(ϑ ) −1 =− α  − (1 − α )f f ∀g, gi gi gi X gi ∂μ ∂μ X X g i=1 ∂ l(ϑ ) =− α α f f ∀g = j, gi ji gi ji ∂μ ∂μ g j i=1 ∂ l(ϑ ) 1 −1 =− α (f ⊗  ) + (1 − α )f vec(F ) J ∀g, gi gi gi gi gi ∂μ ∂(v( )) 2 X X g g i=1 ∂ l(ϑ ) 1 = α α f vec(F ) J ∀g = j, gi ji ji gi ∂μ ∂(v( )) 2 g j i=1 ∂ l(ϑ ) 1 −1 −1 =− α J ( − 2F ) ⊗ gi gi X X g g ∂v( )∂(v( )) 2 X X g g i=1 − (1 − α )vec(F )vec(F ) J ∀g, gi gi gi ∂ l(ϑ ) 1 =− α α J vec(F )vec(F ) J ∀g = j. gi ji gi ji ∂v( )∂(v( )) 4 X X g j i=1 624 Journal of Classification (2021) 38:594–625 Funding Open access funding provided by Alma Mater Studiorum - Universita di Bologna within the CRUI- CARE Agreement. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References Basford, K. E., Greenway, D. R., McLachlan, G. J., & Peel, D (1997). Standard errors of fitted component means of normal mixtures. Computational Statistics, 12, 1–17. Boldea, O., & Magnus, J. R. (2009). Maximum likelihood estimation of the multivariate normal mixture model. Journal of the American Statistical Association, 104, 1539–1549. Cellini, R., & Cuccia, T. (2013). Museum and monument attendance and tourism flow: A time series analysis approach. Applied Economics, 45, 3473–3482. Dang, U., Punzo, A., McNicholas, P., Ingrassia, S., & Browne, R. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34, 4–34. Dempster, A. P., Laird, N. M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1–38. Di Mari, R., Bakk, Z., & Punzo, A. (2020). A random-covariate approach for distal outcome prediction with latent class analysis. Structural Equation Modeling: A Multidisciplinary Journal, 27(3), 351–368. Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808, 18–24. Galimberti, G., Nuzzi, L., & Soffritti, G. (2021). Covariance matrix estimation of the maximum likelihood estimation in multivariate clusterwise linear regression. Statistical Methods & Applications, 30, 235– Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction, 2nd edn. New York: Springer. Henningsen, A., & Hamann, J. D. (2007). systemfit: A package for estimating systems of simultaneous equations in R. Journal of Statistical Software, 23(4), 1–40. Higham, N. J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications, 103, 103–118. Ingrassia, S., Minotti, S. C., & Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approch with elliptical distributions. Journal of Classification, 29, 363–401. Ingrassia, S., Minotti, S. C., & Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics & Data Analysis, 71, 159–182. Ingrassia, S., Punzo, A., Vittadini, G., & Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32, 85–113. Magnus, J. R., & Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley. Mazza, A., Punzo, A., & Ingrassia, S. (2018). flexCWM: A flexible framework for cluster-weighted models. Journal of Statistical Software, 86(2), 1–30. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. In Handbook of econometrics (Volume 4, Chapter 36, pp. 2111–2245). Newton, M. A., & Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap (with discussion). Journal of the Royal Statistical Society: Series B, 56, 3–48. Punzo, A., & Ingrassia, S. (2013). On the use of the generalized linear exponential cluster-weighted model to assess local linear independence in bivariate data. QdS - Journal of Methodological and Applied Statistics, 15, 131–144. Punzo, A. (2014). Flexible mixture modeling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14, 257–291. Journal of Classification (2021) 38:594–625 625 Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31, 989–1013. Punzo, A., & McNicholas, P. D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34, 249–293. R Core Team. (2020). R: a language and environment for statistical computing Vienna. Austria: R Foundation for Statistical Computing. Ritter, G. (2015). Robust cluster analysis and variable selection. Boca Raton: CRC Press. Schafer, ¨ J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), Article, 32. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A.E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8/1, 205–223. Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2013). Clustering and classification via cluster- weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40. Subedi, S., Punzo, A., Ingrassia, S., & McNicholas, P.D. (2015). Cluster-weighted t-factor analyzers for robust model-based clustering and dimension reduction. Statistical Methods & Applications, 24, 623– White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for het- eroskedasticity. Econometrica, 48, 817–838. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Journal

Journal of ClassificationSpringer Journals

Published: Oct 1, 2021

Keywords: Gaussian mixture model; Hessian matrix; Linear regression; Model-based cluster analysis; Sandwich estimator; Score vector

There are no references for this article.