Access the full text.
Sign up today, get DeepDyve free for 14 days.
S. Ingrassia, A. Punzo (2019)
Cluster Validation for Mixtures of Regressions via the Total Sum of Squares DecompositionJournal of Classification, 37
A. Punzo (2012)
Flexible Mixture Modeling with the Polynomial Gaussian Cluster-Weighted ModelarXiv: Methodology
G. McLachlan, D. Peel (2000)
Finite Mixture Models
A. Dempster, N. Laird, D. Rubin (1977)
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
Maurizio Vichi (2011)
Studies in Classification Data Analysis and knowledge Organization
W. DeSarbo, W. Cron (1988)
A maximum likelihood methodology for clusterwise linear regressionJournal of Classification, 5
C. Viroli (2011)
Finite mixtures of matrix normal distributions for classifying three-way dataStatistics and Computing, 21
S. Ingrassia, A. Punzo, G. Vittadini, S. Minotti (2015)
The Generalized Linear Mixed Cluster-Weighted ModelJournal of Classification, 32
A. Mazza, A. Punzo, S. Ingrassia (2018)
flexCWM: A Flexible Framework for Cluster-Weighted ModelsJournal of Statistical Software, 86
A. Mazza, M. Battisti, S. Ingrassia, A. Punzo (2017)
Modeling Return to Education in Heterogeneous Populations: An Application to ItalyStatistical Learning of Complex Data
G. Millo, Gaetano Carmeci (2011)
Non-life insurance consumption in Italy: a sub-regional panel data analysisJournal of Geographical Systems, 13
S. Tomarchio, A. Punzo, L. Bagnato (2020)
Two new matrix-variate distributions with application in model-based clusteringComput. Stat. Data Anal., 152
A. Punzo, S. Ingrassia (2016)
Clustering bivariate mixed-type data via the cluster-weighted modelComputational Statistics, 31
U. Dang, P. McNicholas (2013)
Families of Parsimonious Finite Mixtures of Regression ModelsarXiv: Methodology
A. Punzo, S. Ingrassia (2015)
Parsimonious Generalized Linear Gaussian Cluster-Weighted Models
L. Hubert, P. Arabie (1985)
Comparing partitionsJournal of Classification, 2
S. Ingrassia, S. Minotti, A. Punzo (2012)
Model-based clustering via linear cluster-weighted modelsComput. Stat. Data Anal., 71
L. Anderlucci, A. Montanari, C. Viroli (2014)
A Matrix-Variate Regression Model with Canonical States: An Application to Elderly Danish TwinsStatistica, 74
L. Anderlucci, C. Viroli (2014)
Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal dataThe Annals of Applied Statistics, 9
S. Ingrassia, A. Punzo (2016)
Decision boundaries for mixtures of regressionsJournal of The Korean Statistical Society, 45
U. Dang, A. Punzo, P. McNicholas, S. Ingrassia, R. Browne (2014)
Multivariate Response and Parsimony for Gaussian Cluster-Weighted ModelsJournal of Classification, 34
G. Millo, G. Piras (2012)
splm: Spatial Panel Data Models in RJournal of Statistical Software, 47
Sanjeena Subedi, A. Punzo, S. Ingrassia, P. McNicholas (2012)
Clustering and classification via cluster-weighted factor analyzersAdvances in Data Analysis and Classification, 7
C. Dayton, G. Macready (1988)
Concomitant-Variable Latent-Class ModelsJournal of the American Statistical Association, 83
percentage of students that have earned at least 40 course credits during the solar
L Anderlucci (2014)
367Statistica, 74
A. Punzo, P. McNicholas (2014)
Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted ModelJournal of Classification, 34
F. Leisch (2004)
FlexMix: A general framework for finite mixture models and latent class regression in RJournal of Statistical Software, 11
N. Gershenfeld (1997)
Nonlinear Inference and Cluster‐Weighted ModelingAnnals of the New York Academy of Sciences, 808
(2015)
clusterGeneration: Random Cluster Generation (with Specified Degree of Separation)
S. Ingrassia, S. Minotti, G. Vittadini (2012)
Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical DistributionsJournal of Classification, 29
G. Schwarz (1978)
Estimating the Dimension of a ModelAnnals of Statistics, 6
P. McNicholas (2016)
Model-Based ClusteringJournal of Classification, 33
N. Gershenfeld, B. Schoner, E. Metois (1999)
Cluster-weighted modelling for time-series analysisNature, 397
M. Cugmas, A. Ferligoj (2015)
On comparing partitionsInternational Federation of Classification Societies
Volodymyr Melnykov, Xuwen Zhu (2018)
On model-based clustering of skewed matrix dataJ. Multivar. Anal., 167
X. Meng, D. Rubin (1993)
Maximum likelihood estimation via the ECM algorithm: A general frameworkBiometrika, 80
P. McNicholas (2016)
Mixture Model-Based Classification
M. Gallaugher, P. McNicholas (2017)
A matrix variate skew‐t distributionStat, 6
(2018)
R Core Team
P. Dutilleul (1999)
The mle algorithm for the matrix normal distributionJournal of Statistical Computation and Simulation, 64
(2006)
On being strategic about the ﬁrst year
C. Viroli (2012)
On matrix-variate regression analysisJ. Multivar. Anal., 111
AP Dempster, NM Laird, DB Rubin (1977)
Maximum likelihood from incomplete data via the em algorithmJournal of the Royal Statistical Society: Series B (Methodological), 39
Semhar Michael, Volodymyr Melnykov (2016)
An effective strategy for initializing the EM algorithm in finite mixture modelsAdvances in Data Analysis and Classification, 10
R. Maitra, Volodymyr Melnykov (2010)
Simulating Data to Study Performance of Finite Mixture Modeling and Clustering AlgorithmsJournal of Computational and Graphical Statistics, 19
Shuchismita Sarkar, Xuwen Zhu, Volodymyr Melnykov, S. Ingrassia (2020)
On parsimonious models for modeling matrix dataComput. Stat. Data Anal., 142
Volodymyr Melnykov, Xuwen Zhu (2018)
Studying crime trends in the USA over the years 2000–2012Advances in Data Analysis and Classification, 13
R. Team (2014)
R: A language and environment for statistical computing.MSOR connections, 1
S. Frühwirth-Schnatter (2006)
Finite Mixture and Markov Switching Models
X. Meng, D. Dyk (1997)
The EM Algorithm—an Old Folk‐song Sung to a Fast New TuneJournal of the Royal Statistical Society: Series B (Statistical Methodology), 59
MPB Gallaugher, PD McNicholas (2018)
Finite mixtures of skewed matrix variate distributionsPattern Recognition, 80
C. Davino, D. Vistocco (2015)
Advances in statistical models for data analysis
Finite mixtures of regressions with fixed covariates are a commonly used model-based clustering methodology to deal with regression data. However, they assume assignment independence, i.e., the allocation of data points to the clusters is made independently of the distribution of the covariates. To take into account the latter aspect, finite mixtures of regres- sions with random covariates, also known as cluster-weighted models (CWMs), have been proposed in the univariate and multivariate literature. In this paper, the CWM is extended to matrix data, e.g., those data where a set of variables are simultaneously observed at dif- ferent time points or locations. Specifically, the cluster-specific marginal distribution of the covariates and the cluster-specific conditional distribution of the responses given the covariates are assumed to be matrix normal. Maximum likelihood parameter estimates are derived using an expectation-conditional maximization algorithm. Parameter recovery, clas- sification assessment, and the capability of the Bayesian information criterion to detect the underlying groups are investigated using simulated data. Finally, two real data applications concerning educational indicators and the Italian non-life insurance market are presented. Keywords Mixture models · Matrix-variate · Classification · Clustering 1 Introduction Finite mixture models have seen increasing use over the last decades (for a recent survey, see McNicholas 2016b). Because of their flexibility, they are a suitable statistical tool for modeling a wide range of phenomena characterized by unobserved heterogeneity, and con- stitute a powerful device for clustering and classification. Specifically, in a typical “direct application,” each mixture component represents a group (or cluster) within data, and the scope is to identify these groups and estimate the parameters of the conditional-group distri- butions (McLachlan and Peel 2000; McNicholas 2016a). If no exogenous variables explain the means and the variances of each component, they are also called unconditional mixture models. However, when there is a linear relationship between some variables, important Salvatore D. Tomarchio daniele.tomarchio@unict.it Department of Economics and Business, University of Catania, Catania, Italy Department of Mathematics and Statistics, McMaster University, Hamilton, Canada Journal of Classification (2021) 38:556–575 557 insight can be gained by accounting for functional dependencies between them. For this rea- son, finite mixtures of regression models with fixed covariates (FMR) have been proposed in the literature (see DeSarbo and Cron 1988;Fruhwirth-Schnatter ¨ 2006, for examples). Finite mixtures of regression models with concomitant variables (FMRC; Dayton and Macready 1988) are an extension of FMR where the mixing weights depend on some concomitant variables (which are often the same covariates) and are usually modeled by a multinomial logistic model (see Ingrassia and Punzo 2016, 2020 and Mazza et al. 2019 for details). Unfortunately, these methodologies do not explicitly use the distribution of the covariates for clustering, i.e., the assignment of data points to clusters does not directly utilize any information from the distribution of the covariates. As an alternative to these approaches, finite mixtures of regressions with random covari- ates (Gershenfeld 1997; Gershenfeld et al. 1999), also known as cluster-weighted models (CWMs), allow for such functional dependency. This occurs because, for each mixture com- ponent, CWMs decompose the joint distribution of responses and covariates into the product between the marginal distribution of the covariates and the conditional distribution of the responses given the covariates. Several CWMs have been introduced in the univariate and multivariate literature. Most of them consider a univariate response variable, along with a set of covariates, modeled by a univariate and a multivariate distribution, respectively (see Ingrassia et al. 2012, 2014; Punzo 2014, for examples). Fewer CWMs exist in the case of a multivariate response (Punzo and McNicholas 2017;Dangetal. 2017). In recent years, there has been an increasing interest in applications involving matrix- variate (three-way) data, e.g., Viroli (2011), Anderlucci et al. (2015), Gallaugher and McNicholas (2017, 2018), Melnykov and Zhu (2018, 2019), and Tomarchio et al. (2020). This data structure can occur in several and different application domains, such as longi- tudinal data on multiple response variables, spatial multivariate data, multivariate repeated measures, or spatio-temporal data. Nevertheless, there exists a limited number of con- tributions involving matrix-variate regression models. First introduced by Viroli (2012), these models are a natural generalization of the multivariate-multiple regression (see also Anderlucci et al. 2014). Within the mixture framework, finite mixtures of matrix-variate regressions (MN-FMR) have been recently considered by Melnykov and Zhu (2019). There are no matrix-variate CWMs in the literature and this paper aims to fill this gap by introduc- ing and discussing a matrix-variate CWM in which the cluster-specific marginal distribution of the covariates and the cluster-specific conditional distribution of the responses given the covariates are assumed to be matrix normal. The remainder of the paper is organized as follows. In Section 2, some preliminary aspects are described. The matrix normal cluster-weighted model (MN-CWM) and the expectation-conditional maximization (ECM) algorithm (Meng and Rubin 1993) for param- eter estimation are discussed in Section 3. Computational and operational aspects are laid out in Section 4. In the simulation study outlined in Section 5, the parameter recovery and the classification performance of the MN-CWM are investigated as well as the capability of the Bayesian information criterion (BIC; Schwarz and et al. 1978) to detect the under- lying group structure. The MN-CWM is also therein compared to the MN-FMR and to the multivariate-multiple normal CWM (MMN-CWM). The application of the MN-CWM to two real datasets concerning educational indicators and the Italian non-life insurance market is therefore analyzed in Section 6, whereas some conclusions and ideas for future developments are drawn in Section 7. 558 Journal of Classification (2021) 38:556–575 2 Background 2.1 Matrix Normal Distribution p×r A p × r continuous random matrix Y ∈ R has a matrix normal (MN) distribution, denoted by N (M, , ), if its density is p×r pr r p − − − −1 −1 2 2 2 φ (Y ; M, , ) = (2π ) || || exp − tr (Y −M) (Y − M) , p×r (1) where M is the p × r mean matrix, and and are the p × p and r × r covariance matrices associated with the p rows and r columns, respectively. An equivalent definition specifies the (p × r)-matrix normal distribution as a special case of the pr-multivariate normal distribution. Specifically, Y ∼ N (M, , ) ⇔ vec (Y ) ∼ N (vec (M) , ⊗ ) , (2) p×r pr where N (·) denotes the pr-variate normal distribution, vec (·) is the vectorization pr operator, and ⊗ denotes the Kronecker product. However, a MN distribution has the desir- able feature of simultaneously modeling and identify the between- and within-variables variabilities as well as reducing the number of free parameters from pr (pr + 1) /2to [r (r + 1) /2] + [p (p + 1) /2] − 1. 2.2 Matrix-Variate Regression Model p×r Let Y ∈ R be a continuous random matrix of dimension p × r, containing p responses measured in r occasions. Suppose we observe a set of q covariates for each occasion, inserted in a matrix X of dimension q × r. A generic matrix-variate regression model for Y has the form Y = βw + BX + U , (3) where β is the p×1 vector consisting in the parameters related with the intercept, w is a r ×1 vector of ones, B is the p × q matrix containing the parameters related to the q covariates, and U is the p × r error term matrix. Model (3) can be expressed in compact notation as ∗ ∗ Y = B X + U , (4) ∗ ∗ where B is the p × (1 + q) matrix involving all the parameters to be estimated and X is the (1 + q)×r matrix containing the information about the intercept and q covariates (Viroli ∗ ∗ ∗ 2012). If we assume U ∼ N 0, , ,then Y |X ∼ N B X , , . Therefore, ( ) p×r p×r a matrix-variate regression can be viewed as an encompassing framework containing as special cases the multivariate-multiple regression, when r = 1, and the univariate-multiple regression when r = 1and p = 1. Journal of Classification (2021) 38:556–575 559 3 Methodology 3.1 Matrix Normal CWM Let X, Y be a pair of random matrices, defined as in Section 2.2, with joint distribution ( ) p (X, Y ). Then, a general matrix CWM has the following joint distribution: p (X, Y ) = p Y |X p (X) π , (5) g g g g=1 where p Y |X is the cluster-specific conditional distribution of the responses, p (X) g g is the cluster-specific marginal distribution of the covariates, and π > 0 is the mixing weight (with π = 1). Furthermore, we assume that in each group the conditional g=1 ∗ ∗ expectation E Y |X is a linear function of X depending on some parameters. In this paper, we focus on model (5) by assuming that both p Y |X and p (X) g g ∗ ∗ ∗ are matrix normal densities, and E Y |X = B X , as described in Section 2.2. Thus, model (5) can be rewritten as ∗ ∗ ∗ p X, Y ; = φ Y |X ; B X , , φ X; M , , π , (6) ( ) Y Y g X X g p×r g g g q×r g g g=1 where denotes the set of all parameters. For ease of understanding, a subscript with the variable name is added to the respective covariance matrices. Notice that there is an identifiability issue concerning the covariance matrices. Indeed, for a MN distribution, ∗ ∗ ⊗ = ⊗ ∗ ∗ −1 if = a and = a . As a result, and are identifiable up to a multiplicative constant a (Melnykov and Zhu 2019). According to Gallaugher and McNicholas (2017, 2018) a way to obtain a unique solution is to fix the first diagonal element of the row covariance matrix to 1. Therefore, we adopt this approach in model (6) by setting the first diagonal element of and to 1. Y X g g If the MN-CWM was not available, a possible approach would be to vectorize the matri- ces and consider the MMN-CWM, of which the MN-CWM is a special case; see Eq. 2. However, such a procedure leads to two principal concerns. The first one is the overparame- terization of the vectorized model. Secondly, this increased number of free parameters may directly impact model selection, as will be better analyzed in Section 5.3. 3.2 Parameter Estimation Parameter estimation is carried out via the ECM algorithm, which differs from the expectation-maximization (EM) algorithm (Dempster et al. 1977) in that the M-step is replaced by a sequence of simpler and computationally convenient CM-steps. The EM algorithm cannot be directly implemented because there is no closed form solution for the covariance matrices of the MN distribution, i.e., one of the two depends on the value of the other at the previous iteration (Dutilleul 1999). 560 Journal of Classification (2021) 38:556–575 { } Let S = (X , Y ) be a sample of N independent observations from model (6). i i i=1 Then, the incomplete-data likelihood function is L |S = p (X , Y ; ) i i i=1 ⎡ ⎤ N G ∗ ∗ ∗ ⎣ ⎦ = φ Y |X ; B X , , φ X ; M , , π . i Y Y i g X X g p×r q×r i g i g g g g i=1 g=1 (7) Within the formulation of mixture models, S is viewed as being incomplete because, for each observation, we do not know its component membership. Let z = (z ,...,z ) i i1 iG be the component membership vector such that z = 1if (X , Y ) comes from group (i.e., ig i i component) g and z = 0 otherwise. Now, the complete-data are S = {(X , Y , z )} , ig c i i i i=1 and the complete-data likelihood is N G ig ∗ ∗ ∗ L (|S ) = φ Y |X ; B X , , φ X ; M , , π . c c i Y Y i g X X g p×r g g g q×r g g i i i=1 g=1 (8) Therefore, the corresponding complete-data log-likelihood can be written as N G N G ∗ ∗ ∗ l |S = z ln π + z ln φ Y |X ; B X , , c c ig g ig i Y Y p×r i g i g g i=1 g=1 i=1 g=1 N G + z ln φ X ; M , , .(9) ig q×r i g X X g g i=1 g=1 In the following, by adopting the notation used in Tomarchio et al. (2020), the quantities marked with one dot correspond to the updates at the previous iteration and those marked with two dots represent the updates at the current iteration. E-Step The E-step requires calculation of the conditional expectation of Eq. 9,giventhe observed data and the current estimate of the parameters . To do this, we need to calculate z ¨ = E Z |X , Y ig ˙ ig i i ∗ ∗ ˙ ˙ ˙ ˙ ˙ ˙ π ˙ φ (Y |X ; B X , , )φ (X ; M , , ) g i Y Y i g X X p×r g g g q×r g g i i = , (10) ∗ ∗ ˙ ˙ ˙ ˙ ˙ ˙ π ˙ φ (Y |X ; B X , , )φ (X ; M , , ) j i Y Y i j X X p×r j j j q×r j j i i j =1 which corresponds to the posterior probability that the unlabeled observation (X , Y ) i i belongs to the gth component of the mixture. CM-Step 1 At the first CM-step, we maximize the expectation of the complete-data log-likelihood with respect to = π , M , , B , , fixing = 1 g g X Y 2 g g g g=1 Journal of Classification (2021) 38:556–575 561 , at . In particular, we obtain X Y 2 g g g=1 N N 1 1 π ¨ = z ¨ , M = z ¨ X , (11) g ig g ig i z ¨ ig i=1 i=1 i=1 −1 N N −1 ∗ ∗ −1 ∗ ¨ ˙ ˙ B = z ¨ Y ( ) X z ¨ X ( ) X , (12) ig i Y ig Y g g i i g i i=1 i=1 −1 ¨ ¨ ˙ ¨ = z ¨ X − M X − M , (13) X ig i g X i g g g r z ¨ ig i=1 i=1 ∗ −1 ∗ ∗ ∗ ¨ ¨ ˙ ¨ = z ¨ Y − B X Y − B X . (14) Y ig i Y i g g i g g i r z ¨ ig i=1 i=1 CM-Step 2 At the second CM-step, we maximize the expectation of the complete-data log- likelihood with respect to , keeping fixed at . Therefore, we have 2 1 1 −1 ¨ ¨ ¨ ¨ = z ¨ X − M X − M , (15) X ig i g X i g g g q z ¨ ig i=1 i=1 ∗ −1 ∗ ∗ ∗ ¨ ¨ ¨ ¨ = z ¨ Y − B X Y − B X . (16) Y ig i Y i g g i g g i p z ¨ ig i=1 i=1 4 Computational and Operational Aspects The code for the ECM algorithm previously described, and for the analysis in the following sections, is written within the R computing environment (R Core Team 2018). 4.1 ECM Initialization The choice of the starting values is an important aspect for EM-based algorithms (see, e.g., Maitra and Melnykov 2010; Michael and Melnykov 2016). A common way to start an EM- based algorithm consists in providing initial values for the z in Eq. 10, during the first ig E-step of the algorithm (McLachlan and Peel 2000). In our case, we also need to provide initial values for and in Eqs. 13 and 14, respectively, during the first CM-step of X Y g g the algorithm. Therefore, the following initialization strategy is implemented: 1. Generate G random positive definite matrices for both and . This is done via X Y g g the genPositiveDefMat() function of the clusterGeneration package, by using the “eigen” method. For further details, see Qiu and Joe (2015). 2. Generate N random vectors z = (z ,...,z ) , i = 1,...,N. This is done by using i i1 iG the following three approaches: 2.1 In a “soft” way, by generating G positive random values from a uniform distri- bution on [0,1] for each observation, that are subsequently normalized to have a unitary sum. Being purely random, this procedure is repeated 15 times, and 562 Journal of Classification (2021) 38:556–575 the solution maximizing the observed-data log-likelihood among these runs is considered; 2.2 In a “hard” way, by using the classification produced by the k-means algorithm on the vectorized and merged data. Specifically, after computing {vec (X )} and i=1 {vec (Y )} , the data are merged so that for each observation we have a vector i=1 of dimension (pr + qr) × 1; 2.3 In a “hard” way, by using the classification produced by mixtures of matrix- normal distributions, computed on the merged data. In detail, for each observa- tion, we have a p + q × r matrix. ( ) The approach producing the largest (observed) log-likelihood is finally selected. 4.2 Spurious Clusters A well-known issue in mixture models is related to the possibility for EM-based algorithms to converge to spurious local maxima. This is not a failing of these algorithms, and such solutions reflect a random pattern in the data rather than an underlying group structure. They can usually be detected by the presence of data groups with very few observations or small eigenvalues compared to the others. Consequently, these solutions often have a high likelihood, but are of little practical use or real-world interpretation (McLachlan and Peel 2000). Various approaches to mitigate this problem have been proposed in the literature. For example, Leisch (2004) and Dang and McNicholas (2015) impose a minimum size for the clusters by forcing π ≥ 0.05. Melnykov and Zhu (2019) tackle this issue by excluding all the solutions that involved clusters consisting of fewer than five points. Herein, we deal with this problem by removing the solutions that included clusters with estimated π ≈ 0.05 or less, and with eigenvalues close to 0, in one or more of its estimated covariance matrices. 4.3 Model Selection and Performance Evaluation It is often the case that the number of groups G is not known in advance, and model selec- tion criteria are commonly used for estimating it. Among them, the BIC is one of the most popular, and will be used in the following. Furthermore, when the true classification of the data is known, the adjusted Rand index (ARI; Hubert and Arabie 1985) can be used to eval- uate the clustering performance of a model. Specifically, the ARI calculates the agreement between the true classification and the one predicted by the model. An ARI of 1 indi- cates perfect agreement between the two partitions, whereas the expected value of the ARI under a random classification is 0. The ARI will be used in the manuscript along with the misclassification rate η, which is the percentage of units misclassified. 5 Simulation Studies 5.1 Simulation 1: a Focus on the Matrix-Normal CWM In this study, several aspects related to our model are analyzed. First of all, since the ECM algorithm is used to fit the model, it is desirable to evaluate its parameter recovery, i.e., whether it can recover the generating parameters accurately. For this reason, data are gen- erated from a four-component MN-CWM with p = q = r = 3. Two scenarios are then evaluated, according to the different levels of overlap of the mixture components. In the first Journal of Classification (2021) 38:556–575 563 scenario (labeled as “Scenario A ”), the mixture components are well-separated both in X, by assuming relatively distant mean matrices, and in Y |X , by using different intercepts and slopes. On the contrary, in the second scenario (labeled as “Scenario B ”), there is a certain amount of overlap because the intercepts are all equal among the mixture components, while the slopes and the mean matrices assume approximately the same values among the mix- ture components. The parameters used for Scenario A are displayed in Appendix 1. Under Scenario B , the set of parameters π , , , , , M and the slopes in 1 g X X Y Y 1 g g g g g=1 ∗ ∗ B and B , are the same of Scenario A . The other mean matrices are obtained by adding 1 3 a constant c to each element of the corresponding mean matrices used for Scenario A .In detail, c is equal to −5, 5, and −10 for M , M ,and M , respectively. The intercept col- 2 3 4 ∗ ∗ umn of all the mixture components is equal to (7, 2, 5) , whereas the slopes in B and B 2 4 are all multiplied by −1, with respect to those used in Scenario A . Lastly, two sample sizes are considered within each scenario, i.e., N = 200 and N = 500. On all the generated datasets, the MN-CWM is fitted directly with G = 4, and the bias and mean squared error (MSE) of the parameter estimates are computed. For brevity’s sake, and as also supported by the existing CWM literature (see, e.g., Punzo 2014; Ingrassia et al. 2015; Punzo and Ingrassia 2016; Punzo and McNicholas 2017), attention will be focused ∗ ∗ only on the regression coefficients B ,..., B . Before showing the obtained results, it is 1 G important to underline the well-known label switching issue, caused by the invariance of a mixture distribution to relabeling the components (Fruhwirth-Schnatter ¨ 2006). There are no generally accepted labeling methods. Herein, to assign the correct labels, an analysis of the overall estimated parameters is conducted on each generated dataset to properly identify each mixture component. Table 1 summarizes the estimated bias and MSE of the parameter estimates for Sce- nario A , over one hundred replications for each sample size N, after fitting the MN-CWM with G = 4. The same is reported for Scenario B in Table 2. The first and most immediate result is that the biases and the MSEs take very small values in both scenarios. This is par- ticularly relevant for Scenario B because of the presence of overlap between the mixture components. Furthermore, within each scenario, an increase in the sample size leads to a rough improvement of the parameter estimates, whereas it systematically reduces the MSE. Other aspects that are investigated consist in the evaluation of the classification produced by our model, as well as the capability of the BIC in identifying the correct number of groups in the data. For this reason, under each of the considered scenarios, the MN-CWM is fitted to the generated datasets for G ∈ {1, 2, 3, 4, 5}, and the results are reported in Table 3. It is easy to see that, under Scenario A , a perfect classification is always obtained regard- less of the sample size. Additionally, the BIC regularly detects the correct number of groups in the data. Under Scenario B , because of the larger overlap, the ARI assumes lower, but in any case good, values. Relatedly, the percentage of misclassified units stands at around the 3% for both sample sizes. Concerning the BIC, also in this case, it properly identifies the underlying group structure, with only one exception (when N = 200). A final aspect that is evaluated in this study concerns the initialization strategy. Specif- ically, Table 4 displays the number of times each strategy for the z produces the highest log-likelihood at convergence, within each scenario and for both sample sizes. The initial G random matrices for and are assumed to be the same. X Y g g The first result suggests the importance of considering multiple initialization strategies because none of them is preferred in all the generated datasets. However, the random strat- egy is quite close to this target because it only fails in three datasets under Scenario B .Very similar performances are obtained when the mixture initialization is used. On the contrary, 564 Journal of Classification (2021) 38:556–575 Table 1 Estimated bias and MSE of the regression coefficients B , over 100 replications, under g=1 Scenario A N = 200 N = 500 ⎛ ⎞ ⎛ ⎞ 0.032 −0.001 0.005 0.002 0.001 −0.003 −0.002 0.007 ⎜ ⎟ ⎜ ⎟ Group 1 Bias ⎝ −0.025 −0.010 −0.008 −0.004 ⎠ ⎝ −0.033 0.002 −0.007 0.004 ⎠ −0.028 0.006 −0.014 −0.004 0.003 −0.004 −0.002 0.005 ⎛ ⎞ ⎛ ⎞ 0.343 0.011 0.018 0.014 0.111 0.004 0.006 0.007 ⎜ ⎟ ⎜ ⎟ MSE ⎝ 0.337 0.010 0.017 0.018 ⎠ ⎝ 0.114 0.004 0.006 0.006 ⎠ 0.365 0.011 0.020 0.016 0.104 0.004 0.006 0.005 ⎛ ⎞ ⎛ ⎞ 0.039 −0.004 0.001 −0.001 0.001 −0.003 −0.002 −0.001 ⎜ ⎟ ⎜ ⎟ Group 2 Bias −0.005 −0.004 0.002 0.006 0.019 −0.000 −0.000 −0.000 ⎝ ⎠ ⎝ ⎠ −0.004 −0.008 −0.004 0.009 0.042 −0.004 −0.002 −0.001 ⎛ ⎞ ⎛ ⎞ 0.252 0.003 0.003 0.004 0.084 0.001 0.001 0.001 ⎜ ⎟ ⎜ ⎟ MSE ⎝ 0.204 0.002 0.003 0.004 ⎠ ⎝ 0.089 0.001 0.001 0.001 ⎠ 0.170 0.002 0.003 0.005 0.099 0.001 0.001 0.001 ⎛ ⎞ ⎛ ⎞ 0.005 −0.008 0.005 0.001 0.002 −0.000 −0.001 −0.000 ⎜ ⎟ ⎜ ⎟ Group 3 Bias ⎝ −0.020 −0.010 −0.002 −0.000 ⎠ ⎝ 0.055 0.001 0.004 0.001 ⎠ −0.051 −0.008 −0.003 −0.001 0.018 0.003 0.002 −0.001 ⎛ ⎞ ⎛ ⎞ 0.229 0.005 0.002 0.003 0.104 0.002 0.001 0.001 ⎜ ⎟ ⎜ ⎟ MSE 0.244 0.005 0.002 0.003 0.122 0.002 0.001 0.001 ⎝ ⎠ ⎝ ⎠ 0.235 0.006 0.002 0.004 0.111 0.002 0.001 0.001 ⎛ ⎞ ⎛ ⎞ 0.097 −0.008 0.011 −0.005 −0.041 0.003 0.001 −0.002 ⎜ ⎟ ⎜ ⎟ Group 4 Bias ⎝ 0.027 −0.006 0.006 0.002 ⎠ ⎝ −0.045 0.003 0.005 −0.002 ⎠ −0.017 −0.005 0.006 0.005 −0.006 0.002 0.003 −0.004 ⎛ ⎞ ⎛ ⎞ 0.412 0.003 0.005 0.004 0.242 0.001 0.001 0.001 ⎜ ⎟ ⎜ ⎟ MSE 0.412 0.003 0.007 0.004 0.200 0.001 0.001 0.001 ⎝ ⎠ ⎝ ⎠ 0.397 0.003 0.006 0.004 0.209 0.001 0.002 0.001 the k-means strategy provides the worst performances, even if it produces the best solution in approximately the 80% of the datasets. 5.2 Simulation 2: a Comparison Between the Matrix-Normal CWM and the Matrix-Normal FMR In this study, the matrix-normal CWM is compared to the matrix-normal FMR. Specifically, three scenarios with N = 200, p = 2, q = 3, and r = 4 are considered, and in each of them 30 datasets from a matrix-normal CWM with G = 2 are generated. The first scenario (hereafter simply referred to as “Scenario A ”) is characterized by the fact that the two groups differ only for the intercepts and the covariance matrices. This implies that they have totally overlapping mean matrices, which should make the distribution of the covariates p X not very important for clustering. The parameters used to generate the datasets are ( ) displayed in Appendix 2. In the second scenario (“Scenario B ”), the two groups have the same B and π . The parameters used to generate the datasets are the same as for Scenario A , but with only two differences: a value c = 5 is added to each element of M and we 2 2 Journal of Classification (2021) 38:556–575 565 Table 2 Estimated bias and MSE of the regression coefficients B , over 100 replications, under g=1 Scenario B N = 200 N = 500 ⎛ ⎞ ⎛ ⎞ −0.058 −0.011 −0.015 0.015 0.052 −0.001 0.008 −0.002 ⎜ ⎟ ⎜ ⎟ Group 1 Bias ⎝ −0.037 −0.008 −0.011 0.018 ⎠ ⎝ 0.034 −0.001 0.004 0.002 ⎠ −0.082 −0.002 −0.027 0.010 −0.018 0.000 −0.006 0.001 ⎛ ⎞ ⎛ ⎞ 0.368 0.014 0.018 0.021 0.118 0.004 0.007 0.006 ⎜ ⎟ ⎜ ⎟ MSE ⎝ 0.410 0.014 0.021 0.022 ⎠ ⎝ 0.117 0.004 0.007 0.007 ⎠ 0.361 0.011 0.019 0.022 0.124 0.004 0.006 0.007 ⎛ ⎞ ⎛ ⎞ −0.046 0.002 −0.013 −0.001 −0.014 0.006 −0.002 0.005 ⎜ ⎟ ⎜ ⎟ Group 2 Bias −0.037 0.005 0.001 0.001 −0.030 0.004 −0.006 0.008 ⎝ ⎠ ⎝ ⎠ −0.013 0.008 0.004 0.005 −0.008 0.000 −0.002 0.009 ⎛ ⎞ ⎛ ⎞ 0.046 0.003 0.006 0.004 0.015 0.001 0.001 0.002 ⎜ ⎟ ⎜ ⎟ MSE ⎝ 0.046 0.004 0.003 0.005 ⎠ ⎝ 0.013 0.001 0.001 0.001 ⎠ 0.043 0.004 0.004 0.005 0.013 0.001 0.001 0.002 ⎛ ⎞ ⎛ ⎞ 0.035 −0.017 0.011 0.016 0.007 −0.004 0.002 −0.000 ⎜ ⎟ ⎜ ⎟ Group 3 Bias 0.011 −0.006 0.012 0.008 0.015 −0.004 0.001 0.000 ⎝ ⎠ ⎝ ⎠ 0.023 −0.010 0.005 0.010 0.028 −0.005 0.004 −0.000 ⎛ ⎞ ⎛ ⎞ 0.078 0.006 0.003 0.003 0.027 0.002 0.001 0.001 ⎜ ⎟ ⎜ ⎟ MSE ⎝ 0.073 0.005 0.003 0.003 ⎠ ⎝ 0.025 0.002 0.001 0.001 ⎠ 0.080 0.005 0.002 0.003 0.030 0.002 0.001 0.001 ⎛ ⎞ ⎛ ⎞ 0.039 0.002 0.002 −0.003 −0.043 0.004 0.001 −0.003 ⎜ ⎟ ⎜ ⎟ Group 4 Bias ⎝ 0.005 −0.001 −0.004 −0.007 ⎠ ⎝ −0.014 −0.000 0.003 0.002 ⎠ −0.051 −0.003 0.004 0.004 −0.008 −0.002 0.002 0.003 ⎛ ⎞ ⎛ ⎞ 0.147 0.003 0.005 0.004 0.061 0.001 0.002 0.002 ⎜ ⎟ ⎜ ⎟ MSE 0.160 0.006 0.007 0.006 0.060 0.001 0.002 0.002 ⎝ ⎠ ⎝ ⎠ 0.132 0.003 0.007 0.006 0.069 0.001 0.002 0.002 Table 3 ARI and η values, along with the number of times the true G is selected by the BIC, over 100 replications, for scenarios A and B 1 1 N = 200 N = 500 ARI η True G ARI η True G Scenario A 1.00 0.00% 100 1.00 0.00% 100 Scenario B 0.91 3.04% 99 0.92 2.71% 100 Table 4 Number of times, over 100 replications, the considered initialization strategies produced the highest log-likelihood at convergence N = 200, for scenarios A and B N = 500 1 1 Random k-Means Mixture Random k-Means Mixture Scenario A 100 77 98 100 79 95 Scenario B 97 74 87 100 83 100 1 566 Journal of Classification (2021) 38:556–575 ∗ ∗ set B = B . Lastly, in the third scenario (“Scenario C ”), the two groups have only the 2 1 same slopes and π . Here, with respect to the parameters used under Scenario B , the only g 2 difference is in the intercept vectors, which are (−3, −4) and (−7, −8) for the first and the second groups, respectively. The MN-CWM and the MN-FMR are then fitted to the datasets of each scenario for G ∈ {1, 2, 3}, and the results in terms of model selection and clustering are reported in Table 5.It is possible to see that, in Scenario A , the BIC correctly selects two groups for both models and the classifications produced are perfect. Therefore, even if the two groups have the same means and are strongly overlapping, the MN-CWM seems able to properly identify the true underlying group structure. However, under such a scenario, the MN-FMR should be preferred because the distribution of the covariates p (X) is not useful for clustering and it is more parsimonious than the MN-CWM. On the contrary, Scenarios B and C represent 2 2 typical examples of the usefulness of p (X). Specifically, the BIC always identifies just one group under both scenarios for the MN-FMR, with obvious consequences in terms of the classification produced. Notice that, even if the MN-FMR had been fitted directly with G = 2, the resulting classifications would lead to almost identical ARI and η for Scenario B , and slightly better performance for Scenario C , because ARI = 0.15 and η = 32.48%. 2 2 This underlines how, regardless of the BIC, the MN-FMR is not able to properly model such data structures. 5.3 Simulation 3: a Comparison Between the Matrix-Normal CWM and the Multivariate-Multiple Normal CWM In this study, the MN-CWM is compared to the MMN-CWM. To show the effects of data vectorization, we consider two experimental factors: the matrix dimensionality and the num- ber of groups G. About the dimensionality, we assume square matrices having the same { } dimensions both for the responses and the covariates, i.e., p = q = r ∈ 2, 3, 4 . Simi- larly, situations with three different number of groups are evaluated, i.e., G ∈ {2, 3, 4}.By combining both experimental factors, nine scenarios are obtained and, for each of them, 30 datasets are generated from a MN-CWM. The parameters used to generate the data come from Section 5.1 and are shown in Appendix 1. In detail, when p = q = r = 2 they are obtained by taking the submatrix in the upper-left corner of each parameter, when p = r = q = 3 they are exactly as displayed, whereas when p = r = q = 4arowand a column on each parameter matrix are added, which for brevity’s sake are not reported here. About the number of groups, and by considering Appendix 1,when G = 2and G = 3 the first two and three groups are selected, respectively, while when G = 4 all of them are considered. Table 5 ARI and η values, along with the number of times the true G is selected by the BIC, over 30 replications, for scenarios A ,B ,and C 2 2 2 MN-CWM MN-FMR ARI η True G ARI η True G Scenario A 1.00 0.00% 100 1.00 0.00% 100 Scenario B 0.99 0.03% 100 0.00 47.22% 0 Scenario C 1.00 0.01% 100 0.00 47.18% 0 2 Journal of Classification (2021) 38:556–575 567 { } The MN-CWM is then fitted to each dataset for G ∈ 1, 2, 3, 4, 5 . The same is done for the MMN-CWM after data vectorization, and the results of both models in terms of model selection via the BIC are shown in Table 6. As we can see from Table 6, when the MN-CWM is considered, regardless of the data dimensionality and the number of groups, the BIC always selects the correct number of groups. The same also holds for the MMN-CWM when p = q = r = 2 or, regardless of the data dimensionality, when G = 2. However, when p = q = r = 3, the BIC starts to face issues for G = 3 because the true number of groups is detected only 11 times (the other 19 times, G = 2 is selected) and it systematically fails when G = 4. This problem gets even worse when p = r = q = 4 (with the exclusion of G = 2). The reason for such failures is related to the increased number of parameters with respect to the MN-CWM. Therefore, on the one hand, we have a model that can become seriously overparameterized with negative effects also on model selection (the MMN-CWM) and, on the other hand, we have a model (the MN-CWM) which is able to fit the same data in a far more parsimonious way and without causing problems on model selection. 6 Real Data Applications 6.1 Education Data The first dataset comes from the Italian national agency for the evaluation of universities and research institutes, which makes available to Italian universities quantitative indicators related to the academic careers of the students and the results of the training activities. For this application, the following p = 2 responses that measure the level of completion of studies by students are considered: 1. The percentage of students that graduate within T + 1 years (Complete), and 2. The percentage of students that drop after T + 1 years (Drop), where T is the normal duration of the study program. Moreover, the following q = 2 covariates that may be helpful in explaining this progress are taken into account: 1. The percentage of course credits earned in the first year over the total to be achieved (Credits), and 2. The percentage of students that have earned at least 40 course credits during the solar year (Students). For the sake of simplicity, these variables will hereafter be referred to by the names given in the associated parentheses. All the measurements refer to N = 75 study programs in the Table 6 The number of times, over 30 replications, the true G is selected by the BIC in each of the nine scenarios, for the MN-CWM and MMN-CWM MN-CWM MMN-CWM G = 2 G = 3 G = 4 G = 2 G = 3 G = 4 p = q = r =2 303030303030 p = r = q =3 3030303011 0 p = r = q =4 30303030 0 0 568 Journal of Classification (2021) 38:556–575 non-telematic Italian universities over r = 3 years. Each study program is measured at the national level, i.e., it is the average value of all the study programs of the same type across the country, for the reference period. There are two groups in the data, namely N = 33 bachelor’s degrees and N = 42 1 2 master’s degrees. The MN-CWM and the MN-FMR are fitted to the data for G ∈ {1, 2, 3} and their results are reported in Table 7. The BIC selects a two-component MN-CWM that yields a perfect classification of the data. On the contrary, a three-component MN-FMR is chosen by the BIC and has 6.67% misclassified units (ARI = 0.88). Therefore, our model is able to completely recognize the underlying group structure, differently from the MN- FMR. Notice that, even if we consider the MN-FMR directly with G = 2, this will produce a similar classification with an ARI of 0.89. Additionally, if we vectorize the data and fit the MMN-CWM for G ∈ {1, 2, 3}, then the BIC chooses a one-component model. This is probably due to the overparameterization issue discussed in Section 5.3. In any case, if we would ignore this issue and fit the MMN-CWM directly with G = 2, the ARI would be 0.84, which is lower than those of matrix-variate models. The estimated regression coefficients by the MN-CWM are displayed in Table 8,forthe two groups, respectively. As it is plausible to expect, both covariates have a positive effect on the successful completion of the study programs, even if their magnitude is different in the two groups. The credits obtained during the first year of study might be more important for bachelor’s students, considering the difficulties that arise in the transition from high school to university (Krause 2006). At the same time, obtaining at least 40 course credits per year should be easier for masters’ students, resulting in a greater importance for the completion of the studies. Conversely, both covariates have a negative impact on the drop rates, with the exception of the Credits variable that surprisingly turns out to have a positive sign for the master’s courses. 6.2 Insurance Data For this second real data application, the “Insurance” dataset included in the splm package (Millo and Piras 2012) is used. This dataset was introduced by Millo and Carmeci (2011) to study the consumption of non-life insurance across the N = 103 Italian provinces in the years 1998–2002 (r = 5). In this application we select, as responses, the following p = 2 variables that are related to the consumption and the presence of insurance products in the market: 1. Real per-capita non-life premiums in 2000 Euros (PPCD), and 2. Density of insurance agencies per 1000 inhabitants (AGEN), as well as the following q = 3 financial covariates: 1. Real per-capita GDP (RGDP), 2. Real per-capita bank deposits (BANK), and 3. Real interest rate on lending to families and small enterprises (RIRS). Table 7 ARI and η values for the Model G ARI η MN-CWM and MN-FMR selected by the BIC for the MN-CWM 2 1.00 0.00% education data MN-FMR 3 0.88 6.67% Journal of Classification (2021) 38:556–575 569 Table 8 Estimated regression coefficients for the MN-CWM for the education data Intercept Credits Students (a) Bachelor’s degrees Complete −0.087 0.973 0.102 Drop 0.727 −0.557 −0.178 (b) Master’s degrees Complete 0.309 0.101 0.587 Drop 0.049 0.044 −0.011 There are two reasons why we focused on this subset of covariates. Firstly, they are regularly used in the literature and their relevant effects on the consumption or development of insur- ance products have been widely discussed (see the references in Millo and Carmeci 2011, for further details). Indeed, they are commonly used as proxies for income and general level of economic activity (RGDP), stock of wealth (BANK), and opportunity cost of allocate funds in insurance policies (RIRS). Secondly, we wish to avoid excessive parametrization of the models. Notice that, for a better interpretation of the regression coefficients, the variables RGDP and BANK are divided by 1000; thus, thousands of euros are considered. Differently from the previous application, we do not have a “ground truth” classification of the data. However, because we are using p X in Eq. 6, the sampling distribution of ( ) each covariate could provide useful insights. They are reported in Fig. 1, for the full 5-year data, as done by Melnykov and Zhu (2019). The bimodality in all these histograms seems to suggest the presence of two groups in the data, validating the findings of Millo and Carmeci (2011). They underline the existence of two macro areas, namely Central-Northern Italy, characterized by an insurance penetration level relatively close to the European averages, and Southern Italy, where a general economic underdevelopment has long been standing as a fundamental social and political problem. The MN-CWM and the MN-FMR models are hence fitted to the data for G ∈ {1, 2, 3}, and the BIC selects a two-component MN-CWM and a three-component MN-FMR, respectively. These two partitions are illustrated in Fig. 2 using the Italian political map. Specifically, the Italian regions are bordered in yellow (islands excluded), while the internal provinces are delimited with the black lines and colored according to the estimated group membership both for the MN-CWM and the MN-FMR. Because we do not have a “ground truth” classification for these data, we cannot compute either ARI or η. Nevertheless, the partition produced by the MN-CWM seems in line with the findings of Millo and Carmeci (2011), with a clear separation between Central-Northern Italy and Southern-Insular Italy. Furthermore, with the exclusion of three cases, all the provinces belonging to the same region are clustered together. The only exceptions concern the following: the province of Rome (in the Lazio region), which due to its social-economic development is reasonably assigned to the Central-Northern Italy group; the province of Ascoli-Piceno (in the Marche region); and the province of Massa-Carrara (in the Toscana region). On the contrary, the three groups detected by the MN-FRM are not supported by the literature and are difficult to interpret, even because they put together provinces spanning all over the country without a straightforward and reasonable justification. Notice that, also in this case, if we vectorize the data and fit the MMN-CWM for G ∈ {1, 2, 3}, then the BIC selects a one-component model. This is in contrast to all the evi- dence discussed above, and it is plausibly driven by the overparameterization that for this 570 Journal of Classification (2021) 38:556–575 Fig. 1 Sampling distributions of the covariates for the insurance data dataset is even higher than the previous application, given that the matrices have a greater dimensionality. As in Section 6.1, the estimated regression coefficients by the MN-CWM are briefly presented in Table 9 for the two groups. Overall, the coefficients are quite different between the two groups, highlighting the divergences that characterize these two macro areas. The RIRS has a negative impact on the PPCD because the higher the interest rate on lending, the higher the opportunity cost of investing in insurance products. Conversely, an increase in the RIRS has a positive effect on the AGEN because, from the insurance companies’ point of Fig. 2 Partitions produced by the MN-CWM (a) and MN-FMR (b) for the insurance data Journal of Classification (2021) 38:556–575 571 Table 9 Estimated regression coefficients by the MN-CWM for the insurance data Intercept RGDP BANK RIRS Central-Northern Italy PPCD 85.7936 9.1932 1.8513 −7.3079 AGEN 0.5343 −0.0085 0.0071 0.0073 Southern Italy PPCD −3.6968 6.0029 4.2062 −1.2885 AGEN 0.0307 0.0041 0.0279 0.0039 view, it raises the gains of investing the premiums on financial markets in the time between premium collection and claims settling. Therefore, these increased revenues might lead to an expansion of the insurance companies over the territory. Similarly, and in accordance with existing literature, income (RGDP) and wealth (BANK) have a positive influence on the PPCD (Millo and Carmeci 2011). It would be reasonable to expect that they also have a positive effect on AGEN. However, this is not the case at least for Central-Northern Italy that shows lower coefficients than Southern Italy. In the southern part of the country, where the insurance density and penetration are much lower, an increased income and wealth can give more space to growth and investment opportunities. 7 Conclusion The matrix-variate CWM has been introduced. In the MN-CWM framework, the sets of responses and covariates may be simultaneously observed at different time points or loca- tions. The matrix normal distribution was used for both the cluster-specific distributions of covariates and responses given covariates, and an ECM algorithm was used for parameter estimation. A simulation study analyzed parameter recovery and the classification perfor- mance of the proposed model as well as the capability of the BIC to detect the underlying group structure of the data. A comparison among the different initialization strategies has also been conducted. Additionally, the MN-CWM has been compared to the MN-FMR, dis- cussing under which scenarios the cluster-specific distributions of covariates is useful, and to the MMN-CWM on the vectorized data, where the overparameterization issue and its consequences in model selection have been analyzed. Lastly, in the first real data applica- tion, the MN-CWM produced a perfect classification, differently from the MN-FMR model. Similarly, in the second real data analysis, our model seemed to provide a more reliable partition of the Italian provinces than the MN-FMR model. Further model developments can be readily proposed. It is reasonable to assume that similar overparameterization concerns can affect also the MN-CWM when the dimensions of the matrices are quite high. To further increase the parsimony of our model, constrained parameterizations of the covariance matrices can be employed, by following the approaches of Sarkar et al. (2020), Subedi et al. (2013), Punzo and Ingrassia (2015), and Mazza et al. (2018). Furthermore, to accommodate skewness or mild outliers in the data, skewed or heavy tailed matrix-variate distributions could also be considered for the mixing compo- nents of the model (e.g., Melnykov and Zhu 2018; Gallaugher and McNicholas 2018; Tomarchio et al. 2020). 572 Journal of Classification (2021) 38:556–575 ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ ⎛ ⎜ ⎝ Appendix 1 Table 10 Parameters used to generate the simulated datasets under Scenario A Parameter Group 1 Group 2 Group 3 Group 4 π 0.30 0.30 0.20 0.20 1.00 2.00 0.00 6.00 8.00 6.00 −4.00 −3.00 −4.00 12.00 12.00 11.00 M −4.00 −3.00 −3.00 2.00 1.00 3.00 −9.00 −9.00 −7.00 6.00 7.00 7.00 1.00 2.00 1.00 5.00 6.00 6.00 −4.00 −3.00 −5.00 10.00 11.00 11.00 1.00 0.50 0.25 2.00 0.40 0.08 1.50 0.75 0.38 1.20 0.60 0.30 0.50 1.00 0.50 0.40 0.20 0.40 0.75 1.50 0.75 0.60 1.20 0.60 Xg 0.25 0.50 1.00 0.08 0.40 2.00 0.38 0.75 1.50 0.30 0.60 1.20 1.20 0.60 0.30 1.40 0.70 0.35 0.80 0.40 0.20 1.60 0.80 0.40 0.60 1.20 0.60 0.70 1.40 0.70 0.40 0.80 0.40 0.80 1.60 0.80 0.30 0.60 1.20 0.35 0.70 1.40 0.20 0.40 0.80 0.40 0.80 1.60 0.00 1.00 1.00 1.00 6.00 −1.00 −1.50 −1.00 −5.00 1.00 1.00 1.00 1.00 −1.00 −1.00 −1.00 B −2.00 1.00 1.50 1.00 4.00 −1.00 −1.50 −1.00 −3.00 1.50 1.00 1.00 −5.00 −1.00 −1.50 −1.50 1.00 1.50 1.50 1.00 8.00 −1.50 −1.50 −1.00 −6.00 1.50 1.50 1.00 0.00 −1.50 −1.00 −1.50 1.40 0.84 0.50 1.80 1.26 0.88 1.20 0.84 0.59 1.60 0.96 0.58 Y 0.84 1.40 0.84 1.26 1.80 1.26 0.84 1.20 0.84 0.96 1.60 0.96 0.50 0.84 1.40 0.88 1.26 1.80 0.59 0.84 1.20 0.58 0.96 1.60 2.00 0.60 0.18 1.10 0.55 0.28 1.90 1.71 1.54 1.40 1.26 1.13 0.60 0.20 0.60 0.55 1.10 0.55 1.71 1.90 1.71 1.26 1.40 1.26 0.18 0.60 2.00 0.28 0.55 1.10 1.54 1.71 1.90 1.13 1.26 1.40 Journal of Classification (2021) 38:556–575 573 Appendix 2 Table 11 Parameters used to generate the simulated datasets under Scenario A Parameter Group 1 Group 2 π 0.50 0.50 ⎛ ⎞ ⎛ ⎞ 1.00 2.00 2.00 0.00 1.00 2.00 2.00 0.00 ⎜ ⎟ ⎜ ⎟ M −1.00 1.00 1.00 2.00 −1.00 1.00 1.00 2.00 g ⎝ ⎠ ⎝ ⎠ 0.00 2.00 2.00 1.00 0.00 2.00 2.00 1.00 ⎛ ⎞ ⎛ ⎞ 1.00 0.50 0.25 2.00 0.40 0.08 ⎜ ⎟ ⎜ ⎟ ⎝ 0.50 1.00 0.50 ⎠ ⎝ 0.40 0.20 0.40 ⎠ 0.25 0.50 1.00 0.08 0.40 2.00 ⎛ ⎞ ⎛ ⎞ 1.70 0.85 0.42 0.21 1.00 0.50 0.25 0.12 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 0.85 1.70 0.85 0.42 0.50 1.00 0.50 0.25 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 0.42 0.85 1.70 0.85 0.25 0.50 1.00 0.50 ⎝ ⎠ ⎝ ⎠ 0.21 0.42 0.85 1.70 0.12 0.25 0.50 1.00 2.00 1.00 1.00 −1.00 −7.00 1.00 1.00 −1.00 3.00 1.00 −1.00 1.00 −8.00 1.00 −1.00 1.00 1.00 0.50 2.00 1.20 Y g 0.50 1.00 1.20 2.00 ⎛ ⎞ ⎛ ⎞ 2.00 1.00 0.50 0.25 1.70 0.75 0.38 0.19 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1.00 2.00 1.00 0.50 0.75 1.50 0.75 0.38 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 0.50 1.00 2.00 1.00 0.38 0.75 1.50 0.75 ⎝ ⎠ ⎝ ⎠ 0.25 0.50 1.00 2.00 0.19 0.39 0.75 1.50 Acknowledgements We thank the three anonymous reviewers and the associate editor for their useful com- ments and suggestions. We also thank Professor Salvatore Ingrassia for providing the education dataset as well as for helpful discussions. Funding Open access funding provided by Universita` degli Studi di Catania within the CRUI-CARE Agreement. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References Anderlucci, L., Montanari, A., Viroli, C. (2014). A matrix-variate regression model with canonical states: An application to elderly Danish twins. Statistica, 74(4), 367–381. 574 Journal of Classification (2021) 38:556–575 Anderlucci, L., Viroli, C., et al. (2015). Covariance pattern mixture models for the analysis of multivariate heterogeneous longitudinal data. The Annals of Applied Statistics, 9(2), 777–800. Dang, U.J., & McNicholas, P.D. (2015). Families of parsimonious finite mixtures of regression models. In Advances in statistical models for data analysis (pp. 73–84): Springer. Dang, U.J., Punzo, A., McNicholas, P.D., Ingrassia, S., Browne, R.P. (2017). Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classification, 34(1), 4–34. Dayton, C.M., & Macready, G.B. (1988). Concomitant-variable latent-class models. Journal of the American Statistical Association, 83(401), 173–178. Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. DeSarbo, W.S., & Cron, W.L. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5(2), 249–282. Dutilleul, P. (1999). The MLE algorithm for the matrix normal distribution. Journal of Statistical Computa- tion and Simulation, 64(2), 105–123. Fruhwirth-Schnatter, ¨ S. (2006). Finite mixture and markov switching models. New York: Springer Science & Business Media. Gallaugher, M.P.B., & McNicholas, P.D. (2017). A matrix variate skew-t distribution. Stat, 6(1), 160–170. Gallaugher, M.P.B., & McNicholas, P.D. (2018). Finite mixtures of skewed matrix variate distributions. Pattern Recognition, 80, 83–93. Gershenfeld, N. (1997). Nonlinear inference and cluster-weighted modeling. Annals of the New York Academy of Sciences, 808(1), 18–24. Gershenfeld, N., Schoner, B., Metois, E. (1999). Cluster-weighted modelling for time-series analysis. Nature, 397(6717), 329. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. Ingrassia, S., Minotti, S.C., Punzo, A. (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics & Data Analysis, 71, 159–182. Ingrassia, S., Minotti, S.C., Vittadini, G. (2012). Local statistical modeling via a cluster-weighted approach with elliptical distributions. Journal of Classification, 29(3), 363–401. Ingrassia, S., & Punzo, A. (2016). Decision boundaries for mixtures of regressions. Journal of the Korean Statistical Society, 45(2), 295–306. Ingrassia, S., & Punzo, A. (2020). Cluster validation for mixtures of regressions via the total sum of squares decomposition. Journal of Classification, 37(2), 526–547. Ingrassia, S., Punzo, A., Vittadini, G., Minotti, S.C. (2015). The generalized linear mixed cluster-weighted model. Journal of Classification, 32(1), 85–113. Krause, K. (2006). On being strategic about the first year. In Keynote presentation, Queensland University of Technology First Year Forum,Vol.5. Leisch, F. (2004). Flexmix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8), 1–18. Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19(2), 354–376. Mazza, A., Ingrassia, S., Punzo, A. (2019). Modeling return to education in heterogeneous populations: An application to Italy. In Greselin, F., Deldossi, L., Bagnato, L., Vichi, M. (Eds.) Statistical learning of complex data, volume 88 of studies in classification, data analysis, and knowledge organization, Cham, Switzerland (pp. 121–131): Springer. Mazza, A., Punzo, A., Ingrassia, S. (2018). flexCWM: A flexible framework for cluster-weighted models, (Vol. 86 pp. 1–30). McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. McNicholas, P.D. (2016a). Mixture Model-Based classification. Boca Raton: Chapman and Hall/CRC Press. McNicholas, P.D. (2016b). Model-based clustering. Journal of Classification, 33(3), 331–373. Melnykov, V., & Zhu, X. (2018). On model-based clustering of skewed matrix data. Journal of Multivariate Analysis, 167, 181–194. Melnykov, V., & Zhu, X. (2019). Studying crime trends in the USA over the years 2000–2012. Advances in Data Analysis and Classification, 13(1), 325–341. Meng, X.-L., & Rubin, D.B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278. Michael, S., & Melnykov, V. (2016). An effective strategy for initializing the EM algorithm in finite mixture models. Advances in Data Analysis and Classification, 10(4), 563–583. Millo, G., & Carmeci, G. (2011). Non-life insurance consumption in Italy: a sub-regional panel data analysis. Journal of Geographical Systems, 13(3), 273–298. Journal of Classification (2021) 38:556–575 575 Millo, G., & Piras, G. (2012). splm: Spatial panel data models in R. Journal of Statistical Software, 47(1), 1–38. Punzo, A. (2014). Flexible mixture modelling with the polynomial Gaussian cluster-weighted model. Statistical Modelling, 14(3), 257–291. Punzo, A., & Ingrassia, S. (2015). Parsimonious generalized linear Gaussian cluster-weighted models. In Morlini, I., Minerva, T., Vichi, M. (Eds.) Advances in statistical models for data analysis, studies in clas- sification, data analysis and knowledge organization, Switzerland (pp. 201–209): Springer International Publishing. Punzo, A., & Ingrassia, S. (2016). Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31(3), 989–1013. Punzo, A., & McNicholas, P.D. (2017). Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. Journal of Classification, 34(2), 249–293. Qiu, W., & Joe, H. (2015). clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.4. R Core Team. (2018). R: a language and environment for statistical computing vienna. Austria: R Foundation for Statistical Computing. Sarkar, S., Zhu, X., Melnykov, V., Ingrassia, S. (2020). On parsimonious models for modeling matrix data. Computational Statistics & Data Analysis, 142, 106822. Schwarz, G., et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. Subedi, S., Punzo, A., Ingrassia, S., McNicholas, P.D. (2013). Clustering and classification via cluster- weighted factor analyzers. Advances in Data Analysis and Classification, 7(1), 5–40. Tomarchio, S.D., Punzo, A., Bagnato, L. (2020). Two new matrix-variate distributions with application in model-based clustering. Computational Statistics & Data Analysis, 152, 107050. Viroli, C. (2011). Finite mixtures of matrix normal distributions for classifying three-way data. Statistics and Computing, 21(4), 511–522. Viroli, C. (2012). On matrix-variate regression analysis. Journal of Multivariate Analysis, 111, 296–309. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Journal of Classification – Springer Journals
Published: Oct 1, 2021
Keywords: Mixture models; Matrix-variate; Classification; Clustering
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.