Access the full text.
Sign up today, get DeepDyve free for 14 days.
Abstract We introduce a new dynamic clustering method for multivariate panel data characterized by time-variation in cluster locations and shapes, cluster compositions, and possibly the number of clusters. To avoid overly frequent cluster switching (flickering), we extend standard cross-sectional clustering techniques with a penalty that shrinks observations toward the current center of their previous cluster assignment. This links consecutive cross-sections in the panel together, substantially reduces flickering, and enhances the economic interpretability of the outcome. We choose the shrinkage parameter in a data-driven way and study its misclassification properties theoretically as well as in several challenging simulation settings. The method is illustrated using a multivariate panel of four accounting ratios for 28 large European insurance firms between 2010 and 2020. We propose a new method to cluster multivariate panel data in a dynamic yet stable and economically meaningful way. Building on established cross-sectional clustering methods, such as, for example, k-means clustering, we provide a straightforward and intuitive algorithm to link consecutive cross-sections over time by introducing persistence in cluster assignments via a penalty parameter. This parameter can be chosen in a data-driven way. The approach results in clusters that can be time-varying in location, dispersion, size/composition, and (possibly) in the number of clusters. As our approach ties the different cross-sections together, changes happen gradually over time and cluster switches become more persistent. Both of these features are important in many economic and financial applications, see, for example, Bonhomme and Manresa (2015) for a clustering model of economic development and Patton and Weller (2021) for an asset pricing model based on time-invariant clusters. Lumsdaine, Okui, and Wang (2022) present a panel model with structural breaks, allowing for moderate time-variation in the cluster structure, which they apply to describe firms’ sales growth over time. Custodio João et al. (2022) analyze the persistent dynamics of banks’ business models using a model-based clustering technique. Many existing econometric approaches for modeling grouped panel data fail to incorporate dynamics in cluster composition, that is, potential changes in units’ cluster membership over time. In economic applications, however, we often expect several units to switch cluster over the sample period, particularly when the time-series dimension is large and/or the sample contains periods of stress. Most of the earlier work focuses on clustering entire time-series, while allowing for different types of unobserved heterogeneity in the panel units. Examples include Lin and Ng (2012), Bonhomme and Manresa (2015), Bonhomme, Lamadon, and Manresa (2022), Cheng, Schorfheide, and Shao (2019), and Patton and Weller (2021), who use variations of k-means to iteratively cluster time series and estimate the structure of a linear or nonlinear regression model. A variety of model-based methods to cluster panel data are surveyed in Frühwirth-Schnatter (2011); see also Frühwirth-Schnatter and Malsiner-Walli (2019) for a finite mixture model approach in which the number of mixtures can be different from the number of clusters in the observed data. Another line of literature studies clusters in panel data by means of repeated cross-sectional clustering; see, for example, Oliveira and Gama (2012). This allows for cluster switches, but typically generates clusters that are (too) unstable over time as the obtained structure at one point in time has no bearing on the next cross-section. In addition, it is often unclear how groups can be tracked over time as cluster labeling is partly arbitrary and therefore cluster identification over different cross-sections is difficult; see Frühwirth-Schnatter (2006). Cluster assignment instabilities are also likely to occur when the panel is treated as one large cross-section, to which a hierarchical clustering algorithm is applied that ignores the time-dimension entirely; see for instance Ayadi et al. (2021). To accommodate economically meaningful cluster switching, while at the same time avoiding too frequent switching behavior that ceases to be interpretable, we propose a new penalized model-free approach. The approach extends the repeated cross-sectional clustering framework of Oliveira and Gama (2012) by adding time-dependence to the cluster assignments. The context we have in mind is one where units that switch do so gradually and persistently. For instance, when statistically describing firms’ business models, we would not expect a unit to cross from group A to B in one period, only to return back from B to A in the next. We label such erratic moves between clusters as “flickering,” a feature that we wish to mitigate, while still allowing for flexible dynamics. Specifically, we do so by shrinking observations toward the new (time t) centroid of their previous (time t−1) cluster, before grouping all observations into new clusters at time t. To track the identity of the resulting dynamic clusters, we build on algorithmic ideas that identify clusters by maximizing the overlap in cluster membership over time; see, for instance, Kalnis, Mamoulis, and Bakiras (2005) and Oliveira and Gama (2010). The penalty parameter that determines the extent of shrinkage in our approach is set in a data-driven way using a modified version of the silhouette index, which is a widely used cluster validation index introduced by Rousseeuw (1987). We first study the properties of this parameter in terms of mis-classification rates in a stylized setting. This allows us to determine optimal values for the penalty parameter analytically. Next, we investigate the approach in a number of challenging simulation settings that are analytically untractable, and verify the theoretical properties also numerically. We apply our approach to multivariate panel data of N = 28 European insurance companies covering D = 4 accounting ratios sampled annually between 2010 and 2020. Our sample is close to the set of companies chosen by the European Insurance and Occupational Pensions Authority (EIOPA) for its 2021 insurance sector stress test; see EIOPA (2021, Annex A). We allocate each insurer to one distinct business model (peer) group at each point in time. To our knowledge, our study is the first to do so for the insurance industry. Reliable up-to-date listings of business model peer groups are useful, for example, for prudential supervision. Insurance supervisors, such as the Federal Insurance Office at the U.S. Treasury, or EIOPA as an important part of the European System of Financial Supervision, routinely need to benchmark insurers’ capital positions, cost-to-income ratios, and profitability measures. They do so by comparing each firm’s incoming data to that of approximately similar other firms; see, for example, SSM (2016) and Lucas, Schaumburg, and Schwaab (2019) for a discussion in a banking context. We recover four clusters: re-insurers, life insurers, non-life insurers, and financial conglomerates. The shrinkage parameter is chosen to decrease the number of incidental switches (flickering) while retaining a high overall fit to the data (in terms of silhouette index). Our clustering approach leads to stable cluster allocations over time. In contrast, we verify that the clustering outcomes are visibly more volatile and much harder to interpret economically if no shrinkage is imposed. The results are qualitatively similar whether or not we allow the number of clusters to also vary over time. Before proceeding, we also mention three other links to earlier literature. First, our work also relates to the literature on segmenting audio recordings; see, for instance, Fox et al. (2011). A typical finding in this literature is that hidden Markov models (HMMs) can produce over-segmentation, that is, too frequent jumping between states, or “flickering.” The problem is typically addressed in a Bayesian way by introducing a parameter for self-transitioning and imposing a prior on it. Our approach is different in that we reduce the dynamic problem to a collection of static ones, and introduce a stickiness (or self-transitioning) hyper-parameter chosen by well-known cluster validation criteria. In addition, our approach is model-free and does not require the choice of a prior as in a Bayesian setting. Second, our work is also related to Catania (2021) and Custodio João et al. (2022). Both papers use a dynamic mixture modeling approach that allows for changes in cluster membership. The former does so in a score-driven way, and the latter uses a HMM. Both papers are potentially subject to an over-segmentation or flickering problem; see Fox et al. (2011). Custodio João et al. (2022) address this by enlarging the HMM dynamics with inactive states, ruling out further transitions for some time after an initial transition. Our methodology differs in at least two ways. First, we adopt a more standard, non-parametric approach to the clustering problem without leaning on explicit distributional assumptions as in Catania (2021) and Custodio João et al. (2022). This allows for an easy generalization of our approach to different clustering algorithms. Second, our penalty parameter determining the stickiness in cluster membership is chosen in a data-driven way, whereas the one in Custodio João et al. (2022) is set exogenously using economic arguments. A final strand of recent literature that is somewhat related to us focuses on grouped heterogeneity in panels and structural breaks; see, for instance, Lumsdaine, Okui, and Wang (2022), Smith (2022), and Wang and Tsay (2019). Even though in these approaches the number and timing of the structural breaks are unknown and can be estimated, a main assumption is that there is a small number of breaks and that the breaks are common to the parameters and group memberships of all units. Our method is different as it allows for cluster switches of individual units at any point in time. The remainder of this article is set up as follows. In Section 1, we introduce the methodology. Section 2 considers a simplified setting where we study misclassification probabilities and optimal penalty parameters analytically. Section 3 studies the new approach in a controlled environment and shows reductions in overall misclassification rates in line with our analytical results. Section 4 discusses the empirical application, while Section 5 concludes. The Appendix provides proofs and further technical and empirical results. 1 Methodology In this section, we first introduce our robust clustering methodology. Next, we explain how we link cluster identities over time, which is a crucial step in our method. Finally, we provide data-driven ways to select the shrinkage penalty parameter in our approach. 1.1 Penalized Cross-Sectional Clustering Consider a panel of multivariate financial data, with xi,t∈RD×1 denoting a vector of observed characteristics for unit i=1,…,N at time t=1,…,T . Our goal is to assign each unit i to a peer group of similar units at each point in time t. An example of such a situation is the monitoring of business models in the financial industry by a prudential supervisor as in Custodio João et al. (2022). In the realistic setting of changing market conditions, technological advances, and shifts in regulatory requirements, we expect that some firms may move to a different group or business model at some point in time. However, switching from group A to group B at one point in time, only to switch back from B to A in the following period, is unrealistic in many situations that involve long-term strategies. A suitable clustering method should therefore mitigate excessive cluster switches. To illustrate this, consider an example with D = 2 features in Figure 1. Assume we cluster each cross-section t separately into two clusters by, for instance, a k-means approach. Units are then assigned to the cluster with the closest cluster center. This divides the space in two regions. If an observation xi,t at time t is close to the border that separates the clusters, as in the left-hand panel, even a small disturbance to its position might shift it to the other cluster. A second switch might then occur if it is subject to another small and roughly opposite disturbance in the next period, and so on. We would observe short-lived cluster switches, or “flickering,” caused by little actual movement. Such flickering might not be economically meaningful, and therefore undesirable. Figure 1. Open in new tabDownload slide k-means clustering at two consecutive times. The red circles represent the location of the cluster centers. The blue line separates the clusters and is halfway between both cluster centers. The approach presented in this article takes the cross-section at time t and combines it with the t−1 cluster assignments to produce sticky assignments over consecutive cross-sections. For instance, if a unit is assigned to cluster A at time t−1, we first shrink that unit’s observation at time t toward the mean of cluster A at time t before re-classifying it. To solve the arbitrary labeling of clusters over different cross-sections, we propose a mapping procedure based on the maximum overlap between cluster membership: for instance, if 90% of the units in a particular cluster at time t have the same identity as what was called “cluster A” at time t−1, then we label that cluster also “cluster A” at time t. The precise mapping procedure is explained in detail in Section 1.2. We introduce some notation and present the formal algorithm, which is summarized in Algorithm 1. Let hi,t denote the cluster assignment of unit i at time t, such that ht=(h1,t,…,hN,t)′ denotes the N×1 vector of all cluster assignments for cross-section t. We now start at time t = 1 with a standard cross-sectional clustering algorithm and cluster selection criterion to obtain the number of clusters Kt and the cluster identities ht at t = 1. Next, we move to t = 2 and run a clustering algorithm to obtain a candidate set of cluster assignments h˜t . Using the mapping methodology M of Section 1.2, we relabel the cluster identities in h˜t to h˜t′=M(ht−1,h˜t) , such that the identities in ht−1 and h˜′t are comparable. Based on h˜t′ , we compute the current (candidate) location of each of the previous clusters in ht−1 , except for the clusters that were discontinued. For example, if we estimate the cluster location by its mean, the current (candidate) location of unit i’s previous cluster can be estimated by c(hi,t−1,h˜′t)=(#Pi)−1∑j∈Pixj,t , where Pi={j|h˜′j,t=hi,t−1} is the set of current (candidate) units that are in the same cluster now that unit i was in a period ago, and #Pi denotes the number of elements in Pi. If the number of elements in Pi is positive, we then shrink xi,t toward the current location of its previous cluster. We do so by defining x˜i,t=(1−ε)·xi,t+ε·c(hi,t−1,h˜′t),(1) where ε is a fixed penalty parameter in the unit interval. The effect can be seen in Figure 2. Algorithm 1. Dynamic clustering with shrinkage Figure 2. Open in new tabDownload slide We can interpret x˜i,t as an artificial position lying an ε fraction of the way from the current position xi,t and the center of its last cluster hi,t−1 . Using the shrunk observations x˜i,t , we run a second pass of the cluster assignments as hi,t=1i,t·h˜′i,t+(1−1i,t)·hi,t−1,(2) 1i,t={1 if #Pi=0 or d(x˜i,t,c(h′i,t,h˜′t))<d(x˜i,t,c(hi,t−1,h˜′t)),0 else,(3) where d is a distance measure. In words: if the shrunk observation x˜i,t is closer to the new candidate cluster, or if the old cluster is discontinued, the unit switches to the new cluster. Otherwise, the unit remains in the old cluster.1 The shrinkage of the observation toward the current location of the previous cluster ensures that cluster switches become less likely. If ε equals zero, there is no shrinkage and units can switch cluster identity freely from one cross-section to the next. The steps are repeated for all cross-sections 1,…,T , including a step to determine the number of clusters in each cross-section. The complete algorithm is summarized in Algorithm 1. The initial clustering at t = 1 is the best approximation of the true group structure in the absence of past data. By construction, any error made then will persist for longer the higher ε is set. Section 2 studies conditions under which the initial error tapers off. It is important to note here that we have been silent thus far about which clustering algorithm is used, which distance measure d, and which measure of cluster centroid c. This means that the current shrinkage technique can be applied in a wide variety of settings. Any cross-sectional clustering algorithm that produces a distance measure can be adapted in the above way to feature stickiness. For example, in graph-based algorithms such as in Zahn (1971) or Grundmann et al. (2010), we can shrink the weight of edges connecting points that belonged to the same cluster at t−1. For simplicity, we use k-means in our simulations and empirical application, but other methods are possible if the data call for more exotic cluster shapes. We can be similarly flexible with regard to the choices of distance measures d and centroids c. For instance, if distances are Mahalanobis-based, we can choose to pool across all cross-sections to compute (cluster) covariance matrices, or alternatively compute such matrices per cross-section, thus allowing for heteroskedasticity. To select the number of clusters in each cross-section in Algorithm 1, we use the silhouette index of Rousseeuw (1987). Like other cluster selection criteria, it favors homogeneity of units within each cluster as well as heterogeneity between clusters; better scores on either dimension result in higher values of the index. We pick the number of clusters Kt that maximizes the average silhouette index. The silhouette of point i at time t for a given Kt is sit=b(xi,t)−a(xi,t)max{a(xi,t),b(xi,t)}, a(xi,t)=d(xi,t,Chi,t,t), b(xi,t)=mink≠hi,td(xi,t,Ck,t),(4) where a(xi,t) is the average distance from point i to other points in its own cluster Chi,t,t and b(xi,t) is the average distance from point i to the points in the nearest other cluster. Following Rousseeuw (1987), we set s(xi,t)=0 if Chi,t,t only contains unit i. Intuitively, the average silhouette index st=1N∑i=1Nsit(5) measures how tightly the observations are clustered around the cluster mean [when a(xi,t) is low on average] and how separate the clusters are from each other [when b(xi,t) is high on average]. This makes it a useful measure of fit, which we adapt in Section 1.3 to obtain a data-driven way to select the shrinkage parameter ε . 1.2 Mapping Standard cross-sectional clustering algorithms produce arbitrary cluster labels that have no relation to the labels assigned in previous cross-sections. This complicates the identification of the current location of an observation’s previous cluster. To remedy this, we need to find a correspondence between the labels at t−1 and new candidate labels at time t. We do so by looking at the overlap of every two clusters at consecutive times, as in Kalnis, Mamoulis, and Bakiras (2005). To illustrate, consider a setting where at time t the cross-sectional clustering algorithm produces clusters A and B, while at time t + 1 it produces clusters labeled C and D. If all units that belong to cluster A at time t also belong to cluster D at time t + 1, and all units that belong to cluster B at time t belong to cluster C at t + 1, then the most natural correspondence is to assign the same label to A and D, and similarly to B and C. Following Oliveira and Gama (2010), we refer to this procedure as mapping. To generalize this idea to the less obvious case where there are switches, we form a contingency matrix where the elements in row i and column j represent how many points were assigned to cluster i at time t and to cluster j at time t + 1. We can then formalize the idea of maximizing overlap between clusters at different times as maximizing the trace of this matrix with respect to the ordering of the columns. For example, if both periods have two clusters and the maximum is attained when the contingency matrix is formed with the column corresponding to cluster D on the left and the one for cluster C on the right, then cluster D at t + 1 maps to cluster A (row 1) at t, and so on, as is the case in the example below: C DA:B:tr(3251)=4 → D CA:B:tr(2315)=7. This formal problem can be written as maxP tr(Ct*P),(6) where Ct* is the contingency matrix from time t to t + 1, and P is a permutation matrix. The optimal P can easily be interpreted: cluster i at t + 1 maps to cluster j at t if Pi,j=1 . For a small number of clusters at t + 1 (7 or 8, say), Equation (6) can be easily solved by exhaustive search. For larger numbers of clusters, an efficient algorithm has been developed, known as the Hungarian algorithm (Kuhn 1955). An extension of the above method to the situation where the number of clusters increases or decreases over time can be defined as follows. The contingency matrix then becomes rectangular, so it is no longer possible to compute its trace. This is solved by maximizing the trace of the largest square matrix inside it by switching the columns of the rectangular matrix. That is, the extra clusters’ overlap will not go into the objective function. We can still formulate the problem as in Equation (6) if we augment Ct* with a matrix of zeroes such that the resulting matrix is square, that is, Ct**=(Ct*0m−n×m), Ct**=(Ct*0n×n−m),(7) for the case n < m and n > m, respectively, where 0a×b is the matrix of zeroes of dimension a × b. The problem can then again be written as Equation (6), with Ct** taking the place of Ct* . This formulation is equivalent to stating that the extra clusters exist in both time steps but have no members in one of them [as Frühwirth-Schnatter and Malsiner-Walli (2019) do in their paper]. Therefore, this extended problem can still be solved by the Hungarian algorithm. It does not, however, provide a solution for ties, that is, when two relabelings of the clusters at t + 1 provide the same trace tr(Ct*P) . Luckily, such situations are empirically exceedingly rare. Still, in such rare cases, ties can be broken in a variety of ways, for instance by considering the overlap with the cross-section at t−2, or by using the correspondence with the closest cluster means. 1.3 Selection of the Shrinkage Parameter In this section, we propose a modification of the silhouette index to set the shrinkage parameter ε in Equation (1). In Section 3, we benchmark this statistic against the cross-validation (CV) approach of Fu and Perry (2020), which we also briefly introduce here. The latter, however, turns out to work less well. Our aim is to reduce misclassification rates of observations to clusters. As clustering is an unsupervised learning technique, such misclassification can only be studied in a controlled setting, as we do in Sections 2 and 3. For real data, misclassification cannot be measured and we look instead at measures of cluster fit based on the silhouette index. Here, a trade-off has to be made between the best fit on the one hand, and stability of cluster assignments (no undue flickering) on the other. As flickering is a highly transitory phenomenon, we take advantage of this fact to inform our choice of ε . Specifically, we look for values of ε that have a large effect on bringing down the number of switches, but only a modest effect on the overall clustering fit as measured by the silhouette index. To aggregate the silhouette index across cross-sections, we use the Gini-weighted average version as in David (1968), re-scaled by N: Gt=∑k=1Kt(2k−Kt−1)·#Pk:KtKt·N=∑i=1Kt∑j=1Kt|#Pi−#Pj|2Kt·N,GWS=∑t=1T(1−Gt)·st, where #Pk:Kt is the number of units in the k-th smallest cluster, and st denotes the silhouette index of cross-section t. The second expression for Gt clearly shows that Gt equals zero when the clusters have homogeneous sizes, and increases as inequality in cluster size increases. An important reason for choosing the Gini-weighted silhouette (GWS) over other measures is that it penalizes clusters with a single outlying observation. Rousseeuw (1987) already notes that the simple average silhouette could be vulnerable to outliers: “a situation where the data set contains one far outlier is also an example of a strong clustering structure. Indeed, when the outlier is far enough, the other data look like a tight cluster by comparison.” By applying the Gini weights rather than computing a simple average, we avoid picking cluster numbers that result in single (outlying) observation clusters. This is also in line with our objective to reduce flickering: we want to discourage the short-lived birth and death of small, isolated clusters from one cross-section to the next. The GWS statistic is easy to compute and a direct by-product of Algorithm 1.2 To benchmark the GWS index, we also compute the CV statistic for clustering proposed by Fu and Perry (2020). In their paper, Fu and Perry (2020) use CV to determine the optimal number of clusters in a cross-sectional clustering problem. We, instead, use their approach to set the shrinkage parameter ε . Following Fu and Perry (2020), we first randomly split the units of our dataset in (three) equal groups, as well as the variables (in two groups). Next, we build six-folds out of these groups in the following way. One of the three groups of units is assigned to be the training set, while the other two are the test set. Also, one group of variables is taken as predictor variables ( Xttr and Xtte for the train and test sample, respectively). The other variables are called response variables ( Yttr and Ytte ). In each data fold, we apply four steps to reach a measure of CV error. First, we cluster Yttr using our shrinkage methodology to obtain labels cttr and corresponding cluster means μt,ktr,Y for the training response variables and the k=1,…,Kt clusters. We also cluster the observations in Ytte using the same shrinkage approach, but based on the already estimated cluster means μt,ktr,Y and the same number of clusters Kt. This gives us cluster labels ctte . We treat the labels cttr and ctte as observed “pseudo-labels” in the next step. Next, we perform a classification step on Xtte . Though in principle any classification model could be used, we follow Fu and Perry (2020) and use a simple classifier that estimates cluster means μt,ktr,X of Xttr based on the assignments cttr . We then predict cluster assignments c^tte for Xtte by assigning each observation in Xtte to the cluster with the closest mean μt,ktr,X . Our CV error is then ||Ytte−μt,c^ttete,Y||2 , that is, the prediction error for the response variable in the testing sample based on the predictor variables’ classification model. The squared CV errors are averaged over all observations and all folds, and subsequently minimized to compute the optimal shrinkage parameter ε . 2 Analytical Results for a Simplified Model This section presents a simplified clustering model that allows us to investigate analytically under which conditions a higher shrinkage parameter ε improves correct classification rates. We consider a univariate data generating process, where the observation xt depends on its cluster center ct, according to xt=ct+ηt, where ηt has cdf F(ηt) with zero mean and variance σ2 . The clusters are labeled based on their cluster center ct, which we normalize to 0 and 1, that is, ct∈{0,1} . The similarity of the clusters is then fully determined by the cdf F(·) . If the support of F(ηt) is highly concentrated around ηt=0 , then the clusters are well-separated. If, in contrast, the support of F(ηt) is widespread around zero, then the clusters are very similar and difficult to distinguish. We allow for potential switching of cluster membership by assuming that ct follows a Markov chain with transition probability p, that is, a transition matrix P of the form P=(P(ct=0|ct−1=0)P(ct=0|ct−1=1)P(ct=1|ct−1=0)P(ct=1|ct−1=1))=(1−ppp1−p). The repeated k-means clustering procedure in this setting then corresponds to classifying xt to either the cluster with center 0 ( c^t=0 ) or 1 ( c^t=1 ), regardless of the previous cluster assignment c^t−1 , that is, using ε=0 and basing the assignment on c^t=argminc∈{0,1}|xt−c| .3 The penalized clustering methodology introduced in Section 1 relies on the previous assignment c^t−1 . It can be written as c^tε=c^ε(xt|c^t−1)={1if xt(1−ε)+εc^t−1>1/2,0otherwise.(8) We call this the ε-classifier. The repeated k-means procedure is a special case of this approach for ε=0 . Using the above set-up, we can analytically derive the probability of misclassification P(c^tε≠ct) . In our first result, we define the one-step-ahead misclassification rate as the misclassification probability at t given perfect information about the true cluster membership at t−1. Proposition 1: Given ct−1, the one-step-ahead misclassification probability of the ε-classifier is P(c^tε≠ct|ct−1)=F(ε−1/21−ε)p+F(−1/21−ε)(1−p),(9)where F denotes the cdf of ηt. All proofs can be found in Appendix A. A few features of Equation (9) are worth noting. The k-means error ( ε=0 ) simplifies to P(c^t≠ct|ct−1)=F(−1/2) , which is insensitive to p. On the other extreme, as ε→1 , the error approaches p. Figure 3 plots the misclassification probability for the more interesting intermediate values of ε using a normal distribution N(0,σ2) for F(ηt) . The minimum of each curve is marked by a dot. In most cases, the minimum classification error is obtained at some intermediate value of ε . The reduction in classification errors is larger for smaller values of p, that is, situations where there is infrequent switching and where time t−1 information is most informative about the cluster identity at time t. Improvements are also larger for more cluster similarity (high σ2 ). Only for p = 0.5, the position at t−1 does not bear any information for the cluster assignment at t, and there is no benefit in introducing membership persistence via ε>0 . Later in Section 3, we will see that the current analytical results bear close resemblance to the simulation results. Figure 3. Open in new tabDownload slide Plot of the one-step-ahead misclassification probability in Equation (9). The dot on each curve indicates the lowest misclassification rate. The figure is based on ηt∼N(0,σ2), xt=ct+ηt , and ct∈{0,1} a Markov chain with switching probability p. For σ=0.25 , the clusters are well-separated, while for σ = 1 the clusters largely overlap. Using Proposition 1, we can analytically characterize the optimal value of ε that minimizes the misclassification rate. Proposition 2: If ηt ∼iid N(0,σ2), the value ε*which minimizes the misclassification rate [Equation (9)] for 0<p<12is ε*=2σ2 log(p1−p)2σ2 log(p1−p)−1.(10) Proposition 2 confirms what can be seen from Figure 3. Higher levels of noise and lower switching probabilities p push ε* upward, implying that higher shrinkage and thus more persistence in the classifier is optimal in such cases. In most cases of empirical interest, switching is present but infrequent ( 0<p<0.5 ), leading to strictly positive values of ε* . We can extend Proposition 1 to the case where xt exhibits mean-reverting dynamics in addition to a Markov switching center. Corollary 1: If xt follows the dynamics xt=ct+β(xt−1−ct)+ηt,then the one-step-ahead probability of error of the ε-classifier is P(c^tε≠ct|ct−1)=F(ε−1/21−ε+β(xt−1(2ct−1−1)+1−ct−1))p+ F(−1/21−ε+β(xt−1(1−2ct−1)+ct−1))(1−p),(11)where F denotes the cdf of ηt. We note that the introduction of the term β(xt−1−ct) does not change the main features of the misclassification probability [Equation (11)] when compared with Equation (9). In particular, the concavity of the misclassification rate is still present, as is the strictly positive optimal value of ε ; see Figure B.1 in Appendix B. Proposition 1 assumes that the true past cluster mean ct−1 is known. This is admittedly unrealistic. To arrive at an unconditional misclassification rate, and a corresponding optimal shrinkage parameter ε* , we propagate the classification process n steps ahead to derive the misclassification rate P(c^tε≠ct|ct−n) . Let the conditional correct classification probabilities at time t be qi,t=P(c^tε=i|ct=i) for i = 0, 1, and let qt=(q0,t,q1,t)′ . Also, define the marginal probabilities of the true states πi,t=P(ct=i) with πt=(π0,t,π1,t)′ , such that the probability of correct classification can be written as πt′qt=q0,tπ0,t+q1,tπ1,t . Then the following proposition gives the recursion for qt+1 . Proposition 3: The conditional correct classification probabilities qt follow the recursion qt+1=(z00·(1−p)·π0,tπ0,t+1−z10·(1−p)·π0,tπ0,t+1z10·p·π1,tπ0,t+1−z00·p·π1,tπ0,t+1z01·p·π0,tπ1,t+1−z11·p·π0,tπ1,t+1z11·(1−p)·π1,tπ1,t+1−z01·(1−p)·π1,tπ1,t+1)·qt +(z00·p·π1,tπ0,t+1+z10·(1−p)·π0,tπ0,t+1z01·(1−p)·π1,tπ1,t+1+z11·p·π0,tπ1,t+1,)(12)where zi0=F(1/2−i·ε1−ε), zi1=1−F(1/2−i·ε1−ε−1). We note that Proposition 1 is a special case of Equation (12) by taking qt=(1,1)′ . Other values can be chosen to reflect the uncertainty in the first step. In particular, we can use the output of the first step as input for a second step to obtain the two-step-ahead error rate. This process can be repeated n steps. By iterating further and further, we can study whether introducing persistence in clustering ( ε>0 ) has a lasting benefit, whatever the initialization used. The following two corollaries present the results for n→∞ , establishing that some strictly positive shrinkage parameter is generally optimal even if no information is available about the previous cluster label ct−1 . The result is established for our case of a symmetric Markov chain for ct, where limt→∞πi,t=0.5 for p > 0. Derivations for an asymmetric Markov chain are very similar. Corollary 2: The limiting probabilities of correct classification q for a symmetric Markov chain P(ct=1|ct−1=0)=P(ct=0|ct−1=1)=pare q=(1−(1−p)(z00−z10)p·(z00−z10)p·(z11−z01)1−(1−p)(z11−z01))−1×(z10+p·(z00−z10)z01+p·(z11−z01)). The corresponding limiting misclassification probability is limt→∞P(c^tε≠ct)=1−12z01(1−z˜00)+z10(1−z˜11)+p(z˜00+z˜11−2z˜11z˜00)1−(1−p)(z˜00+z˜11)+(1−2p)z˜00z˜11,(13)where z˜00=z00−z10and z˜11=z11−z01 . Corollary 3: Let f(ηt)be the pdf of ηt, corresponding to the cdf F(ηt), and the limiting misclassification probability q˜=1−12(ι2·q)where ι2∈R2is a vector of ones. Then under the same conditions asCorollary 2, the derivative of the limiting misclassification probability at ε=0is given by ∂ q˜∂ε|ε=0=14(f(12)−f(−12))+12p(f(−12)F(12)−f(12)F(−12))+12(p−1)(f(12)F(12)−f(−12)F(−12)). If the pdf f is symmetric around zero, this expression simplifies to ∂ q˜∂ε|ε=0=−12(1−2p) f(12)(2F(12)−1),which is negative for p < 0.5. Figure 4 plots the n-step-ahead misclassification rate from Proposition 3 and its limit from Corollary 2 for n∈{1,5,∞} , and ηt∼N(0,0.52) . Introducing clustering persistence clearly pays off both in the short and the long term. The minimum misclassification rate is reached at some ε>0 as long as p < 0.5. This follows from the derivative of the misclassification rate at the origin ε=0 in Corollary 3, which is negative for any p < 0.5 and any distribution F of ηt that is symmetric around zero. We also note that the drop in the misclassification rate remains substantial for the limiting case n→∞ for p≤0.1 . Finally, all these results also align with our simulations results in Section 3: under moderate switching, the error rate is concave in ε , and more so if clusters are less well separated (i.e., higher σ). This helps to recognize situations for which it is advisable to allow for cluster switching over time, balanced with the shrinkage approach proposed in this article. Figure 4. Open in new tabDownload slide The n-step-ahead misclassification rate πt′qt for n = 1, 5 and n→∞ using Equation (12) and σ=0.5 . 3 Simulation Study In this section, we investigate the ability of our method to assign units to their respective clusters at each point in time. All simulations are done using a six-dimensional Gaussian distribution (D = 6) to allow for at least three variables in each fold of the CV approach of Fu and Perry (2020). The different cluster centers are drawn randomly from the vertices of a six-dimensional unit hypercube. In the baseline simulation setting, the cluster covariance matrices are set equal to the identity matrix. Unit variances for cluster centers on the vertices of a unit cube imply that there is substantial cluster overlap and thus substantial misclassification risk. We therefore also consider a second setting with variances equal to 0.5 for each component. Throughout the simulations, the true number of clusters is fixed at two (K = 2). At each time, observations are drawn from their current cluster distribution. Units switch clusters from time t to t + 1 with probability p, where we vary p from 0 to 0.25 across different designs. In all settings, we use T = 20 time points, N = 120 units, and 100 simulations runs. As our first-pass cross-sectional clustering algorithm, we choose a simple k-means approach, although, as stated before, our approach can also accommodate other cross-sectional clustering methods, distance definitions, and centroid measures. The baseline simulation results are shown in Figure 5. Overall, the shape of the misclassification curve closely aligns with our analytical results in Figures 3 and 4 in Section 2. At low levels of switching in the DGP, that is, p∈{0,0.01,0.1} , our method with positive ε improves on the repeated k-means case ( ε=0 ). Moreover, there are clearly optimal values for ε in the misclassification plot (upper left panel). Setting ε to these optimal values leads to reductions of misclassification errors from 16% down to 9% for both p = 0 and p = 0.01. The shrinkage approach performs worse than repeated cross-sectional clustering only for highly frequent actual switching (p = 0.25). We do not expect this to be a major problem in practice, as our model is primarily intended for dynamic settings with only occasional switches and substantial persistence in cluster membership. Figure 5. Open in new tabDownload slide Simulation results for four values of p. Baseline setting. Without knowing the true classifications, it is still striking that the GWS index peaks at about the optimal ε (lower-right panel), while the CV error flattens out around the same point (upper-right panel). The switching rate (lower-left panel) combined with the GWS index shows exactly what the approach seeks to achieve—a drastic reduction in the number of cluster switches (lower-left), without sacrificing the fit in terms of the silhouette index (lower-right). Increasing ε avoids frequent reclassification of observations on the borderline between clusters. Such observations only marginally affect the silhouette index as it takes the distances of the observations to the nearest clusters into account. It does, however, bring down the switching rate considerably. This may help setting the value of ε in empirical applications: we are looking for values of ε that reduce the switching rate, without considerable decreases in the silhouette index. We emphasize that our baseline setting implies a major challenge to any clustering algorithm, owing to the substantial cluster overlap. In a setting with lower variances, such as Figure B.2 in Appendix B, we find much smaller misclassification rates, while still achieving reductions in misclassification rates of about 67% (from around 7.5% to around 2.5%) for p = 0.00 and 0.01. Again, the pronounced concavity and minimum of the misclassification reflect the theoretical results in Section 2. Figure 5 also suggests that the CV error approach to select ε works less well. CV errors appear lowest for high values of ε that yield too much persistence in cluster membership. If ε were set based on this criterion, misclassification rates would be higher than those associated to the GWS approach. We therefore prefer the latter over the former in our empirical work in Section 4. To see the effect of choosing the number of clusters, we extend the previous simulation setup by also letting the algorithm choose the number of clusters Kt in each cross-section. We vary the number of clusters in the model from 2 to 4, whereas the true number of clusters is always 2. The results are presented in Figures B.3 and B.4 in Appendix B. The case of an unknown number of clusters, combined with a large cluster overlap, poses a substantial challenge for any clustering method. Misclassification rates are high throughout, and only for p=0.00,0.01 we observe a clear dependence on the shrinkage parameter ε . For those two cases, the reduction in misclassification is substantial, at more than 15 percentage points when the optimal penalty parameter is chosen. The GWS index points to values of ε between 0.3 and 0.6, where the sharpest declines in the GWS index occur. These values appear slightly below the optimal values for misclassification at around ε=0.6 . The CV error, in contrast, seems to flatten out around too high a value of around ε=0.8 , and thus again exhibits a worse behavior. The picture is even clearer if we bring down the error variances to 0.5, reducing cluster overlap. For low values of p, the GWS index now decreases sharply after the optimal (from a misclassification perspective) value of ε has been reached. This allows us to cut misclassification by close to 50% in a data-driven way without sacrificing much of the fit in terms of the GWS index. In contrast, applying the CV-based approach again results in too high values of ε and may therefore miss important aspects in the dynamics of the data. Finally, to benchmark our new clustering approach, we compare it to three versions of Ward’s hierarchical clustering. The first approach (Ward plain) clusters each cross-section separately and links the labels through the mapping step as in Section 1.2. Second, the pooled Ward takes all observations of all units over time and treats them as a single cross-section of N × T separate units. Third, time-aggregated Ward stacks the xi,t over time into a vector xt and considers each of its coordinates as one of N × D separate variables. This last approach does not allow for switches and effectively clusters the whole time series of a unit. The results are presented in Table 1 and can be compared with the left-hand curves in Figures 5 and B.2. All benchmark approaches produce larger misclassification errors than our new penalized dynamic clustering approach. Only Ward’s time-aggregated approach for small p appears to fare slightly better, but at the cost of not allowing for any switches at all. As a consequence, it produces disproportionately large misclassification rates as p increases, exceeding those of the penalized clustering approach of this article. The difference in misclassification rates is substantial: even for a large set of sub-optimal choices of ε , the new method still beats the benchmarks. For instance, in the baseline design with p set at 0 and 0.01, any choice of ε produces lower errors than either the Ward plain or the Ward pooled benchmark. For p = 0.1, the misclassification rate in our approach is only higher when ε≥0.65 . This suggests that also if cluster switches happen more often, a wide range of (optimal or sup-optimal) shrinkage parameters ε results in improvements over the considered benchmarks. Table 1. Misclassification rates and switches for the benchmark models . . Misclassification . Switching rate . Model . P . Baseline . Half-var. . Baseline . Half-var. . Ward 0.0 0.182 0.101 0.391 0.285 plain 0.01 0.182 0.108 0.392 0.305 0.1 0.183 0.119 0.408 0.348 0.25 0.208 0.133 0.431 0.393 Ward 0.0 0.182 0.117 0.368 0.267 pooled 0.01 0.170 0.128 0.368 0.284 0.1 0.176 0.122 0.386 0.330 0.25 0.185 0.121 0.423 0.379 Ward 0.0 0.019 0.001 time-aggregated 0.01 0.045 0.043 0.1 0.212 0.195 0.25 0.284 0.281 . . Misclassification . Switching rate . Model . P . Baseline . Half-var. . Baseline . Half-var. . Ward 0.0 0.182 0.101 0.391 0.285 plain 0.01 0.182 0.108 0.392 0.305 0.1 0.183 0.119 0.408 0.348 0.25 0.208 0.133 0.431 0.393 Ward 0.0 0.182 0.117 0.368 0.267 pooled 0.01 0.170 0.128 0.368 0.284 0.1 0.176 0.122 0.386 0.330 0.25 0.185 0.121 0.423 0.379 Ward 0.0 0.019 0.001 time-aggregated 0.01 0.045 0.043 0.1 0.212 0.195 0.25 0.284 0.281 Notes: The time-aggregated setting does not allow for switches. The baseline has σ = 1, and thus large cluster overlaps. For the half-variance case, σ=0.5 , and the overlap is smaller. Open in new tab Table 1. Misclassification rates and switches for the benchmark models . . Misclassification . Switching rate . Model . P . Baseline . Half-var. . Baseline . Half-var. . Ward 0.0 0.182 0.101 0.391 0.285 plain 0.01 0.182 0.108 0.392 0.305 0.1 0.183 0.119 0.408 0.348 0.25 0.208 0.133 0.431 0.393 Ward 0.0 0.182 0.117 0.368 0.267 pooled 0.01 0.170 0.128 0.368 0.284 0.1 0.176 0.122 0.386 0.330 0.25 0.185 0.121 0.423 0.379 Ward 0.0 0.019 0.001 time-aggregated 0.01 0.045 0.043 0.1 0.212 0.195 0.25 0.284 0.281 . . Misclassification . Switching rate . Model . P . Baseline . Half-var. . Baseline . Half-var. . Ward 0.0 0.182 0.101 0.391 0.285 plain 0.01 0.182 0.108 0.392 0.305 0.1 0.183 0.119 0.408 0.348 0.25 0.208 0.133 0.431 0.393 Ward 0.0 0.182 0.117 0.368 0.267 pooled 0.01 0.170 0.128 0.368 0.284 0.1 0.176 0.122 0.386 0.330 0.25 0.185 0.121 0.423 0.379 Ward 0.0 0.019 0.001 time-aggregated 0.01 0.045 0.043 0.1 0.212 0.195 0.25 0.284 0.281 Notes: The time-aggregated setting does not allow for switches. The baseline has σ = 1, and thus large cluster overlaps. For the half-variance case, σ=0.5 , and the overlap is smaller. Open in new tab 4 Empirical Illustration This section applies the clustering methodology of Section 1 to multivariate panel of D = 4 accounting ratios for N = 28 European insurance companies’ over the period 2010 and 2020 (T = 11). Each year, we allocate insurers to one of k=1,…,Kt distinct business model (peer) groups. We proceed by first describing the data, followed by the empirical results. 4.1 Data Our sample of N = 28 European insurance companies overlaps strongly with a set of 44 insurance companies chosen by EIOPA for its 2021 insurance sector stress test; see EIOPA (2021, Annex A). We observe annual insurer-level accounting data from InsuranceFocus (Bureau van Dijk). We start with the EIOPA’s selection of insurers, which together cover approximately 75% of the European Economic Area’s insurance market based on total assets. We exclude companies for which a complete set of data is not available, resulting in 26 companies, before adding two large Swiss insurance companies to complement the sample (Swiss Re and Zurich insurance). Table B.1 in Appendix B provides a listing of all firms, along with a subset of estimated cluster allocations. We select a parsimonious set of variables to classify our selection of European insurers into broadly similar peer-groups. Our choice of variables is motivated by the desire to tell apart four types of insurers: re-insurance, non-life insurance, life insurance, and financial conglomerate. The first three types are insurers that focus on a specific part of the insurance business. The fourth type is a large insurer that owns at least one sizable deposit-taking (bank) subsidiary. To allocate insurers into peer groups we consider the following variables: insurers’ (i) ratio of total reinsurance premia received over total traditional (life and non-life) insurance premia; (ii) share of life insurance premia to total premia, (iii) share of non-life insurance premia to total premia, and (iv) share of banking assets (loans and mortgages) to total assets. The first three variables are taken from the insurers’ profit-and-loss (“technical accounts”) statements, while the fourth variable is taken from the insurers’ consolidated balance sheets. The first variable allows us to distinguish reinsurance firms from “regular” insurers. The second and third variables allow us to further subdivide regular insurers into life and nonlife insurers. The fourth variable allows us to distinguish financial conglomerates. We rely on International Financial Reporting Standards (IFRS) accounting data, and use domestic-Generally Accepted Accounting Principles (GAAP) accounting data when IFRS data are not available. 4.2 Clustering Outcomes We first discuss the results for a fixed number of clusters K = 4, which is in line with the highest GWS index at almost all time points (see below), and our reading of the general industry perception. As a robustness check, we also provide clustering outcomes when Kt is allowed to vary between two and six, corroborating that K = 4 is an appropriate choice. We initialize our clustering method by applying threshold rules to the first cross-section. These threshold rules divide the data into four mutually exclusive and economically interpretable clusters. Firms receiving more reinsurance premia than non-reinsurance premia are allocated to cluster 1 (“reinsurance”). Non-reinsurance firms receiving more than half of their total premium income from life contracts are allocated to cluster 2 (“life”). Non-reinsurance firms receiving most premium income from non-life insurance are allocated to cluster 3 (“non-life”). Firms exhibiting banking assets (total loans and mortgages) of more than a third of total assets are allocated to cluster 4 (“conglomerate”), potentially overriding the other splits. This approach allocates each firm uniquely to one of the four clusters. We then use our dynamic clustering method, in conjunction with k-means clustering, to allocate the remaining cross-sections conditional on the initial allocation. Our initialization approach has no effect on subsequent cross-sections when no shrinkage is imposed ( ε=0 ). In that case, the clustering outcomes quickly revert to the outcomes implied by independent k-means clustering of each cross-section in isolation. The higher the amount of shrinkage, however, the stickier and the more important the initialization. Figure 6 presents clustering diagnostics as a function of the shrinkage parameter ε. Our goal is to decrease the number of incidental switches (flickering) while retaining a high overall fit to the data. Figure 6 allows us to compare the GWS index (our measure of fit, in the bottom panel) to the number of switches (in the top panel) associated to each value of ε. The bottom panel of Figure 6 indicates that there is a local maximum in fit at ε=0.45 , coinciding with a low number of cluster switches.4 After ε=0.45 , the fit decreases sharply. We therefore choose ε=0.45 for the remainder of the analysis based on K = 4. Figure 6. Open in new tabDownload slide Clustering diagnostics as a function of the shrinkage parameter ε . Top panel: total number of cluster switches. Bottom panel: average GWS index. Our clustering approach leads to stable cluster allocations over time. Figure 8(a) summarizes our cluster allocation outcomes for K≡4 and ε=0.45 . Each column refers to one cluster indexed by k=1,…,4 . Each row denotes one year between 2010 and 2020. Two cluster transitions are indicated by arrows. The four clusters contain 3, 8, 15, and 2 members, respectively, most of the time, with a slight variation in membership only across the last three groups. Traditional non-life and life insurers are the most frequently observed (popular) business models in our sample, ahead of re-insurers and financial conglomerates. The labels given to each cluster correspond closely with what an inspection of the empirical cluster centroids (means) would suggest. Figure 7(b) plots the time-varying cluster means for all the variables contained in xit. The first cluster is characterized by a large ratio of reinsurance premia to life and non-life premia. The second and third clusters are characterized by large ratios of life and non-life premia to total non-reinsurance premia, respectively. The fourth cluster is characterized by a substantial ratio of banking assets to total assets. Figure 7. Open in new tabDownload slide Clustering composition and transitions and cluster means for K≡4 and ε=0.45 . Each column in the left-hand panel refers to one cluster indexed by k=1,…,4 . Each row in the same panel denotes one year between 2010 and 2020, while an arrow represents a transition across clusters. The right-hand panel shows the evolution of the cluster means over time. Figure 8(a) summarizes the cluster allocation outcomes for K≡4 and ε=0 . The clustering outcomes are visibly more volatile, and much harder to interpret economically if no shrinkage is imposed to link the cross-sections over time. Two outcomes are worth noting. First, the reinsurance cluster now shrinks in membership early on in the sample (in 2012), from three to only one member. This can be traced to the first variable being substantially higher for one firm (Swiss Re) than for the other two reinsurance firms (Munich Re and Hannover Re). The fact that the two migrating firms carry “Re” in their names may suggest that these transitions may not necessarily be interpretable. Imposing shrinkage removes these transitions; cf. Figure 7(a). Second, there is some noticeable going back- and forth between the life and non-life clusters. This can be traced back to a few insurers that engage in both life and non-life business, with the precise split between the two being subject to accounting windfalls and other one-off effects (similarly to the setting in Figure 1). Such “middle-of-the-road” or “multi-line” insurers are relatively harder to cluster. Imposing shrinkage avoids these firms flickering back and forth between the life and non-life clusters, yielding more stable clustering outcomes and, in turn, enhancing economic interpretability. Figure 8. Open in new tabDownload slide Clustering composition and transitions for K≡4 and ε=0.00 (panel a) as well as Kt∈{2,…,6} and ε=0.55 (panel b). Each column in the figures refers to one cluster. Each row denotes one year between 2010 and 2020. Each arrow represents a transition across clusters. Finally, we allow Kt∈{2,3,4,5,6} to vary over time. As indicated by Algorithm 1, Kt can be chosen to maximize the local time-t silhouette index. We continue to start our clustering algorithm at Kt = 4 for t = 1, and increase the amount of shrinkage slightly to ε=0.55 to balance the additional source of instability, trading off goodness-of-fit against clustering stability as before. Figure 8(b) presents the clustering outcomes. We note two features. First, four clusters are selected almost always even though Kt is allowed to vary. We see that Kt = 2, Kt = 5, and Kt = 6 are never selected, and that Kt = 3 is only selected twice. This supports our initial choice of K = 4. Second, the fourth (“conglomerate”) cluster appears to merge with the second (“life”) cluster late in the sample (in 2019 and 2020). This can be traced back to the two cluster means moving closer together at that time. Whether this corresponds to a permanent “structural” feature of our data going forward is currently unclear and left for future research. Finally, we observe that choosing ε>0 at a moderately high value allows us to obtain stable clustering results for multivariate panel data both when Kt is time-invariant and when it is time-varying. 5 Conclusion In this article, we propose a new approach to clustering in a panel setting, allowing for dynamics in the cluster location, cluster composition, and number of clusters, while ensuring stability and persistence of assignments via a shrinkage penalty parameter. The method is widely applicable and extends to many cross-sectional clustering algorithm that produces a distance measure, including, for instance, k-means, k-medians, or hierarchical clustering. We show how the penalty parameter can be chosen in a data-driven way with a simple weighted version of the well-known silhouette index. We also show analytically and in simulations that selecting a strictly positive shrinkage parameter helps to reduce misclassification in empirically relevant conditions. An application to business models in the European insurance sector underlines the usefulness of our method in balancing flexibility, that is, allowing for cluster transitions, and penalizing excessive back-and-forth switching between clusters in economic settings. Appendix A: Proofs of Propositions This appendix presents the proofs of the propositions in Section 2. Consider a univariate data generating process, where the observation xt depends on its cluster center ct, given by xt=ct+ηt, where ηt has cdf F(ηt) . The cluster center ct follows a Markov chain with transition probability p. In this setting, our clustering methodology can be written as c^tε=c^ε(xt|c^t−1)={1if xt(1−ε)+εc^t−1>1/2,0otherwise.(A.1) We call this the ε-classifier. Before discussing the results of Section 2, we present Lemma 1, which will be useful below. Lemma 1 Given three real numbers x, a, and b: |a+x|<|b+x| ⇔{x>−(a+b)/2 if b>ax<−(a+b)/2 if b<aand |a+x|=|b+x|if a = b or x=−(a+b)/2 . Proof Assume b < a and that |a+x|<|b+x| . Then, b+x<a+x≤|a+x|<|b+x|⇔b+x<|b+x| , which implies that b+x<0 . So |a+x|<−(b+x) . If a+x<0⇔−a−x>−b−x⇔a<b , which is a contradiction. So a+x>0 . If a+x>0⇔a+x<−b−x⇔x<−(a+b)/2 . Similarly, assume b > a and that |a+x|<|b+x| . Then, a+x<b+x . If b+x<0 then we would have |a+x|>|b+x| , so b+x>0 . So |a+x|<b+x . If x≥−a⇔a+x≥0⇒|a+x|<|b+x| . If |a+x|<0⇔−a−x<b+x⇔x>−(a+b)/2 . □ In our first result, we define the one-step-ahead misclassification rate as the misclassification probability at t given the information of the true cluster assignment at t−1. Recall Proposition 1 from Section 2: Proposition 1: Given ct−1, the one-step-ahead misclassification probability of the ε-classifier is P(c^tε≠ct|ct−1)=F(ε−1/21−ε)p+F(−1/21−ε)(1−p),(9)where F denotes the cdf of ηt. Proof: First, decompose the misclassification probability in a case where a switch happens at t−1, and a case where it does not: P(c^tε≠ct|ct−1)=P(c^tε=ct−1|ct≠ct−1)P(ct≠ct−1)+P(c^tε≠ct−1|ct=ct−1)P(ct=ct−1)P(c^tε≠ct|ct−1)=P(c^tε=ct−1|ct≠ct−1)p+P(c^tε≠ct−1|ct=ct−1)(1−p)(A.2) is composed of two probabilities: the error given a switch (i.e., P(c^tε=ct−1|ct≠ct−1) ) and the error given no switch (i.e., P(c^tε≠ct−1|ct=ct−1) ). We split the proof in two parts, each calculating one of these two probabilities. Calculation of the Probability of Error Given a Switch P(c^tε=ct−1|ct≠ct−1) Recalling our definition of c^tε in Equation (A.1), we can write the two conditional probabilities in terms of the distance between xt and the centers, and then in terms of the noise. For P(c^tε=ct−1|ct≠ct−1) , we have P(c^tε=ct−1|ct≠ct−1)=P(|xt(1−ε)+εct−1−ct−1|<|xt(1−ε)+εct−1−(1−ct−1)||ct≠ct−1)=P(|(ct+ηt)(1−ε)+εct−1−ct−1|<|(ct+ηt)(1−ε)+εct−1−(1−ct−1)||ct≠ct−1)=P(|(1−ct−1+ηt)(1−ε)+εct−1−ct−1|<|(1−ct−1+ηt)(1−ε)+εct−1−1+ct−1)||ct≠ct−1)=P(|(1−2ct−1+ηt)(1−ε)|<|(1−2ct−1+ηt)(1−ε)−1+2ct−1||ct≠ct−1)=P(|1−2ct−1+ηt|<|1−2ct−1+ηt+(2ct−1−1)/(1−ε)||ct≠ct−1). And finally P(c^tε=ct−1|ct≠ct−1)=P(|1−2ct−1+ηt|<|1−2ct−1+ηt+(2ct−1−1)/(1−ε)||ct≠ct−1).(A.3) Applying Lemma 1 we can do away with the absolute value. First, write Equation (A.3) in terms of a and b: P(c^tε=ct−1|ct≠ct−1)=P(1−2ct−1︸a+ηt|<|1−2ct−1+(2ct−1−1)/(1−ε)︸b+ηt||ct≠ct−1). Now check if a < b or a > b: a<b⇔1−2ct−1<1−2ct−1+(2ct−1−1)/(1−ε)⇔0<2ct−1−1⇔1/2<ct−1⇔ct−1=1. And the other case: a>b⇔1−2ct−1>1−2ct−1+(2ct−1−1)/(1−ε)⇔1/2>ct−1⇔ct−1=0. So, we have two cases depending on the true cluster at ct−1 . Then, applying Lemma 1 to Equation (A.3) we have P(c^tε=ct−1|ct≠ct−1)=P(|1−2ct−1+ηt|<|1−2ct−1+ηt+(2ct−1−1)/(1−ε)||ct≠ct−1)={P(ηt>−12(a+b)|ct≠ct−1) if ct−1=1P(ηt<−12(a+b)|ct≠ct−1) if ct−1=0. First, calculating the term −12(a+b) for each case −12(a+b)=−12(1−2ct−1+1−2ct−1+(2ct−1−1)/(1−ε))=2ct−1−1+1/2−ct−11−ε−12(a+b)={1/2−ε1−ε if ct−1=1ε−1/21−ε if ct−1=0. Substituting in each of these cases P(c^tε=ct−1|ct≠ct−1)={P(ηt>(1/2−ε)/(1−ε)|ct≠ct−1) if ct−1=1P(ηt<(ε−1/2)/(1−ε)|ct≠ct−1) if ct−1=0. Using F, the cdf of ηt, and its symmetry, we have P(c^tε=ct−1|ct≠ct−1)={1−F((1/2−ε)/(1−ε)) if ct−1=0F((ε−1/2)/(1−ε)) if ct−1=1P(c^tε=ct−1|ct≠ct−1)=F(ε−1/21−ε).(A.4) Calculation of the Probability of Error Given No Switch P(c^tε≠ct−1|ct=ct−1) We follow the same steps as for the probability of error given a switch. We have, for the second conditional probability on Equation (A.2) P(c^tε≠ct−1|ct=ct−1)=P(|xt(1−ε)+εct−1−ct−1|>|xt(1−ε)+εct−1−(1−ct−1)||ct=ct−1)=P(|(ct+ηt)(1−ε)+εct−1−ct−1|>|(ct+ηt)(1−ε)+εct−1−(1−ct−1)||ct=ct−1)=P(|ηt(1−ε)|>|ηt(1−ε)−1+2ct−1||ct=ct−1). And finally P(c^tε≠ct−1|ct=ct−1)=P(|ηt|>|ηt+(2ct−1−1)/(1−ε)||ct=ct−1).(A.5) Again we apply Lemma 1 so that we can do away with the absolute value. First, write Equation (A.5) in terms of a and b: P(c^tε≠ct−1|ct=ct−1)=P(0︸a+ηt|<|(2ct−1−1)/(1−ε)︸b+ηt||ct=ct−1). We can immediately check when a < b and a > b: a<b⇔ct−1=1a>b⇔ct−1=0, and −12(a+b)=(1/2−ct−1)/(1−ε) . Then, applying Lemma 1 to Equation (A.5) we have P(c^tε≠ct−1|ct=ct−1)=P(0+ηt|<|(2ct−1−1)/(1−ε)+ηt||ct=ct−1)={P(ηt>−12(a+b)|ct=ct−1) if ct−1=0P(ηt<−12(a+b)|ct=ct−1) if ct−1=1={P(ηt>+(1/2)/(1−ε)|ct=ct−1) if ct−1=0P(ηt<−(1/2)/(1−ε)|ct=ct−1) if ct−1=1={1−F((1/2)/(1−ε)) if ct−1=0F(−(1/2)/(1−ε)) if ct−1=1. Finally, using the symmetry of F, P(c^tε≠ct−1|ct=ct−1)=F(−(1/2)/(1−ε)).(A.6) We conclude the proof by substituting Equations (A.4) and (A.6) into Equation (A.2), yielding P(c^tε≠ct|ct−1)=P(c^tε=ct−1|ct≠ct−1)p+P(c^tε≠ct−1|ct=ct−1)(1−p)P(c^tε≠ct|ct−1)=F(ε−1/21−ε)p+F(−1/21−ε)(1−p). □ Next, we prove Proposition 2. Recall that it states: Proposition 2: If ηt ∼iid N(0,σ2), the value ε*which minimizes the misclassification rate [Equation (9)] for 0<p<12is ε*=2σ2 log(p1−p)2σ2 log(p1−p)−1.(10) Proof: The proof is a straightforward optimization of the function [Equation (9)] with ηt∼N(0,σ2) . That is, minεP(c^tε≠ct|ct−1)=minεF(ε−1/21−ε)p+F(−1/21−ε)(1−p). Taking the derivative we have ∂P(.)∂ε=pf((ε−1/2)/(1−ε))12(1−ε)2+(1−p)f(−1/2(1−ε))(−1)12(1−ε)2. Setting it to zero: 0=pf((ε*−1/2)/(1−ε*))12(1−ε*)2+(1−p)f(−1/2(1−ε*))(−1)12(1−ε*)20=pf((ε*−1/2)/(1−ε*))−(1−p)f(−1/2(1−ε*))0=p exp(−0.5(ε*−1/2)2(1−ε*)2σ2)−(1−p) exp(−0.5(2(1−ε*))2σ2)0.5(2(1−ε*))2σ2=−log(p1−p)+0.5(ε*−1/2)2(1−ε*)2σ22−2−(ε*−2−1)2(1−ε*)2σ2=−2 log(p1−p)2−2−(ε*2−ε*+2−2)(1−ε*)2σ2=−2 log(p1−p)ε*1−ε*=−2σ2 log(p1−p)ε*=2σ2 log(p1−p)2σ2 log(p1−p)−1. where f is the Gaussian PDF. □ Corollary 1 extends Proposition 1 to the case of a mean-reverting process. Corollary 1: If xt follows the dynamics xt=ct+β(xt−1−ct)+ηt,then the one-step-ahead probability of error of the ε-classifier is P(c^tε≠ct|ct−1)=F(ε−1/21−ε+β(xt−1(2ct−1−1)+1−ct−1))p+ F(−1/21−ε+β(xt−1(1−2ct−1)+ct−1))(1−p),(11)where F denotes the cdf of ηt. Proof: This proof follows closely that of Proposition 1. As before. First decompose the misclassification probability in a case where a switch happens at t−1, and a case where it does not: P(c^tε≠ct|ct−1)=P(c^tε=ct−1|ct≠ct−1)P(ct≠ct−1)+P(c^tε≠ct−1|ct=ct−1)P(ct=ct−1)P(c^tε≠ct|ct−1)=P(c^tε=ct−1|ct≠ct−1)p+P(c^tε≠ct−1|ct=ct−1)(1−p).(A.7) We also split the proof in two parts, each calculating one of these two probabilities. Calculation of the Probability of Error Given a Switch P(c^tε=ct−1|ct≠ct−1) Recalling our definition of c^tε in Equation (A.1), we can write the two conditional probabilities in terms of the distance between xt and the centers, and then in terms of the noise. For P(c^tε=ct−1|ct≠ct−1) , P(c^tε=ct−1|ct≠ct−1)=P(|xt(1−ε)+εct−1−ct−1|<|xt(1−ε)+εct−1−(1−ct−1)||ct≠ct−1)=P(|(ct+β(xt−1−ct)+ηt)(1−ε)+εct−1−ct−1|<|(ct+β(xt−1−ct)+ηt)(1−ε)+εct−1−(1−ct−1)||ct≠ct−1)=P(|(1−ct−1+β(xt−1−1+ct−1)+ηt)(1−ε)+εct−1−ct−1|<|(1−ct−1+β(xt−1−1+ct−1)+ηt)(1−ε)+εct−1−1+ct−1)||ct≠ct−1). And finally P(c^tε=ct−1|ct≠ct−1)=P(|1−2ct−1+β(xt−1−1+ct−1)+ηt|<|1−2ct−1+β(xt−1−1+ct−1)+ηt+(2ct−1−1)/(1−ε)||ct≠ct−1).(A.8) Applying Lemma 1 we can do away with the absolute value. Writing Equation (A.8) in terms of a and b we have a=1−2ct−1+β(xt−1−1+ct−1)b=1−2ct−1+β(xt−1−1+ct−1)+(2ct−1−1)/(1−ε). Now check if a < b or a > b: a<b⇔1−2ct−1<1−2ct−1+(2ct−1−1)/(1−ε)⇔0<2ct−1−1⇔1/2<ct−1⇔ct−1=1. And the other case: a>b⇔1−2ct−1>1−2ct−1+(2ct−1−1)/(1−ε)⇔1/2>ct−1⇔ct−1=0 So, we have two cases depending on the true cluster at ct−1 . Applying Lemma 1 to Equation (A.8), we have P(c^tε=ct−1|ct≠ct−1)=P(|1−2ct−1+β(xt−1−1+ct−1)+ηt| <|1−2ct−1+β(xt−1−1+ct−1)+ηt+(2ct−1−1)/(1−ε)||ct≠ct−1)={P(ηt>−12(a+b)|ct≠ct−1) if ct−1=1P(ηt<−12(a+b)|ct≠ct−1) if ct−1=0. First, calculate the term −12(a+b) for each case: −12(a+b)=−12(1−2ct−1+1−2ct−1+2β(xt−1−1+ct−1)+(2ct−1−1)/(1−ε))=2ct−1−1+β(xt−1−1+ct−1)+1/2−ct−11−ε−12(a+b)={βxt−1+1/2−ε1−ε if ct−1=1β(xt−1−1)+ε−1/21−ε if ct−1=0. Substituting in each of these cases, we have P(c^tε=ct−1|ct≠ct−1)={P(ηt>βxt−1+(1/2−ε)/(1−ε)|ct≠ct−1) if ct−1=1P(ηt<β(xt−1−1)+(ε−1/2)/(1−ε)|ct≠ct−1) if ct−1=0. Using F, the cdf of ηt, and its symmetry, P(c^tε=ct−1|ct≠ct−1)={1−F(βxt−1+(1/2−ε)/(1−ε)) if ct−1=0F(β(xt−1−1)+(ε−1/2)/(1−ε)) if ct−1=1. Which we can write more compactly as P(c^tε=ct−1|ct≠ct−1)=F(β(xt−1(2ct−1−1)+1−ct−1)+ε−1/21−ε).(A.9) Calculation of the Probability of Error Given No Switch P(c^tε≠ct−1|ct=ct−1) We follow the same steps as for the probability of error given a switch. We have, for the second conditional probability on Equation (A.2), P(c^tε≠ct−1|ct=ct−1)=P(|xt(1−ε)+εct−1−ct−1|>|xt(1−ε)+εct−1−(1−ct−1)||ct=ct−1)=P(|(ct+β(xt−1−ct)+ηt)(1−ε)+εct−1−ct−1|>|(ct+β(xt−1−ct)+ηt)(1−ε)+εct−1−(1−ct−1)||ct=ct−1)=P(|(β(xt−1−ct−1)+ηt)(1−ε)|>|(β(xt−1−ct−1)+ηt)(1−ε)−1+2ct−1||ct=ct−1). And finally P(c^tε≠ct−1|ct=ct−1)=P(|β(xt−1−ct−1)+ηt|>|β(xt−1−ct−1)+ηt+(2ct−1−1)/(1−ε)||ct=ct−1). Again we apply Lemma 1 so that we can do away with the absolute value. Writing the equation above in terms of a and b we have a=β(xt−1−ct−1)b=β(xt−1−ct−1)+(2ct−1−1)/(1−ε). Now check if a < b or a > b: a<b⇔ct−1=1a>b⇔ct−1=0, and −12(a+b)=−β(xt−1−ct−1)+(1/2−ct−1)/(1−ε) . Then, applying Lemma 1 we have P(c^tε≠ct−1|ct=ct−1)=P(β(xt−1−ct−1)+ηt|<|β(xt−1−ct−1)+(2ct−1−1)/(1−ε)+ηt||ct=ct−1)={P(ηt>−12(a+b)|ct=ct−1) if ct−1=0P(ηt<−12(a+b)|ct=ct−1) if ct−1=1={P(ηt>−βxt−1+(1/2)/(1−ε)|ct=ct−1) if ct−1=0P(ηt<−β(xt−1−1)−(1/2)/(1−ε)|ct=ct−1) if ct−1=1={1−F(−βxt−1+(1/2)/(1−ε)) if ct−1=0F(−β(xt−1−1)−(1/2)/(1−ε)) if ct−1=1. Finally, using the symmetry of F, P(c^tε≠ct−1|ct=ct−1)=F(β(xt−1(1−2ct−1)+ct−1)−1/21−ε).(A.10) We conclude the proof by substituting Equations (A.9) and (A.10) into Equation (A.7), yielding P(c^tε≠ct|ct−1)=P(c^tε=ct−1|ct≠ct−1)p+P(c^tε≠ct−1|ct=ct−1)(1−p)P(c^tε≠ct|ct−1)=F(ε−1/21−ε+β(xt−1(2ct−1−1)+1−ct−1))p+ F(−1/21−ε+β(xt−1(1−2ct−1)+ct−1))(1−p). □ Proposition 3 states the probabilities of correct classification in a recursive form. Proposition 3: The conditional correct classification probabilities qt follow the recursion qt+1=(z00·(1−p)·π0,tπ0,t+1−z10·(1−p)·π0,tπ0,t+1z10·p·π1,tπ0,t+1−z00·p·π1,tπ0,t+1z01·p·π0,tπ1,t+1−z11·p·π0,tπ1,t+1z11·(1−p)·π1,tπ1,t+1−z01·(1−p)·π1,tπ1,t+1)·qt +(z00·p·π1,tπ0,t+1+z10·(1−p)·π0,tπ0,t+1z01·(1−p)·π1,tπ1,t+1+z11·p·π0,tπ1,t+1,)(12)where zi0=F(1/2−i·ε1−ε), zi1=1−F(1/2−i·ε1−ε−1). Proof: Define the marginal probability for the true state as πi,t=P(ct=i) . Also, define the conditional correct classification probabilities qt=(q0,tq1,t)=(P(c^t+1ϵ=0|ct+1=0)P(c^t+1ϵ=1|ct+1=1)). The correct classification probability is now given by q0,tπ0,t+q1,tπ1,t. Finally, define z00=F(0.5−ϵ·01−ϵ−0),z10=F(0.5−ϵ·11−ϵ−0),z01=1−F(0.5−ϵ·01−ϵ−1),z11=1−F(0.5−ϵ·11−ϵ−1). We split the proof into the calculation of the terms q0,t+1 and q1,t+1 . Note that q0,t+1=P(c^t+1ϵ=0|ct+1=0)=P(c^t+1ϵ=0|c^tϵ=0,ct+1=0)·P(c^tϵ=0|ct+1=0)+ P(c^t+1ϵ=0|c^tϵ=1,ct+1=0)·P(c^tϵ=1|ct+1=0)=F(0.5−ϵ·01−ϵ−0)·P(c^tϵ=0|ct+1=0)+ F(0.5−ϵ·11−ϵ−0)·P(c^tϵ=1|ct+1=0)=z00·P(c^tϵ=0|ct+1=0)+z10·P(c^tϵ=1|ct+1=0)=z00·P(c^tϵ=0,ct+1=0,ct=0)+Pc^tϵ=0,ct+1=0,ct=1)P(ct+1=0)+ z10·P(c^tϵ=1,ct+1=0,ct=0)+P(c^tϵ=1,ct+1=0,ct=1)P(ct+1=0) =z00π0,t+1·P(c^tϵ=0,ct+1=0,ct=0)+ z00π0,t+1·P(c^tϵ=0,ct+1=0,ct=1)+ z10π0,t+1·P(c^tϵ=1,ct+1=0,ct=0)+ z10π0,t+1·P(c^tϵ=1,ct+1=0,ct=1)=z00π0,t+1·P(ct+1=0|ct=0)·P(c^tϵ=0|ct=0)·P(ct=0)+ z00π0,t+1·P(ct+1=0|ct=1)·P(c^tϵ=0|ct=1)·P(ct=1)+ z10π0,t+1·P(ct+1=0|ct=0)·P(c^tϵ=1|ct=0)·P(ct=0)+ z10π0,t+1·P(ct+1=0|ct=1)·P(c^tϵ=1|ct=1)·P(ct=1)=z00π0,t+1·(1−p)·q0,t·π0,t+z00π0,t+1·p·(1−q1,t)·π1,t+ z10π0,t+1·(1−p)·(1−q0,t)·π0,t+z10π0,t+1·p·q1,t·π1,t. We also have q1,t+1=P(c^t+1ϵ=1|ct+1=1)=P(c^t+1ϵ=1|c^tϵ=0,ct+1=1)·P(c^tϵ=0|ct+1=1)+ P(c^t+1ϵ=1|c^tϵ=1,ct+1=1)·P(c^tϵ=1|ct+1=1)=(1−F(0.5−ϵ·01−ϵ−1))·P(c^tϵ=0|ct+1=1)+ (1−F(0.5−ϵ·11−ϵ−1))·P(c^tϵ=1|ct+1=1) =z01·P(c^ϵt=0∣ct+1=1)+z11·P(c^ϵt=1∣ct+1=1)=z01·P(c^ϵt=0,ct+1=0,ct=0)+P(c^ϵt=0,ct+1=0,ct=1)P(ct+1=0)+ z11·P(c^ϵt=1,ct+1=0,ct=0)+P(c^ϵt=1,ct+1=0,ct=1)P(ct+1=0)=z01π1,t+1·P(c^ϵt=0,ct+1=1,ct=0)+ z01π1,t+1·P(c^ϵt=0,ct+1=1,ct=1)+ z11π1,t+1·P(c^ϵt=1,ct+1=1,ct=0)+ z11π1,t+1·P(c^ϵt=1,ct+1=1,ct=1)=z01π1,t+1·P(ct+1=1∣ct=0)·P(c^ϵt=0∣ct=0)·P(ct=0)+ z01π1,t+1·P(ct+1=1∣ct=1)·P(c^ϵt=0∣ct=1)·P(ct=1)+ z11π1,t+1·P(ct+1=1∣ct=0)·P(c^ϵt=1∣ct=0)·P(ct=0)+ z11π1,t+1·P(ct+1=1∣ct=1)·P(c^ϵt=1∣ct=1)·P(ct=1)=z01π1,t+1·p·q0,t·π0,t+z01π1,t+1·(1−p)·(1−q1,t)·π1,t+ z11π1,t+1·p·(1−q0,t)·π0,t+z11π1,t+1·(1−p)·q1,t·π1,t. Putting q0,t+1 and q1,t+1 together in a system of equations, we can write qt+1=(z00·(1−p)·π0,tπ0,t+1−z10·(1−p)·π0,tπ0,t+1z10·p·π1,tπ0,t+1−z00·p·π1,tπ0,t+1z01·p·π0,tπ1,t+1−z11·p·π0,tπ1,t+1z11·(1−p)·π1,tπ1,t+1−z01·(1−p)·π1,tπ1,t+1)·qt +(z00·p·π1,tπ0,t+1+z10·(1−p)·π0,tπ0,t+1z01·(1−p)·π1,tπ1,t+1+z11·p·π0,tπ1,t+1). □ Corollary 2: The limiting probabilities of correct classification q for a symmetric Markov chain P(ct=1|ct−1=0)=P(ct=0|ct−1=1)=pare q=(1−(1−p)(z00−z10)p·(z00−z10)p·(z11−z01)1−(1−p)(z11−z01))−1×(z10+p·(z00−z10)z01+p·(z11−z01)). The corresponding limiting misclassification probability is limt→∞P(c^tε≠ct)=1−12z01(1−z˜00)+z10(1−z˜11)+p(z˜00+z˜11−2z˜11z˜00)1−(1−p)(z˜00+z˜11)+(1−2p)z˜00z˜11,(13)where z˜00=z00−z10and z˜11=z11−z01 . Proof: First note that for the current symmetric Markov chain limt→∞πi,t=0.5 . Then, the statement in Proposition 3 becomes qt+1=(z00·(1−p)−z10·(1−p)z10·p−z00·pz01·p−z11·pz11·(1−p)−z01·(1−p))·qt+(z00·p+z10·(1−p)z01·(1−p)+z11·p)qt+1=A·qt+b. At the limit, qt+1=qt=q and so (I2−A)q=b. Using this and writing z˜00=z00−z10 and z˜11=z11−z01 , we can solve for q=(1−(1−p)z˜00p·z˜00p·z˜111−(1−p)z˜11)−1×(z10+p·z˜00z01+p·z˜11). Calculating the inverse we get (I2−A)−1=(−pz˜11+z˜11−1pz˜00(2z˜11−1)−pz˜11+z˜00(−z˜11)+z˜00+z˜11−1pz˜00pz˜00(2z˜11−1)−pz˜11+z˜00(−z˜11)+z˜00+z˜11−1pz˜11pz˜00(2z˜11−1)−pz˜11+z˜00(−z˜11)+z˜00+z˜11−1−pz˜00+z˜00−1pz˜00(2z˜11−1)−pz˜11+z˜00(−z˜11)+z˜00+z˜11−1), and q=(I2−A)−1b=(pz˜00(z01+z˜11−1)−pz10z˜11+z10(z˜11−1)pz˜00(2z˜11−1)−pz˜11+z˜00(−z˜11)+z˜00+z˜11−1z01(−pz˜00+z˜00−1)+pz˜11(z˜00+z10−1)pz˜00(2z˜11−1)−pz˜11+z˜00(−z˜11)+z˜00+z˜11−1). The limiting misclassification probability (0.5,0.5)·q then is limt→∞P(c^tε≠ct)=1−12z01(1−z˜00)+z10(1−z˜11)+p(z˜00+z˜11−2z˜11z˜00)1−(1−p)(z˜00+z˜11)+(1−2p)z˜00z˜11. □ Corollary 3 Let f(ηt)be the pdf of ηt, corresponding to the cdf F(ηt), and the limiting misclassification probability q˜=1−12(ι2·q)where ι2∈R2is a vector of ones. Then under the same conditions asCorollary 2, the derivative of the limiting misclassification probability at ε=0is given by ∂ q˜∂ε|ε=0=14(f(12)−f(−12))+12p(f(−12)F(12)−f(12)F(−12))+12(p−1)(f(12)F(12)−f(−12)F(−12)). If the pdf f is symmetric around zero, this expression simplifies to ∂ q˜∂ε|ε=0=−12(1−2p) f(12)(2F(12)−1),which is negative for p < 0.5. Proof: From Corollary 2, write the limiting misclassification probability q˜=1−12ι2′A−1b(A.11) with A=(1−(1−p)(z00−z10)p·(z00−z10)p·(z11−z01)1−(1−p)(z11−z01)) and b=(z10+p·(z00−z10)z01+p·(z11−z01)). Let also ∇z denote ∂z/∂ε|ε=0 . The derivative of Equation (A.11) can be written as −12|A|−2 (|A|·(1 , 1)(∇A⋆)b+|A|·(1 , 1)A⋆(∇b)−(∇|A|)·(1 , 1)A⋆b),(A.12) where A⋆ denotes the transposed matrix of co-factors such that A−1=A⋆/|A| . Define f(η)=dF(η)/dη . Then, z00=F(0.5−ϵ·01−ϵ−0)⇒∇z00=12f(12),z10=F(0.5−ϵ·11−ϵ−0)⇒∇z10=−12f(12),z01=1−F(0.5−ϵ·01−ϵ−1)⇒∇z01=−12f(−12),z11=1−F(0.5−ϵ·11−ϵ−1)⇒∇z11=12f(−12),z00|ε=0=z10|ε=0=F(12),z01|ε=0=z11|ε=0=1−F(−12). And also, for A and b, A|ε=0=A⋆|ε=0=I2,|A| |ε=0=1,∇A⋆=−((1−p) f(−12)p f(12)p f(−12)(1−p) f(12)),∇|A|=−(1−p)f(12)−(1−p)f(−12)=−(1−p)(f(12)+f(−12)),b|ε=0=(F(12) , 1−F(−12))′,∇b=(−12f(12)+p f(12) , −12f(−12)+p f(−12))′=−12(f(12)(1−2p) , f(−12)(1−2p))′=−12(1−2p)(f(12) , f(−12))′. Then, going through each term of Equation (A.12) we have |A|·(1 , 1)(∇A⋆)b|ε=0=−(f(−12) , f(12)) b|ε=0=−f(−12)F(12)−f(12)(1−F(−12)),|A|·(1 , 1)A⋆(∇b)=−12(1−2p)(f(12)+f(−12)),−(∇|A|)(1 , 1)A⋆b=(1−p)(f(12)+f(−12))(F(12)+1−F(−12)). Gathering all terms, we obtain the following expression for the derivative: ∂ q˜∂ε|ε=0=14(f(12)−f(−12))+12p(f(−12)F(12)−f(12)F(−12))+12(p−1)(f(12)F(12)−f(−12)F(−12)). Under symmetry of f around zero, we have f(−12)=f(12) and F(−12)=1−F(12) . The expression then simplifies to ∂ q˜∂ε|ε=0=12p f(12)(F(12)−F(−12))+12(p−1) f(12)(F(12)−F(−12)) 12p f(12)(2F(12)−1)+12(p−1) f(12)(2F(12)−1) ∂ q˜∂ε|ε=0=−12(1−2p) f(12)(2F(12)−1). □ Appendix B: Additional Figures and Tables Figure B.1. Open in new tabDownload slide One-step-ahead misclassification rate using Equation (11) and ct−1=0 . The minimum of each curve is marked by a dot. The misclassification rate still presents a minimum, as in the case where β = 0, and the minimum misclassification probability is usually realized at non-trivial values of ε. Figure B.2. Open in new tabDownload slide Simulation results for four values of p. Half-variance setting. Figure B.3. Open in new tabDownload slide Simulation results for four values of p. Benchmark setting. The number of clusters Kt may vary between 2 and 4, while the true number is always 2. Figure B.4. Open in new tabDownload slide Simulation results for four values of p. Half-variance setting. The number of clusters Kt may vary between 2 and 4, while the true number is always 2. Table B.1. Cluster assignments for K≡4 and ε=0.45 Name . 2010 . 2012 . 2016 . 2020 . Achmea Schadeverzekeringen NV 3 3 3 3 Allianz SE 3 3 3 3 Alte Leipziger 4 4 4 4 Assicurazioni Generali Spa 3 3 3 3 Covea 3 3 3 3 Credit Agricole Assurances 3 3 3 3 Danica Pension Livsforsikringsaktieselskab 2 2 2 2 Fidelidade—Companhia De Seguros SA 3 3 2 2 Gjensidige Forsikring Asa 3 3 3 3 Groupe Des Assurances Credit Mutuel SA 2 2 2 2 Hannover Re AG 1 1 1 1 KBC Verzekeringen 2 2 2 2 Livforsakringsbolaget Skandia, Omsesidigt 2 2 2 2 Mapfre SA 3 3 3 3 Munich Re AG 1 1 1 1 Nn Group NV 2 2 2 4 Pfa Holding AS 2 2 2 2 Pohjola Vakuutus OY 3 3 3 3 Powszechny Zaklad Ubezpieczen SA 3 3 3 3 R + V Versicherung AG 4 4 4 4 Sampo Oyj 3 3 3 3 Swiss Re AG 1 1 1 1 Ethniki Hellenic General Insurance Co. SA 3 3 3 3 Unipol Gruppo Spa 3 3 3 3 Vidacaixa Sa De Seguros Y Reaseguros 2 2 2 2 Vienna Insurance Group AG 3 3 3 3 Zavarovalnica Triglav 2 2 2 2 Zurich Insurance Group AG 3 3 3 3 Name . 2010 . 2012 . 2016 . 2020 . Achmea Schadeverzekeringen NV 3 3 3 3 Allianz SE 3 3 3 3 Alte Leipziger 4 4 4 4 Assicurazioni Generali Spa 3 3 3 3 Covea 3 3 3 3 Credit Agricole Assurances 3 3 3 3 Danica Pension Livsforsikringsaktieselskab 2 2 2 2 Fidelidade—Companhia De Seguros SA 3 3 2 2 Gjensidige Forsikring Asa 3 3 3 3 Groupe Des Assurances Credit Mutuel SA 2 2 2 2 Hannover Re AG 1 1 1 1 KBC Verzekeringen 2 2 2 2 Livforsakringsbolaget Skandia, Omsesidigt 2 2 2 2 Mapfre SA 3 3 3 3 Munich Re AG 1 1 1 1 Nn Group NV 2 2 2 4 Pfa Holding AS 2 2 2 2 Pohjola Vakuutus OY 3 3 3 3 Powszechny Zaklad Ubezpieczen SA 3 3 3 3 R + V Versicherung AG 4 4 4 4 Sampo Oyj 3 3 3 3 Swiss Re AG 1 1 1 1 Ethniki Hellenic General Insurance Co. SA 3 3 3 3 Unipol Gruppo Spa 3 3 3 3 Vidacaixa Sa De Seguros Y Reaseguros 2 2 2 2 Vienna Insurance Group AG 3 3 3 3 Zavarovalnica Triglav 2 2 2 2 Zurich Insurance Group AG 3 3 3 3 Open in new tab Table B.1. Cluster assignments for K≡4 and ε=0.45 Name . 2010 . 2012 . 2016 . 2020 . Achmea Schadeverzekeringen NV 3 3 3 3 Allianz SE 3 3 3 3 Alte Leipziger 4 4 4 4 Assicurazioni Generali Spa 3 3 3 3 Covea 3 3 3 3 Credit Agricole Assurances 3 3 3 3 Danica Pension Livsforsikringsaktieselskab 2 2 2 2 Fidelidade—Companhia De Seguros SA 3 3 2 2 Gjensidige Forsikring Asa 3 3 3 3 Groupe Des Assurances Credit Mutuel SA 2 2 2 2 Hannover Re AG 1 1 1 1 KBC Verzekeringen 2 2 2 2 Livforsakringsbolaget Skandia, Omsesidigt 2 2 2 2 Mapfre SA 3 3 3 3 Munich Re AG 1 1 1 1 Nn Group NV 2 2 2 4 Pfa Holding AS 2 2 2 2 Pohjola Vakuutus OY 3 3 3 3 Powszechny Zaklad Ubezpieczen SA 3 3 3 3 R + V Versicherung AG 4 4 4 4 Sampo Oyj 3 3 3 3 Swiss Re AG 1 1 1 1 Ethniki Hellenic General Insurance Co. SA 3 3 3 3 Unipol Gruppo Spa 3 3 3 3 Vidacaixa Sa De Seguros Y Reaseguros 2 2 2 2 Vienna Insurance Group AG 3 3 3 3 Zavarovalnica Triglav 2 2 2 2 Zurich Insurance Group AG 3 3 3 3 Name . 2010 . 2012 . 2016 . 2020 . Achmea Schadeverzekeringen NV 3 3 3 3 Allianz SE 3 3 3 3 Alte Leipziger 4 4 4 4 Assicurazioni Generali Spa 3 3 3 3 Covea 3 3 3 3 Credit Agricole Assurances 3 3 3 3 Danica Pension Livsforsikringsaktieselskab 2 2 2 2 Fidelidade—Companhia De Seguros SA 3 3 2 2 Gjensidige Forsikring Asa 3 3 3 3 Groupe Des Assurances Credit Mutuel SA 2 2 2 2 Hannover Re AG 1 1 1 1 KBC Verzekeringen 2 2 2 2 Livforsakringsbolaget Skandia, Omsesidigt 2 2 2 2 Mapfre SA 3 3 3 3 Munich Re AG 1 1 1 1 Nn Group NV 2 2 2 4 Pfa Holding AS 2 2 2 2 Pohjola Vakuutus OY 3 3 3 3 Powszechny Zaklad Ubezpieczen SA 3 3 3 3 R + V Versicherung AG 4 4 4 4 Sampo Oyj 3 3 3 3 Swiss Re AG 1 1 1 1 Ethniki Hellenic General Insurance Co. SA 3 3 3 3 Unipol Gruppo Spa 3 3 3 3 Vidacaixa Sa De Seguros Y Reaseguros 2 2 2 2 Vienna Insurance Group AG 3 3 3 3 Zavarovalnica Triglav 2 2 2 2 Zurich Insurance Group AG 3 3 3 3 Open in new tab Footnotes I.C.J. and A.L. acknowledge support from the Dutch National Science Foundation (NWO) under grant 406.18.EB.011. J.S. acknowledges support from the Dutch National Science Foundation (NWO) under grant VI.VIDI.191.169. The views expressed in this paper are those of the authors and they do not necessarily reflect the views or policies of the European Central Bank. 1 The procedure of candidate clustering, mapping, shrinking, and reassignment could be iterated if desired. Also note that the approach could, in principle, be extended from the current hard clustering assignment procedure to a soft clustering assignment. 2 The silhouette index is available at the level of each unit i (see Equation (4)), on average across i at any point in time t, and for the entire data. At the unit level, it compares the closest fit of unit i to its second-best cluster alternative at time t, taking into account all other possible cluster allocations. As a result, sit can play a role similar to the role that the cluster probabilities τij,1:T play in Lucas, Schaumburg, and Schwaab (2019), and that the filtered cluster probabilities τij,t|t play in Custodio João et al. (2022). 3 Of course, knowing that the data are generated by a Markov chain, optimal filters would be available (Hamilton, 1989). We are, however, interested in settings that require only minimal assumptions about the data generating process. Our algorithm can be used with any cross-sectional clustering method, while being flexible over time. 4 The Gini-weighted silhouette index need not be monotonically decreasing in the shrinkage parameter ε. The clustering outcomes at time t can influence the clustering outcomes at later times, leading to non-monotonicity in the aggregate fit; see also the discussion in Section 1.3. References Ayadi R. , Bongini P., Casu B., Cucinelli D. 2021 . Banks’ Business Model Migrations in Europe: Determinants and Effects . British Journal of Management 32 : 1007 – 1026 . Google Scholar Crossref Search ADS WorldCat Bonhomme S. , Lamadon T., Manresa E. 2022 . Discretizing Unobserved Heterogeneity . Econometrica 90 : 625 – 643 . Google Scholar Crossref Search ADS WorldCat Bonhomme S. , Manresa E. 2015 . Grouped Patterns of Heterogeneity in Panel Data . Econometrica 83 : 1147 – 1184 . Google Scholar Crossref Search ADS WorldCat Catania L. 2021 . Dynamic Adaptive Mixture Models with an Application to Volatility and Risk . Journal of Financial Econometrics 19 : 531 – 564 . Google Scholar Crossref Search ADS WorldCat Cheng X , Schorfheide F., Shao P. 2019 . Clustering for Multi-dimensional Heterogeneity. Working paper. Custodio João I. , Lucas A., Schaumburg J., Schwaab B. 2022 . Dynamic Clustering of Multivariate Panel Data . Journal of Econometrics . https://doi.org/10.1016/j.jeconom.2022.03.003 Google Scholar OpenURL Placeholder Text WorldCat David H. A. 1968 . Gini’s Mean Difference Rediscovered . Biometrika 55 : 573 – 575 . Google Scholar OpenURL Placeholder Text WorldCat EIOPA . 2021 . “2021 Insurance Stress Test Report.” 1 : 1 – 59 . EIOPA-BoS-21/552 from 16 December 2021. Available at www.eiopa.eu. Fox E. B. , Sudderth E. B., Jordan M. I., Willsky A. S. 2011 . A Sticky HDP-HMM with Application to Speaker Diarization . The Annals of Applied Statistics 5 : 1020 – 1056 . Google Scholar Crossref Search ADS WorldCat Frühwirth-Schnatter S. 2006 . Finite Mixture and Markov Switching Models , Vol. 425 . Springer . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Frühwirth-Schnatter S. 2011 . Panel Data Analysis: A Survey on Model-Based Clustering of Time Series . Advances in Data Analysis and Classification 5 : 251 – 280 . Google Scholar Crossref Search ADS WorldCat Frühwirth-Schnatter S. , Malsiner-Walli G. 2019 . From Here to Infinity: Sparse Finite versus Dirichlet Process Mixtures in Model-Based Clustering . Advances in Data Analysis and Classification 13 : 33 – 64 . Google Scholar Crossref Search ADS PubMed WorldCat Fu W. , Perry P. O. 2020 . Estimating the Number of Clusters Using Cross-Validation . Journal of Computational and Graphical Statistics 29 : 162 – 173 . Google Scholar Crossref Search ADS WorldCat Grundmann M. , Kwatra V., Han M., Essa I. 2010 . “Efficient Hierarchical Graph-Based Video Segmentation.” In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition , San Francisco, CA : IEEE , pp. 2141 – 2148 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Hamilton J. D. 1989 . A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle . Econometrica 57 : 357 – 384 . Google Scholar Crossref Search ADS WorldCat Kalnis P. , Mamoulis N., Bakiras S. 2005 . “On Discovering Moving Clusters in Spatio-Temporal Data.” In Bauzer Medeiros C., Egenhofer M. J., Bertino E. (eds.), Advances in Spatial and Temporal Databases . Springer , pp. 364 – 381 . Google Scholar Crossref Search ADS Google Preview WorldCat COPAC Kuhn H. W. 1955 . The Hungarian Method for the Assignment Problem . Naval Research Logistics Quarterly 2 : 83 – 97 . Google Scholar Crossref Search ADS WorldCat Lin C.-C. , Ng S. 2012 . Estimation of Panel Data Models with Parameter Heterogeneity When Group Membership Is Unknown . Journal of Econometric Methods 1 : 42 – 55 . Google Scholar Crossref Search ADS WorldCat Lucas A. , Schaumburg J., Schwaab B. 2019 . Bank Business Models at Zero Interest Rates . Journal of Business & Economic Statistics 37 : 542 – 555 . Google Scholar Crossref Search ADS WorldCat Lumsdaine R. L. , Okui R., Wang W. 2022 . Estimation of Panel Group Structure Models with Structural Breaks in Group Memberships and Coefficients . Journal of Econometrics . https://doi.org/10.1016/j.jeconom.2022.01.001 Google Scholar OpenURL Placeholder Text WorldCat Oliveira M. , Gama J. 2010 . “Bipartite Graphs for Monitoring Clusters Transitions.” In Cohen P. R., Adams N. M., Berthold M. R. (eds.), Advances in Intelligent Data Analysis Vol. IX . Springer , pp. 114 – 124 . Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC Oliveira M. , Gama J. 2012 . A Framework to Monitor Clusters Evolution Applied to Economy and Finance Problems . Intelligent Data Analysis 16 : 93 – 111 . Google Scholar Crossref Search ADS WorldCat Patton A. J. , Weller B. M. 2021 . “Risk Price Variation: The Missing Half of Empirical Asset Pricing.” ERID Working Paper No. 274. Rousseeuw P. J. 1987 . Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis . Journal of Computational and Applied Mathematics 20 : 53 – 65 . Google Scholar Crossref Search ADS WorldCat Smith S. C. 2022 . Structural Breaks in Grouped Heterogeneity . Journal of Business & Economic Statistics 1 – 13 . https://doi.org/10.1080/07350015.2022.2063132 Google Scholar OpenURL Placeholder Text WorldCat SSM . 2016 . SSM SREP Methodology Booklet, 1–36. Available at: http://www.bankingsupervision.europa.eu (accessed on 14 April 2016). Wang Y. , Tsay R. S. 2019 . Clustering Multiple Time Series with Structural Breaks . Journal of Time Series Analysis 40 : 182 – 202 . Google Scholar Crossref Search ADS WorldCat Zahn C. 1971 . Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters . IEEE Transactions on Computers C-20 : 68 – 86 . Google Scholar Crossref Search ADS WorldCat © The Author(s) 2022. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
Journal of Financial Econometrics – Oxford University Press
Published: Dec 15, 2022
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.