Access the full text.

Sign up today, get DeepDyve free for 14 days.

Statistics
, Volume 2018 (1808) – Aug 1, 2018

/lp/arxiv-cornell-university/model-selection-by-minimum-description-length-lower-bound-sample-sizes-8mcajozQrY

- ISSN
- 0022-2496
- eISSN
- ARCH-3347
- DOI
- 10.1016/j.jmp.2014.06.002
- Publisher site
- See Article on Publisher Site

The Fisher information approximation (FIA) is an implementation of the minimum description length principle for model selection. Unlike information criteria such as AIC or BIC, it has the advantage of taking the functional form of a model into account. Unfortunately, FIA can be misleading in ﬁnite samples, resulting in an inversion of the correct rank order of complexity terms for competing models in the worst case. As a remedy, we propose a lower-bound N for the sample size that suﬃces to preclude such errors. We illustrate the approach using three examples from the family of multinomial processing tree models. Keywords: Fisher information approximation; minimum description length; normalized maximum likelihood; model selection. FISHER INFORMATION APPROXIMATION 3 Model selection by minimum description length: Lower-bound sample sizes for the Fisher information approximation 1. Model selection and minimum description length In selecting a model from a set of competing models, a trade-oﬀ between the ﬁt of a model and its complexity has to be made. On the one hand, a good model should describe observed data reasonably well; on the other hand, it should be as simple as possible so that results generalize beyond the current set of data. Flexible models with many parameters tend to ﬁt too much noise beyond systematic patterns and hence might not predict new data well, a phenomenon known as overﬁtting (Myung, 2000). Consequently, reﬂecting the principle of Occam’s razor, if a less ﬂexible model can account for the data equally well as a more complex model, the simple model is preferred, thus ensuring a high level of generalizability. Implementations of Occam’s razor include well-known information criteria such as the Akaike information criterion (AIC; Akaike, 1973) or the Bayesian information criterion (BIC; Schwarz, 1978). Both indices weigh the ﬁt of a model in terms of the maximized likelihood against its complexity as measured by the number of free parameters. These criteria share the drawback that merely counting the number of free parameters fails to address the functional form of a model appropriately, that is, the structure of how the parameters are connected to each other. For example, in the context of psychophysics, Pitt et al. (2002) showed that Steven’s power law is more complex than Fechner’s logarithmic law even though both comprise the same number of free parameters. Likewise, taking the functional complexity into account is of fundamental importance when comparing models involving order constraints. Obviously, a model with order restrictions on the parameter vector θ is less complex than a corresponding unrestricted model. However, neither AIC nor BIC can account for this diﬀerence because order restrictions do not aﬀect the number of parameters of a model. To overcome this limitation, Grünwald (2000) proposed to rely on the principle of minimum description length (MDL) when selecting among competing models. This approach was developed in the ﬁeld of algorithmic coding theory (Rissanen, 1978) and FISHER INFORMATION APPROXIMATION 4 addresses the issue of how much a given set of data can be compressed by a model. A model is preferred if it covers regularities in the data by means of the shortest code length (Grünwald, 2007). In the extreme case of randomly generated data, no compression is possible at all. In contrast, if data are generated deterministically without any noise, the code to describe these data can be shortened dramatically by giving the rule which generated the data (Grünwald et al., 2005). Models that compress data tightly provide a high level of generalizability, because they cover systematic patterns of the data that are likely to occur in future data as well. An implementation of the MDL principle was provided by Rissanen (2001) who derived the normalized maximum likelihood (NML) to measure the stochastic complexity of a model given a data set, NML = −LML + C (N), (1) NML where the maximum log-likelihood (LML) as a measure of ﬁt is weighted against the complexity term C (N) = ln f(x|θ(x))dx. (2) NML This complexity term is the natural logarithm of the integral over the maximum likelihoods across the whole outcome space X of potentially observable vectors x with number of observations N. Accordingly, a complex model that ﬁts a wide range of observable data vectors will have a large value of C (N) compared to a model that ﬁts NML only a small subset of observable data (Myung et al., 2006). Unfortunately, there is no general closed-form expression of C (N) and numerical estimation techniques such as NML Monte Carlo (MC) integration are often too time intensive for practical purposes. An alternative is the Fisher information approximation (FIA; Rissanen, 1996), FIA = −LML + C (N), (3) FIA which is asymptotically equivalent to NML. The complexity term C (N) covers the FIA number of free parameters S and the number of observations N in the ﬁrst summand and considers the functional form of the model in the second, S N C (N) = ln + ln |I(θ)|dθ, (4) FIA 2 2π Ω FISHER INFORMATION APPROXIMATION 5 where I(θ) is the Fisher information matrix of sample size one. This matrix contains the expected values of the second partial derivatives of the likelihood function and thereby captures the functional form of the model. Since the Fisher information matrix is often available in closed form, the integral in the second part of C (N) is more tractable than FIA the integral in C (N). C (N) can be estimated by means of MC integration (Pitt NML FIA et al., 2002). 2. FIA in ﬁnite samples Although FIA approaches NML asymptotically, both measures can deviate substantially in ﬁnite samples. For hierarchical model families, in particular, Navarro (2004) noted that the FIA complexity term for a nested model can become larger than that of a nesting model. Such a rank order reversal is obviously problematic, as a nested model, by deﬁnition, must be associated with a smaller complexity value. This notion is adequately reﬂected by NML, because the integral of the maximum likelihoods across the data space is always smaller for the nested model. Using FIA, however, such inverted rank orders of the complexity terms may occur, in turn resulting in a biased model selection in favor of the nesting model, as it always ﬁts at least as well as the nested model. Navarro (2004) supposed that the source of this problem is related to possible violations of the requirement that the maximum likelihood estimator θ must lie suﬃciently within the model manifold, an assumption underlying the approximation of NML by FIA. However, the precise conditions under which FIA results are misleading are not yet established. Inversions of the rank order of FIA complexity terms for small N may not only pose problems for model selection in nested model families but also for model selection in the more general class of NML stable models. Deﬁnition 1 (NML stability). A set of stochastic models is called NML stable if the rank order of NML complexities C (N) across these models is identical for all N ∈ N. NML The deﬁnition of C (N) directly implies NML stability for all pairs of nested NML models, as the NML complexity of a nested model is smaller than that of the nesting model for all sample sizes. However, NML stability does not necessarily hold for sets of FISHER INFORMATION APPROXIMATION 6 non-nested models in general. As an example for a rank order inversion of NML complexities in non-nested model families, consider a hierarchical model that assigns a separate set of parameters to each participant. If such a model is compared to a diﬀerent (non-nested) model that assumes a constant set of parameters for all N, the NML complexity of the latter model compared to the hierarchical model might indeed be larger for small samples, but smaller for larger samples. However, considering typical sets of non-nested models each having a constant number of parameters and presupposing independent and identically distributed data, NML stability is a reasonable assumption. In Section 4, we provide an example of an NML stable, non-nested model family with a FIA complexity rank order that deviates from the NML rank order for small N’s. Note that — unlike for nested models — an inverted FIA rank order for non-nested models does not necessarily imply selection of the model with a larger NML complexity, because an overly small C (N) can be compensated by a larger negative log-likelihood (cf. Eq. FIA 3). Nevertheless, in such a setting, FIA-based model selection will also be biased towards the more complex model. To avoid biases in model selection using FIA for NML stable models, we propose to check whether the C (N) rank order of the candidate models is invariant across FIA diﬀerent numbers N of observations. Based on the deﬁnition of FIA in (4) it is easy to show that for any two models with a ﬁxed but unequal number of parameters S and S , i j respectively, the C (N) rank order cannot be identical for all possible sample sizes. FIA Since the integral in (4) is independent of N, it is straightforward to determine the (single) sample size N for which the complexity terms of two models with S 6= S are i j i,j equal. Equating the FIA terms of two models i and j and solving for N yields " !# Z Z q q N = 2π exp ln |I (θ)|dθ − ln |I (θ)|dθ . (5) j i i,j S − S Ω Ω j i i j When N > N , the C (N) terms of the two competing models i and j will always FIA i,j result in the same rank order. Because C (N) approximates C (N) for increasing N, FIA NML this must be the correct (i.e., NML-consistent) rank order. By implication, for any N < N the rank order of complexity terms is incorrectly inverted. i,j Thanks to Dan Navarro for this example. FISHER INFORMATION APPROXIMATION 7 When plotting the FIA complexity terms for two models as logarithmic functions of N, N gives the number of observations at which these two lines intersect. Figure 1B i,j illustrates this for two exemplary models that will be discussed in more detail in Section 4 of this article. Ideally, N should be as small as possible thus ensuring the absence of i,j the aforementioned bias for a large range of sample sizes. For two models with the same number of free parameters S = S , the complexity curves never intersect since the FIA i j complexity penalties can only diﬀer in their intercepts — the lines are either parallel or identical. In such a case, N is undeﬁned and reversals of C rank orders cannot occur. FIA i,j A) NML complexity B) FIA complexity 3 3 2 2 1 1 WADDprob TTB 0 0 0 20 40 60 80 100 140 0 20 40 60 80 100 140 Number of observations Number of observations Figure 1 . NML and FIA complexities for two decision strategies, take-the-best (TTB) and a probabilistic weighted-additive rule (WADDprob; Figure 2), as functions of the number of observations N. The dotted vertical line for FIA marks the lower bound N = 80. Consequently, to make sure that the correct rank order of FIA complexity terms is found for a candidate set of two or more NML stable models, the actual number of observations in a study should exceed N for all pairs of models (i, j) in the competition: i,j Deﬁnition 2 (Lower-bound N ). Considering a set of M ∈ N NML stable models, the C (N ) NML C (N ) FIA FISHER INFORMATION APPROXIMATION 8 lower-bound sample size N for the application of FIA to this set is 0 0 N := max N , (6) i,j i,j∈{1,...,M} 0 0 with N given by (5) if S 6= S and N = 0 otherwise. i j i,j i,j 3. Nested model comparison We demonstrate the relevance of our approach using examples from the family of multinomial processing tree (MPT) models (Batchelder and Riefer, 1999; Erdfelder et al., 2009). The models diﬀer in size, thus showing that inverted rank orders in C (N) can FIA occur in various situations. An algorithm for the computation of the FIA complexity term for MPT models by means of MC integration was proposed by Wu et al. (2010; for implementations see also Moshagen, 2010; Singmann and Kellen, 2013). In all examples, we estimated the integral of the FIA complexity term based on one million samples. Take-the-best (TTB) Probabilistic weighted-additive (WADDprob) 1- e 1A 1- e 1A 1 1 Item type 1 Item type 1 e 1B e 1B 1 1 1- e 2A e 2A 2 2 Item type 2 Item type 2 e 2B 1-e 2B 2 2 1- e 3A 1- e 3A 3 3 Item type 3 Item type 3 e 3B e 3B 3 3 Figure 2 . Decision strategies as used in Hilbig and Moshagen (ress). The ﬁrst model is the one-high-threshold model of source monitoring as introduced by Batchelder and Riefer (1990, Figure 3) to disentangle the eﬀects of item recognition (D and D ) and source discrimination (d and d ). In the nesting model, the item 1 2 1 2 detection parameters (D = D ) and two of the guessing parameters (a = g) are set equal 1 2 (model 5b). One nested model (model 4) additionally assumes equal source discrimination for both sources (d = d ). For this model pair, Wu et al. (2010) observed an inversion of 1 2 the rank order of the FIA complexity penalties for an extreme proportion of 5% new FISHER INFORMATION APPROXIMATION 9 items and N = 1,000. Using the proposed lower-bound N , we can predict that this inversion vanishes when N exceeds N = 1, 393. Importantly, in case of MPT models, both C (N) and C (N) can vary depending on the proportion of observations per NML FIA tree even when the overall N remains constant. The eﬀect of the proportion of new items on N is shown in Table 1. It is evident that a minimum is reached for a proportion of 50% new items (N = 292), while choosing more extreme proportions leads to an increase 0 0 in N . Note that this minimum N is still larger than the number of observations for thought-disordered and non-thought-disordered manic participants (n = 240 each) in a data set by Harvey (1985) as discussed in Batchelder and Riefer (1990). By implication, if these data were reanalyzed by means of FIA, the more complex nesting model 5b would always be preferred over the less complex nested model 4, irrespectively of the data. Clearly, FIA-based model selection would be severely misleading in such a case. Tree 1 (Source A Items) Tree 2 (Source B Items) d A d B 1 2 D D 1 2 a A a A 1–d 1–d 1 2 1–a B 1–a B g A g A b b 1–g B 1–g B 1–D 1–D 1 2 1–b N 1–b N Tree 3 (New Items) g A 1–g B 1–b N Figure 3 . Source-monitoring model by Batchelder and Riefer (1990). The second example is an MPT model for the “Who said what?” paradigm applied in a social psychological setting (Figure 4; Klauer and Wegener, 1998). This model is an extension of the two-high-threshold model to measure memory for categories, statements, and persons. As in the studies reported by Klauer and Wegener (1998), the nesting model assumes equal probabilities for detecting a statement (D = D = D ). A nested A B N FISHER INFORMATION APPROXIMATION 10 Table 1 Lower-bound N for three model families as a function of the percentage of new items (in source monitoring models), distractor statements (in the ‘Who-said-what?’ paradigm), and Type-3 items (in the decision strategies model). Model family Proportion of new items, distractor statements, and items type 3, respectively 10% 30% 50% 70% 90% Source-monitoring 750 340 292 344 770 ’Who said what?’ 8,616 3,415 2,568 2,711 5,114 Decision strategies 108 80 87 122 323 Note. The ratio of the observations for the remaining item types is held constant at 1:1. model additionally constraints the probabilities of discriminating the category of a statement to be equal (d = d ). In Table 1, the resulting lower-bound N is shown as a A B function of the proportion of distractor items given four persons per category. As for the source-monitoring model, a ratio of 50% distractors is optimal for the application of FIA 0 0 (N = 2, 568), but still results in a lower-bound N that is larger than the number of observations of Experiments 2, 3, and 5 (each n = 1, 920) of Klauer and Wegener (1998), for example. Since the ﬁt of the nested model can never compensate for its larger complexity term, the nesting model would always be selected by FIA for these data sets. Again, model selection by FIA would be misleading. 4. Non-nested model comparison The third example demonstrates the relevance of our approach for non-nested NML stable models. The models stem from the ﬁeld of judgment and decision making and describe the behavior of choosing one of two choice options to maximize a given criterion (i.e., decision strategy; Bröder and Schiﬀer, 2003a). Recently, Hilbig and Moshagen (ress) FISHER INFORMATION APPROXIMATION 11 Statement by a speaker from category A c Category A, correct speaker 1/n Category A, correct speaker D d A A 1-1/n Category A, wrong speaker 1–c 1/n Category A, correct speaker 1-1/n Category A, wrong speaker 1–d 1–a Category B, wrong speaker 1/n Category A, correct speaker 1-1/n Category A, wrong speaker 1–a Category B, wrong speaker 1–D 1–b new Statement by a speaker from category B c Category B, correct speaker 1/n Category B, correct speaker 1-1/n Category B, wrong speaker 1–c a Category A, wrong speaker 1–d 1/n Category B, correct speaker 1–a 1-1/n Category B, wrong speaker a Category A, wrong speaker 1/n Category B, correct speaker 1–a 1–D 1-1/n Category B, wrong speaker 1–b new Distractor statement D new a Category A, wrong speaker 1-a Category B, wrong speaker 1–D 1–b new Figure 4 . MPT model for the ’Who said what?’ paradigm by Klauer and Wegener (1998). FISHER INFORMATION APPROXIMATION 12 proposed to classify participants according to strategy usage by means of FIA. Figure 2 depicts two possible decision strategies, take-the-best (TTB) and a probabilistic weighted-additive rule (WADDprob), each including the error terms e , e , and e of 1 2 3 making a strategy-inconsistent decision. Whereas TTB assumes homogeneous error terms, smaller or equal to chance (e = e = e ≤ 0.5), WADDprob poses an order restriction on 1 2 3 the error probabilities (e ≤ e ≤ e ≤ 0.5). Note that TTB and WADDprob are 1 3 2 non-nested due to a diﬀerent deﬁnition of the error term e (e = 1− e ). 2 2,TTB 2,WADDprob Since the number of possible observations increases only according to a cubic law in N 3 respect to N (e.g., | X |= [(N + 1)/3] for equal proportions of each item type), C (N) can be computed directly for small to moderate sample sizes without requiring NML extensive numeric integration techniques. The NML complexity terms are shown in panel A of Figure 1. It is evident that the NML complexity curves do not intersect for an equal proportion of item types. Thus, the two non-nested models are NML stable for N ≤ 150. The respective FIA complexity curves are shown in panel B of Figure 1. Comparing the FIA and NML complexities indicates that FIA reasonably approximates NML for the TTB strategy across all N, but strongly underestimates NML for the WADDprob strategy. Thus, in the present example, the observed inversion of FIA complexity terms at N = 80 can be attributed to a bias in FIA associated with the WADDprob model for small N’s. This result also shows that N > N does not guarantee that FIA approximates NML well, as the bias in FIA for WADDprob is still substantial even for the largest N considered. Nevertheless, ensuring that the number of observations exceeds the lower-bound N will lead to correct model selections (in terms of NML) and thus still informs the decision to rely on FIA, as severe biases in model selection are avoided and settings are revealed in which direct estimation of NML should be considered. In interpreting the resulting lower bound N of 80 it is important to consider that strategies are usually classiﬁed for each participant separately, sometimes based on relevant response frequencies as low as 15 (e.g., Bröder and Schiﬀer, 2003b). Table 1 We also checked the NML stability of TTB and WADDprob for other proportions of item types with identical results regarding NML stability . FISHER INFORMATION APPROXIMATION 13 shows that the lower-bound N cannot be reduced further by changing the proportion of item types. Therefore, for FIA-based model selection to make sense at the individual level, the minimum requirement for Hilbig and Moshagen’s (in press) candidate models is to obtain at least 27 decisions for each of the three item types per participant. 5. Discussion The advantage of considering the functional form in measuring model complexity has often been discussed in the literature (e.g., Grünwald, 2000; Myung et al., 2006). Unfortunately, the approximation of NML by means of FIA can be misleading (Navarro, 2004), but no precise condition or model characteristic is known yet to predict or correct this problem. In extreme cases, this problem may result in a reversed rank order of FIA complexity terms for NML stable models, in turn leading to severely biased model comparisons. As a practical solution, we propose to calculate N , the lower-bound sample size for the application of FIA to NML stable model families. If the observed sample size exceeds the lower-bound N , researchers can be conﬁdent that the rank order of the C (N) complexity terms agrees with the corresponding NML rank order. Otherwise, FIA FIA should not be used, as model selection will be biased in favor of the more complex model in terms of NML. Speciﬁcally, in the case of two nested models, the more ﬂexible nesting model will always be selected for N < N , regardless of the data. We demonstrated the relevance of our approach using three examples in which the FIA criterion results in a misleading model selection even for moderately large sample sizes. However, although N > N ensures that using FIA does not distort model comparisons, the lower-bound N cannot be used to determine the required sample size to guarantee a reliable approximation of NML via FIA. In any case, it is clear that the source of the current diﬃculty does not lie within the MDL principle itself: The NML criterion does not suﬀer from this problem at any N. It is also important to keep in mind that implementations of the MDL principle such as FIA provide the advantage of considering the functional complexity of models with the same number of free parameters even if N is small. Such models may include order FISHER INFORMATION APPROXIMATION 14 restrictions and diﬀerent functional forms that are not captured by information criteria such as AIC and BIC. FIA can readily be applied to compare models with equal numbers of parameters, as the rank order of the complexity terms is always stable because the complexity penalties diﬀer only in the integral. Most importantly in the present context, for NML stable models that diﬀer in the number of parameters, the approach advocated herein provides a safety belt for substantive researchers who want to take advantage of the beneﬁts of FIA while avoiding severe biases in model selection that may occur in case of small samples. FISHER INFORMATION APPROXIMATION 15 References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory, pages 267–281. Batchelder, W. H. and Riefer, D. M. (1990). Multinomial processing models of source monitoring. Psychological Review, 97:548–564. Batchelder, W. H. and Riefer, D. M. (1999). Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6:57–86. Bröder, A. and Schiﬀer, S. (2003a). Bayesian strategy assessment in multi-attribute decision making. Journal of Behavioral Decision Making, 16:193–213. Bröder, A. and Schiﬀer, S. (2003b). Take the best versus simultaneous feature matching: Probabilistic inferences from memory and eﬀects of reprensentation format. Journal of Experimental Psychology: General, 132:277–293. Erdfelder, E., Auer, T.-S., Hilbig, B. E., Assfalg, A., Moshagen, M., and Nadarevic, L. (2009). Multinomial processing tree models: A review of the literature. Zeitschrift fur Psychologie/Journal of Psychology, 217:108–124. Grünwald, P. (2000). Model selection based on minimum description length. Journal of Mathematical Psychology, 44:133–152. Grünwald, P. (2007). The minimum description length principle. MIT Press, Cambridge, MA. Grünwald, P., Myung, J. I., and Pitt, M. A. (2005). Advances in minimum description length: Theory and applications. MIT Press, Cambridge, MA. Hilbig, B. E. and Moshagen, M. (in press). Generalized outcome-based strategy classiﬁcation: A guide to comparing deterministic and probabilistic choice models. Psychonomic Bulletin & Review. FISHER INFORMATION APPROXIMATION 16 Klauer, K. C. and Wegener, I. (1998). Unraveling social categorization in the ’Who said what?’ paradigm. Journal of Personality and Social Psychology, 75:1155–1178. Moshagen, M. (2010). multiTree: A computer program for the analysis of multinomial processing tree models. Behavior Research Methods, 42:42–54. Myung, J. I. (2000). The importance of complexity in model selection. Journal of Mathematical Psychology, 44:190–204. Myung, J. I., Navarro, D. J., and Pitt, M. A. (2006). Model selection by normalized maximum likelihood. Journal of Mathematical Psychology, 50:167–179. Navarro, D. J. (2004). A note on the applied use of MDL approximations. Neural Computation, 16:1763–1768. Pitt, M. A., Myung, I. J., and Zhang, S. (2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109:472–491. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14:465–471. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42:40–47. Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 47:1712–1717. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6:461–464. Singmann, H. and Kellen, D. (2013). MPTinR: Analysis of multinomial processing tree models in R. Behavior Research Methods, 45:560–575. Wu, H., Myung, J. I., and Batchelder, W. H. (2010). On the minimum description length complexity of multinomial processing tree models. Journal of Mathematical Psychology, 54:291–303.

Statistics – arXiv (Cornell University)

**Published: ** Aug 1, 2018

Loading...

You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!

Read and print from thousands of top scholarly journals.

System error. Please try again!

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.

Copy and paste the desired citation format or use the link below to download a file formatted for EndNote

Access the full text.

Sign up today, get DeepDyve free for 14 days.

All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.