Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
Schnuerch, M., Haaf, J. M., Sarafoglou, A., & Rouder, J. N. (2022). Meaningful Comparisons With Ordinal-Scale Items. Collabra: Psychology, 8(1). https://doi.org/10.1525/collabra.38594 Methodology and Research Practice 1 2 2 3 Martin Schnuerch , Julia M. Haaf , Alexandra Sarafoglou , Jeffrey N. Rouder 1 2 3 Department of Psychology, University of Mannheim, Mannheim, Germany, University of Amsterdam, Amsterdam, Netherlands, University of California, Irvine, CA, US; University of Mannheim, Mannheim, Germany Keywords: stochastic dominance, bayes factors, ordinal scales, meaningful comparisons, Likert items https://doi.org/10.1525/collabra.38594 Collabra: Psychology Vol. 8, Issue 1, 2022 Ordinal-scale items—say items that assess agreement with a proposition on an ordinal rating scale from strongly disagree to strongly agree—are exceedingly popular in psychological research. A common research question concerns the comparison of response distributions on ordinal-scale items across conditions. In this context, there is often a lingering question of whether metric-level descriptions of the results and parametric tests are appropriate. We consider a different problem, perhaps one that supersedes the parametric-vs-nonparametric issue: When is it appropriate to reduce the comparison of two (ordinal) distributions to the comparison of simple summary statistics (e.g., measures of location)? In this paper, we provide a Bayesian modeling approach to help researchers perform meaningful comparisons of two response distributions and draw appropriate inferences from ordinal-scale items. We develop four statistical models that represent possible relationships between two distributions: an unconstrained model representing a complex, non-ordinal relationship, a nonparametric stochastic-dominance model, a parametric shift model, and a null model representing equivalence in distribution. We show how these models can be compared in light of data with Bayes factors and illustrate their usefulness with two real-world examples. We also provide a freely available web applet for researchers who wish to adopt the approach. It is hard to overstate the popularity of ordinal data in al., 1996; Liddell & Kruschke, 2018; McKelvey & Zavoina, social science research. Applications of ordinal variables 1975; Nanna & Sawilowsky, 1998; Norman, 2010; Sullivan such as Likert items to assess respondents’ opinions, affec- & Artino, 2013; Winship & Mare, 1984). Following tive states, or unobservable behavior have become custom- Townsend (1990), however, we believe that there is a differ- ary in political science, economics, educational research, ent, far more fundamental issue that has received consid- health sciences, and psychology. Likert items refer to state- erably less attention (but see Clason & Dormody, 1994): In ments or questions with discrete, naturally ordered re- the usual course of testing the effect of condition on some sponse categories (Bürkner & Vuorre, 2019; Liddell & Kr- outcome variable, researchers typically rely on the compar- uschke, 2018; see also Likert, 1932). A key question in many ison of summary statistics (e.g., measures of central ten- applications is how responses on Likert items differ be- dency). These comparisons may establish a certain order tween two conditions (say, two groups of respondents). Al- relationship at the level of this summary statistic, for exam- though there is near universal recognition that Likert items ple, “The mean value is larger in Condition A than in Con- are ordinal variables, these comparisons are commonly dition B”. They do not imply, however, that this relationship characterized with means and -tests. While some have de- holds in a more general sense, that is, at the level of dis- fended the use of parametric statistics in the context of Lik- tributions. In fact, if the relationship between distributions ert data (Norman, 2010), others have criticized it as a biased as a whole is qualitatively different from that between the and error-prone practice (Liddell & Kruschke, 2018; Win- considered summary statistics, a comparison of the latter ship & Mare, 1984). The typical recommendation is to rely would not be meaningful and may even mislead the analyst. on nonparametric statistics instead to ensure robust infer- This state holds across different levels of measurement, and ences (e.g., Jamieson, 2004; Kuzon et al., 1996; Nanna & it is true for parametric and nonparametric tests alike. Sawilowsky, 1998). Based on Townsend’s (1990) theory of hierarchical infer- The ordinal-vs-metric issue is well known and there is ence, we argue that when comparing responses on a Lik- a large body of literature on it (e.g., Bürkner & Vuorre, ert item between two conditions, researchers should first 2019; Clason & Dormody, 1994; Jamieson, 2004; Kuzon et test for order relationships at the level of distributions. If a martin.schnuerch@uni-mannheim.de Meaningful Comparisons With Ordinal-Scale Items a certain ordering holds at this level, it is also implied at between values but only the rank order. Nanna and Saw- a lower level, that is, for a summary statistic such as the ilowsky (1998) compared the performance of both tests in mean or median. The reverse, of course, is not true. An or- the context of Likert data and found that the nonparametric dering may hold for some summary statistic but not for the test outperformed the parametric test in terms of Type I er- distribution as a whole, that is, consideration of a summary ror control and statistical power. Despite these differences statistic in this case does not represent the phenomena of in performance, both procedures have in common that they interest. Tests of summary statistics are meaningful in our compare distributions by comparing central tendencies. In opinion only when the ordering of the summary statistic Scenario I, the response distributions of Marines and col- indeed represents the ordering of distributions. lege students differ in their central tendencies. College stu- This condition where distributions order is called sto- dents seem to be more often sad than Marines, and both chastic dominance, and it is a well-known concept for ex- a -test and a Wilcoxon rank-sum test will detect that dif- ample in economics (Abadie, 2002; Levy, 1992). Stochastic ference. Importantly, this relationship holds qualitatively dominance describes an order relationship between distri- across the response scale: College students’ reported fre- butions such that one cumulative distribution function is quency of being sad is unambiguously higher than that of “greater” (or “less”) than the other cumulative function for Marines. all possible values (Speckman et al., 2008). Heathcote et al. A different picture emerges in Scenario II: Comparing (2010) developed methods to assess stochastic dominance Marines’ and college students’ answers by means of central and compared their performance with that of existing pro- tendencies implies the same ordering as in Scenario I, that cedures (e.g., Kolmogorov-Smirnov tests). These tests are is, students seem to report being sad more often than only suited for continuous data, however, which renders Marines. This ordering is not preserved at the level of dis- them inappropriate for Likert items. In fact, the limited tributions, however. While many Marines report never being availability of suitable test procedures may be one of the sad, many also report always being sad. Thus, tests of cen- reasons why stochastic dominance is rarely considered in tral tendencies, parametric and nonparametric test alike, applications with Likert data (cf. Madden, 2009; Tubeuf & do not allow for a meaningful comparison of conditions Perronnin, 2008). (Clason & Dormody, 1994). In this paper, we provide a Bayes-factor approach to help The crucial difference between the scenarios is that in researchers use data to assess stochastic dominance and Scenario I, the distributions are stochastically dominant, draw appropriate inferences from Likert items. In the fol- whereas in Scenario II, this dominance does not hold. Sto- lowing, we briefly outline conventional approaches to an- chastic dominance describes the relationship among cumu- alyzing Likert items, and highlight the role of stochastic lative probabilities, and for observed data, may be visual- dominance. We then develop four statistical models that ized using cumulative proportions. Table 2 presents these represent possible order relationships between two re- cumulatives for Scenarios I and II. Each number denotes the sponse distributions: An unconstrained model representing proportion of people whose response fell into the respec- a complex, non-ordinal relationship, a nonparametric sto- tive or a lower category. For example, the first two values chastic-dominance model, a parametric shift model, and for Marines in Scenario I are .30 and .55, and these values a null model representing equivalence in distribution. We indicate that 30% of Marines report to be never sad and 55% show how these models may be evaluated in light of data by report to be either never sad or rarely sad. The key prop- means of Bayes factors and present a user-friendly web ap- erty here is the comparison of these cumulatives to those plet for readers who wish to adopt the analysis in their own for college students. The values for college students indi- research. Finally, we demonstrate the usefulness of the ap- cate that only 20% are never sad and 40% are either never proach by applying it to two real-world examples, and as- or rarely sad. The cumulative proportions for the Marines sess the sensitivity of Bayes factor model comparisons to are always at least as great as those for the college students, reasonable variations in prior settings. and this property holds across all categories. The pattern is more complex in Scenario II. 60% of Likert-Item Distributions Marines report to be sometimes, rarely, or never sad, while only 36% of college students do. Thus, for these three cat- To illustrate why the parametric-vs-nonparametric de- egories, Marines report a lower frequency of being sad than bate does not address the heart of the problem, consider college students. This relationship reverses at Often, how- the following hypothetical example: Suppose we wanted ever. While 80% of college students report to be sad often to compare the frequency of being sad between first-year or less, leaving 20% to be always sad, only 70% of Marines Marines and first-year college students. From each group, chose Often or less, leaving 30% for the highest category. we let 100 individuals indicate on a 5-point Likert item There is no stochastic dominance in this case, Marines are how often they felt sad, with response options ranging from both more frequently never sad and more frequently always “never” to “always”. Table 1 shows hypothetical data for sad. two different scenarios labeled plainly Scenario I and Sce- nario II. For each scenario, we may ask whether there is a Bayesian Models for Ordinal-Scale Data difference between Marines and college students. A nonparametric alternative to -tests for addressing So far, we focused on cumulative proportions, which are this question is the Wilcoxon rank-sum test. Unlike the sample-level data. As researchers, however, we are typically the Wilcoxon test does not consider the difference interested in the underlying population-level probabilities, Collabra: Psychology 2 Meaningful Comparisons With Ordinal-Scale Items Table 1. Ratings Distributions for Hypothetical Sadness Example. Never Rarely Sometimes Often Always Scenario I Marines 30 25 20 15 10 College Students 20 20 20 20 20 Scenario II Marines 40 15 5 10 30 College Students 5 12 19 44 20 Note. Question: How often do you feel sad? Table 2. Cumulative Proportions for Hypothetical Sadness Example. Observed Proportions Never Rarely Sometimes Often Always N Scenario I Marines 0.30 0.55 0.75 0.90 1.00 100 College Students 0.20 0.40 0.60 0.80 1.00 100 Scenario II Marines 0.40 0.55 0.60 0.70 1.00 100 College Students 0.05 0.17 0.36 0.80 1.00 100 Note. Question: How often do you feel sad? that is, the behavior of these proportions in the large-sam- ference is captured in a (semi-) parametric model that un- ple limit. To assess whether stochastic dominance holds in derlies ordinal-regression (also referred to as ordered-pro- population, we need a hypothesis test suitable for ordinal bit or cumulative) model settings (Bürkner & Vuorre, 2019; data. Liddell & Kruschke, 2018; McKelvey & Zavoina, 1975; Win- Tests of stochastic dominance that assume continuous ship & Mare, 1984). In the third model, the semi-paramet- data (such as the Kolmogorov-Smirnov test) are not appro- ric form is further relaxed, leaving a model that has only priate for Likert data. As an extension of one of these tests, a nonparametric stochastic dominance constraint. And fi- Yalonetzky (2013) developed a method for testing stochas- nally, even this constraint is relaxed, allowing for more tic dominance with ordinal data. The test is based on the complex, non-ordinal relationships. By comparing the asymptotic approximation of the multinomial distribution strength of evidence from data for these four models, re- to a multivariate normal distribution. Klugkist et al. (2010) searchers can make insightful, meaningful comparisons developed a Bayesian hypothesis testing procedure for in- across conditions. equality/equality constrained hypotheses for contingency Ordinal-Regression Setup tables. This nonparametric approach is very general and al- lows the analyst to test certain expected orderings of cell It is convenient to start with the well-known ordinal-re- probabilities. Thus, the method could be used to test a cer- gression approach (McKelvey & Zavoina, 1975; Winship & tain ordering of response probabilities implied by stochas- Mare, 1984). Here, the observed variable (i.e., the choice tic dominance in Likert data. Heck and Davis-Stober (2019) of a response category) results from the categorization of discuss a similar approach for testing order constraints, in- an underlying continuous variable. Consider a hypothetical cluding stochastic dominance, in multinomial models (see survey study where respondents are asked to rate a state- also Sarafoglou et al., 2021). ment on a 5-point scale ranging from “Strongly Disagree” We suggest a related approach to assessing stochastic to “Strongly Agree”. The model posits that agreement with dominance with Likert data. Our main goal is to provide this statement can be represented as a continuous, latent four models that encode a series of nested nonparametric variable. This latent variable maps onto rating categories and parametric constraints. While the aforementioned by partitioning the latent space into regions. These regions methods could also be used to encode and test nonpara- are defined by thresholds, and the probability of a response metric constraints, the approach that we propose makes it falling into a certain category is simply the area under straightforward to specify and test both nonparametric and the latent probability distribution between the respective parametric constraints. thresholds (Winship & Mare, 1984). The model setup is il- Under the most constrained of the four models, distrib- lustrated in Figure 1. Note that this setup is conceptually utions across the two conditions are identical. At the next equivalent to that underlying signal-detection theory. most constrained level, the distributions differ but this dif- Collabra: Psychology 3 Meaningful Comparisons With Ordinal-Scale Items is unconstrained, nonparametric, and vacuous; there are as many parameters as degrees of freedom in the data. To add constraint, it is useful to reparameterize the thresholds as follows: where and Here, is the average for the th threshold, and is the difference for the th threshold. The key feature of this pa- rameterization is that denotes a comparison of distribu- tions for the th threshold. Thus, by placing constraints on we can model different types of (ordinal) relationships between the two response distributions. Models We specify four statistical models on each represent- ing a different constraint on the relationship between con- ditions. The models are shown in Figure 2, illustrating the construction in the context of our hypothetical sadness ex- ample (Table 1). Figure 1. Ordinal-Regression Model Unconstrained Model: The first row shows a model that imposes no order constraints on the relationship between conditions. So long as the thresholds order within a con- The latent variable is typically assumed to be normally dition (which is imposed by the likelihood function), there distributed, although the model may be based on other is no restriction on the values and relative order of thresh- probability distributions (e.g., a logistic function; Bürkner olds across conditions. We denote this model as with & Vuorre, 2019). The upper panel of Figure 1 shows a latent There are free parameters in this variable that is partitioned into five regions by four thresh- model: mean-threshold parameters and equally olds (represented by the vertical lines). Whenever the latent many difference parameters The unconstrained model value exceeds a threshold, the observed response is the as- can account for any type of relationship between condi- sociated category (lower panel). Thus, the probability of tions, including complex relationships where response dis- a latent value falling into a certain region corresponds to tributions differ in a way that cannot be captured by an or- the probability of observing the associated response. For der relationship. more details, we refer the reader to an accessible tutorial Dominance Model: The second row shows the domi- by Bürkner and Vuorre (2019), who provide an extensive nance model, For this model, there are again a total of overview of this and related models for the analysis of Lik- free parameters. To capture the notion of sto- ert items. chastic dominance, however, we impose an order constraint In the usual ordinal-regression approach, the thresholds in this model: This constraint implies that are fixed across conditions and differences in distributions thresholds are at least as large—and hence, so are cumula- are captured by shifting the central tendency of the latent tive probabilities—in one condition as in the other one. For distribution. This usual approach may be considered semi- the example in Figure 2, is the threshold that separates parametric as there is no model on thresholds but a para- Rarely from Never for college students, and it has a value of metric model on the effect of conditions. We are going to Likewise, the value for Marines is denoted and start with a fully nonparametric model that is an uncon- has a value of Here, we see that —Marines strained generalization of the ordinal-regression approach, have a higher probability of being never sad than college and then add in increasing degrees of constraint. students. Importantly, this inequality holds for all corre- We start by setting the latent distribution for both con- sponding thresholds, that is, because for all thresh- ditions to a standard normal The free pa- olds it follows that for all threshold pairs. rameters in this setup are the category thresholds. Let There are two possible dominance conditions: one in denote the threshold between response category and which all (i.e., and one in which all in condition For the setup to be (i.e., Whether one or the other or both should valid, thresholds within each condition have to order, that be used is a specification decision that researchers should is, Although it may ap- make ahead of time depending on context. We will discuss pear that the choice of identical standard normals is as- how these decisions may be made subsequently. sumptive, in this setup with free threshold parameters, it is Constant Shift Model: The next row describes a very not. The latent distribution serves merely as a technical de- simple effect where the thresholds in one condition all shift vice that maps observed response frequencies onto regions by the same amount compared to the other condition. The on the real line. Importantly, all observed Likert distribu- model is denoted by and imposes the parametric con- tions across conditions may be accounted for by appropri- straint We include this model because it cor- ate settings of the thresholds. Thus, at this point, the model Collabra: Psychology 4 Meaningful Comparisons With Ordinal-Scale Items Figure 2. Illustration of Statistical Models Note. The latent distribution is fixed as a standard normal and latent thresholds are free parameters. The four models, depicted across the rows, capture different types of relation- ships between conditions (college students vs. Marines). responds to the classical probit-regression model presented tured by a single parameter It is comprised of free pa- above (see Figure 1). In the probit-regression model, the rameters (i.e., mean thresholds and one difference shifts are in the mean of the normal, but this is mathe- In our view, the constant-shift model is useful for cases matically equivalent to fixing the mean and shifting all the where the effect of condition is relatively straightforward thresholds by a constant amount. Unlike the other mod- and can be captured by a shift in central tendency. els we propose in our framework that are nonparametric, Null Model: The last row depicts the null model which the constant shift model imposes a parametric constraint posits that there is no effect of condition. This model is de- on the latent threshold parameters, that is, constancy is noted and imposes the constraint Thus, made with respect to the normal distribution. Thus, for the corresponding thresholds for college students and this model, the choice of identical latent distributions is Marines are identical in this model. For example, the value indeed a substantive statement about the data. Of note, of the threshold between Never and Rarely in Figure 2 for even though constancy reflects the choice of latent distri- college students is and this value is the same for the bution, dominance does not. If thresholds order between threshold between Never and Rarely for Marines. Because conditions for one latent distribution, they must order for all the corresponding thresholds are the same in value, the all other latent distributions. distributions are the same as well. There is no difference In Figure 2, the value of the threshold between Never and among the conditions; hence, there is no effect. The null Rarely for Marines is -0.54, and this value is 0.30 greater model has one free parameter for each threshold, that is, than the bound between Never and Rarely for college stu- parameters in total. dents. This difference is preserved across corresponding Priors on Parameters thresholds. For example, the thresholds between Rarely and Sometimes are 0.05 and -0.25 for Marines and college stu- Our approach is Bayesian, and in Bayesian analysis pri- dents, respectively. The difference, 0.30, is the same as be- ors are needed on parameters. All four models considered tween Never and Rarely. The constant shift model explicitly here comprise parameters for the mean thresholds states that the effect of condition on the ratings can be cap- Collabra: Psychology 5 Meaningful Comparisons With Ordinal-Scale Items so the priors for these parameters should be identical ting. Table 2 shows the cumulatives for the two hypotheti- cal sadness scenarios. The usual approach is to plot receiver across models. A typical choice for these priors are in- dependent normal distributions (e.g., Bürkner & Vuorre, operating characteristic curves (ROCs), and an example for 2019; Liddell & Kruschke, 2018): Scenario I is shown in Figure 4A. The levels of constraint are as follows: If the null model holds, the ROC curve traces the diagonal. If the shift model holds, then the resulting where is a prior standard deviation setting that must be curve is the stereotypical one (Figure 4A) that is common chosen before analysis. in memory and perception research. The dominance model In contrast to the priors on the priors on the dif- implies that the points all lie on one or the other side of ference parameters reflect the substantively motivated the diagonal. The unconstrained model implies only that constraints under the four models. As for the mean thresh- the points increase on the and axes, respectively (Figure olds, we propose a flexible normal distribution as a basis 4C). For analyzing real-world contrasts, it is advantageous for these priors. Under we specify independent normal to plot the differences across the conditions as in Figures 4B distributions for each and D. The advantage here is that it is easier to spot trends because the axis may be scaled for differences rather than where is again specified before analysis. Under trun- the entire range from 0 to 1. The constraints now cen- cated normal distributions are placed on to impose the ter around the horizontal zero line. The null model corre- notion of stochastic dominance: sponds to this line; the shift and dominance model corre- spond to curves strictly on one side of it; the unconstrained where denotes a normal distribution with either model has no such constraint. Figures 4C and 4D show the an upper or a lower bound at 0, respectively. Under ROC and the difference plot for the data in Scenario II. there is just one difference parameter and thus, Bayes Factors Finally, no prior on is needed under as the differ- We can measure the strength of evidence from the data ence in thresholds between conditions is constrained to be for the four models using Bayes factors (Jeffreys, 1961), which are a measure of how well each model predicted Before analysis, researchers can adjust the prior parame- the data before they are observed (Rouder & Morey, 2018). ters and as needed. Thus, the normal prior setting of- Readers who are new to Bayes factors are invited to con- fers the flexibility to provide substantive context through sider one of the many tutorials on their use, and perhaps the choice—and range—of these prior parameters. Here is one of the most helpful resources is the recent 2018 Psycho- some guidance for setting and in practice: Since nomic Bulletin & Review special issue on Bayesian inference thresholds are placed on a standard normal, reasonable val- (Vandekerckhove et al., 2018). ues of should be around 1.0. Figure 3 shows the marginal There are many approaches to computing Bayes factors. prior distribution on mean category probabilities across For the models developed here, we use two different ap- conditions for 5 rating options and for select values of proaches as follows: Some models differ in dimensionality. For middle panel, the marginal priors have the same For example, for response options, there are distribution, centered around .2, for each of the five rating parameters in the unconstrained model, options. Small values of correspond to a belief that ex- parameters in the shift model, and tremes are used excessively at the expense of the middle parameters in the null model. Where the models category (left panel); a large value of corresponds to a differ by a relatively small number of parameters, we find belief that extremes are used rarely (right panel). The set- that the bridge sampling approach proposed by Meng and ting is a good, weakly informative default, and it is Wong (1996) works well. Gronau et al. (2017) provide a de- hard to imagine reasonable settings smaller than and tailed and accessible tutorial on computing Bayes factors larger than 3. with bridge sampling. The approach has been implemented The prior standard deviation on the difference parame- in an R package by Gronau et al. (2020), which we use in our ters, in contrast, should typically be much smaller than on work as well. the mean thresholds. As for any difference parameter, how- We follow a different approach to compare models that ever, the exact choice depends on the analyst’s expectation have the same number of parameters, namely, the uncon- about how strongly the distributions may differ from each strained and dominance model. The dominance model is other. Thus, this choice should be determined by substan- more constrained by virtue of the inequalities. Thus, al- tive, rather than statistical, arguments. For our purposes, though the models have the same dimensionality, the para- we choose a prior standard deviation of that is, meter space for the dominance model is smaller than that of We address the consequences of this choice and for the unconstrained model. In fact, the unconstrained how it affects model comparison results subsequently. model encompasses the dominance model (Heck & Davis- Stober, 2019; Klugkist et al., 2010). When models are en- Data Visualization compassed, the Bayes factor may be computed by consid- ering the posterior and prior probabilities of the constraint The four models correspond to the following helpful data under the unconstrained model (Gelfand et al., 1992). The visualizations. Much like in signal-detection analysis, the running cumulative proportions become the target for plot- Collabra: Psychology 6 Meaningful Comparisons With Ordinal-Scale Items Figure 3. Marginal Prior Distributions on Average Category Probabilities Note. = Prior standard deviation setting on ; rating options. resulting Bayes factor between the dominance and the un- cate a dominance relation in a specific direction, the Bayes constrained model is factor should be calculated accordingly. We do not recommend that researchers compare both stochastic dominance models with one in each direction. This recommendation is a matter of judgment. The moti- The first step is calculating the denominator, that is, the vation is that model comparison and testing should occur prior probability that one distribution dominates another. when researchers have good reason to suspect an effect in a This calculation may be done by Monte-Carlo simulation theoretically meaningful direction. When researchers have from the priors on the collections of and under the no such reasons, exploratory approaches may be more ap- unconstrained model. The next step is calculating the nu- propriate than model comparison. merator. In practice, the computation is surprisingly un- complicated. We follow the approach discussed in Haaf and Software for Computing Bayes Factors Rouder (2017), which is based on the pioneering work of Klugkist et al. (2005). One simply counts the relative fre- We created a user-friendly R web applet for analysis. The quency of posterior samples under that satisfy the user inputs the frequency counts in two conditions such as dominance constraint (see Sarafoglou et al., 2021 for an al- in Table 1. The outputs are Bayes factors for the four mod- ternative, efficient routine to calculating Bayes factors for els. Additional prior inputs, such as the standard deviations order constraints using bridge sampling). Note that Bayes and may be provided as well. The web applet is avail- factors calculated with the encompassing-prior approach able at https://martinschnuerch.shinyapps.io/likertBF/; the are bounded by the prior probability of the constraint under underlying source code as well as a set of useful R functions the unconstrained model. Thus, if there is unequivocal ev- are available at https://github.com/mschnuerch/likertBF. idence that the dominance constraint holds, the Bayes fac- We illustrate this applet with the example data about tor may be no larger than sadness in Marines and college students, Scenario I. A As outlined before, there are two dominance conditions screenshot of the applet while analyzing the data is shown because either distribution could possibly dominate the in Figure 5. Once the data are inputted, we may press “Plot other. A test of stochastic dominance can be two-sided if Data,” and under “Data Visualization,” we may see the di- there is no prediction about which distribution dominates agnostic plots that are shown in Figure 4. Then, to compute the other. In this two-sided case, the prior probability of Bayes factors, we may press “Start Analysis,” and after stochastic dominance is twice that of a directed test, that some time for sampling, the Bayes factors are returned. We is, where a researcher a priori predicts that one distribution may even choose which dominance model we wish by se- dominates the other and not the reverse. The posterior lecting the respective output option. Let’s say a priori we probability is estimated as the relative frequency of poste- may have thought college students would be more often rior samples in the predicted direction only. If stochastic happy. Because we entered the Marines under Condition 1 dominance is observed in this predicted direction, the cor- and the scale ranges from “never sad” to “always sad”, we responding Bayes factor will yield stronger evidence than in specify the one-sided dominance model as “2 > 1”. The re- the two-sided case. Thus, if theoretical considerations indi- sults, shown in the center panel, clearly indicate that the constant shift model is preferred. Finally, by clicking “Plot Collabra: Psychology 7 Meaningful Comparisons With Ordinal-Scale Items Figure 4. ROC and Difference Plots for Hypothetical Sadness Example Note. See Table 2. A, B = Scenario I; C, D = Scenario II. MCMC”, we can visually inspect MCMC samples from the The second example comes from the Pew Research Cen- unconstrained model for and ter’s Election News Pathways Project (Pew Research Center, 2020). Over respondents were surveyed about their Applications perception of the Covid-19 pandemic in late March, 2020. We contrast two questions: In one, participants were asked In this section, we provide two real-world examples of to rate how well US President Trump was responding to these fine-grained analyses. The first example comes from the pandemic; in the other, they were asked to rate how Collingwood et al. (2018) who asked respondents their well their respective state leaders were responding to the opinions about controversial policies of the US adminis- pandemic. The observed proportions and sample sizes are tration under former president Donald Trump, including shown in the panel labeled All in Table 4. the ban on immigration from select Islamic nations and Collingwood et al. (2018) claimed that the Muslim immi- the continuation of the Keystone pipeline project. Colling- gration ban became more popular after it was implemented. wood et al. (2018) conducted two survey waves: one when We use the four models to assess whether there really was the policy was proposed and the other during implementa- an effect, and if so, whether it may be captured with an tion. The observed proportions and sample sizes are shown order relationship as implied by the dominance and shift in Table 3. models. Figure 6, top left, shows the difference in cumu- 1 The data set is publicly available from https://github.com/PerceptionAndCognitionLab/bf-likert 2 The data set is freely available upon registration from https://www.pewresearch.org/politics/dataset/american-trends-panel-wave-64/ Collabra: Psychology 8 Meaningful Comparisons With Ordinal-Scale Items Figure 5. Screenshot of the Accompanying Web Applet Note. The analyzed data shown in the screenshot correspond to the hypothetical Scenario I in Table 1. Collabra: Psychology 9 Meaningful Comparisons With Ordinal-Scale Items Table 3. Ratings Distributions from Collingwood et al. (2018). Observed Proportions Strongly Disagree Disagree Neutral Agree Strongly Agree Immigration Ban First Wave 0.30 0.14 0.14 0.14 0.29 411 Second Wave 0.40 0.11 0.09 0.16 0.23 311 Keystone Pipeline First Wave 0.42 0.08 0.18 0.13 0.18 409 Second Wave 0.39 0.14 0.12 0.14 0.2 311 Note. 1. Agreement with President Trump’s executive order restricting immigration from Syria, Iran, Iraq, Libya, Yemen, Somalia, and Sudan. 2. Agreement with President Trump’s executive order allowing for the Keystone and Dakota Access Pipelines. Table 4. Ratings Distributions from the Election News Pathway Project. Observed Proportions Excellent Good Fair Poor N All Trump 0.24 0.25 0.19 0.32 11491 State Officials 0.21 0.49 0.22 0.08 11432 Democrats Trump 0.04 0.14 0.26 0.56 5937 State Officials 0.21 0.48 0.23 0.07 5914 Republicans Trump 0.47 0.36 0.11 0.06 5101 State Officials 0.21 0.52 0.2 0.08 5076 Note. How would you rate the job each of the following is doing responding to the coronavirus outbreak? A. Donald Trump. B. Your elected state officials. latives. As can be seen, the curve does not cross the zero- The complexity of the effect is easily resolved in this line, indicating the plausibility of stochastic dominance. case by conditioning the data on political-party preference. The Bayes factors for the four models are shown in Table Among those that are Republican, Donald Trump is judged 5. As expected, the winning model is the one-sided domi- quite well in responding to the crisis; among those that are nance model, followed by the shift model. Hence, we con- Democratic, he is judged quite poorly. This partisan divide clude that there is evidence for an effect. The effect is sim- is not present among state leaders. Thus, when we con- ple and can be reduced to an order relationship. The same dition responses on political-party preference, the domi- analysis may be applied to the question about the Keystone nance model in the expected direction wins. pipeline. For these data, the null has a Bayes factor of at Sensitivity To Prior Settings least 2.5-to-1 against any competitor indicating anecdotal evidence for a lack of an effect of wave on the ratings dis- The Bayesian analysis presented here requires the ana- tribution. lyst to set the prior standard deviations on mean bounds Perhaps the most interesting data are those about lead- and effects Such requirements have given some ership in the Covid-19 pandemic. Here, we have strong ev- researchers pause in adopting Bayesian methods. It seems idence for an indominant effect. The unconstrained model reasonable as a starting point to require that if two re- is preferred by several hundred orders of magnitude to any searchers run the same experiment and obtain the same competitor. Donald Trump seems to be a polarizing figure data, they should reach similar, if not the same, conclu- compared to state leaders. People were more likely to give sions. To harmonize Bayesian inference with this starting Donald Trump extreme ratings than state leaders. This po- point, some analysts actively seek to minimize these effects larization may be seen in the difference plot in Figure 6 by choosing likelihoods, prior parametric forms, and (bottom right panel). Here, the curve crosses zero, and heuristic methods of inference so that variation in prior though the deflection may appear slight, it is highly evi- settings have minimal influence (Aitkin, 1991; Gelman et dential because the sample sizes are so large. Accordingly, al., 2004; Kruschke, 2012; Spiegelhalter et al., 2002). it makes little sense to discuss whether Donald Trump is We reject the starting point above including the view viewed as having responded better or worse than local lead- that minimization of prior effects is necessary. The choice ers. of prior settings is important because it affects the models’ Collabra: Psychology 10 Meaningful Comparisons With Ordinal-Scale Items Figure 6. Difference Plots for the Real-World Data in Tables 3 and 4. Note. There is anecdotal evidence for a constant shift in the top-left panel and for a lack of an effect in the top-right panel. In the middle-left panel, there is strong evidence for an in- dominant effect, while there is strong evidence for stochastic dominance in the remaining figures. Table 5. Bayes Factors for Empirical Examples. Null Shift Dominance Unconstrained Immigration Ban 0.21 0.76 1.00 0.13 Keystone Pipeline 1.00 0.29 0.29 0.40 Covid, All 0.00 0.00 0.00 1.00 Covid, Democrats 0.00 0.00 1.00 0.15 Covid, Republicans 0.00 0.00 1.00 0.14 Note. The winning model is assigned a value of 1.00. Bayes factors for all other models are relative to this winning model. predictions about data. Therefore, these settings necessar- clusions, the data simply do not afford the precision to ar- ily affect model comparison. Whatever this effect, it is the rive at a clear verdict between the positions. degree resulting from the usage of Bayes rule, which in turn With this argument as context, we may assess whether mandates that evidence for competing models is the degree reasonable variation in prior settings affect Bayes factor to which they improve predictive accuracy. conclusions among the models. In Figure 3, we show that When different researchers use different priors, they is a good default choice for the prior on and may arrive at different opinions about the data. This vari- this choice may be made without undue influence on model ation is not problematic, however, so long as various prior comparison results. The prior choice on the difference pa- settings are justifiable: The variation in results reflects the rameters is more consequential. For the previous analy- legitimate diversity of opinion (Rouder et al., 2016). When sis, we specified For this setting, we consider a different reasonable prior settings suggest conflicting con- range from (1/2 the original setting) to (2 times the original setting) to be reasonable. Values of Collabra: Psychology 11 Meaningful Comparisons With Ordinal-Scale Items Table 6. Bayes Factors for Modified Election News Pathways Project Data. Null Shift Dominance Unconstrained Covid, All 0.00 0.00 0.13 1.00 Covid, Republicans 0.00 0.00 1.00 0.23 Note. The winning model is assigned a value of 1.00. Bayes factors for all other models are relative to this winning model. place excessive weight on extremely small differences be- miss the underlying structure and may mislead the analyst tween conditions, while values of place excessive (Clason & Dormody, 1994). weight on overwhelmingly large differences. The statistical models developed herein allow for a more To see how variation in this prior setting affects the fine-grained analysis of Likert and other ordinal-scale Bayes factors, we use a modified version of the Election items. The null, constant shift, dominance, and uncon- News Pathways Project data. Unfortunately, with strained models provide for a rich description of possible observations, the sample size is quite large to be typical of structure in the relationship between two distributions, and psychological data. A more typical set would have fewer ob- strength of evidence from data for them may be stated servations, and so for the purposes here we took the fre- via Bayes factors. The models as well as the Bayes factor quencies in Table 4 and divided them by 10. We used the comparisons are straightforward and computationally con- complete data set with both Republicans and Democrats venient. We demonstrated their usefulness with two real- because here we found strong evidence for an indominant world examples and created an easily accessible, user- effect. Along with these data, we used the subset of Re- friendly web applet for researchers. publicans as these showed evidence for a simpler structure, Although we think that researchers will benefit from the namely, stochastic dominance. development presented herein, there are also limitations: The Bayes factors for the modified data set with the 1. The concept of the threshold here is not psychologi- same prior setting as in the previous analysis and cal and should not be interpreted as such. In this frame- are shown in Table 6. Without considering po- work, thresholds describe the proportion of people that litical-party preference, the unconstrained model is still endorse particular responses. They do not describe the in- preferred over the others. The closest competitor is the ternal process by which people respond to Likert items. dominance model, and the corresponding Bayes factor is Likewise, the models do not address whether people use approximately 8-to-1. The reason this value is more mod- the same processes or the same response styles. In this re- erate than that in Table 5 reflects the reduced sample size. gard, the model is a statistical account for addressing con- Among Republicans, the dominance model is again pre- straints at the population level. 2. Although the uncon- ferred over the unconstrained model by a factor of approx- strained, dominance, and null models are nonparametric, imately 4-to-1. The question is whether these two values the constant shift model, which we suspect will be a simple, depend heavily on the range of prior settings. parsimonious account of condition effects, is parametric. The dependence is shown in Figure 7. Here, Bayes factors Whether shifts are constant or not depends on the distribu- of all models against the preferred model within three or- tional form, and, here, the choice of identical normal distri- ders of magnitude are displayed. Although the exact butions for all respondents is a substantive assumption. 3. figures vary slightly, there is no consequential dependence So far, the development only applies to the comparison of of Bayes factors across the reasonable range of prior set- two independent distributions. Of course, psychologists are tings. Both for the complete set (left panel) and the subset often interested in more complicated designs. For example, (right panel), the winning model is preferred over its near- the data from Collingwood et al. (2018) are panel data in est competitor by a relatively constant amount. Hence, the which the same people answered in both waves. We do not Bayes factor method provides for evidence that is fairly ro- take into account any shared variation from the panel de- bust to reasonable variation in prior expectations about sign. 4. Finally, analysis is not always run for a single item data. across just two levels of a covariate. It is more typical to use multiple items to construct latent Likert scales. And in this Conclusion case, questions about a shift or stochastic dominance in the data should be addressed at the scale level. Although the use of Likert items is exceedingly popular, It is one of the strengths of the proposed analysis frame- we argue here that researchers have overlooked a defining work that it affords the flexibility to incorporate other types primitive in analysis (Townsend, 1990). Instead of debating of constraints and data structures. In this paper, we focused the use of parametric vs. nonparametric statistics, we on the common case of comparing two independent re- should assess whether or not two response distributions sponse distributions on a single Likert item. However, fu- can be meaningfully compared by means of their central ture efforts may be devoted to extending our approach to tendencies. If there is no order relationship at the level formulate and test other types of constraints across more of distributions (i.e., no stochastic dominance), common than two conditions and with multiple items (i.e., Likert parametric and even nonparametric tests of differences scales). At this point, our development constitutes only Collabra: Psychology 12 Meaningful Comparisons With Ordinal-Scale Items Figure 7. Dependence of Bayes Factors on Prior Settings ( ) −3 Note. Sensitivity analyses were performed on 1/10th of the Election News Pathways Project data. Only models with Bayes factors within three orders of magnitude (10 ) against the preferred model are shown. a useful first step toward a more complete framework of Funding Information meaningful analysis of ordinal-scale items. The first and the last author were supported by a grant from the German Research Foundation to the research training group Statistical Modeling in Psychology (GRK Contributions 2277); the third author was supported by a Van der Gaag Fund, Royal Netherlands Academy of Arts & Sciences. The first author contributed to the theoretical develop- ment and implementation of the models and the bridge Data Accessibility Statement sampling routine, developed and implemented the web ap- The data used in the first real-world application example plet, acquired and analyzed the data, and drafted and re- by Collingwood et al. (2018) are available from vised the manuscript; the second author contributed to https://github.com/PerceptionAndCognitionLab/bf-likert. the theoretical development of the models and revised the The data for the second example by Pew Research Center manuscript; the third author contributed to the theoretical (2020) are freely available upon registration from development of the models and revised the manuscript; https://www.pewresearch.org/politics/dataset/american- the last author contributed to the theoretical development trends-panel-wave-64/. All code for analyses and figures is and implementation of the models and the bridge sampling included in the R Markdown file of this manuscript. The routine, acquired and analyzed the data, and drafted and Markdown file and supporting files are curated at revised the manuscript. https://github.com/PerceptionAndCognitionLab/bf-likert. Competing Interests Submitted: April 20, 2021 PDT, Accepted: August 30, 2022 PDT The authors declare no competing interests. The second author is an associate Editor at Collabra: Psychology. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CCBY-4.0). View this license’s legal deed at http://creativecommons.org/licenses/by/4.0 and legal code at http://creativecom- mons.org/licenses/by/4.0/legalcode for more information. Collabra: Psychology 13 Meaningful Comparisons With Ordinal-Scale Items References Abadie, A. (2002). Bootstrap Tests for Distributional Jeffreys, H. (1961). Theory of Probability (3rd ed.). Oxford Treatment Effects in Instrumental Variable Models. University Press. Journal of the American Statistical Association, Klugkist, I., Kato, B., & Hoijtink, H. (2005). Bayesian 97(457), 284–292. https://doi.org/10.1198/016214502 model selection using encompassing priors. Statistica Neerlandica, 59(1), 57–69. https://doi.org/10.1111/j.1 Aitkin, M. (1991). Posterior Bayes Factors. Journal of the 467-9574.2005.00279.x Royal Statistical Society: Series B (Methodological), Klugkist, I., Laudy, O., & Hoijtink, H. (2010). Bayesian 53(1), 111–128. https://doi.org/10.1111/j.2517-6161.1 Evaluation of Inequality and Equality Constrained 991.tb01812.x Hypotheses for Contingency Tables. Psychological Bürkner, P.-C., & Vuorre, M. (2019). Ordinal Regression Methods, 15(3), 281–299. https://doi.org/10.1037/a00 Models in Psychology: A Tutorial. Advances in 20137 Methods and Practices in Psychological Science, 2(1), Kruschke, J. K. (2012). Bayesian Estimation Supersedes 77–101. https://doi.org/10.1177/2515245918823199 the t Test. Journal of Experimental Psychology: General, Clason, D., & Dormody, T. (1994). Analyzing Data 142, 573–603. https://doi.org/10.1037/e502412013-05 Measured by Individual Likert-Type Items. Journal of 5 Agricultural Education, 35(4). https://doi.org/10.5032/j Kuzon, W. M. J., Urbanchek, M. G., & McCabe, S. (1996). ae.1994.04031 The Seven Deadly Sins of Statistical Analysis. Annals Collingwood, L., Lajevardi, N., & Oskooii, K. A. R. of Plastic Surgery, 37(3), 265–272. https://doi.org/10.1 (2018). A Change of Heart? Why Individual-Level 097/00000637-199609000-00006 Public Opinion Shifted Against Trump’s “Muslim Levy, H. (1992). Stochastic Dominance and Expected Ban.” Political Behavior, 40(4), 1035–1072. https://do Utility: Survey and Analysis. Management Science, i.org/10.1007/s11109-017-9439-z 38(4), 555–593. https://doi.org/10.1287/mnsc.38.4.55 Gelfand, A. E., Smith, A. F. M., & Lee, T.-M. (1992). 5 Bayesian Analysis of Constrained Parameter and Liddell, T. M., & Kruschke, J. K. (2018). Analyzing Truncated Data Problems Using Gibbs Sampling. ordinal data with metric models: What could possibly Journal of the American Statistical Association, go wrong? Journal of Experimental Social Psychology, 87(418), 523–532. https://doi.org/10.1080/01621459.1 79, 328–348. https://doi.org/10.1016/j.jesp.2018.08.0 992.10475235 Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. Likert, R. (1932). A Technique for the Measurement of (2004). Bayesian data analysis (2nd ed.). Chapman Attitudes. Archives of Psychology, 22(140), 5–55. and Hall. Madden, D. (2009). Mental stress in Ireland, 1994-2000: Gronau, Q. F., Sarafoglou, A., Matzke, D., Ly, A., Boehm, A stochastic dominance approach. Health Economics, U., Marsman, M., Leslie, D. S., Forster, J. J., 18(10), 1202–1217. https://doi.org/10.1002/hec.1425 Wagenmakers, E.-J., & Steingroever, H. (2017). A McKelvey, R. D., & Zavoina, W. (1975). A Statistical tutorial on bridge sampling. Journal of Mathematical Model for the Analysis of Ordinal Level Dependent Psychology, 81, 80–97. https://doi.org/10.1016/j.jmp.2 Variables. The Journal of Mathematical Sociology, 4(1), 017.09.005 103–120. https://doi.org/10.1080/0022250x.1975.998 Gronau, Q. F., Singmann, H., & Wagenmakers, E.-J. (2020). Bridgesampling: An R Package for Estimating Meng, X., & Wong, W. H. (1996). Simulating ratios of Normalizing Constants. Journal of Statistical Software, normalizing constants via a simple identity: a 92(10), 1–29. https://doi.org/10.18637/jss.v092.i10 theoretical exploration. Statistica Sinica, 6, 831–860. Haaf, J. M., & Rouder, J. N. (2017). Developing Nanna, M. J., & Sawilowsky, S. S. (1998). Analysis of Constraint in Bayesian Mixed Models. Psychological Likert scale data in disability and medical Methods, 22(4), 779–798. https://doi.org/10.1037/met rehabilitation research. Psychological Methods, 3(1), 55–67. https://doi.org/10.1037/1082-989x.3.1.55 Heathcote, A., Brown, S., Wagenmakers, E. J., & Eidels, Norman, G. (2010). Likert scales, levels of measurement A. (2010). Distribution-Free Tests of Stochastic and the “laws” of statistics. Advances in Health Dominance for Small Samples. Journal of Sciences Education, 15(5), 625–632. https://doi.org/1 Mathematical Psychology, 54(5), 454–463. https://do 0.1007/s10459-010-9222-y i.org/10.1016/j.jmp.2010.06.005 Pew Research Center. (2020, March 26). Worries about Heck, D. W., & Davis-Stober, C. P. (2019). Multinomial coronavirus surge, as most Americans expect a recession models with linear inequality constraints: Overview – or worse. https://www.pewresearch.org/politics/202 and improvements of computational methods for 0/03/26/worries-about-coronavirus-surge-as-most-a Bayesian inference. Journal of Mathematical mericans-expect-a-recession-or-worse/ Psychology, 91, 70–87. https://doi.org/10.1016/j.jmp.2 Rouder, J. N., & Morey, R. D. (2018). Teaching Bayes’ 019.03.004 Theorem: Strength of Evidence as Predictive Jamieson, S. (2004). Likert Scales: How to (Ab)Use Accuracy. The American Statistician, 73(2), 186–190. h Them. Medical Education, 38(12), 1217–1218. http ttps://doi.org/10.1080/00031305.2017.1341334 s://doi.org/10.1111/j.1365-2929.2004.02012.x Collabra: Psychology 14 Meaningful Comparisons With Ordinal-Scale Items Rouder, J. N., Morey, R. D., & Wagenmakers, E.-J. Townsend, J. T. (1990). Truth and consequences of (2016). The Interplay between Subjectivity, Statistical ordinal differences in statistical distributions: Toward Practice, and Psychological Science. Collabra: a theory of hierarchical inference. Psychological Psychology, 2(1), 6. https://doi.org/10.1525/collabra.2 Bulletin, 108(3), 551–567. https://doi.org/10.1037/003 8 3-2909.108.3.551 Sarafoglou, A., Haaf, J. M., Ly, A., Gronau, Q. F., Tubeuf, S., & Perronnin, M. (2008). New prospects in the Wagenmakers, E.-J., & Marsman, M. (2021). analysis of inequalities in health: A measurement of Evaluating multinomial order restrictions with bridge health encompassing several dimensions of health (p. sampling. Psychological Methods. https://doi.org/10.1 48) [Tech. rep.]. University of York. 037/met0000411 Vandekerckhove, J., Rouder, J. N., & Kruschke, J. K. Speckman, P. L., Rouder, J. N., Morey, R. D., & Pratte, (2018). Editorial: Bayesian methods for advancing M. S. (2008). Delta plots and coherent distribution psychological science. Psychonomic Bulletin & Review, ordering. The American Statistician, 62(3), 262–266. h 25(1), 1–4. https://doi.org/10.3758/s13423-018-144 ttps://doi.org/10.1198/000313008x333493 3-8 Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Winship, C., & Mare, R. D. (1984). Regression models Linde, A. (2002). Bayesian measures of model with ordinal variables. American Sociological Review, complexity and fit. Journal of the Royal Statistical 49(4), 512. https://doi.org/10.2307/2095465 Society: Series B (Statistical Methodology), 64(4), Yalonetzky, G. (2013). Stochastic dominance with 583–639. https://doi.org/10.1111/1467-9868.00353 ordinal variables: Conditions and a test. Econometric Sullivan, G. M., & Artino, A. R. J. (2013). Analyzing and Reviews, 32(1), 126–163. https://doi.org/10.1080/0747 interpreting data From Likert-type scales. Journal of 4938.2012.690653 Graduate Medical Education, 5(4), 541–542. https://do i.org/10.4300/jgme-5-4-18 Collabra: Psychology 15 Meaningful Comparisons With Ordinal-Scale Items Supplementary Materials Response letter version 3 Download: https://collabra.scholasticahq.com/article/38594-meaningful-comparisons-with-ordinal-scale-items/ attachment/101003.pdf?auth_token=qavCoVfW8Qa0T9mzCEFa Response letter version 2 Download: https://collabra.scholasticahq.com/article/38594-meaningful-comparisons-with-ordinal-scale-items/ attachment/101004.pdf?auth_token=qavCoVfW8Qa0T9mzCEFa Peer Review History Download: https://collabra.scholasticahq.com/article/38594-meaningful-comparisons-with-ordinal-scale-items/ attachment/101005.docx?auth_token=qavCoVfW8Qa0T9mzCEFa Collabra: Psychology
Collabra Psychology – University of California Press
Published: Oct 24, 2022
Keywords: stochastic dominance; bayes factors; ordinal scales; meaningful comparisons; Likert items
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.