Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study

The COSMIN checklist for assessing the methodological quality of studies on measurement... Qual Life Res (2010) 19:539–549 DOI 10.1007/s11136-010-9606-8 The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study • • Lidwine B. Mokkink Caroline B. Terwee • • • Donald L. Patrick Jordi Alonso Paul W. Stratford • • Dirk L. Knol Lex M. Bouter Henrica C. W. de Vet Accepted: 2 February 2010 / Published online: 19 February 2010 The Author(s) 2010. This article is published with open access at Springerlink.com Abstract least 67% of the panel members indicated ‘agree’ or Background Aim of the COSMIN study (COnsensus- ‘strongly agree’. based Standards for the selection of health status Mea- Results Consensus was reached on the inclusion of the surement INstruments) was to develop a consensus-based following measurement properties: internal consistency, checklist to evaluate the methodological quality of studies reliability, measurement error, content validity (including on measurement properties. We present the COSMIN face validity), construct validity (including structural checklist and the agreement of the panel on the items of the validity, hypotheses testing and cross-cultural validity), checklist. criterion validity, responsiveness, and interpretability. The Methods A four-round Delphi study was performed with latter was not considered a measurement property. The international experts (psychologists, epidemiologists, stat- panel also reached consensus on how these properties isticians and clinicians). Of the 91 invited experts, 57 should be assessed. agreed to participate (63%). Panel members were asked to Conclusions The resulting COSMIN checklist could be rate their (dis)agreement with each proposal on a five-point useful when selecting a measurement instrument, peer- scale. Consensus was considered to be reached when at reviewing a manuscript, designing or reporting a study on measurement properties, or for educational purposes. J. Alonso L. B. Mokkink (&)  C. B. Terwee  D. L. Knol Health Services Research Unit, Institut Municipal d’Investigacio L. M. Bouter  H. C. W. de Vet Medica (IMIM-Hospital del Mar), Doctor Aiguader 88, 08003 Department of Epidemiology and Biostatistics and the EMGO Barcelona, Spain Institute for Health and Care Research, VU University Medical e-mail: jalonso@imim.es Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands J. Alonso e-mail: w.mokkink@vumc.nl CIBER en Epidemiologı ´a y Salud Pu ´ blica (CIBERESP), URL: www.emgo.nl; www.cosmin.nl Barcelona, Spain C. B. Terwee e-mail: cb.terwee@vumc.nl P. W. Stratford School of Rehabilitation Science and Department of Clinical D. L. Knol Epidemiology and Biostatistics, McMaster University, 1400 e-mail: d.knol@vumc.nl Main St. West, Hamilton, ON, Canada L. M. Bouter e-mail: stratford@mcmaster.ca e-mail: lm.bouter@dienst.vu.nl L. M. Bouter H. C. W. de Vet Executive Board of VU University Amsterdam, De Boelelaan e-mail: hcw.devet@vumc.nl 1105, 1081 HV Amsterdam, The Netherlands D. L. Patrick Department of Health Services, University of Washington, Thur Canal St Research Office, 146N Canal Suite 310, Seattle, WA 98103, USA e-mail: donald@u.washington.edu 123 540 Qual Life Res (2010) 19:539–549 Keywords Delphi technique  Outcome assessment  functional status, and health-related quality of life. These Psychometrics  Quality of life  Questionnaire are constructs which are not directly measurable. Because of the subjective nature of these constructs, it is very Introduction important to evaluate whether the measurement instru- ments measure these constructs in a valid and reliable way. Measurement of health outcomes is essential in scientific The COSMIN initiative (COnsensus-based Standards research and in clinical practice. Based on the scores for the selection of health Measurement Instruments) aims obtained with measurement instruments, decisions are to improve the selection of health measurement instru- made about the application of subsequent diagnostic tests ments. As part of this initiative, the aim of this study was to and treatments. Health status measurement instruments develop a checklist containing standards for evaluating the should therefore be reliable and valid. Otherwise there is a methodological quality of studies on measurement prop- serious risk of imprecise or biased results that might lead to erties. The checklist was developed as a multidisciplinary, wrong conclusions. Organisations such as the US Food and international collaboration with all relevant expertise Drug Administration (FDA) and the European Medicines involved. We performed a Delphi study to address two Agency (EMEA) require that measurement instruments research questions: must be well validated for its purpose [1, 2]. The need for 1. Which measurement properties should be included in reliable and valid measurement instruments of health out- the checklist? comes was clearly demonstrated by Marshall, who showed 2. How should these measurement properties be evalu- in schizophrenia trials that authors were more likely to ated in terms of study design and statistical analysis report that treatment was superior to control when an (i.e. standards)? unpublished measurement instrument was used in the comparison, rather than a published instrument [3]. In this paper, we present the COSMIN checklist, and Before a health status measurement instrument can be describe the agreement of the panel concerning the items used in research or clinical practice, its measurement included in the checklist. properties, i.e. reliability, validity and responsiveness, should be assessed and considered adequate. Studies evaluating measurement properties should be of high Methods methodological quality to guarantee appropriate conclu- sions about the measurement properties of an instrument. Focus of the COSMIN checklist To evaluate the methodological quality of a study on measurement properties, standards are needed. Although The COSMIN checklist is focused on evaluating the many standards and criteria have been proposed, these have methodological quality of studies on measurement prop- not been operationalised into user-friendly and easily erties of HR-PROs. We choose to focus on HR-PROs, applicable checklists (e.g. [4, 5]). Moreover, these stan- because of the complexity of these instruments. These dards do not pay attention to studies that apply Item instruments measure constructs that are both multidimen- Response Theory (IRT) models, or are not consensus based sional and not directly measurable. (e.g. [6, 7]). Such a checklist should contain a complete set In addition, we focused on evaluative applications of of standards (which refers to the design requirements and HR-PRO instruments, i.e. longitudinal applications preferred statistical methods) and criteria of adequacy for assessing treatment effects or changes in health over time. what constitutes good measurement properties. Broad The specification of evaluative is necessary, because the consensus is necessary in order to achieve wide acceptance requirements for measurement properties vary with the of a checklist. application of the instrument [8]. For example, instruments Research on measurement properties is particularly used for evaluation need to be responsive, while instru- important for health outcomes that are directly reported by ments used for discrimination do not. patients, i.e. health-related patient-reported outcomes The COSMIN Steering Committee (Appendix 1) sear- (HR-PROs). A HR-PRO is a measurement of any aspect of ched the literature to determine how measurement prop- a patient’s health status that is directly assessed by the erties are generally evaluated. Two searches were patient, i.e. without the interpretation of the patient’s performed: (1) a systematic literature search was per- responses by a physician or anyone else [2]. Modes of data- formed to identify all existing systematic reviews on collection for HR-PRO instruments include interviewer- measurement properties of health status measurement administered instruments, self-administered instruments, or instruments [9]. From these reviews, information was computer-administered instrument [2]. Examples of extracted on which measurement properties were evalu- HR-PROs are questionnaires assessing symptoms, ated, and on standards that were used to evaluate the 123 Qual Life Res (2010) 19:539–549 541 measurement properties of the included studies. For each performed. (3) Internal consistency statistics should be measurement property, we found several different stan- calculated for each (sub) scale separately’. The panel could dards, some of which were contradictory [9]. (2) The answer each item on a 5-point scale ranging from strongly steering committee also performed another systematic lit- disagree to strongly agree. Next, the panel was asked to rate erature search (available on request from the authors) to which statistical methods they considered adequate for identify methodological articles and textbooks containing evaluating the measurement property concerned. A list of standards for the evaluation of measurement properties of potential relevant statistical methods for each measurement health status measurement instruments. Articles were property was provided. For example, for internal consis- selected if the purpose of the article was to present a tency the following often used methods were proposed: checklist or standards for measurement properties. Stan- ‘Cronbach’s alpha’, ‘Kuder-Richardson formula-20’, dards identified in the aforementioned literature were used ‘average item-total correlation’, ‘average inter-item corre- as input in the Delphi rounds. lation’, ‘split-half analysis’, ‘goodness of fit (IRT) at a global level, i.e. index of (subject) separation’, ‘goodness International Delphi study of fit (IRT) at a local level, i.e. specific item tests’, or ‘other’. Panel members could indicate more than one Subsequently, a Delphi study was performed, which con- method. In the third round, we presented the most often sisted of four written rounds. The first questionnaire was chosen method, both the one based on CTT and the one sent in March 2006, the last questionnaire in November based on IRT, and asked if the panel considered this 2007. We decided to invite at least 80 international experts method as the most preferred method to evaluate the to participate in our Delphi panel in order to ensure 30 measurement property. For internal consistency, these were responders in the last round. Based on previous experiences ‘Cronbach’s alpha’ and ‘goodness of fit (IRT) at a global with Delphi studies [10, 11], we expected that 70% of the level, i.e. index of (subject) separation’, respectively. In the people invited would agree to participate, and of these third round, the panel members were asked whether the people 65% would complete the first list. Once started, we other methods (i.e. ‘Kuder-Richardson formula-20’, ‘aver- expected that 75% would stay involved. We included age item-total correlation’, ‘average inter-item correlation’, experts in the field of psychology, epidemiology, statistics, ‘split-half analysis’, ‘goodness of fit (IRT) at a local level, and clinical medicine. Among those invited were authors of i.e. specific item tests’) were also considered appropriate. reviews, methodological articles, or textbooks. Experts had Panel members could also have indicated ‘other methods’ to have at least five publications on the (methods of) in round 2. Indicated methods were ‘eigen-values or per- measurement of health status in PubMed. We invited centage of variance explained of factor analysis,’ ‘Mokken Rho’ or ‘Loevinger H’ for internal consistency. In round 3, people from different parts of the world. In the first round, we asked questions about which the panel was also asked whether they considered these measurement properties should be included in the checklist, methods as appropriate for assessing internal consistency. and about their terms and definitions. For example, we In the final Delphi round, all measurement properties and asked for the measurement property internal consistency standards that the panel agreed upon were integrated by the ‘which term do you consider the best for this measurement steering committee into a preliminary version of the property?’, with the response options ‘internal consis- checklist for evaluating the methodological quality of tency’, ‘internal consistency reliability’, ‘homogeneity’, studies on measurement properties. ‘internal scale consistency’, ‘split-half reliability’, ‘internal In each Delphi round, the results of the previous round reliability’, ‘structural reliability’, ‘item consistency’, were presented in a feedback report. Panel members were ‘intra-item reliability’, or ‘other’ with some space to give asked to rate their (dis)agreement with regard to proposals. an alternative term. Regarding the definitions, we asked Agreement was rated on a 5-point scale (strongly dis- ‘Which definition do you consider the best for internal agree—disagree—no opinion—agree—strongly agree). consistency?’, and provided seven definitions that were The panel members were encouraged to give arguments for found in the literature and the option ‘other’ where a panel their choices to convince other panel members, to suggest member could provide an alternative definition. In round alternatives, or to add new issues. Consensus on an issue two, we introduced questions about preferred standards for was considered to be reached when at least 67% of the each measurement property. We asked questions about panel members indicated ‘agree’ or ‘strongly agree’ on the design issues, i.e. ‘Do you agree with the following 5-point scale. If less than 67% agreement was reached on a requirements for the design of a study evaluating internal question, we asked it again in the next round, providing pro consistency of HR-PRO instruments in an evaluative and contra arguments given by the panel members, or we application? (1) One administration should be available. (2) proposed an alternative. When no consensus was reached, A check for uni-dimensionality per (sub) scale should be the Steering Committee took the final decision. 123 542 Qual Life Res (2010) 19:539–549 When necessary, we asked the panel members to indi- was from Asia. The response rate of the rounds ranged cate the preferred statistical methods separately for each from 48 to 74%. Six panel members (11%) dropped out measurement theory, i.e. Classical Test Theory (CTT) or during the process. The names of all panel members who Item Response Theory (IRT), or for each type of score, completed at least one round are presented in the such as dichotomous, nominal, ordinal, or continuous ‘‘Acknowledgements’’. scores. The COSMIN taxonomy Results In the Delphi study, we also developed a taxonomy of the relationships of measurement properties that are relevant Panel members for evaluating HR-PRO instruments, and reached consen- sus on terminology and definitions of these measurement We invited 91 experts to participate of whom 57 (63%) properties. The relationships between all properties are agreed to participate. The main reason for non-participa- presented in a taxonomy (Fig. 1). The taxonomy comprises tion was lack of time. Nineteen experts (21%) did not three domains (i.e. reliability, validity, and responsive- respond. Of the 57 experts who agreed to participate, 43 ness), which contain the measurement properties. The (75%) experts participated in at least one round, and 20 measurement property construct validity contains three (35%) participated in all four rounds. The average number aspects, i.e. structural validity, hypotheses testing, and (minimum–maximum) of years of experience in measuring cross-cultural validity. Interpretability was also included in health or comparable fields (e.g. in educational or psy- the taxonomy and checklist, although it was not considered chological measurements) was 20 (6–40) years. Most of the a measurement property, but nevertheless an important panel members came from Northern America (n = 25) and characteristic. The percentages agreement on terminology Europe (n = 29), while two were from Australia and one and position in the taxonomy are described elsewhere [12]. Fig. 1 COSMIN taxonomy of relationships of measurement properties QUALITY of a HR-PRO QUALITY of a HR-PRO Reliability Internal Consistency Reliability (test-retest, Inter-rater, Validity Intra-rater) Content validity Measurement face error Construct validity (test-retest, validity Inter-rater, Intra-rater) Criterion validity (concurrent validity, Structural validity Hypotheses-testing predictive validity) Responsiveness Cross-cultural validity Responsiveness Interpretability 123 Qual Life Res (2010) 19:539–549 543 The COSMIN checklist consensus was rated, and the Steering Committee decided on including this item. The results of the consensus reached in the Delphi rounds Four items included in the checklist had less than 67% were used to construct the COSMIN checklist (Appendix agreement of the panel: item 9 of box A internal consis- 2). The checklist contains twelve boxes. Ten boxes can be tency, item 11 for box C measurement error, and items 11 used to assess whether a study meets the standard for good and 17 of box I responsiveness. All but one was about the methodological quality. Nine of these boxes contain stan- statistical methods. For different reasons, which we will dards for the included measurement properties (internal successively explain, the Steering Committee decided to consistency (box A), reliability (box B), measurement error include these four items in the checklist. (box C), content validity (box D), structural validity (box When asking about the preferred statistical method for E), hypotheses testing (box F), cross-cultural validity (box internal consistency, we initially did not distinguish G), criterion validity (box H) and responsiveness (box I), between types of scores, i.e. dichotomous or ordinal scores and one box contains standards for studies on interpret- (item 9). Therefore, Cronbach alpha was preferred over ability (box J). In addition, two boxes are included in the Kuder-Richardson Formula 20 (KR-20). However, the checklist that contain general requirements for articles in Steering Committee decided afterward that KR-20 was which IRT methods are applied (IRT box), and general considered appropriate for dichotomous scores as well. requirements for the generalizability of the results (Gen- Item 11 of box C measurement error contains three eralizability box), respectively. methods, i.e. standard error of measurement (SEM), To complete the COSMIN checklist, a 4-step procedure smallest detectable change (SDC) and Limits of Agreement should be followed (Fig. 2)[13]. Step 1 is to determine (LOA). In round 3, SEM was chosen as the preferred which properties are evaluated in an article. Step 2 is to method for measuring measurement error (76% agree- determine if the statistical methods used in the article are ment). When asking about other appropriate methods, only based on Classical Test Theory (CTT) or on Item 20% agreed with SDC, and 28% with LOA. Despite the Response Theory (IRT). For studies that apply IRT, the low percentages agreement reached in round 3 on accept- IRT box should be completed. Step 3 is to complete the ing SDC and LOA as appropriate methods, the Steering boxes with standards accompanying the properties chosen Committee decided afterward that both methods should be in step 1. These boxes contain questions to rate whether a considered appropriate to measure measurement error and study meets the standards for good methodological qual- were included in the checklist. The SDC is a linear trans- ity. Items are included about design requirements and formation of the SEM [14], i.e., 1.96 9 H2 9 SEM. preferred statistical methods of each of the measurement Because the SEM is an appropriate method, SDC should properties (boxes A to I). In addition, a box with items on also be considered appropriate. The LOA is a parameter interpretability of the (change) score is included (box J). indicating how much two measures differ [15]. When these The number of items in these boxes range from 5 to 18. two measures are repeated measures in stable patients, it Step 4 of the procedure is to complete the box on general can be used as a method for assessing measurement error. requirements for the generalizability of the results. This LOA is directly related to SEM [16], and we therefore Generalizability box should be completed for each prop- decided to include this method in the checklist. erty identified in step 1. We developed a manual Item 11 of box I responsiveness (i.e. ‘was an adequate describing the rationale of each item, and suggestions for description provided of the comparator instrument(s)’) was scoring [13]. approved by 64% of the panel. Although the percentage agreement was slightly too low, we decided to include this item because it was also included in box F hypotheses Consensus among the panel testing, reflecting the similarity between construct validity and responsiveness. In Table 1, we present ranges of percentage agreement of Item 17 of box I contains two methods, i.e. correlations the panel members for each box, both for the design between change scores and the area under the receiver requirements and the statistical methods. Most of these operator curve (ROC). Seventy-six percent of the panel issues were discussed in rounds 2 and 3. considered the first method as the preferred method. This Percentage agreement among the panel members on method can be used when both the measurement instru- the items 1–3 in the IRT box ranged from 81 to 96%. ment under study and its gold standard are continuous Item 4 (i.e. checking the assumptions for estimating measures. Only 60% considered the ROC method as an parameters of the IRT model) was included based on a appropriate method to measure responsiveness when a suggestion of a panel member in round 4. Therefore, no (dichotomous) gold standard is available. In analogy to 123 544 Qual Life Res (2010) 19:539–549 INSTRUCTIONS FOR COMPLETING THE COSMIN CHECKLIST Are IRT methods used in Complete for each Complete for each Mark the properties that the article? property you marked in property you marked in have been assessed in step 1 the corresponding step 1 the the article. box A to J Generalisability box A. Internal consistency B. Reliability C. Measurement error Yes No D. Content validity (including face validity) Construct validity E. Structural validity F. Hypotheses -testing Complete IRT box G. Cross -cultural vadility H. Criterion validity I. Responsiveness J. Interpretability Step 1 Step 2 Step 3 Step 4 Fig. 2 The 4-step procedure to complete the COSMIN checklist Table 1 Percentage agreement Design requirements (%) Statistical methods (%) of panel members who (strongly) agreed with the items Internal consistency 77–92 (R2) 40–88 (R2–4) about design requirements and Reliability 77–97 (R2) 80–92 (R3) statistical methods for the Measurement error Same items as for reliability 20–76 (R3) COSMIN boxes A–J Content validity 90–94 (R2) na Structural validity 72 (R3) 68–100 (R3) Hypotheses testing 77–92 (R2, R4) 90 (R2) Cross-cultural validity 70–79 (R3–4) 68–94 (R3) Criterion validity 88 (R3) 88 (R3) Responsiveness (general) 90–97 (R2) na Responsiveness (no gold standard available) 64–68 (R3) 88 (R3) Responsiveness (gold standard available) 80 (R3) 60–76 (R3) R round in which consensus was Interpretability na 72–96 (R3) reached, na not applicable diagnostic research, the Steering Committee considered the Discussion ROC method an appropriate method to evaluate if a mea- surement instrument is as good as its gold standard. The In this Delphi study, we developed a checklist containing Steering Committee therefore decided to include this standards for evaluating the methodological quality of method. studies on measurement properties. We consider it useful to 123 Qual Life Res (2010) 19:539–549 545 separate the evaluation of the methodological quality of a differences of opinion. The answers of the research ques- study and the evaluation of its results, similar as is done for tions of the COSMIN study cannot empirically be inves- trials. The COSMIN checklist is meant for evaluating the tigated. Therefore, agreement among experts is useful. In methodological quality of a study on the measurement the literature, cut-offs between 55 and 100% are used [20]. properties of a HR-PRO instrument, not for evaluating the The cut-off of 67% for consensus was arbitrarily chosen. quality of the HR-PRO instrument itself. To assess the It is impossible to draw a random sample from all quality of the instrument, criteria for what constitutes good experts. Therefore, the selection of experts was necessarily measurement properties should be applied to the results of non-systematic. All first and last authors identified by any a study on measurement properties. Examples of such of the two systematic literature searches described in the criteria were previously published by members of our method section were considered as potential experts. We group [6]. However, these criteria are not consensus based. added people who we considered experts and who were not Note that the COSMIN checklist does not include these yet on the list. As a check of being an expert, we searched criteria of adequacy. PubMed to see whether an author had published at least Although we initially intended to develop these criteria five articles on measurement issues. We considered a total [17], due to lack of time, and complexity of the issues, we of 30 experts sufficient to have a spread over the variety of have not developed criteria of adequacy of measurement opinion, and not too large to keep it manageable. properties yet. Consensus on such criteria should be In this study, we focused on HR-PRO instruments. obtained in the future. In addition, it might be useful to However, the same measurement properties are likely to be develop a rating system by which a study can be classified relevant for other kind of health-related measurement into different quality levels, e.g. excellent/good/fair/poor instruments, such as performance-based instruments and methodological quality. clinical rating scales. Furthermore, we focused on evalua- The COSMIN checklist can be used to evaluate the tive instruments. However, for discriminative or predictive methodological quality of studies on measurement prop- purposes, the design requirements and standards for the erties of health status measurement instruments. For measurement properties are likely the same. example, it can be used to assess the quality of a study on The COSMIN checklist gives general recommendations one measurement instrument or to compare the measure- of HR-PRO measurements. Some of the standards in the ment properties of a number of measurement instruments COSMIN checklist need further refinement, e.g. by defin- in a systematic review (e.g. [18, 19]). In such a review, it is ing what an adequate sample size is or an adequate test– important to take the methodological quality of the selected retest time interval or when something is adequately studies into account. If the results of high quality studies described. Since these issues are highly dependent on the construct to be measured, users should make these deci- differ from the results of low-quality studies, this can be an indication of bias. The COSMIN checklist can also be used sions for their own application. as guidance for designing or reporting a study on mea- To help future users of the COSMIN checklist, we surement properties. Furthermore, students can use it when described some of the discussions we have had in the learning about measurement properties, and reviewers or Delphi rounds about the standards elsewhere [21]. In the editors of journals can use it to appraise the methodological manual [13], we described a rationale for each item and quality of articles or grant applications of studies on suggestions for scoring the items in the checklist. measurement properties. The COSMIN initiative aims to improve the selection of There are theoretical arguments that there is a need for measurement instruments. As a first step, we have reached an instrument to demonstrate good reliability, validity, and consensus on which measurement properties are important responsiveness. To our knowledge, Marshall [3] is the only and we have developed standards for how to evaluate these one who empirically showed that the results of studies can measurement properties. The COSMIN checklist was differ when validated measurement instruments are used developed with the participation of many experts in the compared to studies in which non-validated instruments are field. The COSMIN checklist will facilitate the selection of used. However, more empirical research should be con- the most appropriate HR-PRO measure among competing ducted to support the need. Studies could be conducted for instruments. By involvement of many experts in the this purpose, for example, in which the results of ran- development process of the COSMIN checklist, it is highly domized controlled trials (RCTs) that uses well-responsive probable that all relevant items of all relevant measurement measurement instruments and RCTs that uses instruments properties are included, contributing to its content validity. with unknown responsiveness, are compared. In addition, we are planning to evaluate the inter-rater A Delphi approach is useful for situations in which there reliability of the COSMIN checklist in a large international is a lack of empirical evidence, and there are strong group of researchers. 123 546 Qual Life Res (2010) 19:539–549 Acknowledgments We are grateful to all the panel members who Care Research, VU University Medical Center, Execu- have participated in the COSMIN study: Neil Aaronson, Linda Abetz, tive Board of VU University Amsterdam, Amsterdam, Elena Andresen, Dorcas Beaton, Martijn Berger, Giorgio Bertolotti, The Netherlands. Monika Bullinger, David Cella, Joost Dekker, Dominique Dubois, Riekie de Vet (epidemiologist): Department of Epidemi- Arne Evers, Diane Fairclough, David Feeny, Raymond Fitzpatrick, Andrew Garratt, Francis Guillemin, Dennis Hart, Graeme Hawthorne, ology and Biostatistics, EMGO Institute for Health and Ron Hays, Elizabeth Juniper, Robert Kane, Donna Lamping, Marissa Care Research, VU University Medical Center, Amsterdam, Lassere, Matthew Liang, Kathleen Lohr, Patrick Marquis, Chris The Netherlands. McCarthy, Elaine McColl, Ian McDowell, Don Mellenbergh, Mauro Niero, Geoffrey Norman, Manoj Pandey, Luis Rajmil, Bryce Reeve, Dennis Revicki, Margaret Rothman, Mirjam Sprangers, David Stre- iner, Gerold Stucki, Giulio Vidotto, Sharon Wood-Dauphinee, Albert Wu. And an additional thanks to Sharon Wood-Dauphinee for lan- Appendix 2: The COSMIN checklist guage corrections within the COSMIN checklist. This study was financially supported by the EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, and the Anna Foundation, Leiden, The Netherlands. These funding organizations did not play any role in the study design, data collection, data anal- ysis, data interpretation, or publication. Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which per- mits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. Appendix 1: Members of the COSMIN Steering Committee Wieneke Mokkink (epidemiologist): Department of Epi- demiology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands. Caroline Terwee (epidemiologist): Department of Epi- demiology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands. Donald Patrick (epidemiologist): Department of Health Services, University of Washington, Seattle, USA; Jordi Alonso (clinician): Health Services Research Unit, Institut Municipal d’Investigacio Medica (IMIM-Hospi- tal del Mar), Barcelona, Spain; CIBER en Epidemiologı ´a y Salud Publica (CIBERESP), Barcelona, Spain. Paul Stratford (physiotherapist): School of Rehabilita- tion Science and Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Canada. Dirk Knol (psychometrician): Department of Epidemi- ology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amster- dam, The Netherlands. Lex Bouter (epidemiologist): Department of Epidemiol- ogy and Biostatistics, EMGO Institute for Health and 123 Qual Life Res (2010) 19:539–549 547 123 548 Qual Life Res (2010) 19:539–549 References 1. Committee for Medicinal Products for Human Use (CHMP). (2005). Reflection paper on the regulatory guidance for the use of health related quality of life (HRQL) measures in the evaluation of medicinal products, EMEA, London, 2005. Available at: www.emea.europa.eu/pdfs/human/ewp/13939104en.pdf. Acces- sed November 10, 2008. 2. US Department of Health and Human Services FDA Center for Drug Evaluation and Research, US Department of Health and Human Services FDA Center for Biologics Evaluation and Research, & US Department of Health and Human Services FDA Center for Devices and Radiological Health. (2006). Guidance for industry: patient-reported outcome measures: Use in medical product development to support labeling claims: Draft guidance. Health and Quality of Life Outcomes, 4, 79. 3. Marshall, M., Lockwood, A., Bradley, C., Adams, C., Joy, C., & Fenton, M. (2000). Unpublished rating scales: A major source of bias in randomised controlled trials of treatments for schizo- phrenia. British Journal of Psychiatry, 176, 249–252. 4. Lohr, K. N., Aaronson, N. K., Alonso, J., Burnam, M. A., Patrick, D. L., Perrin, E. B., et al. (1996). Evaluating quality-of-life and health status instruments: Development of scientific review cri- teria. Clinical Therapeutics, 18, 979–992. 5. Nunnally, J. C. (1978). Psychometric theory. New York, London: McGraw-Hill. 6. Terwee, C. B., Bot, S. D., De Boer, M. R., Van der Windt, D. A., Knol, D. L., Dekker, J., et al. (2007). Quality criteria were pro- posed for measurement properties of health status questionnaires. Journal of Clinical Epidemiology, 60, 34–42. 7. Valderas, J. M., Ferrer, M., Mendivil, J., Garin, O., Rajmil, L., Herdman, M., et al. (2008). Development of EMPRO: A tool for the standardized assessment of patient-reported outcome mea- sures. Value in Health, 11, 700–708. 8. Kirshner, B., & Guyatt, G. H. (1985). A methodological frame- work for assessing health indexes. Journal of Chronic Diseases, 38, 27–36. 123 Qual Life Res (2010) 19:539–549 549 9. Mokkink, L. B., Terwee, C. B., Stratford, P. W., Alonso, J., 16. De Vet, H. C. W., Terwee, C. B., Knol, D. L., & Bouter, L. M. Patrick, D. L., Riphagen, I., et al. (2009). Evaluation of the (2006). When to use agreement versus reliability measures. methodological quality of systematic reviews of health status Journal Clinical Epidemiology, 59, 1033–1039. measurement instruments. Quality of Life Research, 18, 313–333. 17. Mokkink, L. B., Terwee, C. B., Knol, D. L., Stratford, P. W., 10. Evers, S., Goossens, M., De Vet, H., Van Tulder, M., & Ament, Alonso, J., Patrick, D. L., et al. (2006). Protocol of the COSMIN A. (2005). Criteria list for assessment of methodological quality study: COnsensus-based Standards for the selection of health of economic evaluations: Consensus on Health Economic Crite- Measurement INstruments. BioMed Central Medical Research ria. International Journal of Technology Assessment in Health Methodology, 6,2. Care, 21, 240–245. 18. De Boer, M. R., Moll, A. C., De Vet, H. C. W., Terwee, C. B., 11. Verhagen, A. P., De Vet, H. C. W., De Bie, R. A., Kessels, A. G., Volker-Dieben, H. J., & Van Rens, G. H. (2004). Psychometric Boers, M., Bouter, L. M., et al. (1998). The Delphi list: A criteria properties of vision-related quality of life questionnaires: A list for quality assessment of randomized clinical trials for con- systematic review. Ophthalmic and Physiological Optics, 24, ducting systematic reviews developed by Delphi consensus. 257–273. Journal of Clinical Epidemiology, 51, 1235–1241. 19. Veenhof, C., Bijlsma, J. W., Van den Ende, C. H., Van Dijk, G. 12. Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Strat- M., Pisters, M. F., & Dekker, J. (2006). Psychometric evaluation ford, P. W., Knol, D. L., Bouter, L.M., De Vet H. C. W. Inter- of osteoarthritis questionnaires: A systematic review of the lit- national consensus on taxonomy, terminology, and definitions of erature. Arthritis and Rheumatism, 55, 480–492. measurement properties for health-related patient-reported out- 20. Powell, C. (2003). The Delphi technique: Myths and realities. comes: results of the COSMIN study. Journal of Clinical Epi- Journal of Advanced Nursing, 41, 376–382. demiology (accepted for publication). 21. Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Strat- 13. Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Strat- ford, P. W., Knol, D. L., et al. (2010). The COSMIN checklist for ford, P. W., Knol, D. L. et al. (2009). The COSMIN checklist evaluating the methodological quality of studies on measurement manual. http://www.cosmin.nl. Accessed September 2009. properties: A clarification on its content. BMC Medical Research 14. Pfennings, L. E., Van der Ploeg, H. M., Cohen, L., & Polman, C. Methodology. H. (1999). A comparison of responsiveness indices in multiple sclerosis patients. Quality of Life Research, 8, 481–489. 15. Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measure- ment. Lancet, 1, 307–310. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Quality of Life Research Pubmed Central

The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study

Loading next page...
 
/lp/pubmed-central/the-cosmin-checklist-for-assessing-the-methodological-quality-of-HTAybDIA5Q

References (31)

Publisher
Pubmed Central
Copyright
© The Author(s) 2010
ISSN
0962-9343
eISSN
1573-2649
DOI
10.1007/s11136-010-9606-8
Publisher site
See Article on Publisher Site

Abstract

Qual Life Res (2010) 19:539–549 DOI 10.1007/s11136-010-9606-8 The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study • • Lidwine B. Mokkink Caroline B. Terwee • • • Donald L. Patrick Jordi Alonso Paul W. Stratford • • Dirk L. Knol Lex M. Bouter Henrica C. W. de Vet Accepted: 2 February 2010 / Published online: 19 February 2010 The Author(s) 2010. This article is published with open access at Springerlink.com Abstract least 67% of the panel members indicated ‘agree’ or Background Aim of the COSMIN study (COnsensus- ‘strongly agree’. based Standards for the selection of health status Mea- Results Consensus was reached on the inclusion of the surement INstruments) was to develop a consensus-based following measurement properties: internal consistency, checklist to evaluate the methodological quality of studies reliability, measurement error, content validity (including on measurement properties. We present the COSMIN face validity), construct validity (including structural checklist and the agreement of the panel on the items of the validity, hypotheses testing and cross-cultural validity), checklist. criterion validity, responsiveness, and interpretability. The Methods A four-round Delphi study was performed with latter was not considered a measurement property. The international experts (psychologists, epidemiologists, stat- panel also reached consensus on how these properties isticians and clinicians). Of the 91 invited experts, 57 should be assessed. agreed to participate (63%). Panel members were asked to Conclusions The resulting COSMIN checklist could be rate their (dis)agreement with each proposal on a five-point useful when selecting a measurement instrument, peer- scale. Consensus was considered to be reached when at reviewing a manuscript, designing or reporting a study on measurement properties, or for educational purposes. J. Alonso L. B. Mokkink (&)  C. B. Terwee  D. L. Knol Health Services Research Unit, Institut Municipal d’Investigacio L. M. Bouter  H. C. W. de Vet Medica (IMIM-Hospital del Mar), Doctor Aiguader 88, 08003 Department of Epidemiology and Biostatistics and the EMGO Barcelona, Spain Institute for Health and Care Research, VU University Medical e-mail: jalonso@imim.es Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands J. Alonso e-mail: w.mokkink@vumc.nl CIBER en Epidemiologı ´a y Salud Pu ´ blica (CIBERESP), URL: www.emgo.nl; www.cosmin.nl Barcelona, Spain C. B. Terwee e-mail: cb.terwee@vumc.nl P. W. Stratford School of Rehabilitation Science and Department of Clinical D. L. Knol Epidemiology and Biostatistics, McMaster University, 1400 e-mail: d.knol@vumc.nl Main St. West, Hamilton, ON, Canada L. M. Bouter e-mail: stratford@mcmaster.ca e-mail: lm.bouter@dienst.vu.nl L. M. Bouter H. C. W. de Vet Executive Board of VU University Amsterdam, De Boelelaan e-mail: hcw.devet@vumc.nl 1105, 1081 HV Amsterdam, The Netherlands D. L. Patrick Department of Health Services, University of Washington, Thur Canal St Research Office, 146N Canal Suite 310, Seattle, WA 98103, USA e-mail: donald@u.washington.edu 123 540 Qual Life Res (2010) 19:539–549 Keywords Delphi technique  Outcome assessment  functional status, and health-related quality of life. These Psychometrics  Quality of life  Questionnaire are constructs which are not directly measurable. Because of the subjective nature of these constructs, it is very Introduction important to evaluate whether the measurement instru- ments measure these constructs in a valid and reliable way. Measurement of health outcomes is essential in scientific The COSMIN initiative (COnsensus-based Standards research and in clinical practice. Based on the scores for the selection of health Measurement Instruments) aims obtained with measurement instruments, decisions are to improve the selection of health measurement instru- made about the application of subsequent diagnostic tests ments. As part of this initiative, the aim of this study was to and treatments. Health status measurement instruments develop a checklist containing standards for evaluating the should therefore be reliable and valid. Otherwise there is a methodological quality of studies on measurement prop- serious risk of imprecise or biased results that might lead to erties. The checklist was developed as a multidisciplinary, wrong conclusions. Organisations such as the US Food and international collaboration with all relevant expertise Drug Administration (FDA) and the European Medicines involved. We performed a Delphi study to address two Agency (EMEA) require that measurement instruments research questions: must be well validated for its purpose [1, 2]. The need for 1. Which measurement properties should be included in reliable and valid measurement instruments of health out- the checklist? comes was clearly demonstrated by Marshall, who showed 2. How should these measurement properties be evalu- in schizophrenia trials that authors were more likely to ated in terms of study design and statistical analysis report that treatment was superior to control when an (i.e. standards)? unpublished measurement instrument was used in the comparison, rather than a published instrument [3]. In this paper, we present the COSMIN checklist, and Before a health status measurement instrument can be describe the agreement of the panel concerning the items used in research or clinical practice, its measurement included in the checklist. properties, i.e. reliability, validity and responsiveness, should be assessed and considered adequate. Studies evaluating measurement properties should be of high Methods methodological quality to guarantee appropriate conclu- sions about the measurement properties of an instrument. Focus of the COSMIN checklist To evaluate the methodological quality of a study on measurement properties, standards are needed. Although The COSMIN checklist is focused on evaluating the many standards and criteria have been proposed, these have methodological quality of studies on measurement prop- not been operationalised into user-friendly and easily erties of HR-PROs. We choose to focus on HR-PROs, applicable checklists (e.g. [4, 5]). Moreover, these stan- because of the complexity of these instruments. These dards do not pay attention to studies that apply Item instruments measure constructs that are both multidimen- Response Theory (IRT) models, or are not consensus based sional and not directly measurable. (e.g. [6, 7]). Such a checklist should contain a complete set In addition, we focused on evaluative applications of of standards (which refers to the design requirements and HR-PRO instruments, i.e. longitudinal applications preferred statistical methods) and criteria of adequacy for assessing treatment effects or changes in health over time. what constitutes good measurement properties. Broad The specification of evaluative is necessary, because the consensus is necessary in order to achieve wide acceptance requirements for measurement properties vary with the of a checklist. application of the instrument [8]. For example, instruments Research on measurement properties is particularly used for evaluation need to be responsive, while instru- important for health outcomes that are directly reported by ments used for discrimination do not. patients, i.e. health-related patient-reported outcomes The COSMIN Steering Committee (Appendix 1) sear- (HR-PROs). A HR-PRO is a measurement of any aspect of ched the literature to determine how measurement prop- a patient’s health status that is directly assessed by the erties are generally evaluated. Two searches were patient, i.e. without the interpretation of the patient’s performed: (1) a systematic literature search was per- responses by a physician or anyone else [2]. Modes of data- formed to identify all existing systematic reviews on collection for HR-PRO instruments include interviewer- measurement properties of health status measurement administered instruments, self-administered instruments, or instruments [9]. From these reviews, information was computer-administered instrument [2]. Examples of extracted on which measurement properties were evalu- HR-PROs are questionnaires assessing symptoms, ated, and on standards that were used to evaluate the 123 Qual Life Res (2010) 19:539–549 541 measurement properties of the included studies. For each performed. (3) Internal consistency statistics should be measurement property, we found several different stan- calculated for each (sub) scale separately’. The panel could dards, some of which were contradictory [9]. (2) The answer each item on a 5-point scale ranging from strongly steering committee also performed another systematic lit- disagree to strongly agree. Next, the panel was asked to rate erature search (available on request from the authors) to which statistical methods they considered adequate for identify methodological articles and textbooks containing evaluating the measurement property concerned. A list of standards for the evaluation of measurement properties of potential relevant statistical methods for each measurement health status measurement instruments. Articles were property was provided. For example, for internal consis- selected if the purpose of the article was to present a tency the following often used methods were proposed: checklist or standards for measurement properties. Stan- ‘Cronbach’s alpha’, ‘Kuder-Richardson formula-20’, dards identified in the aforementioned literature were used ‘average item-total correlation’, ‘average inter-item corre- as input in the Delphi rounds. lation’, ‘split-half analysis’, ‘goodness of fit (IRT) at a global level, i.e. index of (subject) separation’, ‘goodness International Delphi study of fit (IRT) at a local level, i.e. specific item tests’, or ‘other’. Panel members could indicate more than one Subsequently, a Delphi study was performed, which con- method. In the third round, we presented the most often sisted of four written rounds. The first questionnaire was chosen method, both the one based on CTT and the one sent in March 2006, the last questionnaire in November based on IRT, and asked if the panel considered this 2007. We decided to invite at least 80 international experts method as the most preferred method to evaluate the to participate in our Delphi panel in order to ensure 30 measurement property. For internal consistency, these were responders in the last round. Based on previous experiences ‘Cronbach’s alpha’ and ‘goodness of fit (IRT) at a global with Delphi studies [10, 11], we expected that 70% of the level, i.e. index of (subject) separation’, respectively. In the people invited would agree to participate, and of these third round, the panel members were asked whether the people 65% would complete the first list. Once started, we other methods (i.e. ‘Kuder-Richardson formula-20’, ‘aver- expected that 75% would stay involved. We included age item-total correlation’, ‘average inter-item correlation’, experts in the field of psychology, epidemiology, statistics, ‘split-half analysis’, ‘goodness of fit (IRT) at a local level, and clinical medicine. Among those invited were authors of i.e. specific item tests’) were also considered appropriate. reviews, methodological articles, or textbooks. Experts had Panel members could also have indicated ‘other methods’ to have at least five publications on the (methods of) in round 2. Indicated methods were ‘eigen-values or per- measurement of health status in PubMed. We invited centage of variance explained of factor analysis,’ ‘Mokken Rho’ or ‘Loevinger H’ for internal consistency. In round 3, people from different parts of the world. In the first round, we asked questions about which the panel was also asked whether they considered these measurement properties should be included in the checklist, methods as appropriate for assessing internal consistency. and about their terms and definitions. For example, we In the final Delphi round, all measurement properties and asked for the measurement property internal consistency standards that the panel agreed upon were integrated by the ‘which term do you consider the best for this measurement steering committee into a preliminary version of the property?’, with the response options ‘internal consis- checklist for evaluating the methodological quality of tency’, ‘internal consistency reliability’, ‘homogeneity’, studies on measurement properties. ‘internal scale consistency’, ‘split-half reliability’, ‘internal In each Delphi round, the results of the previous round reliability’, ‘structural reliability’, ‘item consistency’, were presented in a feedback report. Panel members were ‘intra-item reliability’, or ‘other’ with some space to give asked to rate their (dis)agreement with regard to proposals. an alternative term. Regarding the definitions, we asked Agreement was rated on a 5-point scale (strongly dis- ‘Which definition do you consider the best for internal agree—disagree—no opinion—agree—strongly agree). consistency?’, and provided seven definitions that were The panel members were encouraged to give arguments for found in the literature and the option ‘other’ where a panel their choices to convince other panel members, to suggest member could provide an alternative definition. In round alternatives, or to add new issues. Consensus on an issue two, we introduced questions about preferred standards for was considered to be reached when at least 67% of the each measurement property. We asked questions about panel members indicated ‘agree’ or ‘strongly agree’ on the design issues, i.e. ‘Do you agree with the following 5-point scale. If less than 67% agreement was reached on a requirements for the design of a study evaluating internal question, we asked it again in the next round, providing pro consistency of HR-PRO instruments in an evaluative and contra arguments given by the panel members, or we application? (1) One administration should be available. (2) proposed an alternative. When no consensus was reached, A check for uni-dimensionality per (sub) scale should be the Steering Committee took the final decision. 123 542 Qual Life Res (2010) 19:539–549 When necessary, we asked the panel members to indi- was from Asia. The response rate of the rounds ranged cate the preferred statistical methods separately for each from 48 to 74%. Six panel members (11%) dropped out measurement theory, i.e. Classical Test Theory (CTT) or during the process. The names of all panel members who Item Response Theory (IRT), or for each type of score, completed at least one round are presented in the such as dichotomous, nominal, ordinal, or continuous ‘‘Acknowledgements’’. scores. The COSMIN taxonomy Results In the Delphi study, we also developed a taxonomy of the relationships of measurement properties that are relevant Panel members for evaluating HR-PRO instruments, and reached consen- sus on terminology and definitions of these measurement We invited 91 experts to participate of whom 57 (63%) properties. The relationships between all properties are agreed to participate. The main reason for non-participa- presented in a taxonomy (Fig. 1). The taxonomy comprises tion was lack of time. Nineteen experts (21%) did not three domains (i.e. reliability, validity, and responsive- respond. Of the 57 experts who agreed to participate, 43 ness), which contain the measurement properties. The (75%) experts participated in at least one round, and 20 measurement property construct validity contains three (35%) participated in all four rounds. The average number aspects, i.e. structural validity, hypotheses testing, and (minimum–maximum) of years of experience in measuring cross-cultural validity. Interpretability was also included in health or comparable fields (e.g. in educational or psy- the taxonomy and checklist, although it was not considered chological measurements) was 20 (6–40) years. Most of the a measurement property, but nevertheless an important panel members came from Northern America (n = 25) and characteristic. The percentages agreement on terminology Europe (n = 29), while two were from Australia and one and position in the taxonomy are described elsewhere [12]. Fig. 1 COSMIN taxonomy of relationships of measurement properties QUALITY of a HR-PRO QUALITY of a HR-PRO Reliability Internal Consistency Reliability (test-retest, Inter-rater, Validity Intra-rater) Content validity Measurement face error Construct validity (test-retest, validity Inter-rater, Intra-rater) Criterion validity (concurrent validity, Structural validity Hypotheses-testing predictive validity) Responsiveness Cross-cultural validity Responsiveness Interpretability 123 Qual Life Res (2010) 19:539–549 543 The COSMIN checklist consensus was rated, and the Steering Committee decided on including this item. The results of the consensus reached in the Delphi rounds Four items included in the checklist had less than 67% were used to construct the COSMIN checklist (Appendix agreement of the panel: item 9 of box A internal consis- 2). The checklist contains twelve boxes. Ten boxes can be tency, item 11 for box C measurement error, and items 11 used to assess whether a study meets the standard for good and 17 of box I responsiveness. All but one was about the methodological quality. Nine of these boxes contain stan- statistical methods. For different reasons, which we will dards for the included measurement properties (internal successively explain, the Steering Committee decided to consistency (box A), reliability (box B), measurement error include these four items in the checklist. (box C), content validity (box D), structural validity (box When asking about the preferred statistical method for E), hypotheses testing (box F), cross-cultural validity (box internal consistency, we initially did not distinguish G), criterion validity (box H) and responsiveness (box I), between types of scores, i.e. dichotomous or ordinal scores and one box contains standards for studies on interpret- (item 9). Therefore, Cronbach alpha was preferred over ability (box J). In addition, two boxes are included in the Kuder-Richardson Formula 20 (KR-20). However, the checklist that contain general requirements for articles in Steering Committee decided afterward that KR-20 was which IRT methods are applied (IRT box), and general considered appropriate for dichotomous scores as well. requirements for the generalizability of the results (Gen- Item 11 of box C measurement error contains three eralizability box), respectively. methods, i.e. standard error of measurement (SEM), To complete the COSMIN checklist, a 4-step procedure smallest detectable change (SDC) and Limits of Agreement should be followed (Fig. 2)[13]. Step 1 is to determine (LOA). In round 3, SEM was chosen as the preferred which properties are evaluated in an article. Step 2 is to method for measuring measurement error (76% agree- determine if the statistical methods used in the article are ment). When asking about other appropriate methods, only based on Classical Test Theory (CTT) or on Item 20% agreed with SDC, and 28% with LOA. Despite the Response Theory (IRT). For studies that apply IRT, the low percentages agreement reached in round 3 on accept- IRT box should be completed. Step 3 is to complete the ing SDC and LOA as appropriate methods, the Steering boxes with standards accompanying the properties chosen Committee decided afterward that both methods should be in step 1. These boxes contain questions to rate whether a considered appropriate to measure measurement error and study meets the standards for good methodological qual- were included in the checklist. The SDC is a linear trans- ity. Items are included about design requirements and formation of the SEM [14], i.e., 1.96 9 H2 9 SEM. preferred statistical methods of each of the measurement Because the SEM is an appropriate method, SDC should properties (boxes A to I). In addition, a box with items on also be considered appropriate. The LOA is a parameter interpretability of the (change) score is included (box J). indicating how much two measures differ [15]. When these The number of items in these boxes range from 5 to 18. two measures are repeated measures in stable patients, it Step 4 of the procedure is to complete the box on general can be used as a method for assessing measurement error. requirements for the generalizability of the results. This LOA is directly related to SEM [16], and we therefore Generalizability box should be completed for each prop- decided to include this method in the checklist. erty identified in step 1. We developed a manual Item 11 of box I responsiveness (i.e. ‘was an adequate describing the rationale of each item, and suggestions for description provided of the comparator instrument(s)’) was scoring [13]. approved by 64% of the panel. Although the percentage agreement was slightly too low, we decided to include this item because it was also included in box F hypotheses Consensus among the panel testing, reflecting the similarity between construct validity and responsiveness. In Table 1, we present ranges of percentage agreement of Item 17 of box I contains two methods, i.e. correlations the panel members for each box, both for the design between change scores and the area under the receiver requirements and the statistical methods. Most of these operator curve (ROC). Seventy-six percent of the panel issues were discussed in rounds 2 and 3. considered the first method as the preferred method. This Percentage agreement among the panel members on method can be used when both the measurement instru- the items 1–3 in the IRT box ranged from 81 to 96%. ment under study and its gold standard are continuous Item 4 (i.e. checking the assumptions for estimating measures. Only 60% considered the ROC method as an parameters of the IRT model) was included based on a appropriate method to measure responsiveness when a suggestion of a panel member in round 4. Therefore, no (dichotomous) gold standard is available. In analogy to 123 544 Qual Life Res (2010) 19:539–549 INSTRUCTIONS FOR COMPLETING THE COSMIN CHECKLIST Are IRT methods used in Complete for each Complete for each Mark the properties that the article? property you marked in property you marked in have been assessed in step 1 the corresponding step 1 the the article. box A to J Generalisability box A. Internal consistency B. Reliability C. Measurement error Yes No D. Content validity (including face validity) Construct validity E. Structural validity F. Hypotheses -testing Complete IRT box G. Cross -cultural vadility H. Criterion validity I. Responsiveness J. Interpretability Step 1 Step 2 Step 3 Step 4 Fig. 2 The 4-step procedure to complete the COSMIN checklist Table 1 Percentage agreement Design requirements (%) Statistical methods (%) of panel members who (strongly) agreed with the items Internal consistency 77–92 (R2) 40–88 (R2–4) about design requirements and Reliability 77–97 (R2) 80–92 (R3) statistical methods for the Measurement error Same items as for reliability 20–76 (R3) COSMIN boxes A–J Content validity 90–94 (R2) na Structural validity 72 (R3) 68–100 (R3) Hypotheses testing 77–92 (R2, R4) 90 (R2) Cross-cultural validity 70–79 (R3–4) 68–94 (R3) Criterion validity 88 (R3) 88 (R3) Responsiveness (general) 90–97 (R2) na Responsiveness (no gold standard available) 64–68 (R3) 88 (R3) Responsiveness (gold standard available) 80 (R3) 60–76 (R3) R round in which consensus was Interpretability na 72–96 (R3) reached, na not applicable diagnostic research, the Steering Committee considered the Discussion ROC method an appropriate method to evaluate if a mea- surement instrument is as good as its gold standard. The In this Delphi study, we developed a checklist containing Steering Committee therefore decided to include this standards for evaluating the methodological quality of method. studies on measurement properties. We consider it useful to 123 Qual Life Res (2010) 19:539–549 545 separate the evaluation of the methodological quality of a differences of opinion. The answers of the research ques- study and the evaluation of its results, similar as is done for tions of the COSMIN study cannot empirically be inves- trials. The COSMIN checklist is meant for evaluating the tigated. Therefore, agreement among experts is useful. In methodological quality of a study on the measurement the literature, cut-offs between 55 and 100% are used [20]. properties of a HR-PRO instrument, not for evaluating the The cut-off of 67% for consensus was arbitrarily chosen. quality of the HR-PRO instrument itself. To assess the It is impossible to draw a random sample from all quality of the instrument, criteria for what constitutes good experts. Therefore, the selection of experts was necessarily measurement properties should be applied to the results of non-systematic. All first and last authors identified by any a study on measurement properties. Examples of such of the two systematic literature searches described in the criteria were previously published by members of our method section were considered as potential experts. We group [6]. However, these criteria are not consensus based. added people who we considered experts and who were not Note that the COSMIN checklist does not include these yet on the list. As a check of being an expert, we searched criteria of adequacy. PubMed to see whether an author had published at least Although we initially intended to develop these criteria five articles on measurement issues. We considered a total [17], due to lack of time, and complexity of the issues, we of 30 experts sufficient to have a spread over the variety of have not developed criteria of adequacy of measurement opinion, and not too large to keep it manageable. properties yet. Consensus on such criteria should be In this study, we focused on HR-PRO instruments. obtained in the future. In addition, it might be useful to However, the same measurement properties are likely to be develop a rating system by which a study can be classified relevant for other kind of health-related measurement into different quality levels, e.g. excellent/good/fair/poor instruments, such as performance-based instruments and methodological quality. clinical rating scales. Furthermore, we focused on evalua- The COSMIN checklist can be used to evaluate the tive instruments. However, for discriminative or predictive methodological quality of studies on measurement prop- purposes, the design requirements and standards for the erties of health status measurement instruments. For measurement properties are likely the same. example, it can be used to assess the quality of a study on The COSMIN checklist gives general recommendations one measurement instrument or to compare the measure- of HR-PRO measurements. Some of the standards in the ment properties of a number of measurement instruments COSMIN checklist need further refinement, e.g. by defin- in a systematic review (e.g. [18, 19]). In such a review, it is ing what an adequate sample size is or an adequate test– important to take the methodological quality of the selected retest time interval or when something is adequately studies into account. If the results of high quality studies described. Since these issues are highly dependent on the construct to be measured, users should make these deci- differ from the results of low-quality studies, this can be an indication of bias. The COSMIN checklist can also be used sions for their own application. as guidance for designing or reporting a study on mea- To help future users of the COSMIN checklist, we surement properties. Furthermore, students can use it when described some of the discussions we have had in the learning about measurement properties, and reviewers or Delphi rounds about the standards elsewhere [21]. In the editors of journals can use it to appraise the methodological manual [13], we described a rationale for each item and quality of articles or grant applications of studies on suggestions for scoring the items in the checklist. measurement properties. The COSMIN initiative aims to improve the selection of There are theoretical arguments that there is a need for measurement instruments. As a first step, we have reached an instrument to demonstrate good reliability, validity, and consensus on which measurement properties are important responsiveness. To our knowledge, Marshall [3] is the only and we have developed standards for how to evaluate these one who empirically showed that the results of studies can measurement properties. The COSMIN checklist was differ when validated measurement instruments are used developed with the participation of many experts in the compared to studies in which non-validated instruments are field. The COSMIN checklist will facilitate the selection of used. However, more empirical research should be con- the most appropriate HR-PRO measure among competing ducted to support the need. Studies could be conducted for instruments. By involvement of many experts in the this purpose, for example, in which the results of ran- development process of the COSMIN checklist, it is highly domized controlled trials (RCTs) that uses well-responsive probable that all relevant items of all relevant measurement measurement instruments and RCTs that uses instruments properties are included, contributing to its content validity. with unknown responsiveness, are compared. In addition, we are planning to evaluate the inter-rater A Delphi approach is useful for situations in which there reliability of the COSMIN checklist in a large international is a lack of empirical evidence, and there are strong group of researchers. 123 546 Qual Life Res (2010) 19:539–549 Acknowledgments We are grateful to all the panel members who Care Research, VU University Medical Center, Execu- have participated in the COSMIN study: Neil Aaronson, Linda Abetz, tive Board of VU University Amsterdam, Amsterdam, Elena Andresen, Dorcas Beaton, Martijn Berger, Giorgio Bertolotti, The Netherlands. Monika Bullinger, David Cella, Joost Dekker, Dominique Dubois, Riekie de Vet (epidemiologist): Department of Epidemi- Arne Evers, Diane Fairclough, David Feeny, Raymond Fitzpatrick, Andrew Garratt, Francis Guillemin, Dennis Hart, Graeme Hawthorne, ology and Biostatistics, EMGO Institute for Health and Ron Hays, Elizabeth Juniper, Robert Kane, Donna Lamping, Marissa Care Research, VU University Medical Center, Amsterdam, Lassere, Matthew Liang, Kathleen Lohr, Patrick Marquis, Chris The Netherlands. McCarthy, Elaine McColl, Ian McDowell, Don Mellenbergh, Mauro Niero, Geoffrey Norman, Manoj Pandey, Luis Rajmil, Bryce Reeve, Dennis Revicki, Margaret Rothman, Mirjam Sprangers, David Stre- iner, Gerold Stucki, Giulio Vidotto, Sharon Wood-Dauphinee, Albert Wu. And an additional thanks to Sharon Wood-Dauphinee for lan- Appendix 2: The COSMIN checklist guage corrections within the COSMIN checklist. This study was financially supported by the EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, and the Anna Foundation, Leiden, The Netherlands. These funding organizations did not play any role in the study design, data collection, data anal- ysis, data interpretation, or publication. Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which per- mits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. Appendix 1: Members of the COSMIN Steering Committee Wieneke Mokkink (epidemiologist): Department of Epi- demiology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands. Caroline Terwee (epidemiologist): Department of Epi- demiology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands. Donald Patrick (epidemiologist): Department of Health Services, University of Washington, Seattle, USA; Jordi Alonso (clinician): Health Services Research Unit, Institut Municipal d’Investigacio Medica (IMIM-Hospi- tal del Mar), Barcelona, Spain; CIBER en Epidemiologı ´a y Salud Publica (CIBERESP), Barcelona, Spain. Paul Stratford (physiotherapist): School of Rehabilita- tion Science and Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Canada. Dirk Knol (psychometrician): Department of Epidemi- ology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amster- dam, The Netherlands. Lex Bouter (epidemiologist): Department of Epidemiol- ogy and Biostatistics, EMGO Institute for Health and 123 Qual Life Res (2010) 19:539–549 547 123 548 Qual Life Res (2010) 19:539–549 References 1. Committee for Medicinal Products for Human Use (CHMP). (2005). Reflection paper on the regulatory guidance for the use of health related quality of life (HRQL) measures in the evaluation of medicinal products, EMEA, London, 2005. Available at: www.emea.europa.eu/pdfs/human/ewp/13939104en.pdf. Acces- sed November 10, 2008. 2. US Department of Health and Human Services FDA Center for Drug Evaluation and Research, US Department of Health and Human Services FDA Center for Biologics Evaluation and Research, & US Department of Health and Human Services FDA Center for Devices and Radiological Health. (2006). Guidance for industry: patient-reported outcome measures: Use in medical product development to support labeling claims: Draft guidance. Health and Quality of Life Outcomes, 4, 79. 3. Marshall, M., Lockwood, A., Bradley, C., Adams, C., Joy, C., & Fenton, M. (2000). Unpublished rating scales: A major source of bias in randomised controlled trials of treatments for schizo- phrenia. British Journal of Psychiatry, 176, 249–252. 4. Lohr, K. N., Aaronson, N. K., Alonso, J., Burnam, M. A., Patrick, D. L., Perrin, E. B., et al. (1996). Evaluating quality-of-life and health status instruments: Development of scientific review cri- teria. Clinical Therapeutics, 18, 979–992. 5. Nunnally, J. C. (1978). Psychometric theory. New York, London: McGraw-Hill. 6. Terwee, C. B., Bot, S. D., De Boer, M. R., Van der Windt, D. A., Knol, D. L., Dekker, J., et al. (2007). Quality criteria were pro- posed for measurement properties of health status questionnaires. Journal of Clinical Epidemiology, 60, 34–42. 7. Valderas, J. M., Ferrer, M., Mendivil, J., Garin, O., Rajmil, L., Herdman, M., et al. (2008). Development of EMPRO: A tool for the standardized assessment of patient-reported outcome mea- sures. Value in Health, 11, 700–708. 8. Kirshner, B., & Guyatt, G. H. (1985). A methodological frame- work for assessing health indexes. Journal of Chronic Diseases, 38, 27–36. 123 Qual Life Res (2010) 19:539–549 549 9. Mokkink, L. B., Terwee, C. B., Stratford, P. W., Alonso, J., 16. De Vet, H. C. W., Terwee, C. B., Knol, D. L., & Bouter, L. M. Patrick, D. L., Riphagen, I., et al. (2009). Evaluation of the (2006). When to use agreement versus reliability measures. methodological quality of systematic reviews of health status Journal Clinical Epidemiology, 59, 1033–1039. measurement instruments. Quality of Life Research, 18, 313–333. 17. Mokkink, L. B., Terwee, C. B., Knol, D. L., Stratford, P. W., 10. Evers, S., Goossens, M., De Vet, H., Van Tulder, M., & Ament, Alonso, J., Patrick, D. L., et al. (2006). Protocol of the COSMIN A. (2005). Criteria list for assessment of methodological quality study: COnsensus-based Standards for the selection of health of economic evaluations: Consensus on Health Economic Crite- Measurement INstruments. BioMed Central Medical Research ria. International Journal of Technology Assessment in Health Methodology, 6,2. Care, 21, 240–245. 18. De Boer, M. R., Moll, A. C., De Vet, H. C. W., Terwee, C. B., 11. Verhagen, A. P., De Vet, H. C. W., De Bie, R. A., Kessels, A. G., Volker-Dieben, H. J., & Van Rens, G. H. (2004). Psychometric Boers, M., Bouter, L. M., et al. (1998). The Delphi list: A criteria properties of vision-related quality of life questionnaires: A list for quality assessment of randomized clinical trials for con- systematic review. Ophthalmic and Physiological Optics, 24, ducting systematic reviews developed by Delphi consensus. 257–273. Journal of Clinical Epidemiology, 51, 1235–1241. 19. Veenhof, C., Bijlsma, J. W., Van den Ende, C. H., Van Dijk, G. 12. Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Strat- M., Pisters, M. F., & Dekker, J. (2006). Psychometric evaluation ford, P. W., Knol, D. L., Bouter, L.M., De Vet H. C. W. Inter- of osteoarthritis questionnaires: A systematic review of the lit- national consensus on taxonomy, terminology, and definitions of erature. Arthritis and Rheumatism, 55, 480–492. measurement properties for health-related patient-reported out- 20. Powell, C. (2003). The Delphi technique: Myths and realities. comes: results of the COSMIN study. Journal of Clinical Epi- Journal of Advanced Nursing, 41, 376–382. demiology (accepted for publication). 21. Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Strat- 13. Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Strat- ford, P. W., Knol, D. L., et al. (2010). The COSMIN checklist for ford, P. W., Knol, D. L. et al. (2009). The COSMIN checklist evaluating the methodological quality of studies on measurement manual. http://www.cosmin.nl. Accessed September 2009. properties: A clarification on its content. BMC Medical Research 14. Pfennings, L. E., Van der Ploeg, H. M., Cohen, L., & Polman, C. Methodology. H. (1999). A comparison of responsiveness indices in multiple sclerosis patients. Quality of Life Research, 8, 481–489. 15. Bland, J. M., & Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measure- ment. Lancet, 1, 307–310.

Journal

Quality of Life ResearchPubmed Central

Published: Feb 19, 2010

There are no references for this article.