Floor and ceiling effects refer to specific limitations encountered when measuring health status scores. Floor effects occur when data cannot take on a value lower than some particular number; ceiling effects occur when data cannot take on a value higher than an upper limit. Health status instruments or surveys that are used to assess domains or attributes of health status use a rating scale. This is commonly a Likert scale with rating scales between 1 and 10, for example. There are limitations to the use of such instruments when measuring health status for either evaluative or discriminative purposes. An awareness of these limitations is important because of the problems that can occur in the interpretation of the results obtained when measuring health status, regardless of the domain being measured or the instrument that is being used. In interventional clinical trials, the degree to which health status changes is an important outcome; and the results of a study can be affected by floor and ceiling effects. In cost-effectiveness evaluations, the denominator of the ratio reported could be higher or lower than anticipated if there is a floor or ceiling effect.
Therefore, recognizing ceiling and floor effects, and doing the best to minimize or eliminate these limitations, is important for studies that affect medical decision making. This entry further defines floor and ceiling effects, discusses how these effects are typically detected and potentially accounted for, provides examples of ways researchers try to minimize these scaling
effects, and discusses the implications of floor and ceiling effects on randomized clinical trials
and policy decisions. Finally, newer psychometric methods that are emerging to minimize such effects are briefly discussed.
A ceiling effect occurs when the majority of scores are at or near the maximum possible score for the variable that the health status survey instrument is measuring. The survey instrument cannot measure scores above its ceiling. If a high percentage of people score at the top of a scale, it is impossible to detect an improvement in health for that group.
Measures of activities of daily living (ADL) often have ceiling or floor effects in certain populations. For example, some individuals with specific chronic diseases such as stroke
may exhibit high ceiling effects on more general surveys of health status, thus limiting the ability to distinguish certain aspects of health status between individuals scoring at the ceiling. Ceiling effects are particularly important limitations when researchers are looking for the impact of treatment interventions on changes in health status. A floor effect occurs when the majority of scores are at or near the minimum possible score for the variable that the health status survey instrument is measuring. If a high percentage of people score at the bottom of a scale, it is impossible to detect a decline in health for that group. In clinical trials, for example, floor effects occur when outcomes are poor in the treatment and control conditions.
There are numerous health status measurement survey instruments that are generally divided into generic and disease-specific measures. Common examples of generic health status questionnaires for individuals with chronic diseases include several iterations of the Medical Outcomes Study survey and the EuroQol
). Some examples of disease-specific health status questionnaires include the Chronic Heart Failure Questionnaire (CHQ) and the Peripheral Artery Questionnaire.
Effects can vary by instrument. For example, comparative examinations of the SF-6D
and EQ-5D across seven patient/population groups (chronic obstructive airways disease, osteoarthritis
, irritable bowel syndrome
, lower back pain
, leg ulcers, postmenopausal women, and the elderly) revealed evidence of floor effects in the SF-6D and ceiling effects in the EQ-5D. This suggested that the SF-6D tended to discriminate better at higher levels of function and had heavy floor effects, while the EQ-5D performed in the opposite manner—it did well at lower levels of function, but had high ceiling effects. The choice of an instrument depends on what one wishes to measure. If the population has considerable morbidity
, the EQ-5D may be a better choice. For a generally healthy population, the SF-6D may be the better choice. Another illustrative example is that of the problems encountered in the Veterans Health Study that used the MOS-VA. The VA had to extend the MOS SF-12/36 to include some instrumental activities of daily living (IADL)/ADL type times because of floor effects that occurred with the standard MOS. The pervasiveness of ceiling and floor effects has prompted the quest for a more appropriate approach to health status questions to accurately assess the health status of individuals and populations.
Detecting Ceiling and Floor Effects
Traditionally, classical test theory (CTT), a type of psychometric theory that analyzes measurement responses to questionnaires, has been used to evaluate the psychometric properties of health. Determining if a floor or ceiling effect exists requires an examination of the acceptability of the distribution of scores for the health domains obtained from the health status instrument.
Measures of central tendency
of the data, including mean and median, as well as the range, standard deviation, and skewness are used for such purposes. A score would generally be considered acceptable if the values
are distributed in a normal or bell-shaped curve, with the mean near the midpoint of the scale. Floor effects can be determined by examining the proportion of subjects with the lowest possible scores. Similarly, ceiling effects are calculated by determining the proportion of subjects who achieved the highest possible score. Criteria for defining floor and ceiling effects are controversial. Some recommend a skewness statistic between -1 and +1 as acceptable for eliminating the possibility of a floor or ceiling effect.
Dealing with scales where the distribution is skewed, that is, where there is a ceiling or floor effect, is most problematic when comparing groups, as many statistical procedures rely on scores being evenly distributed. Making comparisons between groups in a clinical trial, or testing
the effect of an intervention, may require additional advanced statistical techniques to adjust or account for the skewness of the data.
Minimizing Ceiling and Floor Effects
There are considerable conceptual and methodological challenges that confront users of health status instruments. Some individuals believe that ceiling and floor effects can be managed with statistical techniques. Others believe that these effects can be avoided or minimized by using disease-specific health surveys. Other options are to begin with a generic survey and use the disease-specific survey only if a ceiling or floor effect is observed. Still others believe that valuable information about the quality of life for individuals can be obtained by using both types of surveys.
Implications in Clinical Trials
Increasingly, researchers believe that measures of health status should be included in clinical trials. Historically, clinical research has focused on laboratory outcomes such as blood pressure
, HgbA1C, morbidity, and/or mortality
. These have been the outcomes measures of greatest interest to researchers, clinicians, and patients. It is now necessary to employ health status measures to obtain a comprehensive assessment of practical health outcomes for individuals enrolled in clinical trials. The selection of the survey depends on the objectives of the evaluation
, the targeted disease and population, and psychometric characteristics. Many of the disease-specific health status measures are sensitive to the occurrence of clinical symptoms or relatively small differences between treatment interventions—in particular, those studies examining the effect of medications—thus reducing the possibility of ceiling effects. Detecting worsening health among people who are already ill presents a different challenge. Low baseline scores make it difficult to detect health status decline, arguing again for disease-specific measures to avoid floor effects. In general, if one encounters a floor or ceiling effect in a study using a general health status measure, then a disease-specific measure, which is purposefully designed to be responsive to disease progression and/or treatment responsiveness issues, should be administered as well.
Disease-specific measures are believed to be more sensitive to treatment effects; however, a number of generic health status measurement scales have demonstrated the ability to discriminate between groups and clinical responsiveness. Thus, while many argue for the exclusive use of disease- and domain-specific measures for different disease conditions, the general recommended approach in randomized clinical trials of new medical therapies is to incorporate both generic and specific instruments to comprehensively assess health status. It may be worthwhile to pilot measures in the type of population to be studied, thus establishing that the measures adequately represent the health of the population before using them to establish the effectiveness of interventions. There is general agreement on the need for more comprehensive measures with multiple domains and multiple items to detect subtle changes in both healthy and severely ill populations.
Because health status measures can provide comparisons across conditions and populations, they are of interest to policy and decision makers. Such information has the potential to improve the quality of care and establish reasonable reimbursement practices.
These measures are also of interest to clinicians because they help to determine the impact of therapeutic interventions and quality of life in their particular patient populations. Health status measures may provide clinicians with information not otherwise obtained from patient histories. Surveys can be self-administered, scanned, and used to provide rapid feedback
of health status data—a phenomenon already occurring in many parts of the United States.
However, these measures must also be interpretable by policy and decision makers, and challenges exist in ensuring that decision and policy makers and clinicians understand these more complex scaling issues with health status measures. Without a full understanding of the concepts and methods, results could impart an incorrect message to a clinician or policy maker and ultimately discourage continued use of the measure. Strategies to make scores interpretable have been described. For an evaluative instrument, one might classify patients into those who experienced an important improvement, such as change in mobility, and those who did not and examine the changes in scores in the two groups. Data suggest that small, medium, and large effects correspond to changes of approximately 0.5, 1.0, and greater than 1.0 per question for instruments that present response options on 7-point scales.
Item Response Theory
CTT remains the dominant theory of measuring health status by researchers and clinicians. However, in the field of psychometrics, CTT is becoming outdated and replaced by more sophisticated, complex models. Item response theory (IRT) potentially provides information that enables a researcher to improve the reliability of an assessment beyond that obtained with CTT. Although both theories have the same aims, IRT is considered to be stronger in its ability to reliably assess health status. IRT allows scaling of the level of difficulty of any item in a domain (e.g., physical function). Thus, theoretically, an item bank could have hundreds or thousands of survey questions covering a huge range of capabilities in a domain. Computerized adaptive testing (CAT) is a way of iteratively homing in on a person's level of ability in a particular domain by selectively asking questions across the broad domain and narrowing the estimate of ability by selecting new items to ask the person based on his or her responses to previous items. For example, if a person has told you that he or she can run a mile, there is no need to ask if he or she can walk one block.
CAT could potentially eliminate floor and ceiling effects by having an item bank so broad that all meaningful levels of ability are covered. However, the newer models are complex and spreading slowly in mainstream research. It is reasonable to assume that IRT will gradually overtake CTT, but CTT will likely remain the theory of choice for many researchers, clinicians, and decision makers until certain complexity
issues associated with IRT can be resolved.
Barbara A. Bartman
See also Decision Making in Advanced Disease
; Decisions Faced by Nongovernment Payers of Healthcare: Managed Care
; EuroQoL (EQ-5D); Government Perspective, Informed Policy Choice
; Health Outcomes Assessment
; Health Status Measurement, Generic Versus Condition-Specific Measures
; Health Status Measurement, Minimal Clinically Significant Differences, and Anchor Versus Distribution Methods; Health Status Measurement Standards
; Measures of Central Tendency; Outcomes Research
; Randomized Clinical Trials; Scaling; SF-6D; SF-36 and SF12 Health Surveys
Brazier, J., Roberts, J., Tsuchiya, A., & Busschbach, J. (2004). A comparison of the EQ-5D and SF-6D across seven patient groups. Health Economics
, 13(9), 873–884.
Guyatt, G. H., Feeny, D. H., & Patrick, D. L. (1993). Measuring health-related quality of life, Annals of Internal Medicine, 118, 622–629.
Kazis, L. E., Miller, D. R., Clark, J. A., Skinner, K. M., Lee, A., Ren, X. S., et al. (2004). Improving the response choices on the veterans SF-36 health survey role functioning scales: Results from the Veterans Health Study. Journal of Ambulatory Care Management, 27(3), 263–280.
Kind, P. (2005). EQ-5D concepts and methods: A developmental history. New York: Springer.
McDowell, I. (2006). Measuring health: A guide to rating scales and questionnaires. Oxford, UK: Oxford University Press.
Mesbath, M. (2002). Statistical methods for quality of life studies: Design, measurements, and analysis. New York: Springer.