Latent Variable Modeling in Biostatistics Research – Measuring Concepts
By Melanie Wall, Ph.D.
Associate Professor
Division of Biostatistics
University of Minnesota
School of Public Health
One of the jobs of a biostatistics researcher is to develop good mathematical and computational techniques for summarizing data in order to answer questions about populations and individuals in those populations. Typical research questions involve variables that are measured on individuals in the population such as: how is X related to Y, how X is distributed in the population, or how X changes for different individuals or for the whole population over time. For many research questions, the variables X and Y can be measured directly. For example, X is "body mass index" and Y is "blood pressure," or X is "drug treatment" and Y is "time until death." In these cases, a multitude of biostatistical tools can be used to answer a specific research question.
Latent Variables
But, what if the variables of interest cannot be measured directly? What if the variables are conceptual phenomena or constructions of ideas, for example quality of life? If this is the case, the first question that needs to be addressed is how to measure the variable in the first place. Conceptual phenomena--often called "latent" variables--and methods for measuring them have been developed within the fields of psychology, sociology, and education. "Intelligence" is the classic example of a latent variable, and Spearman (1904) is credited with the original development of a statistical model--called factor analysis--to measure it. More recently in the health sciences, there has been a growing desire to address research questions that involve conceptual variables, the types that traditionally would only have been examined by psychologists and social scientists. Streiner and Norman (2001) write in their text, Health Measurement:
"In the past 20 years or so, the situation in clinical research has become more complex. The effects of new drugs or surgical procedures on quantity of life is likely to be marginal, conversely, there is increased awareness of the impact of health and health care on the quality of human life…If the efforts of these disciplines are to be placed on sound scientific basis, methods must be devised to measure what was previously thought to be unmeasurable, and assess in a reproducible and valid fashion those subjective states which cannot be converted into the position of a needle on a dial"
Besides quality of life, there are a multitude of latent variables of interest in the health sciences, particularly so in the assessment of health services and in behavioral public health. Some examples of latent variables being examined in the health sciences are: stress, coping, self-esteem, social economic status, unhealthy dieting, parental weight-related norms, satisfaction, body image, social support, self-restraint problems, motivation for abstaining, etc. It is of interest then to develop instruments, typically involving batteries of self-reported questionnaire items and statistical modeling methods. The aim is to combine the responses to these questionnaire items in order to produce reproducible and valid measures of the underlying latent variables. In addition to measuring the latent variable with statistical models, it is then of interest to use statistical models to address research questions where the latent variables are related to one another or to other types of variables.
Continuous versus Categorical Variables
Directly observable variables can be described as either continuous or categorical. While we practically never measure anything on a truly continuous scale--due to the discrete nature of all measurement instruments--a variable like weight (in pounds) is usually considered continuous since a person's weight can theoretically take on any incremental value, for example 140 pounds or 140.3 pounds or 140.31 pounds, etc. Categorical variables, on the other hand, can take on only a finite number of values. For example, blood type can take on 4 possible values: A, B, O, or AB.
Similar to directly observable variables, latent variables can also be (hypothesized to be) continuous or categorical. Depending upon what is assumed about the distribution of the latent variable and what kind of observed variables are used to measure them (i.e. continuous or categorical), the type of statistical model will be different. Historically, continuous latent variable models such as exploratory factor analysis, confirmatory factor analysis, structural equation modeling, and item response theory have been treated as a separate field from categorical latent variable models such as latent class models, latent profile models, and more generally finite mixture models.
Recently, in my own work funded by a grant from the National Institutes of Health (NIH), a more universal framework for considering latent variables has been developed. Very generally, latent variable models can be characterized as being made up of two parts: a measurement model and a structural model. By measurement model we refer to the part that focuses on the way that the observed variables actually measure the latent variables. By structural model we refer to the part that captures the way that different latent variables are related to one another or how they are related to other observed covariates.
I have developed statistical models that simultaneously include both continuous and categorical latent variables in order to answer research questions that involve relationships between different types of conceptual variables. Here I present an example that uses the statistical method we developed for performing a generalized logistic regression--a technique for estimating relationships between variables when the outcome is categorical – where both the predictor variable and the outcome variable are latent variables, that is, they are not observed directly. This example comes from the paper Guo, Wall, Amemiya (2006).
An example from Project EAT
In the past decade, a large comprehensive study of adolescent nutrition and obesity called Project EAT was conducted. Principal Investigator and SPH professor Dianne Neumark-Sztainer (see e.g. Neumark-Sztainer et al., 2002) collected self-reported survey data from students in seventh and tenth grade at 31 Twin Cities schools in the 1998–1999 school year. One research question focused on whether a personal trait related to a girl's body satisfaction could predict her eating disorder risk class. Neither body satisfaction nor eating disorder risk can be measured directly (without error) with a single self-report questionnaire item. But both can be considered as latent variables underlying a series of questionnaire items--all of which may be measuring the latent variables with error.
Body satisfaction was hypothesized by the researchers to be a continuous latent variable measured by a battery of self-report items related to satisfaction with different parts of one's body (e.g. hips, shoulders, waist, etc.). The outcome variable of interest, eating disorder risk class, was hypothesized to be a categorical latent variable representing different types of eating disorder risk related to girls engaging in purging versus those engaging in restricting behaviors. A checklist of nine unhealthy weight control behaviors was asked on the questionnaire. No absolute classification rule based on the checklist of nine behaviors existed, but given a girl's particular (unobserved) eating disorder risk, the researchers expected certain behaviors to show up more than others. Thus, as hypothesized, the researchers were interested in a regression of a categorical latent variable (eating disorder risk) on a continuous latent variable (body satisfaction) while controlling for other observed covariates.
A new statistical model based on certain assumptions about the relationships between the latent and observed variables was developed and a commonly used method in statistics called maximum likelihood was used to estimate the relationships of interest based on the model's fit to the real data (Guo, Wall, Amemiya (2006).). The maximum likelihood method required some advanced computational tricks. These techniques were developed specifically for this kind of latent variable model. Once the new statistical method was fully developed, it was applied to the data for the body satisfaction/eating disorder risk class research question.
Based on the data, the statistical model estimated that there were three latent classes of eating disorder risk: a group which included girls who were not engaging in any risk behaviors (59% of sample), a group that was engaging in restricting behaviors like skipping meals and fasting (35% of sample), and a third group that showed a much higher probability of engaging in both the restricting behaviors and also the purging behaviors like vomiting and using diuretics (6% of sample). The categorical latent variable was used to describe different typologies of behaviors. In addition, the statistical model estimated that the continuous underlying latent variable labeled as "body satisfaction" was highly protective for being in the no-risk group. In other words, girls with higher body satisfaction were much less likely to be in either of the two eating disorder risk groups—those that use restricting behaviors or those that use both restricting and purging behaviors.
As biostatistical methods researchers, we developed a statistical model for addressing research questions that related latent variables of different types to one another, and a computational method to obtain useful estimates from the data. The method also included a diagnostic procedure to verify whether the data truly followed the model's assumptions. One of the most fulfilling aspects of being a biostatistics researcher is the synergy between applications and methods development. That is, the latent variable modeling methods developed here--inspired by a real research question about eating disorder risk and body satisfaction--provided a tool that can then be applied to many different types of research questions.
Guo, J. Wall, M.M., and Amemiya, Y. (2006) "Latent class regression on latent factors", Biostatistis, 7, 1, pp. 145-163.
NEUMARK-SZTAINER, D., STORY, M., HANNAN, P. AND MOE, J. (2002). Overweight status and eating patterns among adolescents: where do youth stand in comparison to the Healthy People 2010 Objectives? American Journal of Public Health 92, 844–851.
Spearman (1904) "General Intelligence," objectively determined and measured. American Journal of Psychology, 15, 201-293