The data and the code for reproducing the analyses presented in this vignette are available at the following links:


Introduction

Repeated measures latent class analysis (RMLCA) is part of the family of mixture models (Collins and Lanza, 2010; Killian & al., 2019; McLachlan and Peel, 2000). These models use a probabilistic approach to capture heterogeneity not directly observed within populations, rather than an algorithmic and heuristic approach. RMLCA is distinguished by its use of categorical observed variables to generate latent classes (subgroups of individuals), which are also recognized as categorical (Lanza, 2016). In terms of method, the procedure used to perform RMLCA relies on the distribution and covariation parameters of the observed variables to test which solution (number of classes) best suits the data (Hagenaars and McCutcheon, 2002; Lanza & al., 2012). The central hypothesis is that the relationships between the variables are explained by the presence of an unmeasured latent variable (class composition), which we attempt to estimate. This estimate is made through a process of iterative calculations, the aim of which is to optimize a maximum likelihood function so that the observations are distributed according to an previously unknown classification. Calculations are initiated by assuming that the categorical latent variable (class membership) is missing for all subjects in the sample, followed by repeated estimates of participants’ potential values (starting values) and their probability of belonging to each class (for a more in-depth definition see Asparouhov and Muthén, 2019 and Nylund-Gibson and Choi, 2018).

Once the latent variable has been estimated, it can then be stored for further analysis. In particular, the estimated latent classes can be used to predict a dependent variable, also called “distal outcome”, or they can be used as a known variable that can be predicted by independent variables (predictors/covariates). In this type of modeling, the mixture model is referred to as the “measurement model” and the relationship between the measurement model (i.e. the latent variable) and the external variables is referred to as the “structural model”. As the structural model treats the latent classes as a known variable, which can take on the role of dependent or independent variable depending on the goals, we use well-known models from the generalized linear models family to estimate them (ANOVA, multinomial logistic regression, etc.).

StepMix

StepMix is a new library available in the R (Cran) and Python (PyPI) languages that can be used to model mixture models (with or without external variables) under a modular, easy-to-use interface. Although StepMix is still under development, the library can currently be used to model a wide range of mixture models based on the distribution of observed variables (e.g., categorical, normal, mixed). A second major specificity of StepMix is that the library can model structural models using different stepwise approaches.

The aim of this vignette is to introduce StepMix by comparing the results of an RMLCA carried out with StepMix and those obtained with the popular poLCA library. A structural model will also be used to present some of the bias-adjusted stepwise approaches offered by StepMix.

Data

The analyses presented in this vignette are based on data from a research project studying the school-to-work transition among vulnerable youth (Dupéré & al., 2018; Thouin, 2022). The RMLCA model is built from 16 observed variables, each measuring the occupation status of 386 young people at a different measurement time: 1) neither at work nor in education (N), 2) at work (W), 3) in secondary education (SE) and 4) in post-secondary education (PE). The RMLCA will thus make it possible to identify different school-to-work transition paths using an inductive approach. A binary variable measuring belonging to an ethnic minority group (0=No, 1=Yes) will then be used as a predictor. We will compare the results of the structural model obtained with poLCA and those obtained with the more robust StepMix methods.


Results: class estimation

The choice of the optimal number of classes to retain in RMLCA must take into account both fit indices and relevant theoretical concepts. Table 1 shows the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and log-likelihood (Llik) for models with one to eight classes estimated with both poLCA and StepMix. As the eight-class model has a higher number of parameters than the number of observations (model over-parameterization), it is not considered in the comparative analysis. The fit indices obtained with noth packages are very similar.

As for model choice, there seems to be a conflict between the AIC and BIC fit indices: the AIC indicates that the seven-class model is the model that best fits the data, whereas the six-class model is chosen by the BIC. This situation is common, as the BIC penalizes models with a higher number of parameters, including the number of latent classes. The BIC is thus generally more robust to overfitting than the AIC. Taking into account visual assessment of the measurement model graphs as well, the six-class solution seemed to have the best fit and better answered the research goal of distinguishing theoretically relevant subgroups.


Table 1 : Fit indices of models estimated in poLCA and StepMix

Models fit indices by number of latent classes
Comparison of models estimated in poLCA and StepMix
Number of latent classes poLCA StepMix
AIC BIC Llik AIC BIC Llik
1 14934.63 15124.51 -7419.31 14934.63 15124.51 -7419.31
2 12778.04 13161.75 -6292.02 12778.04 13161.75 -6292.02
3 11818.19 12395.74 -5763.09 11818.19 12395.74 -5763.09
4 11192.37 11963.76 -5401.18 11192.37 11963.76 -5401.18
5 10916.39 11881.62 -5214.20 10918.49 11883.71 -5215.24
6 10649.06 11808.12 -5031.53 10648.65 11807.71 -5031.33
7 10499.49 11852.39 -4907.75 10498.45 11851.35 -4907.23
8 10403.42 11950.16 -4810.71 10403.57 11950.30 -4810.79
Llik: Log-likelihood


Visually inspecting the sequence graphs (Figure 1 and 2), both packages present six classes that can be interpreted identically: 1) the first class is made up of young people with a higher probability of making an early transition from high school to work; 2) the second class represents a transition from high school to work that is neither early nor late; 3) the third class is characterized by a late transition from high school to work; 4) the fourth class is characterized by an early transition from high school to post-secondary education; 5) the fifth class represents a transition from high school to post-secondary education that is neither early nor late; 6) the sixth class is made up of young people with a higher probability of being neither in work nor in school.


Figure 1 : Mesurement model estimated with poLCA (conditionnal probabilities)


Figure 2 : Mesurement model estimated with StepMix (conditionnal probabilities)


The slight difference between the models estimated by poLCA and StepMix is better reflected in class prevalence (i.e. proportions of groups in the population). The difference in prevalences is greatest for the class characterized by neither an early nor a late transition from high school to work (class 2), where StepMix estimated the prevalence of this class (22.5%; n=87) to be around 1.8 percentage points higher than that estimated by poLCA (20.9%; n=81)\(^a\). Statistically, the similarity of the two classifications can be measured using the Adjusted Rand Index (ARI), which indicates the proportion of overlapping cases in the classification produced by two models (Rand, 1971; Santos and Embrechts, 2009). The closer the index is to 1, the more similar the classifications. In our case, we obtained an ARI of 0.91\(^a\), meaning that around 91% of cases are attributed to the same class in both models.

\(^a\)Since the model doesn’t always converge at exactly the same point with poLCA, you may get slightly different results. StepMix results will always be identical.

## [1] "Rand Index"
## [1] 0.9733127
## [1] "Adjusted Rand Index"
## [1] 0.9087011


RMLCA with predictor

As mentioned in the introduction, once the measurement model has been estimated, one generally seeks to use the obtained latent classes as an observed variable in a structural model. Here, we seek to predict class membership using a predictor, namely ethnic minority.

Unlike StepMix, poLCA doesn’t allow us to integrate the predictor directly without the risk of distorting the measurement model (one-step approach). Therefore, the categorical variable (6 categories) representing membership to latent classes obtained from the measurement model must be extracted as an observed variable and subsequently used. The nnet library was then used to perform multinomial logistic regression. Table 2 presents the results of the multinomial logistic regression, with young people experiencing an early transition from high school to work as the reference category (class 3). The results indicate that young people belonging to ethnic minority groups are significantly more likely than those from non-minority groups to belong to the class characterized by a late transition from high school to work than to belong to the class characterized by an early transition from high school to work (B=1.11, z=3.02, p<0.05). This is a strong relationship, as the odds ratio is 3.03, indicating that young people belonging to ethnic minority groups are three times more likely to experience a late transition to employment. Logistic regression shows no other significant relationship (p>0.05).


Table 2 : Multinomial Logistic Regression with poLCA (3-step Naive Approach)

Coeff. (B) SE Z Sig. (p-value)
Class 2
Intercept -0.058 0.170 -0.340 0.734
Minority -0.422 0.392 -1.076 0.282
Class 3
Intercept -0.895 0.220 -4.063 0.000
Minority 1.109 0.367 3.022 0.003
Class 4
Intercept -0.829 0.215 -3.850 0.000
Minority -0.424 0.511 -0.831 0.406
Class 5
Intercept -0.413 0.188 -2.194 0.028
Minority -0.067 0.400 -0.168 0.867
Class 6
Intercept -0.292 0.182 -1.611 0.107
Minority -0.555 0.438 -1.267 0.205
## [1] "Odds Ratio"
## [1] 3.031195

However, the previous interpretation is biased by the use of a “naive” 3-step approach, in which we: 1) produced the RMLCA model; 2) assigned individuals to the class to which they had the highest probability of belonging (i.e., creating a six-category variable); 3) modeled the relationship between the newly created variable and the predictor. Since mixture models are probabilistic, individuals may have several non-zero probabilities of belonging to one or other of the estimated classes. For example, the 11th participant has a posterior probability of around 0.25 of belonging to the first class and a posterior probability of around 0.75 of belonging to the class characterized by higher probabilities of being neither working nor studying. So, in creating a new variable (step 2), we ignored the uncertainty of class assignment and forced participants to have a probability of 1.00 of belonging to one or other of the classes (modal assignment).

To correct this bias, various approaches have been developed by statisticians and made available mainly in commercial software such as Mplus and Latent GOLD. StepMix is the first freely-available library to include these various bias-adjusted stepwise approaches. Table 3 shows the multinomial regression coefficients obtained with a naive 3-step approach and with 3 different robust stepwise approaches, currently available in StepMix. We invite you to consult the articles published by the researchers who originally developed the different stepwise approaches for more information on the usefulness and reasoning behind these approaches (Bakk and Kuha, 2018; Bandeen-Roche & al., 1997; Bolck & al., 2004; Vermunt, 2010). This vast literature will help guide interested researchers in adopting the most suitable approach depending on the study context (sample size, missing data, number of parameters, etc.). Briefly, the variation in coefficients in this example suggests that the interpretation of results can be affected by the chosen approach, hence the importance of having easy access to these different approaches.


Table 3 : Regression Coefficients Obtained From Different Stepwise Approaches with StepMix

Regression Coefficients Obtained From Different Stepwise Approaches with StepMix
Classes Approach
Naïve BCH ML 2-step
Class 2
Intercept 0.053 0.031 0.021 0.037
Minority -0.536 -0.557 -0.477 -0.556
Class 3
Intercept -0.848 -0.857 -0.934 -0.915
Minority 1.191 1.181 1.136 1.088
Class 4
Intercept -0.792 -0.79 -0.818 -0.797
Minority -0.1 -0.115 -0.16 -0.248
Class 5
Intercept -0.45 -0.435 -0.364 -0.369
Minority -0.306 -0.327 -0.377 -0.354
Class 6
Intercept -0.445 -0.402 -0.484 -0.316
Minority -0.375 -0.406 -0.484 -0.572
Naïve: naive 3-step approach / BCH: Bolck-Croon-Hagenaars approach / ML: Maximum likelihood bias-corrected approach / 2-step: two-step approach


StepMix: Other advantages and futur developments

StepMix already has a number of advantages that set it apart from other open-source packages. For example, the package is not dependent on third-party packages to produce structural models. In the example presented above, the use of the nnet package to produce the multinomial logistic regression between poLCA’s RMLCA model and the variable measuring belonging to an ethnic minority group makes it difficult and hazardous to compare the results of the model with the “naive” 3-step approach obtained with poLCA and those obtained with StepMix. Another similar advantage of StepMix is the ability to model latent groups from observed variables of several types of distribution, which significantly reduces the number of packages used and makes it easier for researchers to learn how to use these models. Thus, StepMix can be used to perform latent profile analysis (LPA), without the need for other software or packages (e.g. mclust). StepMix also enables models to be built from variables with different distributions, as in the case where some variables are categorical and others are numeric and normally distributed. In practice, this avoids the need to introduce dummy variables, particularly in the common case where quantitative variables are transformed into categorical variables. Please refer to the tutorials on the StepMix GitHub page to discover its many other features (missing data management, bootstrap, graphics, etc.).

StepMix is a package always under development. The methods it offers have been designed by a group of developers with backgrounds in artificial intelligence and data science. In the future, we will also be developing modules and indices more adapted to the needs of social science researchers. For example, StepMix currently offers a non-parametric bootstrap module enabling inference via confidence intervals, widely used in machine learning. As p-values are still very popular in the social sciences, they will be integrated into a future version of the package to facilitate its use in research contexts. Check out the vignettes available on Cran and follow the GitHub page to stay updated on future developments!

\(^b\) For the time being, this option is only available in the Python version of StepMix, but will soon be made available in R.


References

Asparouhov, T. et Muthén, B. (2019). Random Starting Values and Multistage Optimization. Mplus. https://www.statmodel.com/download/StartsUpdate.pdf

Bakk, Z. et Kuha, J. (2018).Two-step estimation of models between latent classes and external variables. Psychometrika, 83, 871-892. https://doi.org/10.1007/s11336-017-9592-7

Bandeen-roche, K., Miglioretti, D. L., Zeger, S. L. et Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92(440), 1375-1386. https://doi.org/10.1080/01621459.1997.10473658

Barban, N. et Billari, F. C. (2012). Classifying life course trajectories: A comparison of latent class and sequence analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 61(5), 765-784.

Bolck, A., Croon, M. et Hagenaars, J. (2004). Estimating latent structure models with categorical variables: One-step versus three-step estimators. Political Analysis, 12, 3-27. https://doi.org/10.1093/pan/mph001

Collins, L. M., Graham, J. W., Rousculp, S. S. et Hansen, W. B. (1997). Heavy caffeine use and the beginning of the substance use onset process: An illustration of latent transition analysis. Dans The science of prevention: Methodological advances from alcohol and substance abuse research. (p. 79-99). American Psychological Association. https://doi.org/10.1037/10222-003

Collins, L. M. et Lanza, S. T. (2010). Latent class and latent transition analysis : with applications in the social behavioral, and health sciences. Wiley. https://doi.org/10.1002/9780470567333

Dupéré, V., Dion, E., Leventhal, T., Archambault, I., Crosnoe, R. et Janosz, M. (2018). High school dropout in proximal context: The triggering role of stressful life events. Child Development, 89(2), e107-e122. https://doi.org/10.1111/cdev.12792

Hagenaars, J. A. et McCutcheon, A. L. (2002). Applied latent class analysis. Cambridge University Press.

Han, Y., Liefbroer, A. C. et Elzinga, C. H. (2017). Comparing methods of classifying life courses: Sequence analysis and latent class analysis. Longitudinal and Life Course Studies, 8(4) 319-341.https://doi.org/10.14301/llcs.v8i4.409

Johnston, C. A., Crosnoe, R., Mernitz, S. E. et Pollitt, A. M. (2020). Two Methods for Studying the Developmental Significance of Family Structure Trajectories. Journal of Marriage and Family, 82(3), 1110-1123.https://doi.org/10.1111/jomf.12639

Killian, M. O., Cimino, A. N., Weller, B. E. et Hyun Seo, C. (2019, 2019/03/04). A Systematic Review of Latent Variable Mixture Modeling Research in Social Work Journals. Journal of Evidence-Based Social Work, 16(2), 192-210.https://doi.org/10.1080/23761407.2019.1577783

Lanza, S. T. (2016). Latent Class Analysis for Developmental Research. Child development perspectives, 10(1), 59-64.https://doi.org/10.1111/cdep.12163

Lanza, S. T., Bray, B. C. et Collins, L. M. (2012). An introduction to latent class and latent transition analysis (vol. 2).

McLachlan, G. J. et Peel, D. (2000). Finite mixture models. J. Wiley.http://catalogue.bnf.fr/ark:/12148/cb39038849q

Nylund-Gibson, K. et Choi, A. Y. (2018). Ten frequently asked questions about latent class analysis. Translational Issues in Psychological Science, 4(4), 440-461.https://doi.org/10.1037/tps0000176

Rand, W. M. (1971). Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336), 846-850. https://doi.org/10.1080/01621459.1971.10482356

Santos, J. M. et Embrechts, M. (2009). On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification. Dans C. Alippi, M. Polycarpou, C. Panayiotou et G. Ellinas (dir.), Artificial Neural Networks – ICANN 2009 (vol. 5769, p. 175-184). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-04277-5_18

Thouin, É., Courdi, C., Olivier, E., Dupéré, V., Denault, A.-S. et Lacourse, É. (2022). Introduction à l’analyse de séquence et illustration de son application en sciences sociales à partir de patrons de transitions de l’école au travail. Revue de psychoéducation, 51(2), 427–449. https://doi.org/10.7202/1093470ar

Thouin, É. (2022). La transition de l’école au travail chez les jeunes en situation de vulnérabilité scolaire ou sociale : examen des déterminants, des conséquences et des processus explicatifs [thèse de doctorat, Université de Montréal]. Papyrus. https://bib.umontreal.ca/citer/styles-biblioFigures/apa?tab=5248896

Vermunt, J. K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political Analysis, 18(4), 450-469. https://doi.org/10.1093/pan/mpq025