BOOTSTRAPPING WITH SMALL SAMPLES IN STRUCTURAL EQUATION MODELING: GOODNESS OF FIT AND CONFIDENCE INTERVALS

Structural equation modeling (SEM) has become a regular staple of social science research, however very little is known about small sample size use. A sample size of 200 or larger for SEM models has been advocated (Boomsma, 1983; Kline, 2011) and the main test of model fit ( 2 goodness-of-fit) is sample size dependent and performs optimally in a range of at least 200-400 (Kenny, 2012). Model complexity in SEM can vary, however, a simple model could hold potential benefits to a researcher without the ability to attain 200 observations. Thus research with models with less than 200 need to be considered more. Two manuscripts are presented, both stemming from a 3x 3 factorial simulation with varied sample sizes (n = 50, 100, 200), factor loadings ( = 0.60, 0.75, 0.90), and bootstrap samples to the sample size n and a population sample of size N = 400. One study looks at SEM fit indices and independence from the 2 test as well as bootstrap extension potential. The second study analyzed the use and ease of bootstrap confidence intervals (CIs) for any of the fit indices used in tradition SEM publications, a much needed addition to the field.

This is a manuscript based thesis submission for a masters degree in psychology at the University of Rhode Island. This is the first manuscript of two for this respective thesis.

Manuscript 1
Title: SEM Fit Indices and Chi-Square Independence Co-Author: Dr. Lisa L. Harlow

Intended Journal: Structural Equation Modeling
Status: Not yet submitted, but with the intention to submit by early 2014.
Manuscript is prepared for submission.
1 Abstract Structural equation model (SEM) fit indices have exponentially increased, but many indices are based on the sample-size-dependent 2 goodness-of-fit test and thus may have similar problems in assessing fit .
Debate over the use of SEM fit index usage has continued, but instead of a monochromatic approach, a synthesis of choosing several indices that behave well under many conditions and display a model-based lack of dependence on the 2 goodness-of-fit would be ideal. Use of bootstrapping has steadily become a common supplement to statistical parameter estimation as well as assisting potential small sample issues . Using a bootstrapping approach, the current research assesses models varying with small to moderate sample size (50,100,200) and moderate to large factor loadings (0.60, 0.75, 0.90) with the idea the small samples can compensate with large loadings (Guadagnoli & Velicer, 1988). Because of the highly correlated nature of fit indices, a partial least squares (PLS) regression was used to assess which indices showed the lowest association with the 2 goodness-of-fit under the multiple conditions. Ratios of the model fit between bootstrap replicates and the initial samples were also compared along with model non-conformity. The standardized root mean-square residual (SRMR) and McDonald's Centrality Index (MCI) may o↵er reasonable indices of fit as they were less related to the 2 goodness-of-fit test under the conditions presented and displayed the least variation across varied bootstrap replications. 2

SEM Fit Indices and Chi-Square Independence
The debate among structural equation modeling (SEM) practitioners is how to truly assess model fit. Since  brought LISREL into the SEM forefront, the field has been burgeoning with fit indices. The catalog of indices has grown because of known sample size issues with the 2 goodness-of-fit test , which is the only true statistical test used to assess SEM model fit. Bentler (1985) continued the push forward with his SEM program EQS, which presented even more potential fit indices to help supplement the 2 . An issue thus surfaces, however, that many of the fit indices regularly used in SEM are generally based on the 2 goodness-of-fit test which is exactly what the indices were intended to help remedy.
Past research has focused on specific indices and their relation to sample size, such as  who demonstrated that their Tucker-Lewis Index (TLI) was sample size independent. In a similar fashion,  mathematically derived that his Incremental Fit Index (IFI) does not rely on sample size.
However, a caveat is that traditionally SEM research involves large sample sizes, with the 2 behaving optimally for sample sizes of at least 200-400 . Smaller sample sizes have become slightly taboo in the SEM literature, however the reality is that large sample sizes are not always possible. Other research has suggested that simplified designs can still be optimal with less than 200 observations, alluding to the fact that small samples may be possible with SEM when there are strong factor loadings (Guadagnoli & Velicer, 1988;.
Methodological researchers o↵er varying opinions about the choice of fit indices and the 2 -test.  inflexibly states that only the 2 should be reported in SEM journals, however  rebu↵s that by stating that fit indices have no "golden rules" and should be used as supplementary to the 2 index.  also supports the notion that indices along with power and confidence intervals are necessary for a broader picture of the SEM model. What results is a mix of proponents and non-supporters of fit indices, and a catalog that is so large that any newcomer to SEM may become overwhelmed or simply confused Gerbing & Anderson, 1992;.
Fit indices are known to be correlated quite often with sample size even though some have tried to mathematically deduce and analyze sample size independence .
Strong correlations can still exist between indices and the 2 goodness-of-fit as well as amongst the other indices themselves. An analysis of actual model fit and their predictive qualities on the 2 goodness-of-fit are thus imperative to expand beyond just mathematical independence from sample size.
Several indices have garnered a good amount of use in publications; these include the root mean square of approximation (RMSEA), the standardized root mean-square residual (SRMR), the comparative fit index (CFI), the normed fit index (NFI), and McDonald's centrality index (MCI)  dence, what could also be beneficial is a comparison of fit index independence from the 2 goodness-of-fit test itself. By addressing sample size and 2 independence, we could perhaps begin to narrow down the list of fit indices to a select few that actually work well as a supplement to the 2 goodness-of-fit test. We should also keep in mind that extremely large sample sizes are rather uncommon in practice, so keeping perspective on rational sample sizes also seems worthwhile.
One option for small sample analysis is a process called bootstrapping, or sampling with replacement from an original data set to a desired sample size . By taking an empirical sample of size n, we can randomly draw repeated (anywhere from 1,000 to 10,000 is common) samples with replacement to the same size n, with the goal that the repeated sampling will display a comprehensive picture of the sampling distribution (Efron & Tibshirani, 1986). Extensions can also be accomplished by bootstrapping, to not only draw random samples of size n but even larger sample sizes (i.e. 3n, 5n). In other research,  have advised against using bootstrapping with data that have samples of size 100 or less, however o↵ering the caveat that further research with simpler models could be possible and beneficial. Others have posited that strong factor loadings can amend a small sample size with simple structure (Guadagnoli & Velicer, 1988). A brief overview of bootstrapping is presented next to help illustrate this iterative procedure.

Bootstrapping Basics
Starting with a data matrix of observations such that n is the sample size with c columns each representing a recorded variable. Each row (r) of the matrix acts as a single observation. The r x c matrix of parameters (p) is expected to be an adequate representation of the expected population parameters ✓ p . The r x c matrix for any p can be used in a bootstrapping framework such than each row of the matrix will be available for every replicate with replacement to create a 5 bootstrap based data matrix✓ p of randomized rows from ✓ p . Every row has a probability of 1/r chances of selection regardless of any previous selections. With thousands of bootstrap replicates (B),✓ B will contain B samples of size n, with r randomly selected rows from ✓ p with replacement. A rather thorough picture of the sampling distribution can thus be assumed for any of the parameters p in the B✓ p matrices.  notes that data from an unknown distribution F where X 1 , X 2 , . . . , X n ⇠ F with replicates of X 1 = x 1 , X 2 = x 2 , . . . , X n = x n has an expected mean ofx = n X 1 x n n (1.1) As pointed out by Bootstrap estimates of the CI can be constructed via a non-parametric approach that finds the desired cuto↵ values of the empirically based probability mass function. Finding CI estimates from bootstrapping will be briefly described, shortly. Efron and Tibshirani (1986) state that a minimum of B = 1000 replicates are necessary for non-parametric CI estimation because of its complex nature. Teetor (2011) displays the use of the quantile function in R and the use of an even larger number of replicates (9,999) to provide an empirical estimation of the CI.
In the case of SEM, the Bollen-Stine bootstrap (BSBS) transforms the data so that it is properly testing the 2 statistic . Each✓ p will represent a bootstrap sample from which a covariance matrix can be determined and the SEM model fit for every one of the thousands of replicates. The parameters of interest would typically be the B 2 values, however fit indices of each fitted model could also be estimated B times.
Modern computing power has made bootstrapping readily available, but the understanding of the process itself can perhaps seem daunting. Appendix A displays the structure of a typical bootstrap process, with the resulting✓ B matrices allowing for a simulated mean or median value for starters, as well as a model based CI of any percentage. The quantile function in R can easily compute the cuto↵ points for a string of means or medians based on✓ B at a given CI level.
As an example, a particular fit index of an SEM model, say the SRMR, could be estimated using B = 2000 replicates, or double the minimum B suggested previously (Efron & Tibshirani, 1986 the compiled✓ B will be a close approximation to ✓ p . Any variable or variables in a data set can be done in this same simple bootstrap process (e.g., CFI, 2 , RMSEA, etc). Instead of dice, one could let a computer randomly select r in each replicate so that (✓ p ) is equivalent in length to ✓ p . Extending✓ p by drawing a larger value of observations r than the data set is possible as well, but bootstrapping to the same size n as the sample data is usual.

Population Covariance Matrices
Matrix algebra can be used to construct a covariance matrix to generate data according to the varied parameters in a simulation context. Based on a two factor CFA model with three items each (Figure 1.1), a matrix would contain six manifest 9 variables, and thus the covariance matrix would be 6 x 6, representing two factors with three measures each. To construct a population matrix for the CFA design in Figure 1.1, we need the use of several matrices: ⌃ represents a 6x6 population matrix, ⇤ is a 6 x 2 matrix of factor loadings, is a 2 x 2 matrix of the factor correlations, and ⇥ is a 6 x 6 matrix of measurement errors which standardizes results in ⌃ for a diagonal of all ones. Using this, one can create a population sample of N to bootstrap samples of n from within.
SEM requires one item loading or variance per factor to be set to 1.00 before analysis, such that the constructed ⇤ matrices contain identical values within an acceptable range (0.50, 1.00) and the minimum criteria of three items per factor to prevent model misfit was also met . Thus, the ⇤ loading matrix is comprised of: Based on the suggestion of , the factor correlation should be minimally 0.50 and therefore the factor correlation matrix consists of: 1.00 0.50 0.50 1.00 Finally, the error matrix ⇥ is constructed such that ⌃ ends up with all diagonals of 1.00. The computation necessary for the covariance matrix ⌃ needed for simulating the population based datasets of N = 400 is completed with: A unique aspect of bootstrapping within SEM is that because the 2 goodnessof-fit is testing against the null hypothesis, a readjustment needs to be made to the sample so that we are properly resampling the model fit within a 2 framework . Thus a Bollen-Stine bootstrap (BSBS) requires a transformation to the data matrix so that the samples are reporting accurate simulated p-values ( Figure 1.2). The resulting BSBS gives us a composite Bollen-Stine pvalue (BSp) that is useful to help verify the model fit but also allows for examining fit index behavior under repeated samples. The transformation converts the covariance matrix used in the 2 goodness-of-fit test to accurately test the null by forming a new data set, Z, from the initial data when Y represents the centered raw data, S represents the sample covariance matrix of Y , and⌃ is the implied covariance matrix such that: Bollen and Stine (1992) then mathematically deduce that the covariance matrix of Z being Z 0 Z/(N 1) is now equal to the implied covariance matrix⌃ and boot-strapping from the transformed data in Z will accurately test the null hypothesis in all bootstrap samples.

Absolute Fit Indices
The absolute indices are based simply on the covariance matrix and are most notably represented by the 2 goodness-of-fit test and its comparison to a critical value for the given degrees of freedom (df ). It is widely known that 2 is a↵ected by sample size, however it is usually reported and is preferred to not be significant in SEM research . Absolute fit indices in the current study will thus be represented by the 2 as well as another common absolute fit index, the SRMR. Suggested maximum values for the SRMR have generally been 0.08, but this value has been noted as being too liberal and that 0.06 might be better .

Non-centrality Fit Indices
Non-centrality indices are those which measure distance between the covariance matrix and the matrix of the null model. The RMSEA  has been reported very often in the literature, particularly because it assumes initially that the model will not have a perfect fit .  state that similar to the SRMR, the RMSEA holds rather steady even with slightly non-normal data as is common in psychometrics. The use of RMSEA with a 90% CI has become common and has been recommended by several within SEM . Not unlike the RMSEA, the MCI  has also been shown to lack dependence on sample size, and has performed well in simulations Ding, 1996;Gerbing & Anderson 1992; and will be studied further in the current research. The CFI is a third non-centrality index that has shown some sample size independence and should be considered simply because of its common use in the SEM field with values between 0.95 and 1.00 being preferred Ding, 1996).

Relative Fit Indices
The TLI has become one of the primary indices for relative fit since its introduction  and has performed well, but has shown some dependence on the interaction of sample size and factor loading . The similarity of the TLI to the non-centrality based CFI has been noted with the advice that because of such high intercorrelation, only one should be reported and the TLI is sometimes preferred . This research has opted to go with both the CFI (non-centrality) and TLI due to both their popularity and the need for further analysis of both. The IFI will be the second relative fit index tested particularly because of its reported independence from sample size . The Normed Fit Index (NFI)  will also be included as a non-centrality index based in part to its popularity in use even with some advice against its use Ding, 1996). While the IFI and TLI can minimally exceed values of 1.00, all three relative fit indices should be bounded between 0.00 and 1.00 with a preferred acceptable range between of 0.95 to 1.00 .
Thus, the goal of the current research is to address the dual-facetted question of 2 independence as well as the bootstrapping potential of select fit indices that have seemingly fared well in past research such as the MCI and IFI Ding, 1996;Gerbing & Anderson, 1992;). Focus will be limited to small and moderate sample sizes of 50, 100, and 200 with moderate to good loadings of 0.60, 0.75, 0.90 in a simple two factor CFA. Comparison of bootstrapped samples can be compared to the initial sample as well as bootstrapped back to the population sample of N = 400 for direct comparison.
13 Further research has been needed for some time on the fit indices and their supplementation of the 2 for small n designs (Ding, 1996;Gerbing & Anderson, 1992;. What is preferred are indices that can remain stable over varied model parameters and bootstrap replicates that can demonstrate a strong correlation to the original data set. Indices that function with minimal correlation to the 2 based on model fit would also be highly beneficial. Model non-convergence can also be an issue, particularly with samples below 100, and thus analysis of such will also be important Guadagnoli & Velicer, 1988;.

Research Goals
First, based on past SEM simulation research, it is hypothesized that the MCI might be a prime candidate for use in small sample SEM designs (Gerbing & Anderson, 1992). Second, the TLI, CFI, and IFI tend to be strongly correlated  and thus it is also hypothesized that they will behave similarly across conditions and only one really needs to be reported. Third, it is also hypothesized that bootstrap replicates will highlight very inconsistent 2 values and that stable fit indices could help support the sample-size-influenced 2 test. Fourth, the MCI and SRMR will show strong bootstrap consistency between population (N ) and samples (n) of varied conditions .
And finally, samples of n = 50 will likely see the largest amount of non-converging models, particularly with lower levels of (Guadagnoli & Velicer, 1988).

Methods
A simulation-based study was used to address the bootstrap based qualities of eight SEM fit indices with smaller sample sizes. A simple confirmatory factor analysis (CFA) model with two factors and three items each (Figure 1.1) was used based on the suggestions of past research ) with varied small samples sizes (n = 50, 100, 200) and moderate to strong factor loadings ( = 0.60, 0.75, 0.90) which can help compensate for a small n (Guadagnoli & Velicer, 1988). The factor correlation ( ) between the two factors was set at a modest 0.50, which is considered a lower bound for well-fitting models but could also help prevent over-powering the varied sizes of n and .
A population matrix based sample of N = 400 was generated for every con- 30 total samples were drawn to increase the robustness of the study and to exceed the recommended five samples per population matrix (Guadagnoli & Velicer, 1988. Seeds were used to randomly draw the data in R (see Appendix B), however univariate and multivariate kurtosis were not checked to allow for varied types of model fit in each condition. The data was simulated on a 7-point integer based Likert scale, quite typical in social science research. Figure 1.3 depicts the simulation process used for this research.
All simulations were conducted using the lavaan package in R (R-Core-Team, 2013; . A total of 270 simulated samples (i.e. 3 n's x 3 's x 30 seeds) were conducted with 2,000 bootstrap replicates taken in each scenario providing a total of 558,000 total models being fit. The initial fit for the population samples (N ) as well as the samples drawn from the population (n) were stored as well as the bootstrap replicated fit indices and CIs for the sample n bootstrapped to a size of n as well as to the population sample size for comparison (N = 400). R code for the simulation is presented in Appendix B.
Because of the high amount of correlation between most of the fit indices and the truncated nature of the zero to one boundaries, standard multiple regression would not be advised to assess findings. Thus, a partial least squares (PLS) method of regression will be used, which utilizes a principle components analysis approach and has been shown to be optimal with small data sets as well as with high intercorrelations . The goal is to use a PLS regression to find the weakest fit index predictors of the 2 value for each condition. By determining the indices that are the weakest predictors, one can presume that they are less associated with the 2 goodness-of-fit, however that does not mean there is no correlation between the fit index and the 2 . Use of the plsdepot package in R was necessary for the PLS regression aspect to be employed .
Also recorded was the counts of model non-convergence for the three sizes of n and . The fit indices for the original population sample N was recorded as were the indices for the sample n randomly selected from N . A BSp was calculated and recorded for each of the 270 total samples bootstrapped, as were the quantile based 90% CI values for the bootstrapped models and a corresponding median value of the replicates.

Results
The correlation matrix of the indices to the 2 goodness-of-fit in the original population sample of N and sample of n as well as the bootstrap replicates of both sizes suggest varied results (see Table 1.1). The NFI has the lowest correlations with 2 regardless of sample size or whether bootstrapping was used or not. The followed closely by the IFI, again supporting strong correlation between the three indices (see Table 1.2). The SRMR was the least significant indicator of 2 for the samples of size n ( = 0.008) followed by the correlated CFI and TLI ( = 0.025) having identical values. The IFI is the least associated with 2 for the bootstrap replicate models of size n ( = 0.006) whereas the RMSEA ( = 0.485) and MCI ( = 0.405) both display very large standardized weights with 2 . In the replicates of size N = 400, the SRMR shows a very weak association with 2 ( = 0.026). Compiling all four of these conditions, a PLS regression was run for the indices with the TLI and CFI being the weakest in association to the 2 (r = 0.009). The SRMR also show a weak association in the overall PLS comparison to the 2 -test (r = 0.053).
Correlations between bootstrapped (BS) models and the original samples of N and n for all the fit indices (see Table 1. Non-conformity within the n = 50 bootstrap replicates was influenced by , with loadings of 0.60 (n = 8879), 0.75 (n = 847), and 0.90 (n = 2) (see Table 1.5).
The replicates to size n = 400 for n = 50 only resulted with non-conforming models for = 0.60 (n = 1970), while = 0.75 and 0.90 had no issues with conformity.
Samples of 100 (n = 2752) and 200 (n = 107) for the = 0.60 subgroup in the replicates to size n as well as for 100 (n = 1309) and 200 (n = 5) in the replicates to n = 400 had varied conformity results. Table 1.5 also shows the replicates in the = 0.75 and = 0.90 conditions all had 0 non-conforming models for BS replicates from n back to the size of 400. Interactions are evident for the 2 goodness-of-fit for n and for the initial samples drawn from N (Figure 1.5). An interaction is not evident for BS replicates of 2 with size n (Figure 1.6), but an interaction is observed for the BS replicates of size n (Figure 1.7). The RMSEA has an interaction e↵ect for the samples of size n (Figure 1.8), but not for the bootstrap replicates of size n (Figure 1.9). There is an interaction again for the RMSEA in Figure 1.10 where n is BS replicated to size N . The MCI has an interaction of n and for samples of n (Figure 1.11), but not for the BS replicates of size n (Figure 1.12). 19

Discussion
The correlational analysis of the fit indices from this research portray the RMSEA as highly correlated with the 2 -test and also has the strongest association with the 2 in a PLS form as well. The RMSEA may thus be okay for use as a supplement, but its strong association with the 2 for model fit in the context of small samples might suggest it as not a primary index for use with SEM model fit.
According to the PLS analysis in this research (see Table 1.2), the IFI seems to show the weakest correlation with 2 while the SRMR excelled in the BS replicates to the population size of N = 400. The NFI appeared weakly correlated to the 2 , particularly for the initial population fits, however its advice against use should be kept in mind as well Ding, 1996).
The NFI and SRMR also showed strong inter-correlation between their initial values and the BS replicate values. The RMSEA on the hand had very little inter-correlation within its values and again supports the notion that the RMSEA might not be an optimal primary fit index because of its high correlation with 2 and potential for interaction e↵ects. Because of these findings, the RMSEA could be advised against use in bootstrap extensions within SEM because of its poor inter-correlation with samples and bootstrap replicates.
The CFI, IFI, and TLI acted in similar fashions for their ratios of initial population and sample fit measures compared to their bootstrap replicated fit measures regardless of sample size. Because of the similar correlations and ratios, it may only be necessary to select only one of the three indices since they have negligible di↵erences. The 2 and RMSEA not surprisingly had varied and rather large overall ratios of non-bootstrapped initial fit measures to BS fit measures for both sizes of n and N . The MCI however seemed to hold a ratio closest to one 20 for its initial fit in comparison to its BS replicate fits and could be an ideal stable index for use in small n models and bootstrapping in general.
The issues for non-conformity for sizes of n and N were quite evident when = 0.60, but does improve as n increases. The combination of = 0.60 and n = 50 seems to be very inconsistent with regards to conforming models and would thus be ill-advised for bootstrapping use. Extending the bootstrap size to N did seem to alleviate some of the non-conformity issues, particularly for = 0.75.
The overall non-conforming models for = 0.60 did decrease considerably with the larger bootstrap size, so this option could potentially assist a very small n in assessing its model fit. The BSp values did not show a di↵erence with regards to , but did for n as would be expected with the idea that 2 is influenced by n.
The interaction of n and for the BSp (see Figure 1.3) does allude to the BSp being much smoother for = 0.90 versus the other 's. Again, larger loadings could help with consistency when n is small and even for bootstrap purposes.
The hypothesis that the MCI could be an optimal index for small samples was met as the MCI displayed a very consistent ratio of initial and bootstrap values across sample sizes. The hypothesis that the MCI and SRMR would be optimally consistent for small n models was only partly accepted in that the MCI didn't have a very strong correlation between n and N replicates, however the SRMR did fair well. The hypotheses that the 2 -fit would be sporadic with the varied conditions and that the samples of n = 50 would have the most non-conforming models were both fully accepted. And finally, the TLI, CFI, and IFI did prove to be rather correlated as expected, with the IFI potentially being slightly better with regards to 2 independence.
Simulation studies always have the limitation of non-real data being used.
While the samples used in this simulation were based strictly on 1-7 Likert scale interval based results, it was still artificial. The samples used were not specifically checked for univariate and multivariate skewness and kurtosis as is advised, simply to allow the simulated data to be varied more than just selecting only very good fitting models. Empirical data should ideally meet univariate normality for skewness and kurtosis as well as multivariate measures of normality such as that provided by Mardia's test for multivariate normality to allow for proper estimation by maximum likelihood (Gao, Mokhtarian, & Johnston, 2008;Harlow, 1985;.
Future research could certainly aim at a bootstrap approach to larger samples as well. Fit indices could certainly vary with larger n designs, especially since many are based on the 2 which is influenced by n. The use of real empirical data sets with this bootstrap approach to small n designs could certainly help add credence to the reliability of the BSBS for SEM models. More research with varied factor loadings could certainly display some di↵erent patterns as well, and thus a larger set of conditions could help to delineate the di↵erence between fit indices and .
What seems to be reasonable for fit indices is to act as a supplement to the 2 goodness-of-fit. Instead of focusing on whether fit indices should even be used , maybe the approach should be taken of fit indices being used as what they were intended for, a supportive case along with the 2 . Fit index independence based on sample size alone is an issue of its own, but when determining what fit indices are optimal for reporting in SEM literature, looking at the correlations among the fit indices and particularly with the 2 may be more important. Including two or three fit indices that function di↵erently from one another and yet supplement the 2 goodness-of-fit regardless of sample size could perhaps be the most beneficial use of indices in SEM.
Keeping in mind that indices are not true statistical tests and that there really are no set cuto↵ limits, it becomes the researchers' role to provide specific interpretation relative to their study. However, cherry-picking indices that support ones model the best within a single study is quite counterintuitive to the ideal of using a fit index in the first place and shouldnt have a place in SEM research.
Therefore, a primary goal for fit indices is to thin out the massive catalog of them and find the several that are least correlated with one another and with the 2 goodness-of-fit test. The current research suggests the CFI, TLI, and IFI as potentials, but again their high intercorrelation should lead researchers to utilize only one of the indices. Highly correlated index measures such as the RMSEA may not provide enough non-redundant information as they are mimicking the 2 goodness-of-fit too closely. Thus, more research on this index is also needed.
Further work is needed in the area of fit index correlation and 2 goodnessof-fit independence beyond the initial presentation in this research to make solid claims on which indices are truly optimal. Model specifications and complexity will ultimately weigh a huge burden on optimal fit index usage, so continued research is needed. However, within a small n framework, it appears that the IFI and SRMR might provide good indices because of their lack of 2 association and are most optimal in terms of their relatively low correlation with chi-square. The MCI also faired rather well in this small sample bootstrap framework, particularly in keeping a very consistent ratio of actual fit to bootstrap fit ratios. The SRMR and MCI could both be strongly advised for small n models as well as when bootstrapping a model becomes a reasonable option.
The use of the 2 goodness-of-fit is usually presented for covariance modeling, but the fit indices used need further clarification. This research seems to suggest that the 2 mixed with the SRMR and MCI as well as either the IFI, TLI, or CFI would be most optimal. The BSp should also be included for any SEM research, 23 especially when the BSBS is a readily available option. More work is needed in the area of using bootstrap replication to expand a small sample size issue in SEM, however there seems to be some potential strengths within some indices such as the MCI, SRMR, and IFI. While BS sampling can never replace real data, it can also help to assess the sampling distribution used and could posit an opportunity for research where samples of at least 200 might be hard to achieve. Phd dissertation, University of California, Los Angeles. Correlations between the population samples of N = 400, the samples of size n drawn from N , and the bootstrap (BS) replicated models of size n and n = 400. Cronbach's ↵ is also included for each index between the samples of N = 400 and n = n as well as the bootstrap replicates of n = n and n = 400.
Index r(N,BS N ) r(n, BS n) r(n, N ) Cronbach ↵

Manuscript format is in use.
This is a manuscript based thesis submission for a masters degree in psychology at the University of Rhode Island. This is the second manuscript of two for this respective thesis.

Manuscript 2
Title: Confidence Interval Estimation for SEM Fit Indices 49

Confidence Interval Estimation for SEM Fit Indices
Confidence intervals (CIs) have been a widely accepted and in many cases required practice of modern day scientific research (Cumming, 2012). The vastly growing field of structural equation modeling (SEM) has, however, been slow to present viable CIs for its vast aspects of fit besides the root mean square error of approximation (RMSEA). This deficit has led some to conclude that fit indices may not be appropriate. For example,  is a big proponent for the elimination of fit indices whereas  has countered that with the suggestion that power and confidence intervals be more readily available in SEM research.
The RMSEA ) is a fit index that is commonly presented with a 90% CI in SEM results, however it is not empirically based nor have any other indices adopted CIs.  takes notice that there are no "golden" cuto↵ values for indexes, and therefore CIs could o↵er a more complete picture.
The potential problem for SEM CIs is that estimation is best done by bootstrapping, a procedure that takes constant resampling with replacement from an empirical set of data to estimate a parameter's sampling distribution as well as its CI . A simulation bootstrap process within the SEM framework requires a transformation to the data matrix or else the null hypothesis is not being tested correctly with the 2 goodness of fit (Figure 2.1) and will thus lead to erroneous results without a proper shift of the 2 distribution. The problem of resampling error in SEM models has been corrected by the Bollen-Stine bootstrap (BSBS) which allows for an adjusted simulated p-value of an empirical SEM model . This transformation also allows for fit indices to have sampling distributions constructed based on empirical data, and thus 90% CI estimation is possible not only for the RMSEA or 2 goodness-of-fit test, but also for any fit index a research might choose to utilize. The BSBS has proven quite useful to SEM yet it has not become commonplace knowledge in the field yet.
A naive bootstrap, or selecting repeated observations with replacement from the empirical sample, is the simplest bootstrapping approach and generally works with most data. However, the use of naive bootstrapping in SEM has been repeatedly shown as inaccurate . Thus, the Bollen-Stine transformation must be performed so the covariance matrix becomes consistent with the null hypothesis. That is, one can find the resulting Bollen-Stine p-value (BSp) by finding the total number of bootstrapped 2 values larger than the 2 statistic of the actual empirical dataset used to bootstrap the SEM model divided by the number of replicates. In the following equation,  state the transformation for the new data set Z from the initial data where Y represents the centered raw data, S represents the sample covariance matrix of Y , and⌃ is the implied covariance matrix:  then mathematically deduce that the covariance matrix of Z being Z 0 Z/(N 1) is now equal to the implied covariance matrix⌃ and bootstrapping from the transformed data in Z will accurately test the null hypothesis in all bootstrap samples. While this process may seem tedious, the lavaan package in R comes readily installed with the capability to do BSBS replications and thus the ease for CI estimation is nowhere near as di cult as it might be presumed . Presented below, is some preliminary information on bootstrapping.

Bootstrapping Basics
Starting with a data matrix of observations such that n is the sample size with c columns each representing a recorded variable. Each row (r) of the matrix acts as a single observation. The r x c matrix of parameters (p) is expected to be an adequate representation of the expected population parameters ✓ p . The r x c matrix for any p can be used in a bootstrapping framework such than each row of the matrix will be available for every replicate with replacement to create a bootstrap based data matrix✓ p of randomized rows from ✓ p . Every row has a probability of 1/r chances of selection regardless of any previous selections. With thousands of bootstrap replicates (B),✓ B will contain B samples of size n, with r randomly selected rows from ✓ p with replacement. A rather thorough picture of the sampling distribution can thus be assumed for any of the parameters p in the B✓ p matrices.  notes that data from an unknown distribution F where X 1 , X 2 , . . . , X n ⇠ F with replicates of X 1 = x 1 , X 2 = x 2 , . . . , X n = x n has an expected mean ofx = n X 1 x n n (2.2) As pointed out by Bootstrap estimates of the CI can be constructed via a non-parametric approach that finds the desired cuto↵ values of the empirically based probability mass function. Finding CI estimates from bootstrapping will be briefly described, shortly. Efron and Tibshirani (1986) state that a minimum of B = 1000 replicates are necessary for non-parametric CI estimation because of its complex nature. Teetor (2011) displays the use of the quantile function in R and the use of an even larger number of replicates (9,999) to provide an empirical estimation of the CI.
In the case of SEM, the Bollen-Stine bootstrap (BSBS) transforms the data so that it is properly testing the 2 statistic . Each✓ p will represent a bootstrap sample from which a covariance matrix can be determined and the SEM model fit for every one of the thousands of replicates. The parameters of interest would typically be the B 2 values, however fit indices of each fitted model could also be estimated B times.
Modern computing power has made bootstrapping readily available, but the understanding of the process itself can perhaps seem daunting. Appendix A displays the structure of a typical bootstrap process, with the resulting✓ B matrices allowing for a simulated mean or median value for starters, as well as a model based CI of any percentage. The quantile function in R can easily compute the cuto↵ points for a string of means or medians based on✓ B at a given CI level.
As an example, a particular fit index of an SEM model, say the SRMR, could be estimated using B = 2000 replicates, or double the minimum B suggested previously (Efron & Tibshirani, 1986). The 90% CI could be found as such for the SRMR (✓ SRM R ) which includes all of the model fit SRMR indices from 2000 replicates. To find a 90% CI, we would need to estimate the cuto↵ for the lower This function could be extended easily by adding a third quantile value (which is contained between the c( ) in the command), that measures the median of the sample. The quantile function in R works similar to constructing a box-plot, but with the user having the ability to specify cuto↵ points for the CI. In this case, we could find the 90% CI and the median for✓ SRM R as such: quantile(theta.srmr, c(0.05, 0.50, 0.95)) A similar process can thus be done for any given fit index of a SEM model, including the 2 . While tedious, a typical model would only need one set of bootstrap replicates conducted, thus bootstrapping is not as time consuming as one might perceive. Bootstrapping packages do exist in R and most commercial programs contain simulated bootstrap additions as well. Allowing the computer to store all of the✓ p matrices and then compute the necessary information allows for a not so complicated bootstrap process that adds great depth to any researchers expertise. The resulting bootstrap process assumes that the large number of✓ p replicates will adequately map the sampling distribution of ✓ p . In doing so, di cult parameters (i.e., fit indices) can be estimated with CIs that are based on the data as opposed to complex mathematical processes.

Simple Bootstrapping Examples
A very simple bootstrap procedure could be done with something as simple as the roll of a die. Say we roll the same die repeatedly 20 times, the numbers would never change on the die, but the resulting values all have an equal chance of selection with replacement. Using the quantile function in R, we could find a basic 90% CI of the mean for a standard dice with recorded values stored in a variable we'll call dice.
> dice [1] 6 3 3 5 3 2 2 5 1 3 1 1 3 5 1 4 5 3 3 4 > mean(dice) [1] 3.15 > quantile(dice, c(0.05, 0.50, 0.95)) 5% 50% 95% 1.00 3.00 5.05 While the sample of one die is generic, it gives the basic concept of what bootstrap replication accomplishes (i.e., a mean, median, CI). Now if we were to roll five dice at a time, and take the mean of each roll 20 times, we'd be able to construct a CI around the mean of the five dice. Note that each roll can have a mix of any numbers, ranging anywhere from all di↵erent values to all the same. the compiled✓ B will be a close approximation to ✓ p . Any variable or variables in a data set can be done in this same simple bootstrap process (e.g., CFI, 2 , RMSEA, etc). Instead of dice, one could let a computer randomly select r in each replicate so that (✓ p ) is equivalent in length to ✓ p . Extending✓ p by drawing a larger value of observations r than the data set is possible as well, but bootstrapping to the same size n as the sample data is usual.

Applying Bootstrap CIs to SEM Fit Indices
To find the 90% CI of any size of bootstrapped index or parameter, one must find the interval from the lowest 5% of replicated values to the largest 5%, which can be done with the quantile function in R . The quantile limits thus provide is the likelihood that the median value will fall within these values 90% of the time, and therefore the smaller the CI the better presumption that the simulated parameter is accurate. It should also then be presumed that the empirical index values should also fall within this CI range or the index might not be well representative of the model. Some researchers warn against bootstrapping SEM models of less than sample size 100 , and others state SEM sample size should be at least 200 for accuracy reasons . However, work with small n models with simple designs could be possible and needs further research . In essence, small sample size fit indices could be verified or rejected by CI estimation using the BSBS procedure, and therefore empirical CIs with a median could be readily presented for any one of the fit indices in SEM.
It is thus expected that CI construction will be relatively easy when the appropriate BSBS process is used. From the sampling distributions of various models, 90% CIs can be constructed along with a median index value to supplement the actual index values. Because of the minimal amount of CI research in SEM, this research is more about the process and an exploratory procedure to determine what fit indices seemingly behave well in small n SEM models as it relates to CI estimation.

Methods
A simulation study varying several small to moderate sample sizes (n = 50, 100, 200), small to moderate factor loadings ( = 0.60, 0.75, 0.90), and bootstrap replicated of size n as well as back to the initial population matrix derived size of 400 was used in a 3 x 3 factorial design. Because of the minimal research with small samples in SEM, it seems important to start here with these di↵erent conditions for CI estimation and then move forward. Because of the small n constraint, a simple six-item and two-factor confirmatory factor analysis (CFA) model will be used (Figure 2.2).
A selected set of commonly used fit indices will be monitored with a 90% CI similar to that commonly presented with the RMSEA, but with an empirical bootstrap basis. Two absolute fit indices ( 2 goodness-of-fit; and standardized root mean-square residual, SRMR) will be used, which are generally covariance matrix based indices . Three relative fit indices (Tucker-Lewis Index, TLI; Bollen's incremental fit index, IFI; and normed fit index, NFI)  will be used with the caveat that the TLI, IFI, and the comparative fit index (CFI) show heavy correlation with one another . And finally, three non-centrality indices (McDonald's centrality index, MCI, CFI, and RMSEA)  which measure the distance of the proposed model from a null model will be used.
For robustness, 30 multiple random samples in each of the nine simulated conditions were chosen. All data were simulated to form integer based 7-point Likert scales similar to those used in much of the SEM literature. The samples for the nine conditions were pulled from a population matrix based sample created from the varied levels of n and . Population matrices were constructed randomly and a set sample of 400 was drawn as a population sample for the sake of comparison.
The samples of size n were drawn from the population samples of size 400 and were also randomly generated for all 30 samples in each of the nine conditions, for a total of 270 simulated models. Figure 2.3 illustrates the organization of the methods used in this simulation. Efron and Tibshirani (1986)  The range of the CI is of importance and should be analyzed, as well as the likelihood of the fit indices from the model actually being contained within the simulated 90% CI. Any size CI can be constructed, but this research opted for a 90% interval since it is commonly used with the RMSEA. The standard RMSEA is mathematically deduced, so using the non-parametric BSBS to get a RMSEA CI is also of interest.

Results
Tabulations of the model fit indices encapsulated by their respective 90% CIs for the varied conditions of n and can highlight subtle di↵erences in the indices ( The dispersion of the indices with a zero to one boundary (Figure 2.4) suggests very little di↵erence in the spread of the indices at the bootstrap level of sample size n. Furthermore, the SRMR, MCI, and RMSEA are without outliers in the box-plots (Figure 2.4), where as the TLI has the largest spread of values and outliers. Plotting the di↵erence between the lower confidence limit (LCL) and the median compared to the upper confidence limit (UCL) and the median ( Figure   2.5) for several indices with small CI ranges shows the SRMR and MCI both have minimal amounts of variation in their CIs. The CIs for the SRMR and MCI have the smallest di↵erences between the limits and the median as seen in Figure 2.5 as well.

Discussion
Fit index CIs definitely have a strong potential of use for SEM researchers, as the levels of confidence can add great depth to any fit index. The use of the lavaan package in R is one example of simplifying the BSBS procedure to achieve not only a BSp-value but also 90% CIs for any or all of the fit indices available through the bootstrapLavaan function (see Appendix B of n or . The SRMR also seems to be rather consistent with including the actual fit index value in its bootstrapped CI, which could be beneficial in regards to small n SEM models. With a simple CFA model of sample sizes of 50, 100, and 200 the SRMR and MCI seemed to behave with the least amount of variance. Across all sizes of n and , the minimal variances and di↵erences in means between the MCI and SRMR appear to be least sensitive to parameter variation. The RMSEA similarly showed patterns of being e↵ective across all nine conditions, but resulted with slightly larger CI ranges compared to the SRMR and MCI (Figure 2.5). The TLI, IFI, NFI, and CFI behaved similarly across conditions which was likely due to their high intercorrelations, and all four have the widest variance of CI ranges for the varied sizes of n and .
Many of the indices were only moderate at simulating CIs that contained their respect fit index value from the sample itself (Table 2.1). The NFI most notably was only adequate for when n = 200 and thus its use in small sample SEM could be ill-advised, further supporting the notion that the NFI is over-sensitive to parameters Ding, 1996 indices again suggests the MCI as one of the closest to an ideal ratio of 1:1 from fit index to bootstrap fit index as well as having a minimal 90% CI range based on the these values (see Table 2.3). The IFI and CFI had slightly closer ratios to 1 than the MCI, but had 90% CIs for their respective 270 ratios as almost double the size of the MCI CI range.
The addition of a 2 goodness-of-fit 90% CI would be valuable to any researcher within SEM, and should therefore be considered for usage whenever possible. The benefits of an added CI to the 2 -test would undoubtedly add some increased depth to the research presented in the field. Research presented with an empirical CI for the 2 of a SEM model lends credence to how strong the model may or may not be and the model becomes clearer to the reader and even allows for added ability of power and e↵ect size estimation for current and future use.
Within the eight indices presented in this research, the SRMR has a strength of bootstrap stability (i.e. the fit index being contained in the 90% CI) as well as having a minimal amount of variance within the CI ranges themselves. The SRMR thus seems to have an edge over the other indices in its minimal CI ranges and stability of the sample SRMR value being contained within its bootstrapped CI.
The MCI displayed small CI ranges as well regardless of n and and also added a strong ratio of fit index to bootstrap fix index median being close to 1:1 and having minimal variance in the ratios. The IFI and CFI both have benefits in these areas as well, and could be worth adding to a SEM CI estimation.
The use of several indices as opposed to just one in general is advised  and therefore the SRMR may very well be a good option for any researcher to use or continue using as a supplement to the 2 goodness-of-fit. The MCI could make a good compliment to the research as well as the SRMR and 2 , and mixed with empirical CIs for all three would add strength to SEM research. Due to the high correlation between the IFI and CFI, either one could make a viable addition as well to the fit indices used with bootstrap CIs, but certainly both are not necessary.
A limitation of this research is that it was conducted with only one CFA model based on several possible conditions (e.g., three sample sizes and three factor loading sizes). Future research could examine additional conditions including di↵erent SEM models types and sizes. Further, larger sample sizes and varied levels could be investigated in future research. Another limitation is that the artificiality of any simulation may not always generalize to real-world research.
However, the constructed data sets in this research were made to be integer based and similar to what a real Likert scale study would look like. Still, true empirical data would obviously lend more credit to this CI estimation method and would be complementary to the current findings.
Future research could also include statistical power analysis as a factor to further the information that can be presented with a given SEM model. Power in SEM is a much more di cult task than CI estimation, and thus may be better to address on its own. However, both power and CI estimation would o↵er more clarity and depth to SEM research, as suggested by . Although many of the common and well-supported SEM indices were included in the current research, other indices can also be examined with the procedures presented in this study.
Other research could also focus directly on the expansion of small sample sizes to larger ones with bootstrapping. Analysis investigating how large the bootstrap extensions should be (i.e., to what extended sample size) and what fit indices would be best to utilize. Another possible bootstrapping approach to SEM could focus on one model with varied subsamples of di↵erent sizes. Overall model fit might be good, but comparison of the model to specific groups within the overall sample (i.e., gender or race) may not have balanced subsample sizes and thus comparison could be slightly di cult. The issue to address then becomes, should a researcher bootstrap smaller samples to the size of the larger samples, or could smaller replicated bootstrap samples of the larger subsample sizes be more adequate?
The need for CIs in any research has become obvious and in many cases a requirement (e.g., Cumming, 2012).  sums it up best by suggesting that no measure of central tendency should ever be trusted without a range of confidence. SEM has been surprisingly absent of CIs other than for the RMSEA, and thus the goal of this research is to help illustrate the necessity and the rather non-complex nature of adding 90% CIs to all published models. Modern computing has opened many new gateways and it has become rather easy in the 21st century to conduct repeated iterations of any model in a bootstrap fashion, especially since many programs and packages already allow replication. Researchers need to become more familiar with these methods and journals need to provide more guidelines for SEM publication requirements, which should include CIs and adequately justified fit indices a priori.      . This design can be used for parameter estimation, standard error estimation, confidence limits, or any combination of these.