A MIXED-EFFECTS REGRESSION MODEL WITH APPLICATION TO LONGITUDINAL MISSING AT RANDOM MICROBIOME DATA

Gut microbiota in the human lower gastrointestinal tract can impact the health through human functions as well as triggering numerous diseases. The colonization of the microbiota occurs immediately after the birth of an infant. Early life factors can highly influence the composition of gut microbiome, which ultimately affects infant’s health. Therefore, the relationship between microbiome composition and clinical outcomes of preterm infants observed in Neonatal Intensive Care Units (NICUs) is of critical importance. To study this relationship, it is common to use longitudinal study designs. One of the major challenges in these designs is the huge percentage of missingness of the microbiome composition data, which needs to be appropriately accounted for during the study design. In this thesis, we propose a mixed-effects zero-inflated Beta regression model for longitudinal composition designs with missing at random data. This model captures the dependence of repeated measures for each subject by assuming a first-order autoregressive correlation structure. A Bayesian approach was employed for parameter estimations and inferences under this model. Performance of the model was investigated by a simulation study using different settings of missing data mechanisms. A sensitivity analysis was conducted to study the model misspecification issue. The developed model was further illustrated by a real data analysis on gut microbiome compositions of NICU preterm infants.


Introduction
Gut microbiota in the human lower gastrointestinal tract consists trillions of microorganisms and it accommodates approximately 150 times more genes than the entire human genomes [1,2]. The bacterial cells in the gut is 10 times higher than the cells in the human body [3]. The gut mictobiota plays a major role in human health and disease [4]. Multiple studies reported the connection between the imbalance of gut microbiome and different human diseases such as obesity, inflammatory bowel disease, and diabetes [5,6,7,8]. The exposure to psychosocial stressors could change the composition of these gut microbiota where this alteration can increase the vulnerability of an enteric pathogen [9,10,2].
Colonization of microbiota in the gastrointestinal tract begins immediately after the birth of an infant. Pattern of colonization for an infant will be affected based on the delivery method, the diet method, and environmental factors [11,12].
This early life microbiome composition may affect the human health conditions in their later life stages [13,14,15]. This introduces the desirable need of understanding the microbiome composition of early life.
World Health Organization (WHO) defines a newborn as a preterm infant if the infant was born before the completion of thirty-seven weeks of the gestation period [16]. These preterm infants will be taken care at Neonatal Intensive Care Units (NICUs), which exposes them to different stressors. These stressors may result by the surrounding environment of the NICUs (eg: continuous bright lights and alarm/equipment noises) and painful caretaking procedures administered to the infants during their stay in NICUs [17,18,19]. Such exposure can affect the development of the infant's brain and could lead to learning difficulties in their later life stages [20]. According to D'Agata et al. [21], this early life stress experienced in the NICU can influence the developing gut microbiome of preterm infants.
Most of the gut microbiome studies quantify the gut microbiota using DNA based methods such as high-throughput and low-cost sequencing methods [22].
A common approach is 16S ribosomal RNA (rRNA) gene sequencing method and the sequencing reads are used to identify the bacterial operational taxanomic units (OTUs or taxa) [23,24]. The OTU read counts used to obtain the bacterial relative abundances that are bounded [0, 1). A major challenge faced when conducting statistical analysis for microbiome composition data is the existence of excess zeros.
This may caused due to the absence of the bacterial taxa in each sample [25].
Xia and Sun [26] reviewed different statistical methods which have been employed to analyze microbiome data. The two sample t-test and Wilcoxon rank sum test were used to test the species diversity in each individual sample between two groups [27,28]. One-way analysis of variance (ANOVA) was applied to compare cutaneous microbiota of psoriatic group (affected group), unaffected group, and control group to perceive patterns that control skin colonization [29]. Voigt et al.
[30] exerted a two-way ANOVA to observe taxonomic and functional specific biases initiated by RNALater preservation across all subjects. However, these standard statistical methods have their limitations on applying to longitudinal composition microbiome data [31]. For example, these methods does not account for the dependence of the repeated measures which occurs in longitudinal study designs.
Considering the excess amount of zeros present in OTU counts, Xu et al. [32] compared the performance of the zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), and hurdle models by conducting extensive simulations with an application to a microbiome study. An additive logistic normal multinomial regression model was proposed by Xia et al. [33], which links covariates to bacterial taxa counts with an application and tests the association between diet and stool microbiome composition. Li et al. [34] employed a multivariate zero-inflated logistic-normal model to analyze the association of disease risk factors with individual microbial taxa and overall microbial community composition.
A zero-inflated generalized Dirichlet multinomial (ZIGDM) regression model was developed by Tang and Chen [35] to model the multivariate taxon counts.
To investigate how microbiome compositions change over time and how its trajectories relate to clinical outcomes, it is common to use longitudinal study designs [25,36,37]. In longitudinal study designs, the information is collected from the same subject repeatedly over a period of time. In these study designs, it is more often to confront intermittent missingness (non-monotone) and dropout (monotone) of subjects. The intermittent missingness may occur if a participant misses at least one visit during the study period and dropouts may occur if the participant withdraws from the study prematurely [38]. Little and Rubin [39] introduced three different types of missing data mechanisms: Missing Completely with the assumption that parameters of the missing data mechanism are distinct from the parameters of the sampling model. That is because the missing data mechanism does not need to be included in the likelihood function. On the other hand, MNAR is referred as non-ignorable missing data mechanism since the missing data mechanism cannot be omitted in the likelihood specification. A detailed description of the missing data mechanisms can be found in Chapter 2.
In statistical literature, several approaches have been proposed to handle different missing data mechanisms. The basic approach is casewise deletion where the cases with missing values in a variable(s) are deleted. The method is also known as listwise deletion or complete case analysis, which provides bias if the missing data mechanism is not MCAR [40]. Methods to incorporate MAR data mechanism include maximum likelihood approach using the expectation maximization (EM) algorithm, weighted generalized estimated equations, and multiple imputation [41,42,43]. Ibrahim et al. [44]  where the missing data mechanism is non-ignorable and missing data pattern is non-monotone.
Chen and Li [25] proposed a zero-inflated beta regression model (ZIBR) to study the relationship between the relative abundance of the microbiome with clinical covariates in a longitudinal study. This ZIBR model is a mixture of binary and beta regression components. The binary component models the presence of the bacterial genus and beta component models the non-zero microbiome composition.
In their study, they included the random effects in the model to account for the correlation between repeated measurements of each subject in the study. However, their ZIBR model cannot handle the missing data encountered in longitudinal study designs. This ZIBR model was also used by D'Agata et al. [21] to assess the impact of early life stress on the trajectory of the gut microbial structure, where MCAR missing data mechanism was assumed. However, both ZIBR models proposed in the two studies assume that each individual shares the same random effect among all repeated measures. This assumption could not always be satisfied because the random effect of an individual may not be the same in all the repeated measures of longitudinal study designs.
In this study, we developed a mixed-effects ZIBR model to understand the association between gut microbiome composition of the NICU preterm infants and their stress level when MAR missing data mechanism is assumed in the study design. This new model relaxes the assumption that each individual shares the same random effects in repeated measures by assuming first-order autoregressive covariance structure, AR (1). This correlation structure incorporates a high correlation between adjacent visits of an individual and the correlation systematically decreases when the distance between the visits increases. The Bayesian approach was used to estimate the parameters of this model using the Markov Chain Monte Carlo (MCMC) algorithm, which was implemented in Nimble statistical software [45]. Finally, the obtained results were compared with the study results of D'Agata et al. [21], which used the same data set for their analysis.

List of References
[1] I. Cho and M. J. Blaser, "The human microbiome: At the interface of health and disease," Nature Reviews Genetics, vol. 13, no. 4, p. 260, 2012. [2] J. F. Cryan and T. G. Dinan, "Mind-altering microorganisms: The impact of the gut microbiota on brain and behaviour," Nature Reviews Neuroscience, vol. 13, no. 10, p. 701, 2012.  The motivating data set was obtained from D'Agata et al. study [1]. The data were originally collected from very low birth weight (VLBW) infants who were admitted to the NICU of Tampa General Hospital, Florida between the years 2011 and 2015. In the data collection process, they excluded infants whose birth weights were greater than 1500g, infants who had major congenital anomalies, moribund infants, and infants whose mothers were HIV-infected. A total of 82 VLBW infants were included in the final study sample after the exclusion procedure. A detailed description of the data collection method can be found in D'Agata et al. [1].
According to D'Agata et al. [1], the stools samples of VLBW infants were collected for six consecutive weeks since they were admitted to the NICU. After conducting the microbiome analysis, bacterial operational taxonomic units (OTUs) were reported at the genus level. These reported OTUs were used to calculate the bacterial relative abundance (microbiome composition), which was the dependent variable of this analysis. The infants' stress experience was recorded daily for 6 weeks since the admission to the NICU. These data were collected using the Neonatal Infant Stressor Scale (NISS), a quantitative instrument which measures the interventions performed in NICU [2]. The time dependent variables, stress scores two weeks (lag 2) and one week (lag 1) prior to the current week sampling and the sampling week were considered in the analysis as independent variables.
Other covariates considered includes the baseline variables, infant's gender, birth weight, gestational age, and antibiotic usage. The weekly NISS stress scores of each infant are shown in Figure 1. The average weekly stress scores (shown by the black points) decrease over time, suggesting that infants have higher stress during the first couple of weeks in the NICU.
In some instances, during this longitudinal study, some infants had multiple samples while some infants had no samples collected during certain weeks. We thus used the weekly average of the microbiome composition of the samples as the dependent variables. However, missing data were encountered when no samples were collected during the entire week. Due to the procedure of the data collection [1], MAR missing data mechanism was assumed throughout the study. After excluding the infants with missing covariates, the study sample was reduced to 68. three weeks in the study sample. We consider data from week 3 to week 6 during the analysis due to the higher occurrence of missingness in the first two weeks, as in D'Agata et al. [1] The overall missing data percentage of the response variable was 22.05% and all the covariates of the 68 infants were completely observed. Finally, after removing the low abundant genera from the obtained sample [1], 21 bacterial genera were included in the analysis.

Model for Longitudinal Microbiome Data
The microbiome composition of each bacterial genus is bounded from [0,1), zero-inflated, and the distribution is highly skewed (see Figure 3) [3]. Therefore, the ZIBR model with random effects in the beta component was used to study the association of gut microbiome composition and NICU stress of preterm infants.
Let y t be the relative abundance for a given bacterial genus at time t, for t = 0, 1, . . . , T , where y 0 represents the baseline measurement. Conditioning on the subject-level random effects η t , we assume that y t (0 ≤ y t < 1) follows zero-500 1000 Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week NISS Score inflated Beta distribution given by and where ( are the mean and the precision parameters of the Beta distribution, respectively. corresponding to x t and z t where p and q are the dimensions of each vector. Let with the (a, b) entries of the matrix being ρ |b−a| (−1 < ρ < 1).

Missing Data Mechanism
Incomplete data are often encountered in many fields of studies. The missingness might be caused by various reasons, such as non-responses in survey questions, study drop-outs or failure to follow-up, instrumentation drawbacks and so forth.
The three missing data mechanisms reviewed by Little and Rubin [4] which is mentioned in Chapter 1 can be shown using mathematical notations as follows.
Let Y , R, Y obs , and Y mis denote the complete data, missing data indicator matrix, observed, and missing measures of Y , respectively. The missing data mechanism is given by f (R | Y, φ), the conditional distribution of R given Y and φ, where φ is the unknown parameter vector. Then the MCAR mechanism can be denoted as, where the missingness does not depend on the values of the observed or missing Y . The MAR mechanism can be denoted as, where missingness depends only on observed data. Finally, the MNAR mechanism is when the missingness (R) depends also on the missing values of Y (Y mis ).
Let the joint probability distribution of Y and R given by where θ is the vector of parameters of interest.
Then the joint probability distribution of observed data (Y obs ) and R can be obtained by integrating out Y mis .
Let N denote the total number of subjects then the full likelihood function of θ and φ is given by where i = (1, . . . , N ) denotes the i th subject and Ω is the parameter space.
The likelihood function of θ ignoring the missing data mechanism is given by If the distribution of the missing data mechanism does not depend on the unobserved Y mis i.e., MAR (see (4)) and θ and φ are priori independent, π(θ, φ) = π(θ)π(φ), then (6) can be simplified as follows, Finally, the likelihood-based inference for Y from L f ull (θ, φ | Y obs , R) will be the same as the likelihood-based inference for Y from L ign (θ | Y obs ).

Bayesian Inference
Unlike the frequentist framework, where the parameters are considered as unknown constants, the parameters in the Bayesian framework are considered as random variables. As given by Gelman et al. [5], let θ be the parameter vector of interest and y be the vector of data obtained. By using the Baye's rule, the posterior density can be given by π(θ | y) = π(θ, y) π(y) = π(y | θ)π(θ) π(y) , where π(θ), π(y | θ), and π(y) = π(θ)π(y | θ)dθ are the prior distribution, sampling distribution, and the normalizing constant respectively. Since π(y) does not depend on θ, (10) can be written as, Therefore, the posterior density can be obtained by the likelihood function of parameters times the prior distribution of the parameters.
Let θ = (β, γ, φ, σ, ρ) be the vector of all parameters. Then the likelihood function is given by where n i = (1, 2, . . . , T + 1) denote the number of observed values of each subject i.
By the assumption of missing at random data, the likelihood function will be given by where P i denote the n i × (T + 1) selection matrix to select the observed values.

Prior Distributions
The parameters of interest in this model are β, γ, φ, σ 2 , and ρ. In the Bayesian analysis, we used non-informative priors for parameters. The proposed prior distributions for the coefficients of fixed effects of binary and beta components are multivariate normal: β ∼ N p (0, 10 4 I p ) and γ ∼ N q (0, 10 4 I q ) where I p and I q are p × p and q × q identity matrices respectively. For the precision parameter, we proposed an inverse gamma distribution: φ ∼ IG(10 −3 , 10 −3 ). The variance and correlation parameters of the random effects, we used inverse gamma distribution: σ 2 ∼ IG(10 −3 , 10 −3 ) and uniform distribution: ρ ∼ U (0, 1).
The Bayesian analysis was conducted using the proposed prior distributions in R version 3.5.3 [6] using the nimble package version 0.8.0 [7,8,9]. This software uses Markov Chain Monte Carlo (MCMC) algorithms to obtain the samples from posterior distributions of the parameters.

Simulation Study
In this section, we describe a simulation study that was conducted to evaluate the performance of the proposed model. We set N = 100 and T = 3 visits, where t = 0 is the baseline for each subject. Two covariates were generated for the analysis, where x i1 ∼ N (0, 1) and x i2 ∼ Bernoulli(0.5). For both binary and beta components, same covariates were used. The parameters of the model were set to mimick the real data set used in this study, where it generated approximately similar average zero percentage in the response variable as in real data. The coefficients of fixed-effects of the binary and beta components were set to β = (β 0 , β 1 , β 2 ) T = (2, 2, 1) T and γ = (γ 0 , γ 1 , γ 2 ) T = (−2, 1, 2) T , where β 0 and γ 0 denote the coefficients of the intercept terms, respectively. The random effects of the beta component were generated using multivariate normal distribution, η i ∼ N (0, σ 2 Σ) where σ 2 = 1 and Σ is a 4×4 AR(1) correlation matrix with ρ = 0.8. The response variable y it was thus simulated from model in (1), such that the average overall zero percentage of the response variable to 20%.
In this analysis, the three missing data mechanisms (MCAR, MAR, and MNAR) were used to generate missing data in the response variable. The missing data indicators were simulated for the three cases as follows: where k = 1, 2, 3 represent each missing data mechanism, P it1 = 0.25, P it2 = 1 (1+exp(1+x it1 +x it2 )) , and P it3 = 1 (1+exp(53+x it1 +x it2 −100y it )) . Under this setting, the average overall missing data percentage of each missing data mechanism was 25%.
This percentage is approximately close to the missing data percentage in real data set.
The prior distributions for the parameters in the model was set as follows: β ∼ N (0, 10 4 I 3 ), γ ∼ N (0, 10 4 I 3 ), φ ∼ IG(10 −3 , 10 −3 ), σ 2 ∼ IG(10 −3 , 10 −3 ), and ρ ∼ U (0, 1). After setting up the priors, 100 simulations were conducted for each missing data mechanism and the true values of the parameters were assigned as the initial values. To acquire the convergence of MCMC samples, the first 1000 iterations of the sampler were discarded and obtained every tenth iteration from 9000 iterations. Therefore, 900 samples were generated from the posterior distributions for each parameter and it was used to compute the posterior summaries.  values were close to each other for the parametes β 0 , β 1 , β 2 , γ 0 , and ρ and the CP of the parameters were smaller than 95% except for parameters β 0 and β 1 . These results indicate that the proposed model was more adequate if the missing data were MCAR or MAR than MNAR. The performance of the proposed model was further examined by considering a mis-specified response model. This model was generated by using an unstructured   Then the response variable y it was generated using the above unstructured correlation matrix. The missing data for y it were simulated using the MAR missing data mechanism with average overall missing data percentage of 25%. The analysis was conducted using the proposed model for 100 simulations using the same  prior distributions. The first 1000 iterations were discarded in each simulation and every tenth iteration was obtained from 9000 iterations. Moreover, the true values were assigned as the initial values except for the parameter ρ. Table 4 shows the true values (TRUE), posterior median (EST), SD, SE, RMSE, and CP of each coefficient parameter of fixed effects calculated using the data that were generated from AR(1) structured and unstructured correlation matrices (mis-specified model). When comparing the results of the two models, the correctly specified model shows better results than the mis-specified model. This also can be clearly seen in figure 7 where the bias of each parameter was closer to zero in the correctly specified model compared to the mis-specified model.

Analysis of Gut Microbiome Composition Data of Preterm Infants
This section depicts the results of the analysis conducted for microbiome composition data of NICU preterm infants. The data of the 68 preterm infants were considered for this analysis using the data of the last four weeks as described in Chapter 2. Therefore, T = 3, where t = 0 denotes the baseline (week 3) and t = 1 to t = 3 denotes the follow-up weeks from weeks 4 to 6.  Table 5 shows the descriptive statistics of the study sample. The median gestational age of the preterm infants was 28 weeks, median birth weight was 1082.5g, approximately 53% of the infants were not treated with antibiotic, and the majority of the infants in this sample were female (approximately 53%). Moreover, the median cumulative weekly stress scores of infants were decreased as the number of weeks in NICU increase.
The same covariates, week (week 3 -reference level), antibiotic usage (1 = yes), infant's gender (1 = female), birth weight, gestational age, and the weekly stress score which is 2 weeks and 1 week prior, used in both binary and beta components of the analysis. The continuous variables, birth weight, gestational age, and stress scores were standardized prior to the analysis.
Out of 21 genera, the Bayesian analysis was conducted for the genus, 'Veillonella'. The reason behind the selection was that NICU stress exposure had a significant effect on this genus during D'Agata et al., [1] study. To obtain the MCMC samples from the posterior distributions of the parameters, 600,000 iterations were used, and every 500 th iteration was taken after removing the first  Both tables include, estimates (EST F ) and FDR-adjusted p-values (P-Value) of the parameters obtained from the study D'Agata et al. [1], where they used the frequentist approach with n = 47. In both approaches, estimated parameters were considered as statistically significant at the 0.05 significance level. In the Bayesian approach, the estimate is considered as statistically significant if the 95% HPD interval did not include 0.
According to table 6, in both approaches, the stress two weeks prior (Stress t−2 ) has a significant effect on the genus Veillonella in the beta component. This suggests that when the genus Veillonella was present in the stool sample, larger stress scores were associated with the larger relative abundance of the genus. In the Bayesian approach, antibiotic usage was significant in the binary component.
This indicates antibiotic usage is associated with a higher probability of Veillonella being present in the stool sample. i.e. using antibiotic is 2.147 (exp(0.764)) times the odds to have the genus to be present in the stool sample than non-antibiotic users. The coefficient of week 6 is significant in the beta component in both approaches which shows that sampling week has a significant effect on Veillonella.
In table 7 the antibiotic usage in the binary component was not significant in the Bayesian approach. However, it was significant in the beta component where the infants who were treated with antibiotics were associated with the larger relative abundance of Veillonella compared to infants who were not treated with an antibiotic. Similar to table 6, the stress two weeks prior collecting the stool sample was significant in the beta component in Bayesian approach. This indicates that the Bayesian approach also agrees that the stress exposure of the preterm infants treated in NICU has a significant effect on the genus Veillonella.

Discussion
In this study, we developed a ZIBR model with mixed effects, which can handle the MAR data using the Bayesian approach. This model can be used to overcome two main challenges, the presence of a large number of zeros in microbiome composition data and intermittent missingness with MAR missing data mechanism that occurs in longitudinal studies.
The performance of the proposed ZIBR model was evaluated by a simulation study (see Section 3.1 in Chapter 3). Estimates of the parameters in the simulation study were closer to true values, except for precision parameter φ when the missing data mechanism is MAR. In addition, the coverage probabilities of each parameter were closer to 95%. When the missing data mechanism is MNAR, most of the parameter estimates were not close to true values and the coverage probabilites were also far from 95%. The large bias of the precision parameter φ in the beta component highlights deserves future research. Performance of the model was further examined by simulating another data set employing an unstructured correlation matrix for random effects but using the same proposed model. Moreover, the proposed model was applied to real data on gut microbiome composition of NICU preterm infants.
A major limitation faced in this study was conducting the analysis for all the 21 genera. The multiple comparison issue occurs when performing a higher number of tests simultaneously. Thus, we were not able to report the results of the all 21 genera and instead we report the results of one genus. There are several methods that can be used to adjust this issue in a frequentist framework such as Bonferroni correction and false discovery rate (FDR) [1,2,3]. In Bayesian framework, one approach is to use a joint model to model all the genera to solve this issue [4,5].
Therefore, for future research a multivariate zero inflated model can be used to model all bacterial genera together [6].
The Monte Carlo Expectation Maximization (MCEM) algorithm introduced by Wei and Tanner [7] can be applied to obtain the maximum likelihood estimates (MLE) of parameters for mixed-effect models. In this method, the random effects were treated as unobserved or missing values [8]. This approach can be applied to obtain MLE of parameters of the proposed model. With this approach, one can obtain the fisher's information matrix of parameter estimates and can compare the information gain between the complete case and all case analysis [9]. For future research it is interesting to use the MCEM algorithm to estimate the parameters of the proposed model.