INTERVAL-CENSORED DATA WITH INACCURATE TIME BOUNDS AND TIME-DEPENDENT COVARIATES

A key characteristic that distinguishes survival analysis from other statistics fields is that survival data are usually censored or incomplete in some way. The event time is interval-censored when the exact event time is unknown and the event occurs within some interval of time. The purposes of this study included: developed an easy-to-use code for intervalcensored survival data with both fixed and time-dependent covariates; conducted extensive simulations to investigate the robustness of the interval-censored survival analysis with inaccurate time bounds and time-dependent covariates, particularly under noninformative censoring and informative censoring; conducted a real data analysis using the ACTG 175 data to investigate the robustness of findings under inaccurate time bounds. The likelihood approach was used to draw inferences about the unknown parameters. The parameter estimates and standard errors were obtained; confidence intervals were constructed. The findings for simulations demonstrated that for both noninformative censoring and informative censoring, parameter estimates and coverage probability were more robust against deviations from the true time bounds for regression coefficients of the fixed and time-dependent covariates than for general hazard function parameters. The findings for real data analysis demonstrated that for both noninformative censoring and informative censoring, the estimates and p-values for the fixed and time-dependent covariates were robust against deviations from the true time bounds. This study was the first to develop the code for interval-censored survival data with both fixed and time-dependent covariates, and was the first to investigate the robustness of the interval-censored survival analysis under inaccurate time bounds.


LIST OF TABLES
.
Survival analysis is used in many areas for analyzing data involving the times of transition among several states or conditions (Leung, Elashoff, & Afifi, 1997).
Survival analysis is also called lifetime data analysis, time to event analysis, reliability and event history analysis, depending on the focus and the type of application (Leung et al., 1997;Prinja, Gupta, & Verma, 2010).
In some instances, the events are actual deaths of individuals and "lifetime" referred to the length of life measured from some particular starting point. In other instances, terms such as "lifetime", "death", or "failure" denote the event of interest and are used in a figurative sense (Lawless, 2011).
A key characteristic that distinguishes survival analysis from other statistics fields is that survival data are usually censored or incomplete in some way (Leung et al., 1997). There are three types of censoring, left censoring, right censoring, and interval censoring.

2
The event times are censored when they are not observed accurately (Sparling, Younes, Lachin, & Bautista, 2006). In clinical trials, censoring usually occurs when information on time to outcome event is unavailable for all participants, and observations are censored when information on time to event is unavailable due to loss to follow-up or non-occurrence of outcome event before the trial end (Prinja et al., 2010).
The event time is right-censored when follow-up is curtailed without observing the event (Sparling et al., 2006). In clinical trials, an observation is considered rightcensored if the person was alive at study termination or was lost to follow-up at any time during the study (Leung et al., 1997).
The event time is left-censored when the event occurs at some unknown time prior to an individual's inclusion in a cohort (Sparling et al., 2006). In clinical trials, an observation is considered left-censored if the person had been on risk for disease for a period before entering the study (Prinja et al., 2010).
The event time is interval-censored when the event occurs within some interval of time, but the exact event time is unknown (Sparling et al., 2006). Interval censoring can occur when observing a value requires follow-ups or inspections. Interval censoring occurs frequently in epidemiological, financial, sociological, and clinical (especially HIV and cancer) studies where the event of interest is known only to occur within an interval induced by periodic monitoring (Song & Ma, 2008;Zeng, Mao, & Lin, 2016). Practically, most observational studies dealing with non-lethal outcomes have periodic examination schedules and are interval-censored (Prinja et al., 2010;Zeng et al., 2016).
Right censoring (where the upper bound is infinity) and left censoring (where the lower bound is 0) can be considered as special cases of interval censoring.
Therefore, the more general type of censoring (i.e., interval censoring) was focused in this project. This project conducted interval-censored survival analysis with timedependent covariates.

Time-dependent Covariates
In many instances, factors or covariates affecting an individual's lifetime may vary over time, and we refer to them as time-dependent covariates (Lawless, 2011 (Lawless, 2011).
In this project, the interval-censored survival analysis with time-dependent covariates were conducted under two scenarios, noninformative censoring and informative censoring.

Noninformative Censoring and Informative Censoring
Most studies assumed that the examination time and lifetime of interest are completely independent (e.g., Groeneboom & Wellner, 1992) or conditionally independent given covariates (e.g., Rossini & Tsiatis, 1996). This is known as the assumption of noninformative censoring. Chen, Lu, Chen, and Hsu (2012) suggested that if the examination time is intrinsically related to the lifetime of interest, the 4 examination time is considered informative censoring. Such data may occur in many fields including epidemiological and medical studies.
Dettoni, Marra, and Radice (2020) mentioned that censoring is independent when the hazard rate of the event of interest for the censored observations is equal to the hazard rate for the uncensored observations, otherwise censoring is dependent; censoring is informative or dependent when the censoring times contain information on the parameters of the distribution of the event variable (e.g., Kalbfleisch & Prentice, 2002). Ranganathan and Pramesh (2012) stated that censoring in survival analysis should be noninformative, which means participants who drop out of the study should do so due to reasons unrelated to the study, while informative censoring occurs when participants are lost to follow up due to reasons related to the study.
Although informative censoring has been well-studied in the survival analysis (e.g., Chen et al., 2012;Leung et al., 1997;Emura & Chen, 2018), the specific study which analyzes the problem of informative censoring is scarce, even though ignoring it may cause detrimental consequences on inferential conclusions (e.g., Lu & Zhang, 2012).
For this project, the analysis of interval-censored data was studied under both noninformative and informative censoring scenarios, using both simulations and the real data.

The Real Data -ACTG 175 Data
In real life, we may not be able to record the exact event time. For example, in the real data of the AIDS Clinical Trial Group (ACTG) 175 (e.g., Hammer et al., 1996), time to progression to AIDS may not be accurately observed. Some previous studies considered the data as right-censored (e.g., Hammer et al., 1996;Scharfstein & Robins, 2002), while some studies considered the data as interval-censored (e.g., Song & Ma, 2008;Song & Wang, 2017;Wen, 2012;Wen & Chen, 2014). Several previous studies (e.g., Hammer et al., 1996;Huang & Zhang, 2008;Scharfstein & Robins, 2002;Song, Davidian, & Tsiatis, 2002;Song & Ma, 2008;Song & Wang, 2017;Wen, 2012;Wen & Chen, 2014) analyzing ACTG 175 data were discussed below. Hammer et al. (1996)  Patients were examined at weeks 2, 4, and 8 and then every 12 weeks, with CD4 cell counts determined from week 8 onward. Since the examinations of CD4 cell 6 counts were conducted every 12 weeks and the progression to AIDS depends on the CD4 cell counts, the event time should be considered as interval-censored. As a result, the analysis using models for right-censored data on the ACTG 175 data should be viewed as an approximation, and the accuracy of such approximation needs to be evaluated. In addition, if the periodic examination schedules for patients were followed exactly, then the censoring should be noninformative; however, in practice there might be factors that affect the examination date and the event time simultaneously which will result in informative censoring. Scharfstein and Robins (2002)  Although both studies used the conditional score approach to handle mismeasured covariates, the study of Song et al. (2002) generalized the conditional score approach to multiple time-dependent covariates to handle mismeasured covariates, while the study of Song and Ma (2008) proposed a multiple data augmentation approach that can convert interval-censored data into right-censored data and then employed the conditional score approach, which was developed for analyzing right-censored data, to handle mismeasured covariates. The goal of the study of Song and Ma (2008) was to fill in the gap between methodologies for analyzing interval-censored data without mismeasured covariates and methodologies for analyzing right-censored data with mismeasured covariates. The performance of the multiple augmentation approach was assessed via simulation studies and the ACTG 175 data. The ACTG 175 data was analyzed as interval-censored data with mismeasured covariates. The variables used in the analysis included CD4 count and treatment. The findings showed that this approach had satisfactory empirical performance for interval-censored data. Song and Ma (2008) mentioned that this approach can be applied under other semiparametric model assumptions such as the 8 additive risk model, and that the conditional score approach can be replaced by other error-dealing approaches for right-censored data, such as correction approaches.
The study of Song and Wang (2017) also used the conditional score approach to handle mismeasured covariates. However, different from the study of Song et al. (2002) and the study of Song and Ma (2008), Song and Wang (2017) proposed a partially time-varying coefficient proportional hazards model to explore the associations between the hazard of failure and covariates (time-varying covariates and fixed covariates). The coefficients were estimated using a polynomial spline approach, and the corrected score and conditional score approaches were used to deal with mismeasured covariates. The proposed model was assessed by simulations and applied to the ACTG 175 data to explore the temporal dynamics of the effect of treatment and CD4 counts on time to AIDS or death. The variables included in the model were treatment and CD4 count. Although both corrected score and conditional score approaches were asymptotically equivalent, the conditional score approach was recommended for its better finite sample performance. The proposed approach will have applicability in a wide range of studies for dealing with survival data with longitudinal mismeasured covariates. The proportional hazards model may be generalized to include functions of the random effects by analogy to Song et al. (2002).
Similar to the study of Song and Ma (2008), which analyzed general intervalcensored data with mismeasured covariates, the study of Wen (2012) also analyzed general interval-censored data with mismeasured covariates. However, while Song and Ma (2008) converted interval-censored data into right-censored data using multiple augmentation approach and employed the conditional score approach to handle mismeasured covariates, Wen (2012)  Another study analyzing the ACTG 175 data (Huang & Zhang, 2008) proposed an estimation method for the bivariate proportional hazards model for competing risks.
The copula approach was used to do sensitivity analysis for the Cox proportional hazards models. The joint modeling was used to address informative censoring. By this approach, the joint distribution of the survival time and censoring time is assumed to be a function of their marginal distributions. Huang and Zhang (2008) used this approach to assess the effect of informative censoring on the parameter estimates. The proposed approach was assessed by simulation studies and applied to the ACTG 175 data. In the ACTG 175 data analysis, covariates were selected using a stepwise selection algorithm. A total of nine variables were selected for the analysis, including treatment, hemophilia, symptomatic HIV infection, gender, intravenous drug use, age, race, prior antiretroviral therapy, and CD4 count.
The ACTG 175 data was also analyzed in the present study and the robustness of findings was investigated. The details of the present study were described below.

Present Study
In real life, we may not be able to record the exact event time or the accurate lower bound and upper bound for interval-censored survival data. For example, in the ACTG 175 data, the examinations of CD4 cell counts may not be conducted exactly every 12 weeks, and the time bounds may be affected by unobservable event time, which may result in inaccurate time bounds. If this is the case, the estimates as well as standard errors may not be calculated consistently, which may affect the accuracy of the findings. In this project, the impact of inaccurate time bounds on the analysis of interval-censored data was studied under both noninformative and informative 11 censoring scenarios, using simulations and the real data. The purposes of the present study included: • Developed an easy-to-use code for interval-censored survival data with both fixed and time-dependent covariates.
• Conducted extensive simulations to investigate the robustness of the intervalcensored survival analysis with inaccurate time bounds and time-dependent covariates, particularly under noninformative censoring and informative censoring.
• Conducted a real data analysis using the ACTG 175 data to investigate the robustness of findings under inaccurate time bounds.
This study was the first to develop the code for interval-censored survival data with both fixed and time-dependent covariates, and was the first to investigate the robustness of the interval-censored survival analysis under inaccurate time bounds.
This study contributed to the research literature about the topic and contributed to the methodology, procedures, and analysis of future survival studies examining the similar topic.

Data Setting
The data consists of independently and identically distributed (i.i.d.) measurements from subjects. Each subject has a sequence of visit times, a vector of fixed covariates, a vector of time-dependent covariates, and an event time.
Specifically, for subject = 1, …, , let ,1 < ⋯ < , be a random sequence of visit times, where denotes the total number of visits for the ith subject. For convenience, let , +1 = ∞. The first visit is at ,1 = 0, the beginning of the study. and Equation 2: This formula allows for left-, right-, and interval-censored observations. The Let and be the coefficient vectors for the fixed and time-dependent covariates, be the intercept; is p-dimensional, is q-dimensional. With timedependent covariates, the hazard function can be expressed as where > 0 and are general hazard function parameters; the parameter vector = ( , , , , ) . Specific values for and yield a specific distribution (Sparling et al, 2006). For instance, = 0 yields a Weibull hazard. With no time-dependent covariates, the hazard function can be simplified as Based on the hazard function, the cumulative hazard function is

Model
The likelihood approach was used to draw inferences about the unknown parameters. The log-likelihood function can be expressed as where the parameter vector = ( , , , , ) . The estimator, ̂, was the maximizer of the log-likelihood function and was asymptotically normal. The R function optim with BFGS method (Nocedal & Wright, 1999) was used to calculate the maximizer.
The R package, numDeriv, was used to calculate the Hessian matrix of the negative log-likelihood. The estimated variance was the diagonal of the inverse of the Hessian matrix and the standard errors of the estimates equaled the square root of the estimated variance. The 95% confidence intervals were constructed using estimates plus and minus 1.96 times standard errors.

Development of an Easy-to-Use Implementation
Existing software provide limited support in fitting regression models with interval-censored survival data and time-dependent covariates. For example, the survreg function in R can be used for left, right, or interval-censored data, but it can only be applied to fixed covariates; the coxph function can be used for both fixed and time-dependent covariates, but it can only be used for right-censored data. To facilitate the study of the impact of inaccurate time bounds, I developed my own code for interval-censored survival data which included both fixed and time-dependent covariates. , 1 , 2 , , , , were 0, 0.8, 0.8, 0, 2, and −0.5, respectively.

Simulation Studies
In this study, extensive simulations were conducted to investigate the robustness of the interval-censored survival analysis with inaccurate time bounds and time-dependent covariates, particularly under the following two scenarios.
1) noninformative censoring: The new lower bound and upper bound of interval were defined as follows: where was drawn from a uniform distribution on [1/ , 1] and was drawn from a uniform distribution on [1, ] and ≥ 1 was a tuning parameter which was a sequence of 20 numbers between 1 and 10.
2) informative censoring: The new lower bound and upper bound of interval were defined as follows: where was drawn from a uniform distribution on [1/ , 1] and was drawn from a uniform distribution on [1, ]; ≥ 1 as well as ≥ 1 was a tuning parameter that was a function of and . For example, , and was a tuning parameter as in the noninformative censoring scenario.
The method described in Chapter 2 was applied to obtain the parameter estimates. Then the bias, the standard error, the mean squared error (i.e., the average squared difference between the estimated values and the true value), the coverage probability (i.e., the probability that a procedure for constructing random regions will produce an interval containing the true value) were calculated for each parameter. The programming language was R.

Real Data Application
The ACTG 175 (e.g., Hammer et al., 1996) Table 1 for noninformative censoring and in Table 5 for informative censoring. We can see from Table 1 that for noninformative censoring, bias was more robust against change across 20 M values for 1 , 2 , and than for , , and . We can see from Table 5 that for informative censoring, bias was more robust against change across 20 M values for 1 , 2 , and than for , , and .

20
The average standard error across 500 data sets for each parameter across 20 M values for scenario 2 were demonstrated in Table 2 for noninformative censoring and in Table 6 for informative censoring. We can see from Table 2 and Table 6 that standard error slightly increased across M values for each parameter, for both noninformative censoring and informative censoring.
The average mean squared error across 500 data sets for each parameter across 20 M values for scenario 2 were demonstrated in Table 3 for noninformative censoring and in Table 7 for informative censoring. We can see from Table 3 that for noninformative censoring, mean squared error was more robust against change across 20 M values for 1 , 2 , and than for , , and . We can see from Table 7 that for informative censoring, mean squared error was more robust against change across 20 M values for 1 , 2 , and than for , , and .
The coverage probability of 95% confidence interval across 20 M values for each parameter for scenario 2 were demonstrated in Table 4 for noninformative censoring and in Table 8 for informative censoring. Scatter plots were also drawn for coverage probability for noninformative censoring and informative censoring (see Figure 2). We can see from Table 4 and Figure 2 (a) that for noninformative censoring, coverage probability was more robust against change across 20 M values for 1 , 2 , and than for , , and . We can see from Table 8 and Figure 2 (b) that for informative censoring, coverage probability was more robust against change across 20 M values for 1 , 2 , and than for , , and .
Additionally, box plots were drawn for parameter estimates of each parameter across 20 M values for noninformative censoring and informative censoring for the simulated dataset including only fixed covariates (i.e., scenario 1). See Figure 3 for box plots. Scatter plots were drawn for coverage probability of 95% confidence interval across 20 M values for each parameter for noninformative censoring and informative censoring of scenario 1 (see Figure 4). To summarize, the findings showed that for noninformative censoring of scenario 2, parameter estimates as well as coverage probability were more robust against change across M values for 1 , 2 , and than for , , and , which were not that robust against change across M values; for informative censoring of scenario 2, parameter estimates as well as coverage probability were more robust against change across M values for 1 , 2 , and than for , , and , which were not robust against change across M values. Moreover, for noninformative censoring of scenario 1, parameter estimates as well as coverage probability were more robust against change across M values for 1 and 2 than for , , and , which were not that robust against change across M values; for informative censoring of scenario 1, parameter estimates as well as coverage probability were more robust against change across M values for 1 and 2 than for , , and , which were not robust against change across M values.

22
In both scenario 2 and scenario 1, the parameter estimates and coverage probability for were more robust against change across M values for noninformative censoring than for informative censoring.
In the scenarios considered in this simulation study, noninformative censoring is almost irrelevant to survival time, so as M value changed, for noninformative censoring, both scale and shape parameters ( , , and ) changed but were not as severe as that for informative censoring. Informative censoring is related to survival  Table 9 for noninformative censoring and in Table 10 for informative censoring. We can see from Table 9 that for noninformative censoring, the estimates for treat and cd4 were relatively robust against change across M values; the p-values for treat and cd4 were also relatively robust against change across M values. We can see from Table 10 that Table 11 and Table 12 for noninformative censoring and in Table 13 and Table 14 for informative censoring. We can see from Table 11 and 12 that for noninformative censoring, the estimates as well as p-values for treat and cd4 were relatively robust against change across M values, and the values and range were 25 similar to those in the analysis when only treat and cd4 were used. For example, when only treat and cd4 were used, the range of estimates for treat were around (-0.334, -0.042); when more covariates were included, the range of estimates for treat were around (-0.349, -0.057); the p-values for treat and cd4 in both analyses were all less than 0.05 and significant. We can see from Table 13 and 14 that for         than for , , and , which were not that robust against change across M values.
In the scenarios considered in this simulation study, for both noninformative censoring and informative censoring, the scale and shape parameters ( , , and ) changed as M value changed, so parameter estimates and coverage probability for , , and were not robust against change across M values. However, since noninformative censoring is not related to survival time but informative censoring is related to survival time, the change of scale and shape parameters ( , , and ) was not as severe as that for informative censoring. In addition, the informative censoring scenario considered in this study has no strong relationship with fixed covariates and time-dependent covariates, thus for informative censoring, parameter estimates and coverage probability for 1 , 2 , and were also robust against change across M values.
In the real data analysis, the estimates, standard errors, and p-values across 20 M values for the fixed covariate, treat, and the time-dependent covariate, cd4, in ACTG 175 were obtained. The findings demonstrated that for noninformative censoring, the estimates as well as the p-values for treat and cd4 were relatively robust against change across M values; for informative censoring, the estimates as well as the p-values for treat and cd4 were more robust against change across M values. The estimates for treat and cd4 were all negative and significant for both noninformative censoring and informative censoring, which indicated longer survival time and event occurred later. The estimates under informative censoring were most robust against change across M values, which was consistent with the findings in the simulation study and worth further investigation.
The real data analysis in this study also included some more covariates. The estimates, standard errors, and p-values across 20 M values were obtained for the fixed covariates, treat, karnof, and symptom, and the time-dependent covariate, cd4. The findings for this analysis when more covariates were included were compared with the findings for the analysis when only treat and cd4 were used. The results for treat and cd4 in this analysis were very similar to those in the analysis when only treat and cd4 were used.
This study was the first to develop the code for interval-censored survival data with both fixed and time-dependent covariates, and was the first to investigate the robustness of the interval-censored survival analysis under inaccurate time bounds.
This study contributed to the research literature about the topic and contributed to the methodology, procedures, and analysis of future survival studies examining the similar topic.
This project used ACTG 175 as the real data and was an illustration of how to evaluate the robustness of findings under inaccurate time bounds. Similar procedures can be conducted on other datasets with inaccurate time bounds. Moreover, this study analyzed clinical trial data and included one time-dependent covariate in the analysis.
Future studies can include more time-dependent covariates and can be extended to other areas, such as the financial, accounting, and marketing areas; for example, be used for bankruptcy prediction and credit default prediction. What's more, this study selected covariates that were included in the real data analysis based on previous studies analyzing ACTG 175 data (e.g., Hammer et al., 1996;Huang & Zhang, 2008;Scharfstein & Robins, 2002;Song et al., 2002;Song & Ma, 2008;Song & Wang, 2017;Wen, 2012;Wen & Chen, 2014). Future studies can use automatic variable selection methods such as forward regression, backward regression, stepwise regression, and LASSO (e.g., Tibshirani,1996), in order to select the best subset of predictors for the analysis. In addition, for interval-censored data of a large number of observations, future studies can use a possible approach, the divide and conquer strategy, to handle the data. For example, for the simulated data with 500000 observations, the data can be divided into 500 blocks and the parameter estimates and their variance within each block can be calculated. Finally, the 500 results will be merged into one.
Besides, future studies may include joint models for longitudinal and time-toevent data which bring these two data types together into a single model so that one can infer the association between the longitudinal biomarker and time to event to better examine the treatment effects and reduce bias in estimates of the treatment effects. One example of the joint model can be described as follows.
We can utilize subject-specific random effects to account for the possible dependence between the event time and the visit time , . This approach involves 48 two models; one for the gap time between adjacent visits, and the other for the event time. As an illustration, we can assume that the hazard function for the gap time We can still use the likelihood approach to estimate and . The likelihood function is