Risk Classification in High Dimensional Survival Models

Sparse regression models are an actively burgeoning area of statistical learning research. A subset of these models seek to separate out significant and non-trivial main effects from noise effects within the regression framework (yielding so-called “sparse” coefficient estimates, where many estimated effects are zero) by imposing penalty terms on a likelihood-based estimator. As this area of the field is relatively recent, many published techniques have not yet been investigated under a wide range of applications. Our goal is to fit several penalty-based estimators for the Cox semiparametric survival model in the context of genomic covariates on breast cancer survival data where there are potentially many more covariates than observations. We use the elastic net family of estimators, special cases of which are the LASSO and ridge regression. Simultaneously, we aim to investigate whether the finer resolution of next-generation genetic sequencing techniques adds improved predictive power to the breast cancer patient survival models. Models are compared using estimates of concordance, namely the c-statistic and a variant which we refer to as Uno’s C. We find that ridge regression models best fit our dataset. Concordance estimates suggest finer resolution genetic covariates improve model predictions, though further work with more observations is required.


LIST OF TABLES
1 A confusion matrix giving the expected distribution of a model's prediction outcomes in terms of its sensitivity and specificity, given some P "true" outcomes and N "false" outcomes. . . . . . . . . . . . . 12 2 An example of within-sample normalization using counts per million (CPM) yielding misleading results: it is more likely that genes 1-4 are expressed identically in patients 1 and 2, while only gene 5 is differentially expressed. CPM normalization only considers the number of counts within each patient and yields a result suggesting all five genes are differentially expressed.  Table   Page viii 9 Summary of cross-validated loss in models fit to the covariates determined by univariate filtering at the isoform level. For each given level of α, the model with the lowest value of mean cross-validated loss ("cvm") over all possible values of λ is shown. Models run on the 332 constituent isoforms are labeled as "iso" type, and those run on the 76 gene-level aggregates are labeled as "gene" type. The standard deviation ("cvsd") and number of non-zero model coefficient estimates ("nzero") are displayed. For comparison purposes, we've added the number and proportion of the 76 genes represented by at least one non-zero isoform or aggregate coefficient estimate in the "gzero" and "percnzero" columns. . . . . . . . . . . . . . . . . . . . . . 36 10 Summary of cross-validated loss in models fit to the covariates determined by univariate filtering at the gene-aggregate level. For each given level of α, the model with the lowest value of mean cross-validated loss ("cvm") over all possible values of λ is shown. Models run on the 914 constituent isoforms are labeled as "iso" type, and those run on the 298 gene-level aggregates are labeled as "gene" type. The standard deviation ("cvsd") and number of nonzero model coefficient estimates ("nzero") are displayed. For comparison purposes, we've added the number and proportion of the 298 genes represented by at least one non-zero isoform or aggregate coefficient estimate in the "gzero" and "percnzero" columns. . . . . . 40 11 C-statistics for each model from Table 9. These models were fit to the set of covariates determined by univariate filtering at the isoform-level. "iso" models were fit to the set of 332 constituent isoforms and "gene" models were fit to 76 gene-level aggregates. . . . . 45 12 C-statistics for each model from Table 10. These models were fit to the set of covariates determined by univariate filtering at the isoform-level. "iso" models were fit to the set of 914 constituent isoforms and "gene" models were fit to 298 gene-level aggregates. . . . . 46 13 Difference in Uno's C between each "gene" and "iso" pair from Table 11. The difference is given in the "concordance" column. A negative difference indicates Uno's C is higher for the "iso" level model.  Table   Page ix 14 Difference in Uno's C between each "gene" and "iso" pair from Table 12. The difference is given in the "concordance" column. A negative difference indicates Uno's C is higher for the "iso" level model. Test statistics for log-rank tests for the final ridge models. Tests were on the association between survival outcome and whether the patient's risk score was below the median risk but used the same set of observations the models were fit on. However, there is a large difference in the χ 2 values between the two models. . . . . . . . . . . . . . 51

Figure Page
1 Genes, encoded as DNA, are expressed as proteins through a multistage process beginning with transcription into RNA. Protein coding exon regions are then spliced and translated into protein structures. Gene expression can be measured by counting mRNA splicing variants. The expression can either be treated as the sum or the set of all mRNA counts derived from a gene The lasso tends to produce sparse coefficient estimates, while the ridge estimator shrinks all estimates towards zero. In the case of estimating two coefficients, the free parameter λ in the lasso estimator dictates the size of a square region (left) corresponding to the L 1 norm penalty, while the ridge estimator's L 2 2 penalty creates a circular constraint (right). The estimators return the values in the parameter space closest to the maximum likelihood estimate (β) which meet their constraints. In the case of the lasso, this tends to be on a vertex of the square (or edge of the hypercube in higher dimensions), yielding one or more coefficient estimates of zero  Tables 9 and 10 plotted on the same set of axes, α versus "cvm". The "type" column is denoted by shape, and Table 9's models appear in blue, while xiii 17 C-Statistics and Uno's C for models from Tables 9 and 10 plotted on the same set of axes, α versus "concordance". The "type" column is denoted by shape, and

Introduction and Background
Information is being collected at an ever-increasing rate, yet more information does not guarantee its holder more knowledge. The areas of statistical learning and machine learning have erupted with new techniques to automate the process of gleaning knowledge from massive stores of information. This thesis is an exploration of a sliver of these methods, applied to a specific problem domain: cancer progress prediction through genomic sequencing. We investigate whether finer-grained genomic information, available through newer sequencing methods, provides additional predictive power over the previous resolution of information.

Gene Expression
The central dogma of molecular biology describes the mechanisms by which information encoded at a cellular level is expressed as proteins by an organism . In a simplified form, it suggests information, which partially determines the physical traits of the organism, is encoded as linear sequences of deoxyribonucleic acid (DNA). This information is used to construct proteins, large three dimensional structures which facilitate nearly all cellular functions within the organism. A gene is a sequence of DNA that maps to a particular protein, and the genome comprises all genes. Genes are expressed as proteins through a series of steps. DNA is transcribed into an intermediate ribonucleic acid (RNA) form, and the RNA sequence is spliced to remove non-coding "intron" regions and join the remaining proteincoding "exon" regions into new contiguous sequences . The order and sequence of the exon regions determines the specific protein variant which will be generated. Multiple splicing variants of a gene may exist. The splicing variants of a gene, referred to as its "isoforms", use different arrangements and subsets of ex- Many diseases, such as breast cancer, arise from corruption of the genetic code within exon regions (the exome) . Cancer is the manifestation of uncontrolled cell growth. Since cellular functions are dependent on protein, excessive upregulation of protein production can increase cell reproduction. Similarly, certain regulatory functions are maintained through protein structures. In these cases, downregulation of these proteins could also result in increased cell reproduction.
The cell constantly dismantles and regenerates many proteins in order to facilitate these regulation mechanisms. Measuring the level of gene expression within a cell is an indicator of protein activity. This thesis aims to compare isoform-level expression information with overall gene expression in cancerous cells.

Survival Models
A survival model, or time-to-event model, is a set of assumptions made about the survivor function of a random event. The survivor function, denoted S(t ), is defined as the probability of the random event occurring after some time t , denoted Pr(T > t ) where T is the random variable time of event and both t and T are defined in relation to a common starting time. The survivor function is the complement of the cumulative distribution function of T . We assume the event will eventually occur such that T is strictly positive and continuous. Sometimes T is referred to as the "failure time". If we assume S(t ): • equals 1 when t = 0 (the event cannot occur until some time has passed), • is non-increasing for t > 0, and • approaches 0 as t → ∞ (the event must occur given infinite time).
then we can show S(t ) has a one-to-one relationship with a hazard function h(t ) which represents the instantaneous rate of the event occurring at time t given that it has not yet occurred. The relationship between survivor function and hazard function is given by The hazard at time t is a rate, not a probability, and has less restrictions on its form. Namely, a valid hazard function yields non-negative values for all values of t and integrates to infinity over the positive real line. By modeling a valid hazard function we've effectively modeled a valid survivor function using the relationships shown above.
Modeling the hazard function allows us to answer questions in a similar fashion to modeling the log-odds of the event occurring, such as the determination of risk factors in a logistic regression. However, the hazard function explicitly accounts for the passage of time. This can yield more information from "right-censored" observations when fitting a model. Right-censored observations are when the event must have occurred after a time t i , but we do not know the actual time-of-event. Modeling the log-odds ignores the censoring time t i and only considers the absence of the event. Modeling the hazard rate, however, uses both censoring times and time of events during the model fitting, as discussed below in the specific case of the Cox proportional hazards model.

The Cox Proportional Hazards Model
One way to model the effect of covariates on the time of a random event uses the approach of hazard regression. If the covariates X are unchanging with respect to time, their effect on the hazard function can be modeled as h(t |X ) = h(t )g (X ) where g (X ) > 0. A popular model family for this conditional hazard is based on the Cox semi-parametric proportional hazards model ). The Cox model assumes that the ratio of two rates with different covariates, i.e. conditional hazard ratio, for the same event is constant and a function of the covariates. The covariates modify the shared underlying hazard function h 0 (t ), denoted as the baseline hazard.
The log hazard equation above shows how the Cox model may be seen as a generalized linear model (Nelder and Baker, 1972), where the intercept γ(t ) contains the baseline hazard and a log link is used to estimate the hazard rate rather than an expected value. Extensions to the Cox model allow for time-varying covariate ef-fects  but lose the ability to predict individual survivor functions. In this thesis, however, we do not consider any time-varying covariates.
The Cox model is semi-parametric as it assumes linear covariate effects ( X i β) on the hazard rate and it fits the effects β (also referred to as the coefficients of the model) to observations using maximum likelihood of a partial likelihood function which does not specify the baseline hazard. A fully parametric model with a traditional likelihood function would have to specify, and thus make assumptions about, the baseline hazard. The partial likelihood function, L P L (β), for the Cox model is given by where D is the set of event times, R r is the set of observations at risk of the event at time r , and j r is the observation with the event at time r (Tibshirani et al., 1997).
Maximum partial likelihood suggests the values of β which produce the highest value of the partial likelihood function given a dataset are the best estimates for the true parameters of the model. In general, maximum partial likelihood estimation is solved as an optimization problem. For mathematical and computational convenience, this is often reformulated as maximizing the log partial likelihood function log L P L (β), and the maximum partial likelihood estimator (MPLE) of β,β, is given bŷ Fitting a Cox model allows a researcher to make probabilistic statements about relative risk (i.e a patient who smokes has twice the risk of experiencing a heart attack as one who does not smoke, all else being equal.) If the covariates are continuous, e β j can be considered a the change in relative risk due to a one unit change of covariate X j .

Regularized Models
In problems where there are more unknown parameters to be fitted than observations, the Cox model is over-parameterized; that is, more than one combination of parameter estimates maximize the partial likelihood function. Some other information would be needed to distinguish the "best" set of parameter estimates. Modifying this over-parameterized model such that the MPLE yields a single unique estimate is termed "regularizing" the model. If a regularized likelihood function is convex, the MPLE of the regularized model may be estimated using efficient convex optimization techniques. This is important to fit models on large datasets in a reasonable amount of time (i.e. hours or days!) Moreover, if it is known apriori that not all, or possibly just a few, covariates affect the time of the outcome event then the model should also account for this. The assumption of relatively few true covariate effects is termed "sparsity" . This thesis considers several cases of one family of regularization methods which exploit the sparsity assumption. In these methods β is estimated using a "penalized" estimator which can be written aŝ λ is a free parameter which dictates the bias towards 0 exerted by a penalty term P T (β), where P T (β) ≥ 0. If λ = 0 and the model is not over-parameterized, β * yields the maximum likelihood estimate given byβ. As λ approaches ∞, the estimatorβ * becomes smaller and smaller, with all covariate effects becoming 0. A penalized estimator will give estimates which are biased towards 0. In the context of prediction this property may be advantageous even if the underlying model is not over-parameterized. If the data contain many "noisy" measurements or covariates which do not truly influence the outcome, reducing the magnitudes of the estimated covariate effects may reduce model overfitting.

Ridge Regression, LASSO, and Elastic Net
The ridge  estimator is a penalized estimator which uses a penalty term of the squared Euclidean distance, or L 2 2 norm, of the coefficients.
The L 2 2 norm of β is defined as This norm shrinks coefficients towards zero as λ increases, but the estima-  , 1992) showed this is asymptotically true for a ridge estimator in the case of logistic regression, and that a ridge estimator can be applied to the Cox model (Verweij and Van Houwelingen, 1994). Additionally, the penalty function used to find the ridge estimator is convex and so the ridge estimator is unique and may be found quickly even for large datasets.
A related penalized estimator is the LASSO (Tibshirani, 1996), or lasso, estimator. LASSO is short for Least Absolute Shrinkage and Selection Operator. The lasso estimator uses a penalty term of the Manhattan distance, or the L 1 norm, of the coefficients. The L 1 norm of β is defined as The L 1 norm has several useful properties which have boosted the popularity of the lasso. It is the "smallest" of the L p family of norms (in terms of p) which is convex, and so it yields an estimator with a convex penalty function . The "smallest" member of the L p family of penalty terms is the L 0 generalized norm which is optimal for variable selection, as it corresponds to best subset selection where the Figure 2. The lasso tends to produce sparse coefficient estimates, while the ridge estimator shrinks all estimates towards zero. In the case of estimating two coefficients, the free parameter λ in the lasso estimator dictates the size of a square region (left) corresponding to the L 1 norm penalty, while the ridge estimator's L 2 2 penalty creates a circular constraint (right). The estimators return the values in the parameter space closest to the maximum likelihood estimate (β) which meet their constraints. In the case of the lasso, this tends to be on a vertex of the square (or edge of the hypercube in higher dimensions), yielding one or more coefficient estimates of zero. Source: Statistical Learning with Sparsity likelihoods of all combinations of inclusions for predictor variables are considered.
Best subset selection, however, requires computational resources which are infeasible for non-trivial problems. The L 1 norm as a penalty term tends to remove predictor variables from fitted models, allowing the lasso estimator to yield "sparse" estimates.
The use of the L 1 penalty also has the propensity to shrink small-magnitude covariate effects to 0 while leaving large-magnitude effects close to their unbiased MPLE estimates. However, if the covariates are highly correlated the lasso estimator will arbitrarily shrink one effect towards zero.
The elastic net (Zou and Hastie, 2005) estimator attempts to yield sparse estimates similar to the lasso while imposing uniform shrinkage on the effects of covariates by using a weighted combination of the L 1 and L 2 2 norms as the penalty term.
It introduces a second free parameter, α ∈ (0, 1), and uses the weighted combination 2 ) as the penalty term. It can be thought of as a generalization of the ridge and lasso estimators.

Bayesian Interpretation
These estimators can also be considered in the context of the Bayesian frame- The "maximum a-posteriori" or MAP estimate of β can be obtained as the mode of Pr(β|D) and is equivalent to maximizing log Pr(β|D), yielding the equation Using the partial likelihood for the Cox model, the ridge estimator is equivalent to the MAP estimator where the effects β i have i.i.d Gaussian prior distributions such that . The lasso estimator is equivalent to a MAP estimator using i.i.d Laplacian prior distributions on each β i , centered at zero and with spread parameters determined by λ. The elastic net is equivalent to a MAP estimator using a mixture of penalty terms as the i.i.d priors on each β i . Figure 3. Comparison of Laplace and Gaussian distributions with identical location and scale parameters (0 and 1, respectively). The Laplace distribution places more probability mass at its central value and in its tails than the Gaussian distribution. This characteristic is indicative of the lasso estimator's sparsity properties.

Model Assessment
Many techniques to order statistical models from "better" to "worse" quantify the "fit" of each model. That is, they use criteria derived from each model's likelihood function. A well fitting model can then be interpreted in the context of its assumptions.
In contrast, we are interested in comparing the predictive power of our estimated models. Several measures have appeared in the literature, such as the cross-validated partial likelihood, Brier Score, and pseudo R 2 (Van Wieringen et al., 2009). Crossvalidated partial likelihood is based on the model likelihood function, discussed previously and is presented in our results. Here we discuss another measure of predictive power for a single model (concordance), one of its interpretations (area under a receiver operating characteristics curve), and a variant of a common estimator for it.
Due to the asymptotic normality of the estimator we can test for differences in pre-dictive power between models, which we also present in our results.

Concordance and Receiver Operating Characteristics Curves
Concordance is defined as where (T i ,X i ), i = 1, 2 are two independent failure times and covariate sets and pred(X i ) is a predictor of the hazard rate as a function of the covariates. The estimated linear predictor,βX i , is used as pred(X i ) in the case of Cox and penalized Cox models. In the absence of right-censoring the concordance is equivalent to the area under a Receiver Operating Characteristics (ROC) curve for the given model  and is related to the sensitivity and specificity of the model.
The ROC curve for a predictive model with two outcomes is a graphical representation of the model's ability to correctly predict observations experiencing the event or "positive" observations, referred to as sensitivity, as the tolerance for incorrect positive predictions or "false positives" ranges from nil to unbounded.
In a binary outcome scenario, where outcomes are "true" or "false" and we have a set of covariates for each "true" or "false" observation, a model which computes a score based on the covariates can be considered to predict a "true" outcome if the score is above a fixed threshold and "false" otherwise. The model's performance can be evaluated according to its sensitivity and specificity. The sensitivity is the proportion of true positives (the outcome was "true" and the model predicted "true") among all "true" observations, and the specificity is the proportion of true negatives (the outcome was "false" and the model predicted "false") among all "false" observations. As seen in Table 1, a confusion matrix showing the expected distribution of model predictions against actual outcomes can be constructed from the model's sensitivity and specificity along with a set of outcomes.
Model: "true" Model: "false" Total Actual: "true" sensitivity × P (1-sensitivity) × P P Actual: "false" (1-specificity) × N specificity × N N Table 1. A confusion matrix giving the expected distribution of a model's prediction outcomes in terms of its sensitivity and specificity, given some P "true" outcomes and N "false" outcomes.
The ROC curve plots specificity against sensitivity as the threshold for class prediction is varied. The horizontal axis denotes (1-specificity) and the vertical axis denotes (sensitivity).
A model which has no predictive power will appear as a straight line between 0,0 (denoting the threshold at which all observations are classified as "false"), and 1,1 (denoting the threshold at which all observations are classified as "true") since There are several methods to assess whether two models fitted on the same observations have statistically different AUCs including those based on the asymptotic normality of concordance .

C-Statistics
Survival models model time-to-event, not binary outcome, and must use adapted ROC methods. Instead of computing the full ROC curve to determine the AUC, a prevalent estimator of concordance for a survival model is the cstatistic (Lee and Mark, 1996). The c-statistic is defined as where concord is the number of concordant pairs, discord is the number of discordant pairs, and tied is the number of tied pairs. If all observations experienced the event, the c-statistic is an unbiased estimator of concordance. In the presence of right-censoring the distribution of the c-statistic is dependent on the censoring distribution and may no longer directly estimate the concordance ).
We note that detecting whether a survival model has increased predictive power over another may not be possible by the comparing the concordances if both models have reasonable predictive power (Pencina et al., 2008).
However, the use of concordance is still prevalent in the literature without a widely accepted alternative, though work is ongoing to introduce new measures Lee, K. L. and Mark, D. B. (1996). Tutorial in biostatistics multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors.

Cleaning
The Firehose portal provides aggregated outputs from a pipeline sequence of tools. However, the raw output was not in a format immediately ready for analysis.
Specifically, the data was organized in row-column format such that each row corresponded to a covariate (cancer stage, patient age, gene expression count) and each column corresponded to a patient. We used a set of R (R Development Core Team, ) and Python (VanRossum and Drake, 2010) helper routines to transpose the raw data into a row-column format where each row represents a patient (an observation) and each column represents a variable of interest. Separate cleaning and transformation steps were required by each dataset (gene expression counts and survival time & clinical covariates) before merging the two for analysis in R as detailed below.

RSEM Data
In order to give researchers the ability to explore many aspects of breast cancer gene expression, the BRCA data contains gene expression estimates of tissue samples from breast tumors and normal non-cancerous regions. Though the underlying gene expression counts are integers, the RSEM count estimates take on real values. We removed the 112 normal tissue samples from the data, leaving 1100 sets of breast tumor gene expression estimates, with each set containing 73,599 isoform count estimates.

Equivalence of Gene and Isoform Counts
Although it was not used in our analysis, TCGA provides other datasets, including gene-level expression estimates. We aggregated the isoform expression dataset at the gene-level and compared it to the provided gene-level estimates to verify our aggregation procedure was correct. All (patient,gene-level) counts were within .01 units, which we attribute to rounding precision. We used this as justification for working directly with the isoform expression dataset and later aggregating it at the gene-level. 6229 isoforms out of the 73,599 total did not correspond to a known gene.
We verified this was an anomaly of the underlying dataset and not an artifact of our procedures or an error. Although outside of the scope of this thesis, these isoforms do not map to a known gene in the HG19 reference genome used by the RSEM procedure to create expression count estimates. These isoforms were included in the normalization procedures detailed below if they exhibited non-trivial variance among patients, but are excluded from the analysis as we cannot compare them to a gene-level aggregate.

Removal of low-variance Isoform Counts
Many of the isoform expression estimates were zero for all patients. Since we are only interested in isoforms which have an effect on survival time, we removed Figure 4. Histogram of log sample variance for estimated isoform counts in the original TCGA dataset. Isoforms with log sample variance less than or equal to zero were removed from the data.
all isoforms with low sample variance of expression counts among patients. Figure 4 shows the histogram of log sample variance for each isoform across all patients. Isoforms with log sample variance less than or equal to 0 were excluded from all subsequent cleanup and analysis, and are ignored for the rest of this thesis. 67,027 isoforms had sample variances of at least 1 and were not excluded, and 63,214 isoforms had a known gene correspondence. Although the 67,027 isoforms were used in the normalization procedures, our final RSEM data contained information from 63,213 isoforms corresponding to 19,330 genes.

Between and Within Sample Normalization
The isoform expression estimates are dependent on the amount of tissue sampled (with more biomass we would expect to see higher estimated counts) and need to be rescaled. A common scaled unit which we adopt is "counts per million" or CPM, which is just a scaled sample proportion of each estimated count per gene or isoform  Table 2. An example of within-sample normalization using counts per million (CPM) yielding misleading results: it is more likely that genes 1-4 are expressed identically in patients 1 and 2, while only gene 5 is differentially expressed. CPM normalization only considers the number of counts within each patient and yields a result suggesting all five genes are differentially expressed. Source: Harold Pimentel i in patient j . In this section we will use the term "gene" to refer to both gene-level aggregates and isoforms when describing normalization techniques.
CPM, defined as only normalizes counts relative to a patient, or within-sample. An example from (Pimentel, 2014) illustrates the issue with this well; if we have counts from 5 genes for two patients where genes 1-4 are known to be expressed identically but gene 5 is expressed several orders of magnitude more in patient 2, we end up with the misleading CPMs shown in Table 2.
We need to normalize counts between patients as well in order to compare them. This requires the use of a between-sample normalization method. One method by (Bullard et al., 2010) uses an upper quantile based method to correct for between-sample differences. Normalized count data is available from Firehose using this method. We use the normalization method and implementation described by  in which the "sequencing depth" of each patient is estimated using a log-linear model. They define sequencing depth as a measure of the relative counts between patients. In the example from Table 2 the sequencing depth of patient 2 would be 3 relative to patient 1, since genes 1-4 had raw counts three times higher but are expected to show identical expression in both patients.
The normalization method is iterative, and uses half of the measured genes to estimate sequencing depth under the assumption that gene expression counts come from a Poisson distribution where the mean µ i , j count for patient i and gene j is equal to the expression level of j scaled by the sequencing depth of i . A simplified description and example of the method in action is described below.
• In the initial step, the sequencing depth is estimated as the proportion of each patient's total gene counts. For the example in Table 2, the proportion vector would be approximately (0.09, 0.91), by taking the marginal counts of (10, 100) and dividing by the sum.
• Next, each total gene count is scaled by the current estimation of the sequencing depth. For • The genes with GOF values within the first and third quartiles of the GOF vector are used to estimate the sequencing depth as in the initial step, and the procedure repeats. In the example, genes 1-4 would be used to estimate sequencing depth in the next iteration, and gene 5 would be ignored. When the estimates of sequencing depth remain nearly constant between iterations, the procedure can be terminated.
Our estimates of sequencing depth are summarized in Figure 5 and Table 3. The normalization procedure centers the final estimates around 1. The implementation also pre-filters genes with overall small counts before starting the procedure -this yielded 3012 isoforms out of 67,027 being filtered. We scaled each patient's isoform counts by the estimated sequencing depth to perform between-sample normalization. We then added 1 to each normalized count before converting each patient's normalized counts to CPM, guaranteeing all CPM values are greater than zero and allowing us to perform logarithmic transformations during the analysis.  Table 3. Descriptive statistics of estimated sequencing depth distribution. n = 1100.

Survival Times and Censoring
We used the "clinical pick" merged dataset, which contains high-level clinical covariates and survival outcomes for 1097 patients. We created the dependent variable time (time of event or censoring time) by using the "days_to_death" field when available or "days_to_last_followup" if missing. We also created a censoring indicator variable censoring with a value of 1 if days_to_death was used or 0 if days_to_last_followup was used.
The dataset contained observations from 12 males with breast cancer, of which along with 95% confidence intervals, is shown in Figure 6.

Merged Data and Clinical Covariates
We merged the isoform CPM transformed data and filtered "clinical pick" data using the common patient identifiers. This yielded a merged dataset containing 1074 observations. These observations contained all of the high-level clinical covariates from the "clinical pick" dataset.
We considered including several clinical covariates in our analysis as controlling variables. Originally we considered including "years_to_birth", "radiation_therapy", "number_of_lymph_nodes", "pathologic_stage", and "race". Age and number of   Table 4. Counts of "pathologic_stage" and collapsed stage variables, tabulated against survival outcome. "NA" and "stage x" were collapsed into the "unknown" level.
"Race" yielded a similar problem -the only "american indian or alaska native" was censored, and a number of patients did not have a recorded race. We elected to omit "race" as a categorical covariate.
We ran a standard Cox model on the remaining clinical covariates; "race" Censored Deceased  american indian or alaska native  1  0  asian  58  3  black or african american  151  29  white  630  108  NA  87  7   Table 5. Counts of the "race" variable tabulated against survival outcome.
"years_to_birth", "radiation_therapy", "number_of_lymph_nodes", and stage.  Table 6. Cox model estimates, along with Wald statistics and associated p-values for each coefficient where stage1 is the reference class, and Likelihood Ratio Test (LRT) of the model compared to a model with the lymph node covariate excluded. "num-ber_of_lymph_nodes" is the only coefficient with a statistically insignificant p-value. The LRT suggests there is not a statistically significant difference between the two models.
Our final dataset consisted of survival outcome, "years_to_birth", "radia-tion_therapy", stage, and RSEM covariates for each patient. Patients with missing information were omitted. This left 967 observations, with 109 events and 858 cen-

Filtering
Although the methods we use during the analysis fit over-parameterized models, the resulting covariate effect estimates are still limited by the number of observations. In particular, the lasso estimator will return at most 967 coefficient estimates. This is just over 1% of the isoform covariates! While the ridge estimator will return coefficient estimates for all covariates, it is unlikely that these will represent the true effects due to the lack of power. In order to increase the chances of fitting models with meaningful predictive power, we reduced the number of isoform covariates by con-ducting per-variable selection procedure while correcting for multiple testing using a local false discovery rate (fdr) threshold.

Per-Variable Models
We fit a standard Cox model for each individual isoform which has a known correspondence to a gene (63,213 isoforms), where each model contained only the single isoform covariate. As is common in genomic analysis of CPMs, the base-2 logarithmic transformation was taken on each covariate.
For each model, we collected the probability of estimating a covariate effect at least as large as the model's estimated effect under a null hypothesis where the true covariate effect is 0 (the p-value for the covariate). This was done using a Likelihood Ratio Test (LRT): the scaled difference between log likelihoods of models with and without the effect comes from a χ 2 distribution with one degree of freedom if the models are equivalent, as the sample size goes to infinity .
These are plotted in Figure 8 Table 7.
# models 63213 # non-NA 63213 # <= .05 5417 # <= .001 215 Table 7. Counts of per-isoform Cox models and models with extremely low p-values. Isoform filtering is based on a later correction to meet a false discovery rate threshold due to multiple testing.
We aggregated the isoform counts into gene-level counts, yielding 19,330 gene  Table 8.

Multiple Testing Correction
As alluded to in the previous section, conducting 63,213 or 19,330 tests (in the case of isoform or gene-level testing, respectively) constitutes a multiple testing is- Figure 9. Relative histogram of binned p-values of gene-level covariates in univariate per-aggregate Cox models, compared to the theoretical distribution under the null assumption of no effect on survival time. The resulting plot is similar to Figure 8.
sue. Under the assumption of isoform independence and the null hypothesis, we'd expect to see 3160 isoforms with a p-value less than 0.05. That is, even if all isoform expression levels have no effect on patient survival time (assuming all isoform expression levels are independent of one another) we'd likely find a large number spurious correlations between expression and survival time relative to the number of patients in our dataset.
One approach to identifying and separating the "interesting" isoforms or genes from those with spurious correlations is to control the local false discovery rate . We assume there are two underlying classes of covariates ("interesting" and "unimportant") and that each isoform or gene covariate belongs to one of the two classes. "Unimportant" covariates are assumed to have no "true" effect on patient survival time, while "interesting" covariates have a nonzero effect. We assume each covariate has a η 0 probability of being "unimportant" and a 1 − η 0 probability of being "interesting". If the "unimportant" and "interesting" covariates are identically distributed within their classes, the p-values, denoted y, from the per-covariate models follow a mixture density where f 0 is the density function of the "unimportant" p-values and f 1 is the density function of the "interesting" p-values. Since f 0 is the density function of the null hypothesis, we can define the probability of a covariate being "unimportant" given its p-value equals y as which is the definition of the local false discovery rate, or fdr. Here, f 0 is a Uniform 0,1 density. We use the R package "fdrtool"  to estimate botĥ η 0 andf (y) and construct the set of fdr probabilities for each isoform and gene-level covariate.
η 0 was estimated at 0.8408 for the isoform mixture and 0.7772 for the gene-level mixture. This makes sense; there is likely a higher proportion of "noise" covariates at the isoform level than when they are collapsed down into aggregates representing an overall gene level.
We plotted an unscaled version of the empirical cumulative distribution function (CDF) for the estimated fdr values in Figures 10 and 11.
We used a cutoff threshold of 0.20; that is, accepting covariates with an estimated probability of being "unimportant" given their p-value -a false discovery -less than  constituent 332 isoform variants (including the 81 "interesting" isoforms). The 298 "interesting" genes correspond to 914 isoform variants.

Analysis
We fit a battery of penalized models to the merged and filtered dataset. Two overall sets of models were fit to the data. Both sets of models were fit with the highlevel clinical covariates. The first set of models included the genetic covariates determined by the isoforms with significant univariate association with overall survival after controlling for a false discovery rate of 0.20. The second set included the genetic covariates based on gene-level aggregates with significant associations with overall survival after controlling for the same false discovery rate. In all cases, the log base2 transformation of the CPM was used.

Cross-Validated Loss
We performed a grid search to assess the impact of α and λ on the fit of the data. α is the model parameter which determines the proportion of the L 1 and L 2 norms in the penalty term of the elastic net, where α = 0 corresponds to the LASSO and α = 1 corresponds to ridge regression. λ is the model parameter which determines the magnitude of the penalty term, where λ = 0 corresponds to an unpenalized model.
Quality of fit was determined using the cross-validated loss procedure of  as implemented in the package "glmnet" . The goodness-of-fit for a model is given by splitting the dataset into a series of "folds", and estimating the model parameters for each partition of the dataset which excludes a "fold". The difference between the partiallikelihood of the estimated parameters on the full dataset and the partial-likelihood of the estimated parameters on the partition is taken, and the sum of the differences is the cross-validated loss. The values of λ which minimize the cross-validated loss are considered to "best" fit the data.
However, if the number of folds is less than the number of observations, the partitioning of the data is not necessarily unique and the cross validation procedure should be repeated to assess the impact of this variability. We opted to eliminate this variability by finding the cross-validated loss using leave-one-out cross validation (LOOC) where partition contained all but one observations. This guarantees that each partition is unique and that the estimated model parameters found will not vary.

Concordance
We computed the c-statistics using the procedure implemented in the package "survival"  for the subset of the models in our cross validated grid search which had the "best" values of λ for a given α parameter.
We also compared isoform and gene-aggregate models using the variant of the c-statistic implemented in the package "survC1" . We provided the risk scores for the cross validated models specified above as univariate predictors and computed the concordance. We then tested the difference in concordance for statistical significance between the model corresponding to the gene-aggregate covariates and the model corresponding to their constituent isoform covariates, given a fixed value of α and each model's "best" value of λ according to the cross-validation procedure. This procedure was run for both sets of covariates, those determined by isoform-level filtering as well as those determined by gene-level filtering.

Cross-validated Model Fitting
We present the results of the cross-validated loss procedure and discuss the fits and number of nonzero coefficient estimates for isoform and gene-aggregate models.
Cross-validated loss per observation,ĈV i , as described previously (2.3.1), is defined mathematically as a function of λ and α: where is the partial-likelihood function of the complete dataset, −i is the partial-likelihood function of the dataset excluding the i th observation, andβ −i are the coefficient estimates for the dataset excluding the i th observation. We used leaveone-out cross-validated (LOOC) loss, the mean overĈV i for all i observations, to compare isoform and gene-aggregate models. We sometimes refer to LOOC loss as the overall cross-validated loss.
The values of overall cross-validated loss are show in Tables 9 and 10 for each value of α in our grid search and the corresponding value of λ which minimized the loss.  Table 9. Summary of cross-validated loss in models fit to the covariates determined by univariate filtering at the isoform level. For each given level of α, the model with the lowest value of mean cross-validated loss ("cvm") over all possible values of λ is shown. Models run on the 332 constituent isoforms are labeled as "iso" type, and those run on the 76 gene-level aggregates are labeled as "gene" type. The standard deviation ("cvsd") and number of non-zero model coefficient estimates ("nzero") are displayed. For comparison purposes, we've added the number and proportion of the 76 genes represented by at least one non-zero isoform or aggregate coefficient estimate in the "gzero" and "percnzero" columns.

Isoform-based Models
The cross validated loss ("cvm" column) for all of the 11 penalized models is lower for models fit on the "iso" type of covariates and the lowest loss occurs when using the ridge estimator (α = 0). The ridge estimator does not perform variable selection and thus produces non-zero coefficient estimates for all 332 isoform covariates or all 76 gene-aggregate covariates (as well as the 6 clinical variables) as shown in the "nzero" column. As more weight is placed on the L 1 norm in the penalty term, the number of variables in the model drops to 69 and 33 respectively when using the LASSO estimator (α = 1). We've added a column representing the number of the 76 genes which are represented by at least one non-zero coefficient estimate ("gzero") to facilitate comparison between the variable selections in both types of models. The models fit on the "iso" type of covariates always represent variants of the same or more unique genes than those fit on the "gene" type of covariates and even the LASSO estimator produces a model with over half of the genes represented. The proportion of genes with non-zero coefficient estimates is shown in the "percnzero" column.
The grid search over λ and α on these covariates is denoted in Figures 12 and 13.
The models which minimize the loss tend to have a number of null or zero coefficient estimates (with the exception of the ridge estimator model), even for low values of α.
As α decreases, the value of λ for the model with the lowest loss increases. The relationship of cross validated loss ("cvm" column) for all of the 11 penalized models is reversed from the previous section; for any given value of α, the loss is lower for models fit on the "gene" type of covariates. As before, the lowest loss occurs when using the ridge estimator (α = 0). The number of covariates removed as α increases Figure 12. Partial Likelihood Loss as a function of log(λ) on the covariates determined by univariate filtering at the isoform level.

Gene-based Models
The top plot refers to models fit with the "iso" covariates (332 isoforms) while the bottom plot refers to those fit with the "gene" covariates (76 gene-level aggregates). α values of 0 and 1 are the ridge and LASSO estimator, respectively. Figure 13. Models with partial likelihood losses of less than 15, as a function of log(λ) and α on the covariates determined by univariate filtering at the isoform level. The top plot refers to models fit with the "iso" covariates (332 isoforms) while the bottom plot refers to those fit with the "gene" covariates (76 gene-level aggregates).  Table 10. Summary of cross-validated loss in models fit to the covariates determined by univariate filtering at the gene-aggregate level. For each given level of α, the model with the lowest value of mean cross-validated loss ("cvm") over all possible values of λ is shown. Models run on the 914 constituent isoforms are labeled as "iso" type, and those run on the 298 gene-level aggregates are labeled as "gene" type. The standard deviation ("cvsd") and number of non-zero model coefficient estimates ("nzero") are displayed. For comparison purposes, we've added the number and proportion of the 298 genes represented by at least one non-zero isoform or aggregate coefficient estimate in the "gzero" and "percnzero" columns. and more weight is placed on the L 1 norm is also much greater than in the previous section, though this is likely an artifact of the larger number of covariates overall.
The column representing the number of the 298 genes which are represented by at least one non-zero coefficient estimate ("gzero") shows that, as in the previous section, models fit on the "iso" type of covariates always represent the same or more unique genes than those fit on the "gene" type of covariates.
The grid search over λ and α on these covariates is denoted in Figures 14 and 15.

Comparison
The relationship between the models fit on the isoform-filtering covariates and the gene-filtering covariates is shown in Figure 16. The reversed relationship between "type" and "cvm" is visible. The cross-validated loss is lowest for models fit on isoform covariates determined by isoform-level univariate filtering, but is highest for models fit on isoform covariates determined by gene-level univariate filtering.

Concordance
We present high level results of concordance measures for the models presented in the previous section. Concordance was computed on the observations of the models fit above. Rationale and shortcomings of this approach are treated in the discussion.

C-Statistics
The c-statistic, as defined previously in 1.4.2, of the linear predictor,βX i , for all the models in Tables 9 and 10, is given in Tables 11 and 12.
The relationship between the "iso" and "gene" models remains consistent between both sets of covariates. For both sets of covariates, "iso" type models have a higher point estimate of concordance than "gene" type models. The most concordant models are often those fit using the ridge estimator. Both sets of covariates show Figure 14. Partial Likelihood Loss as a function of log(λ) on the covariates determined by univariate filtering at the gene-aggregate level. The top plot refers to models fit with the "iso" covariates (914 isoforms) while the bottom plot refers to those fit with the "gene" covariates (298 gene-level aggregates). α values of 0 and 1 are the ridge and LASSO estimator, respectively. Figure 15. Models with partial likelihood losses of less than 15, as a function of log(λ) and α on the covariates determined by univariate filtering at the gene-aggregate level. The top plot refers to models fit with the "iso" covariates (914 isoforms) while the bottom plot refers to those fit with the "gene" covariates (298 gene-level aggregates).
α values of 0 and 1 are the ridge and LASSO estimator, respectively. The size of the points represents the number of genes with non-zero coefficients out of the 298 total.
Nearly null models (with no genetic covariates) can be seen as λ increases and the number of non-zero coefficients plummets. Figure 16. Models from Tables 9 and 10 plotted on the same set of axes, α versus "cvm". The "type" column is denoted by shape, and  Table 11. C-statistics for each model from Table 9. These models were fit to the set of covariates determined by univariate filtering at the isoform-level. "iso" models were fit to the set of 332 constituent isoforms and "gene" models were fit to 76 gene-level aggregates.

Uno's C
We used Uno's C estimator to re-estimate concordance and obtain an estimate of the standard error. Uno's C (along with standard errors) are shown graphically against the c-statistics in Figure 17. Since the sampling distribution of Uno's C is asymptotically normal, we constructed confidence intervals for the difference in concordance between "iso" and "gene" type models and present the results in Tables 13 and 14. The differences are also shown graphically in Figure 18.  Table 13. Difference in Uno's C between each "gene" and "iso" pair from Table 11. The difference is given in the "concordance" column. A negative difference indicates Uno's C is higher for the "iso" level model. The standard error of the difference and a 95% confidence interval based on the asymptotic distribution of the difference are reported as well.  Table 14. Difference in Uno's C between each "gene" and "iso" pair from Table 12.
The difference is given in the "concordance" column. A negative difference indicates Uno's C is higher for the "iso" level model. The standard error of the difference and a 95% confidence interval based on the asymptotic distribution of the difference are reported as well.
Figure 17. C-Statistics and Uno's C for models from Tables 9 and 10 plotted on the same set of axes, α versus "concordance".
The "type" column is denoted by shape, and Table 9's models appear in blue, while Table 10's models are plotted in red. The estimated 95% confidence interval for each value of Uno's C, based on its asymptotic normality, is given by the line surrounding each point. Although the values differ between the estimators, the point estimates of the c-statistics lie within the estimated confidence intervals of Uno's C. The points are slightly jittered to show overlap in confidence intervals. Figure 18. Difference in Uno's C for models from Tables 13 and 14 plotted on the same set of axes, α versus difference in "concordance". Table 13's models appear in blue, while Table 14's models are plotted in red. The estimated 95% confidence interval for each difference is given by the line surrounding each point. Standard error and confidence intervals were estimated by using 1000 iterations of Uno's perturbation-resampling technique. Intervals which contain 0 are denoted by shape. Although the models fit on the isoform-level filtering covariate set have a larger difference in concordance between isoform and gene-aggregate models, the standard error is greater. The points are slightly jittered to show overlap.

Final Models
We have two types of models we wish to compare and two sets of covariates they were fit on. We refer to these four categories by shorthand, where the first designation gives the pre-selection covariate group and the second gives the granularity of the model covariates. Iso-iso models are fit on the covariates pre-selected by isoformlevel filtering and genomic covariates are represented by isoform log2 CPM ("iso" type models). Iso-gene models are fit on the set of gene-aggregated covariates ("gene" type models) pre-selected by isoform-level filtering. Gene-iso and gene-gene models are similar but fit on the set of covariates pre-selected by gene-level filtering.
We are particularly interested in iso-iso models and gene-gene models. Genegene models represent the pre-selection and model fitting possible before nextgeneration sequencing techniques, while iso-iso models represent the finer grained pre-selection and fitting possible with RSEM.
We found that fixing α = 0 and searching over λ minimized the cross-validated loss for both of these model categories. The resulting estimated models are equivalent to those found using the ridge estimator. We focus our discussion of concordance on these ridge models for the iso-iso and gene-gene categories. We refer to these ridge models as the iso-iso and gene-gene models.
The c-statistic for the iso-iso model is higher than the gene-gene model. This suggests that models created and fit on the finer grained covariates have increased predictive power. We used the asymptotic normality of Uno's C estimator to construct a 95% confidence interval for the difference in concordance between the iso-iso and gene-gene models. The difference, presented in  Table 15. Estimates of Uno's C for the iso-iso and gene-gene models, along with 95% confidence intervals. The 95% confidence interval for the difference in estimates, based on the asymptotic normality of the estimator, suggests the difference in estimates is statistically significant.
ified by median risk score in Figures 19 and 20. The curves are estimated using the same observations used to fit the models. As is the case with the concordance estimates, the curves likely overestimate model performance. However, we assume the difference between the curves is still indicative of the true difference between models. Test statistics from log-rank tests on the stratifications are given in Table 16. Model Iso-Iso 198 Gene-Gene 142 Table 16. Test statistics for log-rank tests for the final ridge models. Tests were on the association between survival outcome and whether the patient's risk score was below the median risk but used the same set of observations the models were fit on. However, there is a large difference in the χ 2 values between the two models.

Discussion
We present a short summary of the work performed and discuss the final models.
Shortcomings in our methodology and further work are outlined.

Summary
Our intention is to investigate whether isoform-level expression information improves cancer survival predictions as compared to overall gene expression.
Our investigation is focused on women with breast cancer from the TCGA's BRCA datasets. One dataset contained high level clinical covariates and survival time for 1097 patients. Another contained RSEM estimated isoform counts sequenced from  1212 tissue samples, of which 1100 belong to breast tumors. We found that patient age, cancer stage, and the presence of radiation therapy were clinical covariates strongly associated with survival outcome.
The original RSEM data contained estimated counts for 73,559 isoforms per sample. However, many of these estimated counts were always 0 or exhibited extremely low variance. We took the sample variance of each isoform and ignored isoforms with a sample variance less than or equal to 1.
We normalized the remaining RSEM data between cancerous tissue samples by estimating and correcting for the "sequencing depth". We then added one to all corrected counts and converted them to CPM, or counts per million. Isoforms without a known gene correspondence were ignored, leaving 63,213 isoform (or 19,330 genelevel aggregate) counts for each sample.
Merging the two datasets yielded 967 observations of women breast cancer patients. 109 women died under observation. We performed variable pre-selection at the isoform and the gene-aggregate levels yielding two sets of covariates ("isoformlevel filtering" and "gene-level filtering"). Pre-selection involved fitting a univariate Cox model for each covariate and collecting the p-value associated with a likelihood ratio test of the model. The set of p-values were corrected to control for the false discovery rate, and we selected the covariates which had a corrected p-value below 0.20.
Our models are fit on either the log2 transform of CPM ("iso" type models) or the log2 transform of CPM aggregates ("gene" type models), along with the clinical covariates. We performed a grid search to minimize cross-validated loss over the free parameters, α and λ, of the elastic net penalized Cox model. When α = 0 the penalty term reduces to the ridge penalty, and when α = 1 the model fits the LASSO.
We found that the ridge models minimized the cross-validated loss for our model categories of interest, "iso" models fit to covariates determined by "isoform-level filtering" and "gene" models fit to covariates determined "gene-level filtering". We computed estimates of concordance for the ridge models that minimized cross-validated loss and tested the difference in concordance for significance. We also estimated Kaplan-Meier survival curves for each model, stratifying observations by relationship to the median risk score.

Investigation of Final Models
Our results suggest there is a difference in the predictive power between our final models. The concordance estimate for the iso-iso model is higher than for the genegene model. When we created a 95% confidence interval for the difference using the asymptotic normality of Uno's C, we found the interval did not contain 0 and that the difference was statistically significant.
Ridge models do not perform variable selection and estimate non-zero effects for all covariates. This suggests the difference in concordances is partially due to the difference between the univariate filtering procedures. The filtering procedures selected two sets of genes with non-trivial differences. 43 common genes were selected by both procedures. This leaves 33 different genes out of the 76 selected by only the isoform-level filtering, and 255 different genes out of the 298 selected by only the gene-level filtering procedure.
It is of particular interest that the iso-iso model had a higher concordance estimate than the gene-gene model. The isoform filtering showed a higher proportion of "unimportant" isoforms. Additionally, isoform counts exhibit greater variability than genes, since the latter are aggregate counts. The results suggest that despite this increase in noise and variability, isoform level information may provide meaningful advantages over gene-level information.
Furthermore, the concordance estimates suggest that both of these models have increased predictive performance compared to models fit on clinical covariates alone. Repeating our methodology on just the clinical covariates yielded a ridge model which minimized the cross-validated loss across the grid search of α and λ.
The estimated c-statistic of this model was 0.79 and the value of Uno's C was 0.717, with a 95% confidence interval of (0.649,0.785). Both final models with genomic covariates had statistically significant higher estimates of concordance, validating our assumption that genomic information has a non-trivial association with breast cancer survival.

Shortcomings
We used the same observations to fit each model and then to compute the concordance measurements. An ideal methodology would have computed the concordance on a separate set of observations. However, we would have to create this set by partitioning our data and this would have caused some issues due to the low number of events within the dataset. Our dataset had 109 observations with a known failure time. Reducing this number would introduce non-trivial additional variation in the estimated models dependent on the choice of partitioning. Similarly, concordance is estimated using the relationship between observations with at least one known failure time. Again, the choice of partitioning would introduce non-trivial variability into the estimations of concordance. We could account for this variability by systematically estimating models and concordances for all possible partitions but were unable to perform this due to computing constraints.
We rationalize our results due to our focus on model comparisons. The use of the same observations to measure model performance means we overestimated the performance. However, we aren't interested in the actual model performance but the differences between models. We compared models using the predicted risk scores of each model on the same set of observations as would be done under a partitioning scheme. We assume that though each model's performance is overestimated, the relative difference is still a valid estimate of the true difference in predictive power.
Another important caveat with this work stems from the assumption that the underlying RSEM count values are uncorrelated between observations and are free of systematic error. Our data (and other available TCGA datasets) are the result of rolling collection by a number of participating centers. Recent work has shown that inadvertent biases introduced per center or batch of samples are not accounted for by RSEM and can introduce false positives in work that relies on RSEM counts (Love et al., 2015).

Future Work
Much of the work done is related to the preprocessing, cleaning, and filtering of the data. A thorough investigation into the additional predictive power of isoformlevel covariates on cancer survival should include additional TCGA datasets following the same procedure. A comparison of our results with a replication on non-TCGA breast cancer survival data could also yield interesting findings.
Alternatively, testing the discriminative power of concordance using a known data-generating process would aid in the interpretation of results on real world data.
This could also be used to validate our preprocessing and filtering techniques.