MODELING THE PROBABILITY OF MORTGAGE DEFAULT VIA LOGISTIC REGRESSION AND SURVIVAL ANALYSIS

The goal of this thesis is to model and predict the probability of default (PD) for a mortgage portfolio. In order to achieve this goal, logistic regression and survival analysis methods are applied to a large dataset of mortgage portfolios recorded by one of the national banks. While logistic regression has been commonly used for modeling PD in the banking industry, survival analysis has not been explored extensively in the area. Here, survival analysis is offered as a competitive alternative to logistic regression. The results of the final modeling for both methods show very similar fit in terms of the ROC with the survival model having slightly better performance than logistic regression in the training dataset and almost the same performance in the testing dataset. In term of prediction of defaulted and non-defaulted mortgage portfolios, the logistic regression model outperforms survival analysis in the training dataset, while survival model outperforms logistic regression in the testing dataset. Overall, the results support that the survival analysis approach is competitive with the logistic regression approach traditionally used in the banking industry. In addition, the survival methodology offers a number of advantages useful for both credit risk management and capital management.

Credit risk affects virtually every financial contract. Therefore the measurement, pricing, and the management of credit risk have received much attention from financial economists, bank supervisors and regulators, and financial market practitioners. Profits realized on loan products, such as credit cards and mortgage loans, depend heavily on whether customers pay interest regularly or miss payments and default on their loans. The latter is considered to be a credit risk which is the dominant source of risk for banks.
The key focus of the credit risk is to predict if a customer will default on her mortgage loan in the future, or to evaluate the probability of default (PD). The PD can be estimated based on the customers' credit bureau data, such as past credit activity, and their application data as well as their payment behavior for the loans on a book. A lower predicted probability of default means a better creditworthiness. For a loan origination, a bank generally sets a cut-off threshold and approves a credit to those customers that have the predicted probability of default (PD) less than the pre-defined threshold. For the ongoing credit risk management, the predicted probability will be combined with the other risk factors to determine the allowance of a loan loss reserve (ALLL), which in turn will be used to cover the losses when the loans default. The PD is not only important for effective risk and capital management, but also for the pricing of credit assets, bonds, loans and more sophisticated instruments such as derivatives.

2
The goal of this thesis is to predict the PD for a mortgage portfolio. A mortgage portfolio consists of all mortgage loans on a bank's book; and a mortgage loan is a loan secured by a real property through the use of a mortgage note, which serves as an evidence of the loan existence. A mortgage loan has a risk-based interest rate and is scheduled to amortize over a set period of time (called term), typically 15 or 30 years.
All types of real property can be, and usually are, secured with a mortgage and bear an interest rate that is supposed to reflect the lender's risk. The lender's risk is based on the predicted PD and other risk parameters.
In order to predict PD, one needs to define the dependent variable on whether the mortgage loan defaults or not. The criterion that determines if a loan defaults varies on the product and the regulations. In what proposed next, the mortgage loan is flagged as default whenever one of the following conditions appears in the account's monthly data: 1) The payment has 180 days or more days past due.
2) There is a charge-off or a partial charge-off event for this account.
The bank maintains records of when each payment is due for every loan on the book. The due date information is used to populate the due date on a customer's mortgage bill or credit card bill. If a payment is delayed, the system will start calculating the accumulating days before the payment is recorded on the book, or the days past due (DPD). The bank will have a monitoring system to monitor the loans with the past due status. Different banks may have different response systems. For example, if a customer only misses one payment, it could just trigger the warning 3 process as the customer might just be on vacation and can forget to mail in the payment. In this situation, the bank may send a reminder to the customer. If the customer responds the reminder and pays in the following month, the number of days past due will be back to zero. However, if the customer keeps delaying the payment, the number of days past due will keep accumulating and when it exceeds a certain threshold (such as 90 days or 120 days), the bank will evaluate the loan and decides if any impairment is needed. The bank may request an appraisal of the underlying property and in the meantime, send letters to let the customer know that the property will be taken by the bank if the payment is still not received in some periods. In some situations, a customer may have temporary financial hardness, such as losing job or having a big medical bill to pay, and then the bank may choose to work with the customer to reduce the monthly payment either through extending the loan term or even taking some partial charge-off to further cut the bill. Charge-off means that the bank pays the loan from the bank's ALLL (reserve for the loan losses). Partial chargeoff means that the bank pays part of the loan. This is one of the strategies to resolve a defaulted loan. In other situations, if the customer decides not to pay at all or there is no way the customer can keep the payment even if the payment is reduced, the bank starts the foreclosure process to recover the loan from the sale of the property.
Logistic regression has found wide acceptance as a model for the dependence of a binary response variable on a vector of explanatory variable (Strauss, 1992). It has been the most commonly used method in predicting PD (Stepanova and Thomas, 2002). Many methodologies have been investigated (Altman, 2010;Gurný, 2009;4 Gurný 2010). Survival analysis is one of the alternatives to logistic regression that has recently been explored with application to different portfolios (Stepanova, 2000;Allen and Rose, 2006;Im et al, 2012). Originally, the methods of survival analysis have been developed and intensively applied in medical fields and specifically in life-anddeath clinical trials. Recently, some banks have started exploring the application of survival analysis in predicting PD. If looking at the mortgage loan from a life cycle view, one can represent the time to mortgage default as a time to event (similarly to the time to death in a clinical trial) and model this time using survival analysis methods. In my thesis, I would like to apply both logistic regression and survival analysis methods to a large dataset of mortgage portfolios and compare the results in terms of prediction and interpretation.

Risk Profile Review
Many factors impact the default rates, such as FICO score, loan to value (LTV), month on book, etc. In what follows next, I will discuss the key factors in more details and explain how these factors will be tested using the mortgage portfolio data in later sections.
Industry (Mester, 1997;Brown et al, 2010) and academic researches (Altman and Saunders, 1997;Avery et al., 2003) suggest that mortgage default rate relies on FICO scores. A FICO score is a credit score developed by FICO, a company that specializes in what's known as "predictive analytics," which means they take information and analyze it to predict what's likely to happen. The FICO score is the best-known and the most widely used credit score model in the United States. It is used in about 90% 5 of consumer-lending decisions, according to a financial-services research firm CEB TowerGroup (Andriotis, 2015). Using mathematical models, the FICO score takes into account various factors in each of these five areas to determine credit risk: payment history, current level of indebtedness, types of credit used and length of credit history, and new credit. FICO company is not a credit reporting agency. In fact, to create credit scores, it takes information provided by one of the three major credit reporting agencies -Equifax, Experian or TransUnion. Both the Federal Home Loan Mortgage Corporation (Freddie Mac) and the Federal National Mortgage Corporation (Fannie Mae) have encouraged mortgage lenders to rely on credit scoring in order to increase consistency across underwriters (Mester, 1997).
While assessing credit risk, reliance on only the credit score is considered insufficient. Even before the mortgage meltdown, industry experts began to worry about the possibility of not fully capturing the credit risk embedded in mortgages.
Reasons for concern before sub-prime mortgages began to default included rising loan to value (LTV) ratios, and a decreasing dependence on documentation of a borrower's assets, employment, and income (OCC, 2005). LTV is calculated as the loan amount divided by underlying property value. It is one of the key factors the bank check and monitor from credit risk perspective. As the property is used as a collateral, if the loan default, the bank can take the property and sale it to recover the loss. Therefore, when the property value is higher than the loan amount, the borrower has less motivation to default. The customer can decide to sell the property and payoff the loan with extra money of her own if she could not keep the monthly payment. Since it costs time and 6 money to sell the house, the bank normally will need 20% cushion for a mortgage loan origination. This is why most banks require 20% down payment when a customer applies a mortgage loan. This type of loan is called a prime loan. If the customer could not pay 20% down payment, a subprime loan could be applied for the amount that is lower than 20% down payment. The subprime loans normally have much higher interest rate than the interest rate for prime loans.
While the Equal Credit Opportunity Act (implemented by the Federal Reserve Board's Regulation B, also called fair lending) prohibits creditors from discriminating in any way during a credit transaction because of an applicant's demographic characteristics, such as race, religion, national origin, gender, marital status, or age, empirical research has shown that these factors do actually have predictive power of credit risk. A basic breakdown of borrowers into sub-prime and prime mortgages reveals some significant demographic distinctions. Sub-prime borrowers are disproportionately minorities, have less income, are older, and have fewer years of education and have significantly less financial sophistication (Lax, 2004). These demographic variables correlate quite well with FICO scores and LTV ratios, as borrowers in the sub-prime segment have both lower FICO scores and high LTV ratios than borrowers in the prime segment (Banasik et. al., 1996).
In order to calculate the LTV, the bank will need both the loan amount and the collateral value. The loan amount is easily captured on the book. In terms of the collateral value, there are multiple ways to get the house value. The most accurate way is to have a formal appraisal, which cost about $350-$500 for a single family house.

7
Another way is to update the house value based on the house price index as the property value is heavily impacted by the market and the house price index is a good indicator to reflect the house market in different locations. If the house market is going up, the house value will go up as well because the house can be sold at a higher price The key risk drivers within mortgage can best be analyzed by examining the relationships among the following variables: In addition to the factors described above, credit risk can depend on macroeconomic variables and factors. In economic downturns, the default probabilities increase and risk ratings deteriorate. The macroeconomic factors that are considered in this project include unemployment rate and CS index as described above, sourced primarily from U.S. federal government and Moody's economy.com.

Methodologies
Traditional risk assessment methods include discriminate analysis (DA) and logistic regression. Altman (1968) built a famous warning model of multi-variables, the Z-model by using multivariate discriminate analysis. Ohlson (1980) was the first one who used the logistic regression model to predict of financial risks. Wiginton (1980) was one of the first who applied a logistic regression model and discriminate analysis to credit rating and then compared the two methods. Wiginton showed that 9 the logistic regression model performed better than the discriminate analysis in terms of the proportion of individuals who were correctly classified. However, according to his findings even logistic regression failed to make a significantly high proportion of correct classifications to warrant the use of his model for unaided decision-making.
Later, Tang (2002) tested the accuracy of the logistic regression model by sampling 5 listed companies with good financial conditions and 5 companies with bad conditions from the Shanghai and Shenzhen Securities markets and found that logistic regression could distinguish the company with good conditions from the company with bad conditions. Now logistic regression has become the main approach to the classification step in credit scoring and the most commonly used approach in credit risk management.
Numerous other statistical methods that attempted to fit more complex models with higher degrees of nonlinearity between the predictors and the response, such as support vector machines, neural networks, and Bayesian network classifiers, have also been investigated for credit scoring (Im et. al, 2012). The results do not always conclude which method is consistently better than the others. For example, Desai et a1 (1996) found that neural networks performed significantly better than linear discriminant analysis for predicting the 'bad' loans, whereas Yobas et a1 (2000) reported that the latter outperforms the former method. Furthermore, most of these studies only evaluated a limited number of classification techniques on one particular credit scoring data set. Hand (2006) argued that potential performance improvements attainable using more complex models were often offset by other sources of 10 uncertainty that were exacerbated by the added complexity. In addition, in the real business world, the choice of the best methodology also takes the cost and benefit into consideration. The increased complexity of these model methodologies may increase the implementation cost with only a marginal benefit; that is why logistic regression analysis has become the standard approach in banking industry.
Survival analysis is an area of statistics that deals with the analysis of survival data. The survival data can be collected in medical or reliability studies, for example, when a deteriorating system is monitored and the time until event of interest is recorded. The credit risk data is very similar to the survival data. The time until the loan gets to default in the credit risk data can be viewed as the time until the event of interest (e.g., death) in the survival data. In this interpretation, survival analysis can serve as a useful statistic tool for credit risk management. The idea of employing survival analysis for building credit-scoring models was first introduced by Narain (1992) and then developed further by Thomas et al. (1999). Narain (1992) applied the accelerated life exponential model to 24 months of loan data. The author showed that the proposed model estimated the number of failures at each failure time well. Then a scorecard was built using multiple regressions, and it was shown that a better creditgranting decision could be made if the score was supported by the estimated survival times. Thus, it was concluded by Narain (1992) that survival analysis could add a new dimension to the standard approach. However, the author did not make any comparison with alternative methods.
been thoroughly investigated and applied in the industry. The main purpose of this thesis is to apply both logistic regression and survival analysis methods to a large dataset of mortgage portfolios and to compare two methods in terms of data fit and prediction power. The long-term goal of this thesis is to learn both methodologies and their respective advantages and to be able to apply them effectively in my actual work.
The rest of this thesis is organized as follows. Chapter 2 introduces the basic concepts and the literature review on the methods used in this thesis. Chapter 3 describes the initial data analysis. Chapter 4 discusses the model results from both logistic regression and survival analysis methods and compares the model performances. Chapter 5 provides the final comments as well as the potential broad impact.

Logistic Regression
Logistic regression is a generalized linear model technique that allows one to predict discrete outcomes. The response variable in logistic regression is a Bernoulli variable that can take the value 1 with a probability of success , or the value 0 with probability of failure 1-. For credit risk analysis, let define a random variable D that takes values 1 and 0, where the value of 1 (D = 1) means the loan is default and 0 means the loan is not default. Then the probability of default is defined as the probability of success for the random variable D, that is =P(D=1). Although not as common and not discussed in this thesis, applications of logistic regression can be been extended to cases where the response variable has more than two categories known as a multinomial regression.
In logistic regression, the relationship between the response and the independent variables is described by the logit transformation of  as follows: observations, which measured the proportion of correctly classified cases. Based on the eight datasets, the results indicated that different modeling techniques had different performance in different datasets. For example, the author found that linear SVM had the best performance for Australia portfolio while NN works the best on German portfolio in terms of both PCC and AUC. In general, it could be observed that the best average rank was attributed to the NN classifier. However, the simpler, linear classification techniques such as LDA and LR also had a very good performance, which was in the majority of the cases not statistically different from that of the SVM and NN classifiers. Based on the research, Baesens et al (2003) concluded that the more complex models generally performed quiet similarly to logistic regression, in terms of predicting probability of default.

Survival Analysis
Survival analysis is one of the alternative approaches to logistic regression that have not been extensively explored; selected studies include Thomas et al., 1999;Stepanova et al., 2002;and Im, 2012. Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. In survival analysis, subjects are usually followed over a specified time period and the focus is on the time at which the event of interest occurs.
In this thesis, the event of interest is the default of a mortgage loan.
The time to event can be measured, for instance, in days, weeks, months, or years.
In this thesis, the time to default is recorded in months. Let T denote the time to default of a mortgage loan, and f(t) be the probability density function (pdf) and F(t) be the cumulative density function (cdf), or the probability that a loan will be less than or equal to any value t, F(t) = Pr {T <= t). Then the survival function can be defined by the following equation: . 16 For continuous survival data, the hazard function is a more popular characteristic than the pdf to describe the distributions. The hazard function is defined as a limit: , which represents the instantaneous risk that an event occurs at time t.
Specifically, the survival probability is the probability that the loan i will To apply survival analysis in consumer credit modeling, we suppose that one or more further measurements are available for each individual, so that we have a vector of covariates, X, e.g., application characteristics such as current FICO score, current Loan to value, etc. In order to assess the relationship between the distribution of default time and these covariates, Cox (1972) proposed the following model: where is a vector of unknown parameters and is an unknown function giving the hazard for the standard set of conditions, when = 0. It's called the proportional hazards (PH) model because the assumption is that the hazard of the individual with application characteristics X is proportional to some unknown baseline hazard. The vector of coefficients is estimated using maximum likelihood.
PH models assume that the hazard functions are continuous. However, credit performance data are usually recorded only monthly so that several defaults at one time can be observed. These are tied default times, and the likelihood function must be modified because it is now unclear which individuals to include in the risk set at each default time The exact likelihood function has to include all possible orderings of tied defaults (Kalbfleisch and Prentice 1980), and hence is very difficult computationally. A number of approximations have been developed. One of these is achieved by replacing equation (2.1) by a discrete logistic model (Cox, 1972): , where And then similarly to logistic regression, a logit link function can be used: or in term of the credit risk variables: , where CS is the Case Shiller house price index, FICO score is the credit score, and LTV is the loan to value calculated as loan amount divided by collateral value. More detailed description of the credit risk variables can be found in Chapter 1.1: Risk Profile Review. Thomas et al. (1999) compared performance of exponential, Weibull and Cox's nonparametric models with logistic regression and found that survival-analysis methods were competitive with, and sometimes, superior to, the traditional logistic regression approach. The paper was developed based on personal loan data from a major UK financial institution. The data consisted of application information of 50,000 loans accepted between June 1994 and March 1997 together with their monthly performance description for the period up to July 1997. The monthly performance indicators were used to determine whether the loan was censored or defaulted, therefore, for each loan there was a survival time. In order to compare with standard credit scoring approaches, the data is also used to develop logistic regression model.
The analysis and results suggested that proportional hazard models investigated in this sample were competitive with the logistic regression approach in identifying those loans who defaulted in the first year. The proportional hazard results for the second year with fewer defaults were not as encouraging and suggested that more sophisticated models might be appropriate. The survival analysis approach benefited more from a large sample of 'bads' than did the logistic regression approach. The poor performance under the second year criterion was also partly due to the fact that the ordering of risk of default did not change whatever the time period.
It was noted by Thomas et al. (1999)

Univariate Analysis
As discussed earlier, FICO score is a very important factor that the majority of the credit industries use for risk management. The origination FICO score is the FICO score from the loan's application file. The origination FICO score does not change over the loan period, however, it defines the status of the customer's credit application which then may serve as a good indication for PD over the loan lifetime as shown in the Figure 2.
As shown in Figure 2, the loans with the origination FICO scores less than 660 have significantly higher default rate than the loans with the origination FICO scores 23 greater than or equal to 660. Generally, many banks have a credit policy that sets the lowest FICO score for which a loan application can be approved; but them almost all banks would have an exception policy according to which some loan applications that do not meet the credit requirements can also be approved. The lowest required FICO score can vary among the banks, ranging from 620 to 660; therefore, the loans that have the FICO score lower than 660 could be sometimes exceptionally approved..  The FICO score is an indicator of a risk at a particular point in time. It changes as new information is added and as historical information ages. For example, past credit problems impact one's credit score less as time passes. Lenders request a current score when a new credit application is submitted, so they have the most recent information available.

24
As shown in Figure 3, similarly to the relationship between the rate of default and the origination FICO score, with the current FICO score increases, the rate of default decreases quickly for FICO scores below 660 and then remains relatively low for FICO scores above 750. After removing the loans with the current FICO scores less than 660, the rate of default shows a slightly different trend, it decreases for the current FICO scores less than 748, increases for the scores between 748 to 790, and then again decreases for the scores higher than 790. This observation suggests a difference in the modeling of PD of loans with the origination/current FICO scores below and above 660. As explained in Chapter 1.1, the current LTV is calculated as the current total loan amount divided by the current property value. The LTV is one more key factor that determines if a loan can be approved. In traditional residential mortgages and home equity loans there is an 80% rule, that is if the mortgage's LTV is more than 80%, the loan is most likely not approved or has to go through the exception review process. According to this 80% rule, a binary dummy variable that takes a value of 1, when the LTV is greater than 80%, is created and included in our initial modeling.
25 Figure 4 also shows that when the current LTV is greater than 80%, the default rate increases dramatically from below 0.4% to over 1%.

Figure 4: Default rate by current LTV
The property value is heavily impacted by the local house market which is reflected in CS index. Based on the data, there is no clear trend of the default rate and CS index directly. However, I observed that if CS one year growth rate is less than -12%, which means the house prices decreased 12% comparing with the price a year ago, the PD is significantly higher, as the blue line shows in Figure 5. The CS one year growth rate is calculated as the CS index today minus CS index a year ago and then divided by CS index a year ago. After removing the loans with CS 1 year growth less than -12%, the default rates generally decrease with the increased CS growth rate, which is the red line in Figure 5. The unemployment rate is the most important macroeconomic factor that many bank tracks. When the unemployment rate is getting higher, more people lose their jobs from which many people get their main source of mortgage payment. The mortgage data analyzed in this thesis also confirmed that the higher the unemployment rate, the higher the probability of default as shown in Figure 6, especially after the unemployment rate reaches to around 6.5 -7%, which will create the panic of the customers and then impact the confidence index. Month on book is also another key factor in the bank's monitoring process. The default rate is very low in the first year or two for mortgage loans. After three or five years (36 month to 60 month), the loan default rate may increase dramatically as seen in Figure 7.
Due to the specific history of the underlying data, the unemployment rate has been increasing along the month on book, therefore, the default rates have very similar trend with the two variables.  Table 3 is a matrix of correlation coefficients for each pair of the most important variables. The purpose of this correlation analysis is to find the pairs of variables that are highly correlated and would require additional caution if included in the model.

Correlation Analysis
The p-values are all less than 0.05, which means that the correlations among all variables are statistically significant. However, one should not confuse statistical significance with practical importance. If the sample size is large enough, even a weak correlation can be statistically significant.
In order to assess practical importance, one common computation is to square the correlation coefficient to get the coefficient of determination. This shows how much of the variation in one of the variables is associated with the variation in the other. For example, an r of 0.06273 between the current LTV and the month on book produces an R-square of only 0.39% (0. 06273 * 0. 06273 = 0.0039, or 0.39%). This means the knowledge of the month on book would account for only 0.39% of the variance in the current LTV, even though the p-value for their correlation is less than 0.05. Hinkle et al (2003) proposed a rule of thumb for interpreting the size of a correlation coefficient (see Table 2). Note that the interpretation of correlation can also vary on the size of the data analyzed. The sample size in this thesis is large, so only the correlation higher than 0.6 -0.8 is considered to be high enough for further consideration. This also corresponds to a range of 36% to 64% for the coefficient of determination. Therefore, only the correlation between the origination FICO score and the current FICO score are considered to be highly correlated and require extra caution if both of them are included in the model. The correlation between the month on book and the unemployment rate is 0.55766, which means the 31% of the variance in one variable can be explained by the variance in another variable. Even though it's not over 0.6, we will also need to be careful if both of the variables are to be included in the same model. The month on book and the CS one-year growth rate have a similar situation.    The month on book and the current LTV are having positive sign which means the default rate is higher when the month on book and the LTV are higher. The current FICO score has a negative sign which means the default rate is lower when the FICO score is higher. And the CS index 1 year growth greater than -12% also has negative sign which means the default rate is lower for the segment with the CS index 1 year growth greater than -12% comparing with the segment of loans with the rate less than or equal to -12%. All the variables are statistically significant and make business sense. For example, the FICO score getting higher means the credit worthiness is better for a customer, and hence the probability of default will be smaller.

Model Fit Statistics
A Wald test is used to test the statistical significance of each coefficient ( ) in the model. A Wald test calculates a Z statistic, which is: This z value is then squared, yielding a Wald statistic with a chi-square distribution.
However, several authors have identified problems with the use of the Wald statistic. Menard (1995) noted that for large coefficients, standard error is inflated, lowering the Wald statistic (chi-square) value. Agresti (1996)   The deviance test is used instead of as the statistic for the overall fit of the logistic regression model. It is the fit of the observed values to the expected values.
The bigger the difference (or "deviance") of the observed values from the expected values, the poorer the fit of the model. The maximum likelihood is a way of finding the smallest possible deviance between the observed and predicted values. The deviance is usually referred to as the "negative two log likelihood" (shown as "-2 Log L" in SAS). The deviance statistics is called -2LL by Cohen et al. (2003) and D by some other authors (Hosmer and Lemeshow, 1989), and it can be thought of as a chisquare value.

Hosmer-Lemshow Goodness of Fit Test
The Hosmer-Lemeshow test is a statistical test for the goodness of fit for the logistic regression model. The data are divided into approximately ten groups defined by increasing order of estimated risk. The observed and expected number of cases in each group is calculated and a Chi-squared statistic is calculated as follows: ' 35 with , and be the observed events, expected events and number of observations for the gth risk decile group, and n be the number of groups. The test statistic follows a Chi-squared distribution with n-2 degrees of freedom. A large value of Chi-squared (with small p-value < 0.05) indicates poor fit and small Chi-squared values (with larger p-value closer >= 0.05) indicates a good logistic regression model fit. The P value is 0.1359, which means the model has a good fit.

Rank Ordering Testing
Receiver Operating Characteristic (ROC) curve is a two-dimensional graph that visually depicts the performance and performance trade-off of a classification model (Fawcett, 2004). ROC curves are industry standard methods for comparing two or more scoring algorithms (Thomas et al, 2004). In a ROC curve the true positive rate (sensitivity) is plotted in function of the false positive rate (1-specificity) for different cut-off points. Each point on the ROC plot represents a sensitivity/specificity pair corresponding to a particular decision threshold. Another most widely used way to evaluate quality of a scorecard is the Gini coefficient besides ROC curve. The Gini coefficient had its first application in economics measuring the degree of inequality in income distribution and was calculated using the Lorenz curve (Kleiber, 2007). The Gini index has been brought into a lot of applications (Hand, 2005;Chatterjee et al, 2007), including credit scoring, where it is often referred as the accuracy ratio or power ratio. The Gini coefficient is used as a measure of how well a scorecard or variable is able to distinguish goods and bads. It is a rank ordering correlation coefficient and is exactly the same as the 37 Somer's D statistics provided by SAS, which is used to determine the strength and direction of relation between pairs of variables. Its values range from -1.0 (all pairs disagree) to 1.0 (all pairs agree). It is defined as (nc-nd)/t where nc is the number of pairs that are concordant, nd is the number of pairs that are discordant, and t is the number of total number of pairs with different responses. Another common measure of discrimination used in credit scoring is Kolmogorov-Smirnov test (KS) statistic. Traditionally the KS statistic is used to compare an unknown, observed distribution to a known, theoretical distribution. The maximum distance between the cumulative distributions are calculated and measured against a critical value. If the distance is less than the critical value, there is a good chance that the distributions are the same.
In credit scoring, KS is often calculated as the maximum distance between the cumulative distribution of the predicted probability of defaults and the cumulative distribution of the predicted probability of the non defaults.
Let and be the empirical cumulative distribution functions of the default segment and the non-default segment respectively. The KS statistic in this case is , where is the supremum function and gives the max of the distance of the two distributions.

38
The null hypothesis that the two segments are from the same population will be rejected at level if . For example, in this analysis, at level 0.05 ( is 1.36), which is much smaller than 0.828, therefore, the null hypothesis will be rejected. In the credit world, the D value is more important than . The D value ranges from 0 to 1 or 0 to 100 in percent format, with the higher D, the better distinguish the default and non-default segments, hence the better performance of the model. The D value is 0.828 for model in the training dataset.  And the deviance residual for the th is:

Residual Analysis
, where is the number of event response out of trials for the th observation; is the weight of the th observation; is the estimate of evaluated at , and = 1-; is the probability of an event response for the th observation given by where F(·) is the inverse link function; and is the maximum likelihood estimate of ( . Pregibon (1981) suggests using the index plots of several diagnostic statistics to identify influential observations and to quantify the effects on various aspects of the maximum likelihood fit. In general, the distributions of these diagnostic statistics are not known, so cutoff values cannot be given for determining when the values are large. However, the plots provide displays of the diagnostic values, allowing visual inspection and comparison of the values across observations. As shown in Figure 10, the model fits the non default segment better than it fits the default segment. This finding is in line with the business expectation as the default is a rare event hence it's hard to model.

Cross Validation
For model prediction, we would like an estimation method with low bias and low variance. There are many reasons for the bias and variances, such as model misspecification, data scarcity, over fitting, etc. Cross validation is one of the testing methods to check for the bias and variance. Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
There are several types of cross validation, including leave-p-out cross validation, leave-one-out cross validation, k-fold cross validation, and repeated random sub-41 sampling validation which is the method used in this thesis. One round of such cross validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To follow the sampling selection, 70% of random sample is selected as the training dataset and the rest 30% for the testing dataset for each round. To reduce variability, 1000 rounds of cross-validation are performed, and the validation results are averaged over the rounds. The advantage of this method (over k-fold cross validation) is that the proportion of the training/testing split is not dependent on the number of iterations (folds). The disadvantage of this method is that some observations may never be selected in the testing subsample, whereas others may be selected more than once.
Based on the 1000 runs, the average coefficients as well as the standard deviations for every factor in the model are calculated as listed in Table 11. We can find that the coefficients for the model selected in this thesis are all reside in the 95% confident interval. For example, the coefficient for the month on book is 0.0304 and the average coefficient for this variable is 0.0306 with 95% confidence interval from 0.0229 to 0.0383.  Figure 11 displays the distribution of the coefficients from the 1000 runs for each variable in the model. All of them are approximately normal distribution with the mean as shown in Table 11. The cross validation results show that the coefficients for the variables are stable for these factors and hence indicate small bias and variance.

Survival Analysis
Survival analysis models factors that influence the time to an event. Ordinary least squares estimation falls short because the residuals of survival analysis generally does not have a normal distributed and the model cannot handle censoring which is very common in survival data.

Probability of Density Function (pdf)
Density functions are essentially the histograms comprised of bins of vanishingly small widths. As indicated in Figure 13, the shorter survival times between 30 month and 60 months are more probable, indicating that the risk of the loan default in these periods is high.

Hazard curve
The primary focus of survival analysis is typically to model the hazard rate (h(t)), which has the following relationship with the pdf and S(t), h(t) = f(t) / S(t). The hazard function describes the probability of the event occurring at time t (f(t)), conditional on the subject's survival up to that time t (S(t)). The hazard rate thus describes the instantaneous rate of failure at time t and ignores the accumulation of hazard up to time t. Figure

Maximum likelihood from survival analysis
As discussed in more detail in Chapter 2, the model that is applied is as following: .

47
The model is estimated by Proc PHreg in SAS, which implements the regression method proposed by Cox (1972). PH in Proc PHreg stands for Proportional Hazard model. The hazard function is . The reason the cox regression model is called proportional hazard model is because the hazard for any individual is a fixed proportion of the hazard for any other individual. If we take the ratio of the hazards for two individuals and , we will get: .
We can see that cancels out of the numerator and denominator. Therefore, the Proc PHreg estimates the coefficients of the proportional hazards model without having to specify the baseline hazard function , which is partial maximum likelihood. Based on the univariate analysis, Table 12 is the survival model based on the same variables as in logistic regression. The hypothesis that each coefficient is 0 is tested by the following testing statistics (Table 13). The p-value is less than 0.0001 from both testing, so the null hypothesis is rejected and at least one of the coefficients is nonzero. Notice that in Table 12, there is no intercept estimate which is a characteristic feature of partial likelihood estimation. The last column, hazard ratio, is just exp( ).

48
For dummy variable with value 1 and 0, the hazard ratio is the ratio of the estimated hazard for those with value 1 to the estimated hazard for those with a value of 0 by controlling for other covariates. For CS 1 year growth greater than -12% dummy variable, the hazard ratio is 0.54. This means the estimated hazard of default for accounts that with CS 1 year growth greater than -12% is only about 54% of the hazard for those with CS 1 year growth less than -12% if holding all the other variables the same.
For quantitative covariates, the estimated percent change in the hazard for each 1unit increase in the covariate can be obtained by subtracting 1 from the hazard ratio and multiplying by 100. For the current FICO score, the hazard ratio is 0.303, which yields 100(0.303 -1) = -69.7. As the FICO score is input as the raw FICO divided by 100, therefore, for each 100 FICO score increase, the hazard of default goes down by an estimated 69.7%.
Similarly, for 1 additional month on book, the hazard of default goes up by an estimated 1.5%. However, the p value for the month on book is greater than 0.05, which means this variable is not statistically significant. To ensure the best model to 49 be selected, I also tried to remove the month on book and use unemployment rate instead and get the following model. All the variables are now having p value less than 0.05.

Model Selection
As explained in Chapter 4.1, AIC, SC and log likelihood multiplied by -2 can be used to compare models with different sets of covariates. Even though these statistics cannot be used to construct a formal hypothesis test, the comparison could give us an indication with a smaller value meaning a better fit. As shown in Table 15, model 2 with the unemployment rate has slightly better fit comparing with model 1 which uses the same set of variables as the ones used in the logistic regression model.  The main purpose of the model used in the credit risk management is to predict which customer is more likely default so the bank could take actions to actively manage such accounts; therefore, model 1 is selected and Chapter 4.2 will compare this model with the logistic regression model.

Predicted Time to Default
One of the main outputs from survival model is the predicted time to event, which is time to default as in this thesis, for future accounts with specific covariates. Median survival times are often used in medical studies as a way to characterize the survival experience of a group of patients. The median survival time can be well estimated provided that the censoring is not too heavy (Ying, et al, 1995). Under heavy censoring, there may be a significant percentage of reflected intervals for which the median survival time cannot be estimated; this is because the probability that the estimated survival curve will not cross 0.5 can be substantial (Strawderman, et al 1997). In practice, the proportion of defaulted credits is very small and the proportion 51 of censored data will be very large in credit risk management data. This often introduces challenges for the time to default prediction. Lee, et al (2007) tried to tackle the heavy censoring issue by taking a lower quantiles prediction. However, it's prone to have relatively bigger tail errors based on the prediction from the lower quantiles, which is the limitation of using survival analysis for the time prediction based on the heavy censoring data. Figure 18 shows the actual time to default and the predicted time to default for the defaulted loans. We can see that the predicted time to default have similar distribution but with a fat tail. Another way to look at the prediction error is to calculate the delta as the predicted time to default minus the actual time to default. Figure 19 is the distribution of the delta. The distribution is asymmetrically distributed with mean equal to 0 based on the t-test (Table 17).

Comparison and Summary
The results of the final modeling from both methods show very similar fit in terms of the ROC with the survival model having slightly better performance than logistic regression in the training dataset and almost the same performance in the testing dataset. In terms of prediction of defaulted and non-defaulted mortgage portfolios, the logistic regression model outperforms survival analysis in the training dataset, while survival model outperforms logistic regression in the testing dataset. Using a very large set of real data from a bank's mortgage portfolio, the logistic regression model has similar performance in terms of rank ordering power. For survival model, I implemented a hazard model with time varying covariates in predicting the time-to-default and then predicted the non-default and default using the same time frame. The analysis supports that survival analysis models are competitive with the industry standard logistic regression approaches.
As discussed earlier, many more complex models have been investigated in different articles; however, none of them becomes the common practice in the real world. This is partially due to the fact that the flexibility attainable using more complex models leads sometime to poor predicting performance. Moreover, the cost of implementation of such models is higher than the potential value added by them.
Survival analysis is an alternative to logistic regression that is still reasonably simple.
My thesis confirmed that for the sole purpose of predicting probability of default within a single specific period, survival modeling has little advantage over logistic regression model. This is consistent with the findings from Stepanova and Thomas (2003). However, survival analysis methodology offers a number of advantages that will be very useful for both credit risk management and capital management. First, it 54 provides a consistent method of predicting probability of default within arbitrary different periods of time. With logistic regression model, in order to get the prediction for different time window, different models have to be built with perhaps different data structures. Second, survival analysis can take into consideration the most recent data. In contrast, for logistic regression, if the probability of default within 24 months, the latest 24 months of data will not be able to be used as we will have to have at least 24 months of performance window in order for us to observe the actual defaults.
Third, as Stepanova and Thomas (2001) illustrated, another use of the survival probability can be used to calculate the expected profit from a loan. The article introduced the idea of expected profit from a loan which can be calculated as the sum of the present values of the installments each multiplied by the probability of receiving it (the customer's survival probability), less the loan amount. In this case, the profit from a loan can be estimated, which then can be used in the profitability management.
Last but not the least, the survival analysis provide more complete information on the predicted time to default distribution. Even with certain limitation due to the heavy censoring, the knowledge obtained from the predicted distribution of T can be useful in the broader context of profit modeling. There are also limitations in using survival methods for this type of data. The first limitation is from the application of the survival analysis in the banking industry perspective. All the models built in the banking industry need to be understood by the business users so they can better manage the business based on the model outputs. Logistic regression is built for binary data (in our application are, default or not) and it is usually estimated using maximum 55 likelihood. Both the coefficient of the parameters and the probability of default can be interpreted in a straightforward fashion. In contrast, the survival analysis model, especially the PH hazard model, models the hazard rate. The hazard rate is more difficult to interpret from a business point of view, and the estimation is carried using partial maximum likelihood without having to define the base hazard, and it is less common among practitioners. This could be one of the reasons why logistic regression is still the prevailing method in the industry for default analysis. Another potential limitation is from the cost-benefit perspective. One of the main usages of the default probability is for the reserve calculation. In order to calculate the reserves, when exactly the loan will default does not matter too much as long as we know what is the probability of the loan will default in the next year. The logistic regression model gives the predicted probability of default directly for the next 12 months; however, one would need additional calculations in order to get the probability of default from survival analysis modeling. This increases the implementation cost without adding too much value.
In summary, when the default modeling gets more attention in broader areas, such as profitability management, which is the directions that banking industry is heading, the additional benefits from the survival analysis modeling can be leveraged.

Future Steps
This thesis is carried out on a sample of the mortgage loans originated in 2004. It will be helpful to test the model on the mortgage loans out of this sample when data are available. As the default event is rare event, it would be of interest to investigate