Network-Based Statistical Methods for the Analysis of Stock Returns

To maximize returns and diversify a financial portfolio, the stock price market participants have always been interested in learning associations of stock price returns for different companies. Five primary goals of this thesis are: (1) to evaluate and infer associations of stock returns between different companies in selected industrial sectors and countries, and (2) to identify groups of companies that exhibit the most similar stock market trends, and (3) to evaluate changes in associations between companies in time period from 2009 to 2015, (4) to forecast future return movements using selected classification methods, and (5) to explore the relationship between the accuracy of classification of stock return movements and network node properties. This thesis analyzed daily stock price data collected from publicly available sources, Yahoo Finance, for a sample of eighty-nine selected companies from four industrial sectors and three countries (China, Germany, and the US) for a time period of seven years from 2009 to 2015. Daily prices were converted into returns and then used to compute a correlation matrix and a corresponding association network. Obtained network was employed to identify clusters of companies that exhibit similar return trends and to evaluate the relationships within and between different industrial sectors. To assess changes in associations between companies during special financial events, annually dynamic networks were created. Four classification methods, namely Linear Discriminant Analysis, Quadratic Discriminant Analysis, k-Nearest Neighbors, and Logistic Regression were built to predict price movements for all selected companies. The relationships between classification accuracy rates and network properties were evaluated graphically. The results of the network-based analysis showed that the companies that traded in the same stock market and/or belonged to the same industrial sector had significant associations. Specifically, Chinese companies had higher inner correlations in banking and telecommunication sectors; the US and German companies had stronger associations in banking and auto manufacturing sectors. Interestingly, the associations among companies became stronger and more companies tended to be grouped together in the network during significant financial events and in the early recovery periods. The results of classification analysis revealed the superior performance of logistic regression method compared to other three classification methods, particularly for the Chinese companies. Remarkably, companies that acted as followers and belonged to medium-size clusters with eight to thirteen neighbors in the association network were easier to classify than other companies, thereby supporting the relationship between classification and network-based methods.

The prediction accuracy of LDA model across 3 different countries .............. 64 Table 8 The prediction accuracy of QDA model across 3 different countries ............. 64 Table 9 The prediction accuracy of KNN model across 3 different countries. ............ 65 Table 10 The  Figure 1 The stock market index performance and linear trends for three countries Beyond discovering associations between financial indices of different countries, recovering the associations between companies based on their stock price returns is a complicated, but an important task that has received much interest among investors and financial researchers. The pair trading used in financial corporations and hedge fund is a very illustrative example of utilizing the association between different companies to gain profits.
The pair trading has been originally proposed by Gerry Bamberger and Nunzio Tartaglia's quantitative group at Morgan Stanley (Bookstaber 2007). Pair trading is done by closely monitoring two stocks whose prices are highly correlated. When the two stocks temporarily go out of sync, the trader would long the stock that is relatively lower in price and short the one that is higher in price. This way when the two stock prices converge again, the trader would benefit from his long and short positions.
The simultaneous long and short selling are widely used trading techniques in pair trading that aim at gaining profits from the relative movement of stock prices in both an upward trend market and a downward market (Ehrman 2006 forecast of future return movements using parametric and non-parametric classification methods, and (5) assessment of relationships between the accuracy of classification of stock return movements and network node properties.
To achieve these goals, daily historical stock price data was collected from publicly available national and international sources for multiple companies in selected industrial sectors in the US, China, and Germany for a period of seven years (from 2009 to 2015). Obtained prices were utilized to create a correlation matrix and characterize corresponding association network. The network properties, such as the average node degree, network density, clustering coefficient, and average betweenness centrality, were computed for the generated correlation network. The network-based community detection method was used to find the cohesive sets of companies that exhibit the most similar stock return trends. Note that outlined network characteristics and community detection were applied to stationary network (inferred from all seven years of data) and a sequence of dynamic networks (inferred from annual data). The purpose of analysis of dynamic (annual) networks is to assess changes in associations between companies during special financial events. Since many companies depend on loans and credits, fluctuations in their stock prices may affect the stock prices of their lenders in other industries. As a result, the associations between different companies could change over time especially when facing big financial event and in the periods of early market recovery. To predict future stock return trends, parametric classification methods, including linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression, and non-parametric k-nearest neighbor method, were applied to collected price data.
The rest of this thesis is organized as follows. Chapter 2 reviews related work in network-based and multivariate analysis of financial data. Chapter 3 outlines the data collection process and preliminary data analysis. Chapter 4 describes the network methods and predicting models used in this thesis. Results and Conclusion are summarized in Chapter 5 and Chapter 6.

REVIEW OF LITERATURE
Application of correlation networks to the analysis of associations between different companies and/or countries based on stock market prices or exchange indices has received considerable attention among statisticians and financiers over the last ten years (Heimo et al. 2007;Nobi et al. 2014;Sienkiewicz et al. 2013;Song et al. 2011).
To analyze the association movement in stock market data, correlation graphs are often used to infer associations among the countries, industries, and/or companies, and then more sophisticated network-based techniques are employed to extract information about the global structure of the stock market. For example, correlation graph-based approach has been used to analyze the presence of both short-term and long-term association dynamic among stock exchange indices of 57 countries in the time period from 1996 to 2009 by Song (Song et al. 2011). Threshold-based correlation networks have been utilized to explore the effect of global financial crisis of 2008 on the association of stock prices in local Korean stock markets by Nobi (Nobi et al. 2014 constructed based on the correlation network, and then used for identification of the industry sectors of central nodes in the network by Heimo (Heimo et al. 2007).
Besides analyzing the association movement in stock market data, using historical data to create an optimal predictive model for current and future events became very popular in the new millennia (Leung, Daouk, and Chen 2000;Wang and Shang 2014;Kara, Acar Boyacioglu, and Baykan 2011;Alrasheedi 2012;Nguyen, Shirai, and Velcin 2015;Zhang et al. 2015;Peng 2015).
Specifically, to predict the movement direction of future stock price/index, many various statistical and machine learning classification models, such as Linear Twitter associated with these companies (Zhang et al. 2015). Similarly, financial data collected from Bloomberg has been used as an additional predictor for a neural network model forecasting stock price movement (Peng 2015).
To predict the future stock price return or volatility, many time series models, such as Autoregressive (AR), Autoregressive Integrated Moving Average (ARIMA) and Generalized Autoregressive Conditional Heteroscedasticity (GARCH), have been proposed. The time series model normally uses the previously observed values to predict the future output. For instance, the output of autoregressive model at time t is a linear regression of its own previous values. The AR(2) model uses previous two-day records, that is the value at time t-1 and the value at time t-2, as predictor-one and predictor-two to forecast the output at time t. In this study, instead of using the time series notion to build the optimal predictive model, the focus is on comparing the predictive performance of different objects by using the same predictors.
Unlike previous research that focused on exploring the changes of association of global and local exchange indices, the research in this thesis focuses on evaluating association among a subset of leading companies that comprised multiple industries (banking, communication, manufacturing, and pharmaceutical) in three countries with high market activity (USA, Germany, and China) located on three different continents (North America, Europe and Asia). Additionally, the correlation based threshold network is used to detect and clearly represent the associations between companies and to discover the clusters of companies following the most similar trends. Besides the static network, the annual dynamic network is employed to assess changes in associations between companies in the time period from 2009 to 2015. Also, four classification models are created and compare the prediction performance of different objects across 3 different countries under the same treatment and standard. The six years' historical data in the period from 2009 to 2014 are utilized as predictors to create the classification models, and the accuracy rate is used as a justification to evaluate the performance of different models. Finally, to find the relationships between classification accuracy rates and network node properties, graphical tools are applied here to evaluate the relationship, such as the scatter chart and level plot.

DATA DESCRIPTION AND PRELIMINARY ANALYSIS
In order to perform a robust and an accurate data analysis that would bring valid insights, it was essential to go through three data processing steps: data collection, cleaning, and formatting. Section 3.1 explains how the data was initially obtained from a large financial data source provider, Yahoo Finance. Section 3.2. Illustrates an extensive data cleaning process that was applied to improve the data quality.
Specifically, Section 3.2 focuses on situations with missing data values or differences in trading dates. Additionally, this section describes how all collected stock prices were converted to the US dollars to ensure the stock returns to be comparable across different companies, industry sectors, and countries. Finally, Section 3.3 provides the results of an extensive preliminary analysis in order to understand the data and to present a reader with a general overview of the data.

Data Collection
Public daily closing stock prices recorded between Jan 2nd 2009 to Dec 31st 2015 were collected from Yahoo Finance for a subset of leading companies from multiple industrial sectors (banking, communication, manufacturing, and pharmaceutical) in three countries with high market activity (USA, Germany, and China). Closing prices were chosen as a proxy for the most accurate and commonly used measures of stock price values when financial data was collected on a daily basis.
The distribution of the companies among four industries in three countries is as follows in Table 1.

Data Preparation
The project data was collected in a period time from Jan 2nd

Preliminary Analysis
Price chart and Candlestick Chart are commonly used plots for financial representation that are used to describe how a stock evolved over a given period of time. The following three price charts depicted in Figure   The reasons of causing stock rise or fall vary. In generally, there are three main factors that could affect stock price movements. First factor is a fundamental performance of a company, for example, the change of management, the earnings and profits related news, release of a new product, lay-off employees and etc. Second factor is related to the industry performance and behavior of competitors. The last factor is related to the overall national economy and economic policy.

Returns
In this thesis, the returns were calculated and would be applied for the further analysis. Compared to stock prices, the big benefit of using stock returns instead of stock price is that the stock returns reduce the variance of stock changes, or volatility.
Apparently, the stock price would have extremely large change compared to the previously recorded price especially during significant financial events. Song et al.
where is the closing price of company i on day t.

Return Distribution
Eighty-nine companies were selected for this project, where each country had 20 to 30 companies. In what follows in Figure Table 2 Summary of stock returns for three countries and across 7 years.

• Pearson Correlation Matrix
Pearson correlation is one of the most popular statistical measures that evaluates an association between two different numerical sets. Pearson correlation can be used to estimate the similarity between stock returns of two companies i and j as follows: where {σ "# } is a sample covariance between stock returns of company i and company j, and σ "" , σ jj are the sample variances of stock returns of company i and company j, respectively. Figure 5 illustrates estimated values of the correlation coefficient ρ "# between companies i and j obtained from 7 years of daily collected data. Naturally, a correlation coefficient ranges from -1 to 1 with higher positive values indicating direct linear relationship between stock returns. The summary of correlation coefficients computed from daily stock returns between different companies is illustrated in Figure 5. The left panel of Figure 5 illustrates a correlation matrix that demonstrates that the majority of correlation values are positive meaning the most companies returns followed same trends. The right panel of Figure 5 supports this conclusion that is the distribution of correlation coefficients is right-skewed with values ranging from -0.1 to 0.85.  Table 3 Mean correlation across 3 countries and 4 industries.
In order to evaluate the inner correlation across 3 countries and 4 industries, we computed the average correlation coefficients and summarized the results in Table 3.
One can see that the inner correlation of German Communication and Pharmaceutical is smaller than other two German sectors, especially the German Manufacturing. US banking has the largest inner correlation which equals to 0.685. As well, the Chinese communication and banking has greater inner correlation than the other two Chinese industries.

• Returns Annual Volatility
Volatility is also known as the variation of a stock or index, and it could estimate how risky a particular stock/ index is (Ensor,2014). Volatility could be measured using the security's standard deviation, which describes how tightly the stock prices are distributed around the mean. Because the value of standard deviation is always positive, volatility is also positive. In this section, the annual average volatility is measured as follows: where ! "# is the standard deviation of a daily return, and P is the time period of return.  Chinese stock market turbulence. Other financial events could also contribute to significant changes in the average correlation and the average volatility, but they remained beyond the scope of this project.
The findings of the preliminary analysis present that the return has a mean zero and varies in a small range. The majority positive correlation coefficients between different companies implied that most companies' returns followed same trends and were affected by the national economy. The volatility represented the variation of returns and usually has a similar trend with the average correlation of 89 research companies. However, the volatility only refers the variation of a specific asset or index and can not provide information on how the relationship changes within the research data. Thus, in order to fill the gap and detect the associations between companies, we will introduce and describe the network-based analysis methods and classification models in the following chapter. The main findings and conclusion of proposed study are summarized in Chapter 5 and Chapter 6.

Correlation-Based Network
Statistical analysis of network data combines an area of mathematical graph theory and statistical data analysis. In proposed study, correlation-based networks are used to represent the association between independent objects.
There are two main components that comprise any network, namely vertices and edges. Formally, a network graph, or simply, graph G= (V, E) is used to represent a set of elements (vertices) and a set of interconnections (edges) between the elements, where V denotes a set of vertices (also commonly known as nodes) and E denotes a set of edges (also commonly called links). It what follows, the network graph under consideration is assumed to be a simple, undirected graph, that is a graph with no multi-edges between any elements, no edges that connect an element to itself (selfloops), and where the direction of edges is not important.
For the purposes of the proposed study, vertices (V) are defined as the leading companies selected from multiple industrial sectors (banking, communication, manufacturing, and pharmaceutical) in three countries with high market activity (USA, Germany, and China), while the edges (E) are defined as the connection between any of the vertices (companies). The correlation network of stock returns can be constructed in two steps: (1) Evaluate the similarities between different vertices using Pearson correlations introduced and computed in the previous chapter of this thesis; and (2) Utilize an appropriate test statistic to verify if the similarities between daily stock returns of different companies are statistical different from zero.
In Section 3.3, Pearson correlation was introduced as a measure that can evaluate the similarities in prices of different companies within the dataset. Here, ! "# is used to denote the correlation coefficient between time series of stock returns for a pair of companies i and j.
The following hypothesis tests are applied to verify the existence of significant linear relationships between possible pairs companies: However, if multiple hypothesis tests are performed on the same data, the results from one test may be related to the results of from another test, and therefore the type I error rate may be much higher than pre-defined significance level α (the problem of multiple testing). To address this problem, the Benjamini-Hochberg adjustment is applied here, where the p-value is adjusted based on the control of false discovery rate (FDR). The procedure of FDR is organized as follows. After finding the statistical pvalues of multiple tests described by Equation 4, one needs to sort these p-values from smallest to largest, generating a sequence of p(1) ≤ p(2) ≤⋯≤ p(N), and then declare potential edges in the network for pairs of nodes for which p(k)≤(k/N)γ, where k is the k th smallest p-value among the N tests. This formula means all hypothesis tests with smaller p-values than p(k) will be rejected (the null hypothesis will be rejected at the level γ) and corresponding edges are going to be assigned. Here, the level γ is a user specified value.
After the p-values are adjusted, the adjusted p-value will be compared with the significance level, here, the standard 0.05 significance level is utilized.

Threshold Correlation Network
Threshold correlation network approach is similar to the correlation network approach described above. It is, in fact, a simpler approach that can be very helpful if the inferred correlation network is too complex and a large percentage of edges are significantly different from zero. The threshold network ignores associations between two vertices with the corresponding correlations smaller than a pre-set threshold and preserves the associations with the corresponding correlations greater than the threshold.
Formally, we create network graph G= (V, E) following the formula [6] to determine edge set E (see A. Nobi, et al., 2014): , [6] where ! stands for the correlation threshold, and ! "# represents an estimated correlation between company i and company j. The character e ij denotes the edge between node i and j.
Note that the threshold network is only able to create an undirected network graph and depict if the correlations between two vertices are larger than the threshold value ! . The threshold network graph cannot describe the direction of edges. We also should be cautious about the value of threshold ! . The network will be fully connected if ! is set too small, and the network will be empty if ! is far too large. Therefore, choosing a suitable ! is one of the key elements of this part of analysis.

Random Graph Models
Random graph models are frequently used to test the 'significance' of characteristics in a constructed network graph. Here, we adapt two random graph methods: one is a classical random graph model originally proposed by Erdős and Rényi; and one is a generalized random graph.
The core concept of the classical random graph theory is adding successive random edges to a set of N isolated vertices. To test significance of structural characteristics of the observed network, a sequence of classical random graphs is created where each graph has the same node number and edge numbers as the observed network graph. Formally, for network graph G, 1,000 random graphs are simulated with and , where and denote the vertex number and the edges number of the observed network graph, respectively.
Similarly, a sequence of generalized random graphs is created where each graph has the same number of nodes and the degree sequence of as the observed network graph (see Section 4.1 for more details).
After obtaining the simulated graphs, one can calculate the distribution of structural network characteristics computed for each random graph. The examples of structural network characteristics include but not limited to a graph density, an average betweenness centrality, an average vertex degree, and a clustering coefficient. All these characteristics are introduced in the next section. The distribution of a given network characteristic constructed from simulated classical or generalized random graphs can be considered a reference distribution that one can use to examine how likely the characteristic of the observed graph is under this distribution. This likelihood can be used then to assess the significance of the observed network characteristic compared to a random graph structure.

Network Characteristics
To detect and describe the structure in an observed network graph, the following network characteristics are computed: vertex degree, graph density, clustering coefficient, and betweenness centrality.

• Vertex Degree distribution
Vertex degree is defined as a measure of vertex connectivity in a given network.
In the network graph G=(V, E), the degree of a vertex v, denoted as , is the number of edges in graph G incident to the vertex v. Hence, the degree of a company is the number of other companies associated with this company, or the number of companies the price returns of which have a strong linear relationship with the price returns of a given company. Often, the degree distribution is used as a fundamental property of network graph, because it could be easily computed and interpreted as the representation of the network connectivity.

• Graph Density
To evaluate whether or not a given vertex is the 'central' vertex in the network graph, the network density characteristic is included in the proposed analysis.
The global density characteristic of a graph could be defined as the frequency of realized edges relative to its potential edges, and it lies between zero and one (E. D. Kolaczyk, 2014). The number of potential edges in an undirectional graph G= (V, E) with no self-loops and multiple edges is equal to N v *(N v -1)/2. Thus, the density of graph G can be defined as: where E is the number of realized edges in G and V is number of vertex.
Similarly, the density of a sub-graph H= (V H , E H ) is: which can be used to measure how close is the subgraph H to a clique.

• Clustering coefficient
Clustering coefficient (cl) is a network characteristic, similar to a graph density.
Both a clustering coefficient and a graph density are used to describe the cohesive properties of a given graph.
Clustering coefficient (cl) is a measure of the frequency with which connected triples 'close' to form full triangles in the undirected graph G= (V, E). It could be computed using equation [9]: [9] where ! ∆ ($) is one third of the number of triangles in graph G, and ! " ($) is the number of connected triples.
The clustering coefficient also can be computed locally. The local clustering coefficient of a node may help to determine whether neighbors of a node also connected and how close they are to forming a clique. The clustering of a vertex i in graph G could be obtained using the following equation: cl # $ = 2| ( )* : , ) , , * ∈ / 0 , ( )* ∈ 1 | 2 0 (2 0 -1) , where the v j and v k are the neighbors of vertex i, and ! "# represents the edge between node j and k. The ! " denotes the number of neighbors of node i.
The value of cl T is also called transitivity of the graph and widely used in the social network literature. In the social network, the clustering can indicate how likely one person's friends befriend each other. Similarly, the local clustering coefficient in this thesis suggests how much the associated companies of a specific node are also highly correlated with each other. The global version of the cluster coefficient evaluates the overall level of clustering in the network. It presents how likely corporations in the data are interconnected and how likely they form communities.

• Betweenness centrality
To investigate and quantify the 'importance' of vertices in a network graph, a betweenness centrality measure is used. There are several variants available, but the most commonly used definition of betweenness centrality is: where ! ", $|& is the total number of shortest paths between s and t that pass through v, and ! ", $ is the total number of shortest paths between s and t (include both pass or not pass the vertex v). If the vertex v has the largest ! " # which means this vertex has large probability being a central vertex in the network graph. Here, the average of all vertices betweenness centrality coefficients were computed at different threshold.

Network Community Detection
In order to identify groups of companies that exhibit the most similar stock market trends, agglomerative hierarchical clustering method is utilized in this thesis.
The reason of using the agglomerative hierarchical clustering instead of the divisive hierarchical clustering is that the former strategy detects and aggerates the similar companies together until only one cluster left; the latter one merges all companies together, and gradually detaches the dissimilar corporations. This thesis is more interested in the similarities between different companies; thus, the agglomerative hierarchical clustering method is employed.
The agglomerative hierarchical clustering algorithm works as follows. In the beginning, the algorithm places every element into its own cluster. Next, according to the hierarchical clustering principle, two clusters with the most similar properties merge thereby creating a new cluster. On each step of the algorithm, two of the most similar clusters merge. The process continues until all clusters are merged and all elements belong to only one cluster.
In this thesis, the roles of elements play the companies in selected sectors and countries. Many methods can be utilized to measure similarities between companies.
Here, Euclidean distances are utilized to measure the similarities between stock price returns.
Formally, Euclidean distance is defined as the length of a straight-line distance from one point to another point in Euclidean space. Suppose p is one company (point) with n return records, then p= (p 1, p 2, …, p n ), and q is another company (point) with n return records written as q= (q 1, q 2, …, q n ) in Euclidean n-space. The distance between two companies (points), p and q, can be computed using the formula [12]: d ", $ = d $, " = (' ( -* ( ) , + (' , -* , ) , + ⋯ + (' / -* / ) , . [12] Note that d(p, q), the distance between two points, p and q, is an undirected line segment connecting these two points. In this project, the data contains 1621 observations (n=1621) for each of 89 companies, so the pairwise distances between each pairs of 89 vectors of size n need to be computed.
Given the matrix of computed Euclidean distances, one can apply the hieratical clustering algorithm. First, all companies are initially separated in their own clusters.
Next, two companies with the minimal distance are merged. As this new cluster is formed, the distance between two clusters with multiple elements needs to be computed. In this thesis, the complete linkage is used to calculate such distance [13]: ! "# = max (∈",+∈# (! -. ) , [13] where is the distance between two clusters A and B, and is the distance between each vectors of stock returns p and q in clusters A and B. This way, one can use the maximum distance of as the distance between two clusters A and B. After computing the distances between new clusters, clusters with the are merged.
As mentioned earlier, this process continues until all companies are merged in one cluster. As a result, a dendrogram tree is constructed that illustrates the arrangements of merged clusters. Usually, clusters are defined by cutting branches off the dendrogram tree at a specific value of heights that is a closeness measure of different companies or clusters. Note that cutting the dendrogram tree at different heights will result in different clustering solutions.
There are other options to compute the distance between two clusters including single linkage, average linkage, to name a few. The reason of utilizing the complete linkage instead of using, for example, the commonly used single linkage is that the research data is highly correlated (see the preliminary results in Chapter 2). The single linkage clustering has a disadvantage on analyzing high correlated data. Following the single linkage algorithm for highly correlated data will result in the situation where on each new iteration the existing cluster merges one new observation with the closest similarity with the created clusters. At the end, if one cuts the tree at a specific value, the tree will be divided into two main parts, one is a cluster, and another one is a set of separated self-clustered companies. The result does not have much of explanation value. Yet, the complete linkage clustering does not have this limitation and for the highly-correlated data this type of linkage forms some small cliques first, and then uses the dissimilar companies in each clique to compute the similarities between two clusters. If one cuts the tree at a specific value, the tree will be divided into different clusters containing companies that have inner similarities. Thus, the complete linkage clustering is a more suitable technique to achieve one of the research goals of detecting the companies with similar stock return trends.

Dynamitic Networks
Dynamic network is adapted here for the purpose of discovering the changes in association of eighty-nine selected companies from four industrial sectors and three countries (China, Germany, and the US) over seven years from 2009 to 2015. The dynamic network is a statistical network analysis method used to describe the complex dynamic system. Compared to previously described (static) networks, for construction of which daily observations from all seven years are used, the dynamic network conceptually splits the data into a different time windows and uses data sequentially from each window to create a set of new networks.
In this thesis, dynamic analysis of network data is applied annually for, on average, 232 daily records, to explore the structural changes in association networks of companies, industries, and countries. Specifically, the data was separated by year (for example, the first record was computed using the stock return records in the period In order to describe the Network graph, three network characteristics are computed across the 7 research years including graph density, average betweenness centrality, and clustering coefficient.
To omit the overlapping edges and represent the relationships between connected companies more clearly, the topology technique, the spanning tree, is applied in this section. The spanning tree is a subset of graph G, which has all the vertices covered with minimum number of edges. For example, if three vertices are inner connected and form a triangle in a graph G. The spanning tree H is a subset of the graph G, which simply connect all three vertices with two edges. Hence, there are no circle/loop in the spanning tree graph.

Classification Methods
In what follows next, the focus is on forecasting of the movement directions of stock price returns, and understanding of the relationships between a model predictive power and the network/data properties. The following four classification methods are explored including LDA, QDA, KNN, and Logistic Regression. The performance of these methods is compared based on the prediction accuracy of the directions of stock returns.
A typical approach to assessing model performance is separating the data into two sets, training dataset and test dataset. The training data is usually used to build a model and estimate the related parameters; the test data is normally used to test the performance of the developed model. Here, the dependent variable (Y) represents movement directions of stock price returns (Upward, Y=1/Downward, Y=0); the predictors are the returns of the eighty-nine research companies. The predictor X t-1 is utilized to predict stock movement direction of a target company j at time t, Y j,t . The predictor is matrix with 1620 rows and 89 columns, which also could be formally written as X t-1 = (X 1,t-1 , X 2,t-1 , X 3,t-1 , X 4,t-1 ,…, X p,t-1 ). The structure of predictors displays as following:

Linear Discriminant Analysis
Linear Discriminant Analysis is based on the assumption that the predictors X= (X 1 , X 2, X 3, X 4,…, X p ) is drawn from a multivariate Gaussian distribution, where X has a class-specific mean and a common covariance matrix, and Y is a categorical class variable (James et al. 2013). In this project, Y is a two-level class variable that represents the movement direction; that is Y is equal to one if the return is positive ('Upward' movement), and Y is equal to zero ('Downward' movement) otherwise.
The conditional distribution of X given Y=1 and given Y=0 is denoted as ƒ 1 (x)=f(X=x|Y=1) and ƒ 0 (x)=f(X=x|Y=0), respectively, and can be computes as follows: where p is the dimension of random variable X, Σ is a covariance matrix, common for both classes, and ! " , ! " are means of X for class 1 and class 0, respectively.
The optimal Bayes classification rule could be written as follows: where ! " and ! " are prior class probabilities, and ! " # is a linear function which projects p-dimensions vector X onto one dimension with maximum class separability: [16] The unknown parameter ! " , ! " , ! " , and ! " are estimated from training set: [17] where N is the total number in the training set and N 1 and N 0 are the numbers in class 1 and 0, respectively. Character c denotes the different class of Y, here, Y has twoclass.
However, the QDA does not assume the covariance is homogeneous, it allows different two groups within the data having different covariance, Σ " and Σ " . The conditional distribution of X given Y=1 and given Y=0 is denoted as ƒ 1 (x)=f (X=x|Y=1) and ƒ 0 (x)=f(X=x|Y=0), respectively, and can be computes as follows: , [14] in addition, in QDA the decision boundary is quadratic.

K Nearest Neighbors
K-Nearest Neighbors (KNN) classification is a distance based, non-parametric classification method that does not require feature vector X to follow any particular distribution. The steps of employing a KNN algorithm are: (1) label all observations in the training set and compute the distances from each observation in the testing set to each observation in the training set, in a pair-wise fashion; (2) sort the distances and detect k of the closest neighbors for each point in the test set; (3) use a majority vote to label the points in the test set using the labels of k closest neighbors in the training set.
The Euclidean algorithm is utilized here to calculate the distances between all objects in the test set and all other objects in the training set. The procedure of computing Euclidean distances in KNN has some differences from building a distance matrix in Hierarchical clustering. Like for any classification algorithm, for KNN, data is divided into two sets, a training set and a test set. Let p to be a vector of price returns in the test set, p= (p 1, p 2, …, p p ), and q to be a vector of price returns in training set and q= (q 1, q 2, …, q p ). Every vector in the training set are labeled. In order to correctly label vectors in the test set, the Euclidean distance are computed using the formula [12]: d ", $ = d $, " = (' ( -* ( ) , + (' , -* , ) , + ⋯ + (' / -* / ) , , where the distance between two vectors p and q is an undirected line segment connecting these two points. Here, the training dataset has 1388 vectors and each vectors has 89 numbers, at the same time, the testing data has 232 vectors. The Euclidean space has 89 dimensions.
The nearest vectors X i would be obtained by sorting the computed distance values.
And if there are no ties, the classification (label) for each row of the test set is voted by the majority neighbors in the K candidates, and if there are ties for the K nearest vectors, all candidates will be included to vote. For example, if more than one-half of k neighbors around vector p have label 'Upward' movement (! = 1 ), the p will be assigned the label 'Upward', otherwise, the p will be assigned the label 'Downward' movement (! = 0 ).

Logistic Regression
Logistic regression model is another classification method. It can be considered a special case of the generalized linear regression model with a binary dependent class variable Y that is used for prediction of the probability of Y to belong to a particular class. In this thesis, variable Y represents the movement direction of a price return; Y equals to 1 ('Upward' movement) if the corresponding daily return is positive, and Y equals to 0 ('Downward' movement) otherwise.
Formally, the logistic regression model can be written as follows: In Equation [20], p(X) is an estimate of how likely the dependent variable Y is equal to one ('Upward' movement); X is the predictor matrix, and ! is the vector of corresponding coefficients. If data have N trading days and p predictors, X will be a N*(p+1) matrix, where X = (1, ' ( , ' ) , ' * , ' + , … , ' -) , and ! is a (p+1) vector of coefficients. The probability value, p(X) , is positive and in the range between 0 and 1.
Generally, a threshold of 0.5 is applied to determine to which class an object belongs, that is and if p(X), Y will be equal to one and the 'Upward' movement label will be assigned, otherwise, Y will be equal to zero and the 'Downward' movement label will be assigned.
The vector of coefficients, ! , in the logistic function is estimated by maximizing the likelihood function: where X i is the i th observation in the training data and Y c denotes the correspond observed Y class(Y=1 or Y=0). The probability of being 'Upward' is ! " # , and the probability of being 'Downward' is 1-#(% & ) .

Evaluation of Model Performance
Since multiple models have been applied for the prediction of price returns movements, their individual performance is of interest in this study. To evaluate the performance and validate the results of prediction, the confusion matrices and accuracy rates are computed.  Table 4 The confusion table layout

Confusion matrix is a table that is broadly
The accuracy rate can be obtained using the following equation: In this chapter, selected methods from two areas of statistics, Statistical Analysis of Network Data and Classification Analysis, were introduced. The numerical results of analysis and the conclusions will be presented in the following Chapter 6 and Chapter 7.
Given the statistical methods introduced in the previous chapter, this chapter utilizes them to investigate the structure of the correlation networks inferred from price returns data (previously described in Chapter 2) and the predictive power of several classification models for corresponding price return movements. At the end of the chapter, the relationship between the accuracy of classification and the network node properties is explored.
Specifically, this chapter is organized as follows. Section 5.1 presents the correlation network inference and the analysis of its structure. The following network properties are computed the node degree distribution, density, clustering coefficient, and betweenness centrality. Section 5.2 describes the clustering analysis and identification of groups of companies that exhibit the most similar stock market price return trends. Furthermore, Section 5.3 evaluates the changes in associations between companies annually in the post-crisis time period from 2009 to 2015. Section 5.4 outlines and evaluates the predictive power of four selected classification methods.
Finally, Section 5.5 addresses the relationship between the accuracy of classification on stock return movements and network node properties.

Correlation-Based Network
To detect the hidden associations between different companies in selected dataset, a correlation network G is inferred from stock market price data (see Chapter 2 for data description). In the graph G, nodes represent companies and edges represent sufficient correlations between vectors of stock market price returns computed over period of seven years. In total, there are 89 companies represented as nodes and 3961 possible edges in the network. The existence of a significant association between these companies has been verified by testing a set of the following hypothesis: For illustration, the left panel of Figure 7 presents the network with all potential 3961 edges, and the right panel of Figure 7 depicts the network with the 2809 significant edges only. One can see that the graph illustrated on the left panel is denser than the graph illustrated on the right panel, even though the network densities of the two graphs are 1 and 0.709, respectively. However, the associations between different vertices are not clear. It is still difficult to infer any meaningful conclusion by reading over complex network graph with too many overlapped edges. One research goal of this thesis is using network analysis method to evaluate the associations between different companies and industries. The overlapped edges clearly cannot contribute much to achieving this research goal.

Threshold Network
As explained before, a substantial number of associations between pairs of corporations is significant at 5% level. It appears that the inferred network graph is extremely dense if one depicts all significant edges in the graph. Alternatively, to clarify the most important associations between different companies, the threshold network, described in Chapter 3, can be utilized.
The choice of the threshold value is of particular importance in this analysis. It is worth noting that if the threshold is set too high, only a few extremely influential associations can be present; while a large number of 'sub strong correlated' edges will be absent; alternatively, if one set the threshold very low, too many of 'less important' edges will present in the network. Here, in order to determine the suitable threshold, three important network characteristics including graph density, clustering coefficient, and betweenness centrality will be utilized.

• Threshold Value Selection
The left graph in Figure 8 demonstrates that the graph density decreases as the threshold value for correlation increases. The weak correlations are cut off as the threshold changes from 0 to 0.4; at the same time, the graph density descends from 0.962 to 0.097. If the graph density is very close to zero, it implies that the network graph is an empty graph with the number of edges close to 0.
The average betweenness centrality and the overall clustering coefficient are shown in the right panel of Figure 8. One can see that the clustering coefficient has a decreasing trend until it encountered the first drop at threshold ! =0.06. Then the clustering coefficient begins to increase and then climbs the first peak at ! =0.179. It is not difficult to conclude that the edges with insufficient correlation are very unlikely to form triangles and will be cut off first. At the same time, edges with stronger correlation values tend to form triangles with their connected neighbors (causing clustering coefficient to increase) will remain in the network. If threshold ! is set to even larger values, the inner-cluster edges will be removed (causing to decrease).
When the threshold ! equals to 0.415, the network graph has a low clustering coefficient but high average betweenness coefficient, this results illustrates that few vertices in the graph have extremely high centrality compared with other vertices.
When the threshold equals to 0.415, the corresponded graph density is 0.085, meaning that 8.5% of potential undirected edges will actually present.  The top left panel of Figure 9. shows a network graph created at threshold θ=0.06.
This inferred graph is very dense comparing to other three graphs. One can easily deduce that the threshold value of 0.06 is too small, and cannot produce a meaningful result. The top right panel of Figure 9 displays still very dense graph (inferred at threshold θ=0.17) with one isolated vertex 22 (Biotest Pharmaceuticals Corporation).
When θ=0.555, most of vertices are isolated, but majority Chinese banks and US banks are grouped together with other banks in the same county. The bottom left panel of Figure 9 shows a network graph created at threshold θ=0.415 with 89 nodes and 335 edges. This picture exhibits the associations within 89 research corporations more clearly, and hence will be used in the further analysis.

• Characteristics of the threshold Network
To understand the significance of the structural properties of the created threshold network, three major network characteristics, vertex degree, betweenness, and local clustering (vertex clustering) are evaluated in this section.
In the left panel of Figure 10, one could see that there are 3 distinct groups in the degree distribution: (1) with vertex degree less than 17, (2) with vertex degree in the range from 19 to 22, and (3) with vertex degree greater than 25. For example, HSBC and TD Bank are two US banks from the third group with 28 degrees implying that these two banks have strong associations (correlation coefficient is greater than 0.415) with other 28 corporations in the network.
The betweenness centrality distribution, illustrated on the middle panel of Figure   10, shows that more than 50 corporations have betweenness equal to zero that means these companies are not passed through in the shortest paths between other pairs of vertices. TD Bank, HSBS has the largest betweenness coefficient 250 and 177, respectively.
The distribution of the local clustering is displayed in the right panel of Figure 10.
The local clustering is a measure that is used to describe how likely the neighbors of the target node are likely to form a cluster (neighbors for each other). There are 16 vertices with local clustering equaling to 1, which implies that neighbors connected to these nodes also connect with each other within the neighborhood. The clustering coefficients of TD Bank and HSBC equal to 0.339 and 0.449, respectively.

• Assessing Significance of Network Characteristics
In order to test the significance of the network characteristics, two random graph simulation techniques are applied in this section. Specifically, the results of 1000 classical random graphs (CRG) and 1000 generalized random graphs (GRG) simulations are presented here. In the classical random graphs, each graph has the same number of vertices number and the same number of edges as original threshold network G 89 vertices and 335 edges. The difference between CRG and GRG is that the latter one has the required degree sequence, and the previous one does not have this limitation. The results of these two simulation methods are shown in Figure 11 and Figure 12. Figure 11 depicts distributions of the classical random graphs. The two histograms show that both characteristics, namely clustering and betweenness follow a bell shape distribution with means equal to 0.085 and 62.65, respectively. The clustering and betweenness distributions of simulated generalized random graphs are shown in Figure 12 with means equal to 0.208 and 37.88. Figure 11 The distribution of classical random graphs' characteristics Figure 12 The distribution of generalized random graphs' characteristics The mean of clustering coefficients in Figure 11 and Figure  The statistics values and corresponding p-values were computed using two random graph simulation methods are presented in Table 5.  Table 5 Significance test results of Network characteristics Two-tailed test is used here to verify the significance of Network characteristics, and one can observe that the obtained p-values of degree and density are both equal to 1 for both classical random graphs and generalized random graphs. This result is expected due to the nature of simulation process of the random graphs. In in both cases the number of edges (335), and the number of vertices (89) are fixed (the same as in the original graph G), and the density is computed by using formula: .
Thus, all simulated 1000 random graphs, by the construction, have the same graph density as graph G.
It is worth mentioning that the clustering coefficient is computed by treating the network as integral, at the same time, average betweenness is obtained by averaging 89 vertex betweenness values. The p-value of clustering coefficient and the average betweenness are both zero suggesting the constructed network graph with threshold equaling to 0.415 to be significantly different from the classical (generalized) random graph and to capture the important associations in the dataset.
In order to make the network graph more visually understandable, the decorated network will be presented in the next section.

Visualization of Network
Decorating graph layout can be very helpful in visualizing large network. Here, three different vertex shapes, namely circle, square and triangle, are used to denote the three countries: Germany, US, and China. Four different colors of vertex, red, green, blue, and purple are utilized to represent the four industrial sectors including: Banking, Manufacturing, Telecommunication, and Pharmaceutical, respectively. Comparing the decorated graph in Figure 13 to the non-decorated in Figure 9, one can observe that the decorated graph is more readable and it gives a more general overall impression of the relationships between corporations across different countries and sectors. For example, Figure 13 shows that Chinese companies (triangles) are close to each other and do not have strong association with US or German companies. The US corporations (circles) and German companies (squares) are in a one connected group, and the US banks locate in the central of the group. As well, three Chinese telecommunications (blue triangles) trading in the US are grouped with majority US and German companies instead of Chinese companies.

Network Community detection
One of the research goals of this thesis is to identify groups of companies that exhibit the most similar stock market trends. Here, hierarchical clustering method and reduced network technique are utilized to detect the associations between corporations and the relationship within and between different industries, respectively.
In order to present the graph in a readable way, the company ID are created. The country initial and sector initial are utilized here. For example, the first letters of the three research countries Germany, US, and China are shortened as G, U and C, respectively. Similarly, the four research industrial sectors, Banking, Pharmaceuticals, Manufacturing, and Telecommunications are shortened as character B, P, M, and T.
As well, companies in the same country and sectors are labeled using the short name of belonging sectors and numbers. For example, there are 3 Chinese telecommunication companies, thus, these three are labeled as 'CT1','CT2','CT3'.
Hierarchical clustering dendrogram is shown in Figure 14 with 22 clusters. If one cuts the dendrogram tree at height 1.48, the graph will be divided into five main communities and each community will have more than 4 components. In the Figure   In this section, the clustering method detected the companies that exhibit the most similar stock market trends. In the next section, the reduced network is used to evaluate associations between different industries.

Dynamitic Analysis Result
To explore the annual changes in company/country/industry associations in the past seven years (from 2009 to 2015), the annual dynamic association networks with the spanning trees are created. From Table 6 and Figure 16, one could find that before  The network characteristics are summarized in Table 6, and include graph density, clustering coefficient and average betweenness. The network graph became increasingly denser and started to form more connected clusters in the early recovery period, 2-3 years after the crisis (2009 to 2011) In order to detect the annual changes in company/country/industry association for some specific companies and sectors, the annual dynamic spanning trees are constructed and presented in Figure 16. As introduced in the Chapter 4, the spanning tree is a subgraph of the network graph that simply connects all originally connected nodes and omits the loops. One can see that most Chinese companies maintained their connections and did not change them much in the time period from 2009 to 2014.
However, the associations within Chinese companies became denser in 2015 due to the Chinese stock market crash. The US pharmaceutical companies had weak associations during the crisis, and stronger inner relationships after the crisis.
Discovering the associations between different companies statically and dynamically from 2009 to 2015 is one of the research goals of this thesis, which has been achieved in this section. On the other hand, exploring the relationship between the accuracy of classification models and network node properties is the last research goal of this thesis. In the next section, four classification models will be used to predict the future stock movement and discover the associations between accuracy rate and node properties graphically.

The Performance of Four Classification Models
Forecasting the future return trend movement of different companies has been particularly popular in the field of financial data analysis. One of goals of this thesis is to discover if there is a relationship between the accuracy of classification of stock return movements and network node properties. utilized, (X 1,t-1 , X 2,t-1, X 3,t-1, X 4,t-1,…, X p,t-1 ), to class the next day stock movement direction of a target company Y t.

• Model performance across different countries
A typical approach to assessing model performance is separating the data into two parts, training set and test set. The training data is used to build a model and estimate the related parameters; the test data is normally used to test the performance of the developed model. To evaluate the model performance and select the better performance classification model, the true accuracy rates are applied in this section.
The true accuracy rate of four classification models across three countries are listed in the following four tables, Table 7-Table 10.  Table 7 The prediction accuracy of LDA model across 3 different countries The QDA model performance across three different countries is displayed in   Table 9 The prediction accuracy of KNN model across 3 different countries Table 9 shows the average accuracy of KNN model to be 0.51. The predictions provided by KNN for Chinese companies are slightly higher than the predictions for other two countries, but the differences are still minor.
The classification accuracy results for logistic regression are listed in Table 10.
One can find that the average accuracy rates of German and Chinese companies are around 0.53, greater than accuracy rates achieved by previously described three classification models. The mean accuracy rate of the US companies is slightly higher than a random guess.  Table 10 The prediction accuracy of Logistic regression model across different countries To compare performance across three countries for four outlined models, four parallel box charts are created below. The first three panels of Figure 17 compare the performance of the models in each country separately. The last panel of Figure 17 (bottom right) is created by combining all companies in all three countries and compares the accuracy of the four classification models. The top left panel of Figure   17 shows that for German companies the QDA model has the poorest performance, LDA and Logistic regression model have a better prediction than other two models. In the US, all four model medians are around 0.5. The LDA and Logistic regression models perform better than the other two models for the selected Chinese corporations.

SUMMARIZATION
The last chart suggests that the performance of LDA and Logistic models are similar, but the logistic regression model has a slightly higher standard division than LDA. Bottom right panel: compare model performance in general

Associations Between Network Features and Regression Model Performance
The accuracy rates of four classification models were compared in the previous section, and the LDA and Logistic regression model had a better performance than the other two selected models. The aim of his section is to explore the association between Logistic regression classification accuracy and threshold network properties.
The main reason of using Logistic regression model instead of LDA is as follows: the LDA model assumes the predictors to follow a multivariate normal distribution; however, the logistic regression does not carry normal distribution requirement for predictors. Unfortunately, the previous preliminary results showed that the research data returns were not normally distributed. Thus, the results of LDA are suspect, and this research will use the logistic regression as a more reliable model to do the further analysis.
Here, in this section, two graphical methods, the scatter chart and the level plot are applied to detect the relationship between logistic regression model and node properties. The scatter plots are displayed in Figure 18, and the level plots are presented in Figure 19. It is worth noting that the nodes in the network are research companies and the accuracy logistic regression classification rates are calculated for each company separately.
From Figure 18, one can clearly see that vertex clustering has a positive relationship with the classification accuracy rate. The maximum accuracy is equal to 0.6 when the vertex clustering equaling to 1, where neighbors of the node also connected and formed a clique. When the degree of vertex locates between 8 and 13, the accuracy rates of most companies are greater than 0.5 and deviation of accuracies has a small variability. The betweenness centrality of the most vertices is less than 100, and there are 9 out of 89 corporations that are outliers and have incredibly high betweenness. The accuracy of the nine organizations varies in a large range and has a great variability. Figure 18 The scatter plots between classification accuracy rates and threshold network node properties.
The scatter plot is a tool that detects (if present) the relationships between two variables. Here, the level plot is applied to detect the association between 3 variables.
The main idea of level plot is that the X-axis and Y-axis are divided into different cells.
If there are many observations located in the same cell, then, this cell will be divided into the smaller cells. Thus, the big size cell does not stand for most case happening in there, on opposite, it stands the event barely happened in that coordinate. In addition, the shade of color can represent the value of accuracy rate.
The level plot in Figure 19 describes the relationships between the network node properties and the model accuracy rate, where the X-axis and Y-axis are the network characteristics, and different shades of green and red color denote if the accuracy rate are greater than 0.5. Figure 19 The level plots of the relationship between classification accuracy rate and network node properties.  Figure 19, one could easily see that the cells color are bright blue when vertex has clustering coefficient being greater than 0.7 and degree being bigger than 7. One also can note from this figure that vertex degree and clustering has a negative relationship, as the vertex degree decreases, the clustering coefficient increases. The 'important' company having extremely big degree is difficult to form a small clique. The top right panel of Figure 19 shows the maximum company correlation coefficient has positive correlation with vertex degree. There is no clear pattern in the bottom left panel of Figure 19.
Hence, based on the observed results from the scatter plots and the level plots one can conclude that the movement direction of stock price return would be easier to classify by using logistic regression model than other companies, if the vertex satisfies the following conditions: (1) vertex are more like to be a follower instead of a leader with eight to thirteen neighbors in the association network, and (2) vertex prefers to form small cliques, where its connected neighbors are also close and connected.

CHAPTER 6 CONCLUSION CONCLUSION
In order to reach the five goals of this thesis, we used the 89 companies daily stock price data collected from a publicly available source, Yahoo Finance, for a time period from 2009 to 2015. To reduce the variance of the data, the close price data was converted into daily returns, and then used to compute a correlation matrix and create a corresponding association network.
After obtaining the association network, the community detection method, agglomerative hierarchical clustering, was applied in this study to identify companies that exhibit the most similar return trends. The results suggested that the companies that traded in the same stock market and/or belonged to the same industrial sectors had significant associations. Specifically, the Chinese companies had higher inner correlations in banking and telecommunication sectors; while the US and the German companies had stronger associations in banking and auto-manufacturing sectors.
In addition to detecting static associations between companies over the research years from 2009 to 2015, the annually dynamic networks were created to assess annual changes in associations between selected companies during a special financial period, i.e.2009 European Debt Crisis. The results showed that the associations among companies became stronger and more companies tended to be grouped together in the network during European Debt Crisis and in the early recovery periods.
Another focus of this thesis was on discovering the relationship between classification accuracy rates and the network node properties. Four classification models, namely Linear Discriminant Analysis, Quadratic Discriminant Analysis, K-Nearest Neighbors, and Logistic Regression were created and evaluated. The results revealed the superior performance of the logistic regression method compared to the other three classification methods, particularly for the Chinese companies. Thus, the logistic regression was utilized later to detect the relationship between model accuracy rates and network node properties. Two graphical tools were applied in this thesis. The results illustrated that companies that acted as followers and belonged to medium-size clusters with eight to thirteen neighbors in the association network were easier to classify than the other companies.
Even though the logistic regression had a better performance comparing to the rest three classification methods, LDA, QDA and KNN, especially for Chinese corporations. However, it is worth to mention that the accuracy of Logistic regression model of Chinese companies has mean=0.530 and standard deviation=0.033. If one assumes the model accuracy follows a normal distribution, then the 95% confident interval for the true accuracy will be in the range of from 0.478 to 0.598. Based on this result, it is difficult to conclude that the logistic regression has a statistically significant performance compared to a random guess.
Thus, in order to improve the classification accuracy, in a future study, I aim to use the data fusion classification method proposed by Dr. Natallia Katenka to predict future price movement more accurately. This method will leverage the vertex own historic data and the related neighbors data as predictors to forecast the future return directions.