A Statistical Analysis and Modeling of Information Diffusion Across Online Social Networks

In recent years, online social networks have become a very popular and effective forum for information exchange. These large, highly interconnected networks span the globe and have the ability to disseminate information in a fraction of the time it would take other communication networks. Given the myriad ways in which online social networks can be used, creating accurate, predictive models for the spread of information across them is very valuable. With that, modeling processes on large networks is a difficult task. It is computationally expensive, and usually prohibitive, to model a process on the entirety of a very large network. Given these complexities, creating smaller network graphs that are characteristically similar to the original networks graphs enable researchers to run models that are otherwise not feasible. This project aims to create prototypic networks and model the spread of information across them using network-based epidemiological models to better understand how information spreads across an online social network. More specifically, the focus will be on the spread of the news of a scientific discovery, i.e., the Higgs boson particle, on Twitter.

Normalized mean square error (NMSE) of sample graph local density distributions as a function of all possible density values, i.e., zero to one, and denoted as k. NMSE is computed by computing the squared difference between the proportion of nodes in a sampled graph with a density less than or equal to k and the proportion of nodes in the original graph with a density less than or equal to k. ix 20 RMSE of SI models as a function of the transmission rate, β, for three graphs generated from sampling techniques. The minimum of each line represent s the transmission rate at which a SI model on a particular graph best fits the actual proportion of Twitter users becoming aware of the Higgs boson news ( Figure  19) The MHRW RMSE has the smallest minimum value of the three RMSE plots and has the best fit to the actual proportion of users becoming aware over time. MHRW Minimum at β = .056. 62 21 SI epidemic processes run across three of the sampled graph types. The black line is the actual information diffusion process across the entire Twitter network. This network is used to compute the mean square error for each process by comparing the proportion of informed users at each timestamp. The processes on the Barabasi-Albert and power-law models have a transmission rate of 1. Total number of iterations for epidemic processes run across the Barabasi-Albert and power-law graphs with a β = .4. The number of iterations visualized was extended from 7 to the total number required to completely reach the entire graph to illustrate that the graphs still conform to a shape similar to the MHRW and actual information diffusion curves (a sigmoid curve), but it takes more iterations to complete the process. . . 64

Introduction
In recent years, online social networks have become a very popular and effective forum for information exchange. These large networks span the globe and have the ability to disseminate information in a fraction of the time it would take other communication networks. Given the myriad ways in which online social networks can be used, creating accurate, predictive models for the spread of information across them is very valuable. Modeling processes, however, on large networks is a difficult task. It is computationally expensive, and usually prohibitive, to model a process on the entirety of a very large network with most of the currently available software packages. This project aims to create prototypic networks and model the spread of information across them using network-based epidemiological models to better understand how information diffusion across an online social network. More specifically, the focus will be on the spread of the news of a scientific discovery, i.e., the Higgs boson, on Twitter.
The discovery of the Higgs boson was one of the greatest scientific achievements in modern science. The elusive nature of the particle and its hypothesized ability to unify the three majors forces into what is known as the Grand Unified Theory made the discovery of this particle a very momentous achievement for the scientific community [1]. The particle and its importance had also reached the gernal public and had been dubbed the God Particle because of its potential ability to shed light on the framework of the universe [1]. News of such a momentous discovery was of a particular interest to the scientific community and the general public.
A discovery of this magnitude had not been made in the recent past, and had certainly not been made since the explosion of popularity of online social networks.
Given Twitters role as a global, online news outlet [2], tracking the spread of such momentous news across Twitter offers a very interesting look at how significant events spread across social communities in the era of online social networks.

Overview of Online Social Networks
Online social networks began to form in the 1990s when the Internet started to become available to households and individuals users [3]. These networks offered very limited functionality, e.g., forums, chatrooms, and instant messaging.
However, social media websites grew much more popular at the turn of the 21st century when a website, named Friendster, was launched. Friendster attempted to mimic real world friendship communities by creating connections between users that had a common interest or considered to be friends. In the years to follow, sites such as LinkedIn, Myspace, Facebook, adopted the idea of organizing an online social networks that revolved around communities formed by shared interests and friendship.
The focus of this thesis is on another popular online social media website called Twitter. Twitter was created in 2006 and offered a slightly different social experience than the preceding social networks. Twitter allows users to post short messages, referred to as 'tweets', on their own profile wall. A user's wall is a space on their profile that is used to share various types of tweets. Users can share personal experiences, daily activities, momentous events, etc.
Twitter users form connections with other users in a way that is slightly different than other social networks such as Facebook or LinkedIn. The difference between the types of connections made on Twitter when compared to other social networks is that rather than being friends with another user, a Twitter user has the option of following another user. Following another Twitter user enables one to view all of the postings made to their wall. This unidirectional relationship does not guarantee that the user being followed will see any of the content posted to the wall of the person doing the following. This relationship between users differs from other social networks because most other social networks establish bidirectional connections between users that are considered friends. The bidirectional connections require a mutual agreement of reciprocal information sharing, as opposed to the unidirectional structure that is used on Twitter. This structure of social relationships is atypical when one think about social networks because friendships is usually thought of as a reciprocal relationship. So, the idea of sharing information with ones friends must be altered when thinking about information exchange on Twitter.  The Twitter wall of each user represents a "news feed', an aggregate of the information posted by all the of the users they are following on Twitter. The news feed is where each user interacts with the Twitter community. It is where each user gains access to all of the information being offered by the friends, news agencies, celebrities, etc., which they are following. Users also have the ability to perform a number of actions on the tweets of others. The three most popular actions are retweeting, mentioning, and replying. Retweeting is the process of reposting another users message onto your own wall. This, in turn, offers that persons piece of information to all of the followers of the retweeter. Mentioning another user in one of their own tweets is an action that allows alerting users to the information contained in that particular post. Replying is a simple action that allows a user to reply to a given tweet. The replies to tweets are seen at the bottom of the tweet and resemble web forum discussions.
Until now, networks in this thesis have been described in a more relaxed manner with an emphasis on describing social networks in the real world. However, along with this applied notion, a more formal body of knowledge will also be required to fully understand social networks. The branch of mathematics known as graph theory provides the formal understanding necessary to analyze networks.
Graph theory is an integral part of network analysis because networks are most commonly expressed and studied as mathematical graphs (see Chapter 3 for more details). Generally, networks represent a system of interacting elements, e.g., Twitter users follow each other. A network graph represents a set of predefined nodes (elements) and edges (links).

Introduction to Information Diffusion and Epidemic Modeling
Information diffusion can be regarded as a dynamic process that takes place on a social network and can be thought of as the evolution of time-indexed vertex attributes on a given network [4]. For example, the spread of knowledge or a rumor through a population can be thought of as a dynamic, time indexed process in which the piece of information reaches a certain number of people at a given time period.
The information diffusion models are primarily derived from classical epidemiology. There are, however, differences between the spread of disease via physical contact and the spread of information online.
The spread of the disease via physical contact relies on geographic proximity of different individuals and usually the transmission of the disease is unintentional.
Also, infected individuals may be unaware of their infection. In online social networks, information diffuses from one user to a follower or to individuals within his community. Also, information is transmitted intentionally. Physical contact and geographical location have much less bearing on the probability of whether not information will be transmitted from one user to another. Online social networks span the globe and, as a result, a Twitter user on one continent can potentially become infected with the information of the news story from a Twitter user that is on a different continent.
An important difference between Twitter and other social networks is unidirectionality, meaning that the potential for information diffusion can only move in one direction. For example, Person A may be following person B on twitter, and there will be a connection on the network, but it doesn't mean that Person A can spread the information to Person B. This notion of one-way information flows leads to interesting phenomena in epidemic spreads regarding the rates of infection [5].
In what follows in Chapter 3 and Chapter 4, network-based epidemic models similar to those discussed in [6] for computer viruses will be used to model the scientific discovery of the Higgs boson particle on Twitter. Epidemic modeling will be used to model the spread of information over time. Classical epidemiological models make the assumption that the population is homogenous and moving randomly throughout a given space. This assumption of modeling across a random network graph cannot be assumed when modeling the spread of information across Twitter (for reasons described earlier). Given that, network-based epidemiological models will be used instead of classical epidemiological models. With these network-based models, the paper aims to accurately predict the spread of information across a social network.

Case Study: News of the Higgs-Boson Discovery on Twitter
In what follows, the structure of the Twitter network graph as well as the data related to the news of the Higgs boson discovery on Twitter will be introduced. The nodes of the Twitter graphs represent individual users that possess accounts on Twitter. The connections, or edges, that comprise the network graph indicate that one account is following another account. Following allows one user to track the postings or comments of another user. If account A follows account B, the postings of B will appear on the newsfeed of A. Therefore, if B posts a particular news story or makes a reference to a particular event, it is likely that this information will appear on the newsfeed of account A. that are more self-explanatory than the previous features. If someone wants to mention another user in a post of theirs, they use the @ sign followed by the other account holders name, and they will be notified of the posting. Replying is a chat-like feature that allows people to comment on any post they feel deserves a comment.
The connections, or edges, that comprise the network graph on Twitter create a unidirectional relationship that allows information to be transmitted in only one direction (e.g., from the followee to the follower). This type of connection implies that Twitter should be represented as a directed graph.
Conversely, online social networks like Facebook have bidirectional relationships between friends because information and communication can flow in both directions (from one friend to another, or vice versa).
Following on Twitter allows one user to track the postings or comments of another user. If account A follows account B, the postings of B will appear on the newsfeed of A. Therefore, if B posts a particular news story or makes a reference to a particular event, it is likely that this information will appear on the newsfeed of account A. Retweeting is an action that can be performed by users who want to post someone elses postings on their own page. If account A enjoyed account Bs comment or news story, he can retweet it and have it appear on his page.
Mentioning and replying are others features that are a little more self-explanatory than the previous features. If someone wants to mention another user in a post of theirs, they use the @ sign followed by the other account holders name, and they will be notified of the posting. Replying is a chat-like feature that allows people to comment on any post they feel deserves a comment.
The social network of followers will serve as the underlying network graph on which the processes of information diffusion will be modeled. Although all three actions can contribute to the information diffusion on Twitter, only retweet and mentioning actions will be modeled as dynamic processes via network-based epidemic models.
The data used in this thesis was originally collected by De Domenico et al. [1] using the Twitter API service and can be found on the Stanford Network Analysis Project website [7]. The dataset contains four large network graphs that contain interactions related to the spread of the news about the Higgs boson. There exist social, retweet, mention, and reply networks. The social network is the largest one, containing roughly 450,000 nodes with 14,000,000 edges, and represents the following activity of all the users engaged in the spread of this news story. The retweet network, 425,000 nodes and 733,00 edges, represents all the retweets done during the spreading period. The mention and reply networks are significantly smaller and represent chatting done between different accounts holders.
Information diffusion processes depend on node connectivity and clustering that comprise the main characteristics of the underlying social network structure.
In the beginning of Chapter 4, the following network characteristics will be analyzed: the number of connections each node has to the rest of the graph, the clustering of nodes, the cohesion of nodes, and the correlation between the number of links an element possess and the number of links its neighbors possess.
The social network of Twitter followers is similar to many other online social networks in that its distribution of connections, or edges, closely follows an exponential distribution [8]. This means that the overwhelmingly majority of users have between zero and one hundred connections to other users, but there are a small percentage of users that have a very large number, upwards of 50,000, of connections to other users. This type of graph will have certain areas of very high clustering around these accounts that have 50,000 followers. The clustering coefficient is an important statistic. This measure of how closely nodes cluster is of importance because rates of information diffusion vary greatly depending how how connected a given node is to the surrounding nodes.
The network of Twitter users described previously is very large, both in number of nodes and edges. The size of the network makes it difficult, and often prohibitive, to model processes on it using available software packages. Sampling from this large network to obtain scaled down, but still representative, networks is an important first step taken before simulating information diffusion processes.
Different sampling methods provide different networks and network structure characteristics. The network characteristics of the graphs created from each sampling technique are compared to the characteristics of the original network to determine which method proceeds the most similar scaled down, tractable network (see Chapters 3.2 and 4.2 for more information).
The remaining content of the thesis is organized as follows. Chapter 2 is dedicated to a review of the existing literature related to three main areas: quantitative analysis of online social networks, scalability of networks graphs, and information diffusion processes. Chapter 3 provides background information and mathematical definitions of network analysis characteristics, network sampling methods, and epidemic models for information diffusion. Main results and findings are summarized in Chapter 4, and followed by conclusions in Chapter 5.

CHAPTER 2
Review of Literature

Quantitative Analysis of Online Social Networks
Social scientists have studied the relationships and connections that people forge with one another in a network based framework since the 1930s [1]. These early social networks refer to a collection of entities that are not interacting on the Internet.
Social networks exists in many forms and include, but are not limited to, friendships among people, co-authorship on journal publications, business alliances between certain companies, ally networks between nations [1]. All of these social networks possess entities that are linked together and interaction between entities are dependent on links between the entities. Similarly, online social networks are complex systems in which interactions between entities (e.g., blog owners, Twitter users, etc.) take place on the Internet.
Social networks have been shown to exhibit properties of a nonrandom structure of links between the entities. The nonrandom structure of social networks is typically classified into one of two major categories. The first, small world networks, have been shown to be more typical of offline social networks [2]. Small world networks have have a structure that requires a small number of links to be traversed to reach any of the entities on the graph. This characteristic of a social network of friendships among people has been popularized by the idea of the small-world phenomenon, which was first studied by Milgram in [3]. This theory posits that if a one person is randomly selected, it would take a small number of friendship-link traversals, six is a popular, but controversial estimate, to reach any other person in the world. While this accuracy of this idea is controversial, it has been shown the average number of link traversals between two americans is six [4].
The second type, power law networks, are much more common among online social networks [2]. Power law networks are networks in which a large percentage of the entities have a small number of links to other entities and a small percentage of the entities have a very large number of links to other entities. Information networks, such as Twitter, have been shown to exhibit power law characteristics. It has also been shown that Internet networks such as blogging services often exhibit properties of power law networks [1]. The power-law networks also typically posses a very large connected component of entities that encompasses nearly the entire number of entities [2]. It will be illustrated in Chpater 4.1 that the Twitter network being analyzed also has a very large connected component that accounts for nearly all of the entities and links of the entire graph.
The increasing popularity of online social networks has lead to a rapid increase in a network size. The popularity of online social networks has also increased the number of online social networks. This increase in size and frequency indicates that these networks are becoming increasingly more integrated into daily life, while also getting larger and more complex. Given this, the value of analyzing these networks is increasing, but the task of analyzing them is becoming increasingly difficult due to the very large size of the networks. Sampling methods are one way to address this problem.

Sampling of Network Graphs and Scalability
The impetus for sampling online social networks is motivated by the sudden explosion in popularity of these networks. This very rapid increase in popularity has dramatically increased the size of these networks and have made the networks intractable in their natural state. Thus, scaling down the massive online social networks using sampling methods is required before conducting any further inferential analyses [5].
Two popular sampling techniques used to created sampled networks are element-based and link-based. Element-based sampling techniques involve randomly selecting a subset of elements from the whole graph and constructing links between the subset of nodes based a certain set of conditions [1]. Link-based sampling involves randomly sampling a set of edges and creating a network from the subset of edges [1]. The goal of sampling in this thesis, regardless of the technique employed, is to obtain a scaled down network that posses link and element features that are similar to those of the original network.
Power law networks possess a small number elements have a very large number of connections to other elements in the graph. This characteristic is very different from random graphs, which possess a degree distribution which closely follows a normal distribution around a most common degree value. Given this, it is unsurprising that a large number of sampling techniques that apply to random graphs will fall short of preserving the characteristics of a very large scale free network. If nodes are randomly selected, it is very unlikely that the very high degree nodes will be selected. Furthermore, if the nodes are selected based on a likelihood that is proportional to their degree, it is even more unlikely that they will be selected. Given that these high-degree nodes are crucial when analyzing the topology of a power law network, it is easy to see how some of the more traditional sampling techniques mentioned below can lead to very misleading inference about the topology or processes of a given network.
It has been shown in [6] that subgraphs of a scale free network (a form of powerlaw network) that are obtained through random sampling do not produce a network that has an exponential degree distribution. Common sampling methods such as the induced subgraph sampling (a form of element-based sampling), incident subgraph sampling (a form of link-based sampling), snowball sampling, and many others sampling techniques have been shown to be unable to preserve the degree distribution of a scale free network [7].
With that, it has been shown that random walk algorithms are better suited to sample representatively from a power law network [7]. Random Walk algorithms employ a Bayesian sampling method know as a Metropolis Hastings algorithm to crawl along the graph from vertex to another [8]. This Monte Marlo Markov Chain (MCMC) approach has been proven to be effective at preserving the characteristics of directed online social networks [9]. In fact, it has been shown that the Markov Chain of a Metropolis-Hastings Random Walk can be tailored to any vertex distribution [10].
The sampling designs discussed in the this section will be considered as appropriate sampling techniques and described in detail in Chapter 3, and performed on the Twitter graph in Chapter 4.

Network Processes: Information Diffusion and Epidemic Modeling
The diffusion of information across online networks has been analyzed since the advent of the Internet. The idea of information cascading in a systematic manner across an online network was applied to computer viruses in the early nineties [11].
The result of this analysis showed that epidemiological models could effectively be applied to modeling information spread on a directed computer network [11]. It also showed that network topology greatly influenced the scope of computer virus propagation through a network of computers [11]. These early types of information diffusion model have been altered over the years to model the spread of information across online social networks.
News of important events and rumors have been described as infections of the mind, and this analogy to ideas and information acting as infectious pathogens has given rise to a large body of research devoted to modeling the spread of information across social networks using epidemic modeling techniques [12].
This research led to numerous discoveries regarding the benefits and downsides of applying epidemic models to the spread of information [12]. It has been shown that network topology not only impacts the scope of the spread, but also the transmission rate required for an epidemic to spread [12]. Traditional epidemic models rely on an assumption that the people or entities being studied comprise a random network. This assumption of a random network means that all uninfected people are equally likely to be infected by a randomly selected infected person.
As a network, this would mean that every person, or element, has a link to every other person in the network. As discussed earlier in this chapter, online social networks are rarely random and, as a result, the epidemic processes modeled on them exhibit characteristics that were previously thought to be very unlikely to occur.
Traditional epidemic models rely heavily on the value of the reproductive ratio, R 0 , which represents the average number of people a given person will infect over the course of the epidemic [13]. If the reproductive ratio is less than 1, then the epidemic will die out over time. Conversely, if the reproductive ratio is greater than 1, the epidemic will continue to spread until all elements have become infected [13].
This methodology of gauging the severity and scope of an epidemic based on R 0 only works for random networks. As one might expect, R 0 is not an effective gauge of epidemic spreading when dealing with nonrandom networks. In fact, it has been shown that epidemics can spread to exponentially in scale-free networks regardless of reproductive ratio and even transmission rate [14]. This result is due to the fact that a typical node is likely connected, or a short path away from, a highly connected node [15]. The existence of the highly connected nodes enable the epidemic to continue to spread even when the reproductive ratio and transmission rates are still low [15]. Unsurprisingly, it has been shown that networks possessing exponential degree distributions are particularly susceptible to epidemics [16] .
Twitter is one of the few online social networks that has been proven to act more as a information dissemination service as opposed to to true social network with required reciprocal connections between users [17]. Furthermore, a topological analysis of the degree distributions shows low reciprocity of following [17]. This understanding of Twitter as an information diffusion medium has been used to construct various information diffusion models.
Current Twitter information diffusion models mostly fall into two broad categories. The first being a method of analyzing only the retweet, i.e. spreading, network and its topological characteristics over time [18]. The second method involves analyzing the retweet network in conjunction with underlying social network (i.e., A Twitter network of followers) [19].
The first method is primarily used to understand how cascading trees evolve over the course of a give time period. Cascading trees are directed networks that originate at a root node, an individual Twitter user in this case, and extend as far that individuals influence will reach [19]. This method of analyzing cascading sequences is often used when comparing the influence of one user with another [20].
This type of analysis is also used to study how far across the network the average users influence will go over a specified time period [18]. It is important to note that this type of analysis does not presume a specific social network structure prior to analyzing the retweet network.
The second method involves modeling the spread of information after characterizing the original social network. This type of analysis often includes epidemic modeling because one is aware of the network structure and the number of individuals present in a given population before creating the model. These analyses have shown that epidemic models can accurately model the spread of news and rumors on a network of Twitter users [21]. The ability to model information diffusion using epidemic models is, in part, made possible by the potential for a rapid spreading of information on Twitter [22]. Epidemics are often intense and episodic, leading to a sharp spike in the number of infected individuals followed by a sharp decline [13]. This short-lived, intense activity on Twitter will be explored in more detail throughout Chapter 4.  of the the graph, respectively [1]. Edges can be thought of as the connections between vertices on a graph.

List of References
. Induced subgraphs will also be created and utilized when making inference about the whole network. Induced subgraphs are subgraphs that are constructed from a selected set of vertices, and then adding all edges associated with those vertices to the subgraph. Formally, an induced subgraph of and E ⊆ E is the subset of edges found on G that are associated with V [1]. Later in the chapter, subgraphs and the sampling methods by which they are obtained, will be discussed and proposed as models for creating prototypic, scaled down networks.
The Twitter network, and its graph representation, will often be compared to a random graph to highlight the differences between topology and its role in epidemic processes. Given this, the concept of a random graph must be introduced.
Formally, a random graph, often referred to as an Erdos-Renyi Graph, is a collection G Nv,N E of all graphs G with |V | = N v and |E| = N e , and assigns probabilities is the total number of unique vertices [1]. Random graphs rarely have dense cluster or high degree nodes.
Twitter users and the connections between users can be thought of as a mathematical graph with nodes representing users and the links between users as edges.
Given a graph representation of a complex system like Twitter users, it is valuable to explore the characteristics and structural properties of this graph. With clearly defined edges on a particular graph G, it is important to understand what relationship these edges represents. Are they unidirectional or bidirectional processes?
This distinction is essential for modeling information diffusion on social networks since the relationship between users dictates which way information can flow on the network. Directed graphs are comprised of unidirectional edges. A graph G for which edges E have a distinct ordering, i.e., {u, v} is distinct from {v, u}, is a directed graph [1].
It is common to express vertex adjacencies, connections between the vertices of a graph, in the form of a matrix. Matrices prove to be useful for computation and data analysis. An adjacency matrix is for a directed graph is a N V × N V binary matrix that represents all vertex connections on a graph with entries: A variation of the adjacency matrix is an adjacency list. The adjacency list is often used when the graph is particularly large and/or the adjacency matrix is very sparse, i.e. contains a large proportion of zero entries [1]. An adjacency list is a N e × 2 matrix in which each row represents a connection between the two nodes in the row. Adjacency lists will be used throughout the analysis due to the very large size and sparsity of the Twitter graph.
An important characteristic of any given vertex, v, in the network, i.e., v ∈ V , is the number of edges connecting it to the rest of the vertices. This measure of connectedness is formally referred to as the degree of a given vertex d v [1]. For a graph, G, let f d be defined as the proportion of vertices v ∈ V with a degree value d v = d [1]. Given that, {f d } d≥0 is referred to as the degree distribution of graph G [1]. Graphically, the degree distribution of G can be represented with a histogram.
Computing a degree distribution is a common starting point for characterizing a network since the distribution of connections between the vertices often indicates to which family of graphs a particular network belongs and helps in modeling a particular process over a graph [1]. It provides a very useful summary of the connectivity of the graph. This is crucial when determining how to predict a particular process over a graph [1]. As highlighted in Chapter 2, it is common for a social networks to posses a power law degree distribution. A power law degree distribution can be expressed as an exponential function in which the frequency of nodes with a given degree, f d , sharply decreases as degree, d, increases: where α is an exponent to be estimated to determine the relationship between a degree value and the likelihood that a node will posses that degree value.
It is common to perform a log transformation on the degree distribution of power law networks to asses how closely the degree distribution follows an exponential curve. Recalling basic properties of exponential functions, a log-transformed degree distribution should approach a straight line. As seen in Equation 2, there is a linear relationship between log(d) and log(f d ) : where C represents an arbitrary constant.
A number of regression-based approaches have been utilized to estimate the rate, α, at which frequency decreases as a function of degree, but these approaches have come under scrutiny for their ad hoc approach [1]. These methods are not advisable due to the disproportionate variability in the data at high degrees. A more mathematically rigorous estimator that is routinely used when dealing with power-law distributions is the Hill Estimator [1]. The rate of exponential degree decrease,α k , estimated using the Hill Estimator, can be computed as follows: where d 1 ≤ ... ≤ d Nv are the sorted vertex degrees [1]. This measure of α is iteratively computed for a large range of values of k with the hope that the function, γ − k 1, stabilizes at a particular alpha level. A plot of all values of α with respect to k, known as a Hill Plot, can be used to find a stable value for α. An estimate of α may then be used in sampling from a power-law distributions when attempting to create prototypic networks for further analysis.
Another useful network characteristic is a graph density, which is defined as the ratio of the number of actual edges over the maximum number of possible edges [2]. Graph density can range from zero to one, with zero values indicating a very sparse graph and values close to one indicating a very dense graph. It is important to understand how sparse or dense a graph is when dealing with very large graph, and graph density is particularly useful in understanding graph topology. Numerically, the density of subgraph H can be expressed as: where E H being edges of the graph and V H being vertices of the graph [1].
Note that defining the density of subgraph H allows for flexibility of computing density for a range of graphs from the entire graph, G, or one vertex, v [1].
The clustering coefficient of a graph is an important network characteristic which is related to density calculations. Two types of clustering coefficients, global and local, are typically used when describing network characteristics. Clustering coefficients can be thought of as the average of a set of densities that were calculated for every vertex of a graph.
Clustering coefficient can be defined using a more geometric thought process involving the number of triangles in the graph. In graph theory, a triangle is a complete subgraph of order three; and a connected triple is a subgraph with three vertices connected by two edges [1]. The local clustering coefficient, cl(v), can be expressed as: where τ (v) represents the number of triangles of G in which v falls and τ 3 (v) representing the number of connected triples in which v has two edges [1]. The number of connected triples for a given vertex can also be thought of as a binomial represents the number of possible ways to choose two edges from all the possibles edges of adjacent to v.
The global clustering coefficient can be computed as a weighted average of all local clustering coefficients: The assortativity of a graph, r, measures how similar cohesive, clustered subsets of vertices are with one another. Formally, the assoritativiy of a graph is defined as: where ( the characteristics that will be analyzed are the degree of a given vertex and the degree of the vertex's neighbors. This relationship sheds light on the similarities or differences between Twitter users that are connected with one another.

Social Network Sampling Algorithms
The sheer size of the Twitter network makes analysis computationally expensive, and often prohibitive in available software packages. Given this, considerable effort is dedicated to creating network graphs that are characteristic of the social Twitter network, but much smaller. The smaller graphs generated are easier to analyze and visualize. These graphs, however, should possess similar network topology characteristics, e.g. degree distribution, average degree, graph density, clustering coefficient, and assortativity, in an attempt to accurately render a graph that is significantly scaled down, but still representative of the original network.
In order to determine if a sampling method has produced a reasonably similar and representative subgraph of the entire network, it is necessary to compute characteristics of the whole network that will serve as a benchmark by which to measure all sampled or generated subgraphs.
In what follows, six sampling techniques are introduced in an attempt to determine which method of sampling best captures the characteristics of the larger social network. The techniques are: To evaluate how similar a subgraph's characteristic distribution is to the original graph, a normalized mean square error (NMSE) will be used [3]. This metric is utilized to compare the characteristic distributions, e.g., local density, local clustering coefficient, and node degree, of the sampled graph to the original graph.
Let θ k be the fraction of nodes in the graph that have less than or equal to a characteristic value of k, andθ k be the fraction of nodes in a subgraph that have less than or equal to a characteristic value of k. The normalized mean square error (NMSE) is defined as: Next, each of the six outlined sampling algorithms will be described.
The first method, induced subgraph sampling, randomly samples vertices and creates corresponding induced subgraphs. An induced subgraph consists of all sampled vertices and all adjacent edges between those specific vertices in a graph G. Specifically, n vertices are randomly chosen from V yielding a subset H, where Then, all edges from vertex pairs (v i , v j ) ∈ E are added to the graph to connect the vertices [1]. This method of sampling is usually applied to social network research to determine a contact networks between randomly selected people in the network [1].
The second method, incident subgraph sampling, involves randomly selecting n edges from the edge set E from the graph G to create a subset of edges . . , e n }, and then creating a subset of vertices, V H , that comprise the ends of all edges within E H . Another way to conceptualize the subset of vertices that is taken during indecent subgraph sampling is to identify all the unique vertices that comprise the edges of E H . This sampling methods most noticeably differs from induced subgraph sampling since the probability of certain vertices being selected as part of the subgraph is not uniform across all vertices in the original graph. In fact, in incident subgraph sampling, the probability that a vertex is selected is proportional to its degree. Vertices with larger degrees will contain more edges and, thus, will be more likely selected. It is important to note that although this method is slightly better at identifying larger nodes, it only takes one edge from the large-degree vertex. This proves to be problematic for a network as large as the Twitter network because the likelihood of the subgraph becoming dense enough to be representative of the whole graph is extremely low, as illustrated in include v, all ofv's friends, and all of the friends of friends of v. It is clear that this method of sampling will capture local clustering since, by construction, it creates a cluster around every randomly selected vertex. However, when dealing with a very large network, the probability that two randomly chosen vertices will have common friends, thus linking the two clusters, is very low. This problem of disjoint clusters is illustrated in Section 4.2.
Sampling from a power law degree distribution involves constructing a new graph, H, using the unique degree values of nodes in the original network in conjunction with a power coefficient, α, obtained via the Hill estimator (see Section 3.1). A set of vertices is created from sampling from the newly created degree distribution that takes the form: Intuitively, Barabasi-Albert preferential models properly mimic the process by which the Twitter network of followers is created. When new users join the social network, the users are offered a list of popular account/celebrities (highly followed twitter accounts) which might be of interest to them. And given that Twitter has also become a growing source of news and entertainment for many people, it is not unreasonable to presume that people opt to follow popular Twitter accounts of news outlets and celebrities, as opposed to randomly choosing Twitter accounts to follow. These type of graphs have a degree distribution that closely follows a power law distribution. While a large percentage of nodes have a much, much smaller number of connections, this means that there is a small number of nodes that have an incredibly large number of connections. These heavily connected nodes, sometimes referred to as super-spreaders in epidemic literature, play a crucial role in the severity of the epidemic spread. Due to the process by which new users prefer to connect with existing, popular users, a preferential attachment model might be more suited to replicating a Twitter network.
The last method used to sample from the social network is a Metropolis-Hastings Random Walk (MHRW) algorithm that employs a Bayesian approach to randomly walk along the graph from one vertex to another if certain conditions are met. Formally, the algorithm can be described as follows: First, a randomly selected vertex, v, is sampled from the entire graph and set as the initial vertex.
One of the edges of vertex v is randomly selected and is walked along to the corresponding neighbor vertex, w. Two proposals functions, Q(v) and Q(w), are created that correspond to the degree of nodes v and w, respectively. Next, Q(v) is divided by Q(w) and compared to a randomly generated number, p, from a continuous uniform distribution defined on the interval [0, 1]. If the ratio of the proposal functions Q(v) and Q(w) is greater than the randomly generated uniform number,p, the new vertex, w, is added to the subset of vertices and the process then starts over with vertex w as the starting point and its neighbors are analyzed.
Otherwise, this process continues until a prespecified algorithm cost is met or until a specified vertex count is met. For this project, the algorithm is run until a specified vertex count is met that is similar in size to subgraphs obtained by the other sampling methods. Algorithm 1 illustrates pseudo-code of the MHRW algorithm that is used to sample the subgraph G * from the entire social network G.

Data: Social Network Graph, G
Result: Subgraph, G * , using Metropolis-Hastings Random Walk Initialization; Randomly Select a vertex, v, from G; Random neighbor, w, of v is selected; Let p be a random sample from runif(0,1); if Q(v) Q(w) > p then Crawl to w; w becomes initial node; else Randomly sample another neighbor from v; end end Algorithm 1: Metropolis-Hastings Random Walk Algorithm.

Information Diffusion and Epidemic Modeling
The information diffusion process on a social network can be viewed as a Susceptible-Infected, or SI model; a type of epidemiological model that allows for each node in the network to become infected, and remain infected for the duration of the process. If an informed user can be considered infected, it is entirely possible for other Twitter users to retweet (become infected) another users story.
The important characteristic of the SI model is that once a user becomes infected, that user is considered infected for the entire duration of the epidemic process. The SI-type of model is chosen over other epidemic models because the information diffusion model is concerned with how much of the network has become aware of the discovery of the news, not the frequency with which certain users retweet, mention, or reply to other users. Others epidemic models take into account the possibility to recover from the infection become susceptible after an initial infection. Although interesting, these models in their current form are not applicable to the information diffusion modeling at hand. Modified, networkbased epidemiological models will be used to account the nonrandom topology of the Twitter network.
Traditional SI epidemic models, expressed by Equations 9 and 10 consist of two probabilities: probability of an infection, and probability that all individuals remain in the same state that they were in the previous time period (either infected or susceptible). N S (t) represents the number of susceptible individuals, N I (t) represents the number of infected individuals, and δt is the infinitesimal change in time. This model does not take network structure into account. In fact, it assumes that the population is homogenous and that each infected person is equally likely to infect any of the susceptible individuals. In terms of network structure, this would be referred to as a complete graph, or a graph in which every node has an edge connected to all other nodes.
An important metric in epidemic modeling is the basic reproduction number, R 0 . The basic reproduction number is defined as the number of infections expected in the by a single infected person [1]. This number is very important in determining whether or not an epidemic will occur because it provides an estimate for the average number of people a given infected person will subsequently infect. If R 0 < 1, then as time increases the infection will die out because each new person, on average, infects less than one new person. If R 0 > 1, an epidemic is likely to occur because each infected person is infecting more than one person.
It is important to note that these epidemic predictions based on R 0 apply to homogenous populations. Later in this chapter, the basic reproduction number will be discussed in a network-based framework.
Given what is known about the topology of the social network, the traditional model represented in Equations 9 and 10 would not be suitable for a network-based information diffusion process. The model must be modified to restrict each node to have the potential to infect only its neighbors.
Equation 13 shows the network based epidemic process that will be modeled on the sampled subgraph (see [1] for more details). Let X(t) represents a stochastic process which forms a continuous-time Markov chain. This process is comprised of states vectors x that consist of either 1 or 0, with 1 being infected and 0 being susceptible. The state vectors and the numbers that comprise them represent the state of a given node over time. The coefficient β is the transmission rate that represents the probability that the neighbor of an infected vertex becomes infected, and M i (x)δt represents the number of neighbors of i infected at time t.
where x is a successive change of state from x involving only one element at a time.
The basic reproduction number for an SI epidemic process on a graph can be defined as: where {f d } is a degree distribution and E(d) & E(d 2 ) are the first and second moments of f d , respectively [1]. While this reproduction number produces reasonable estimates for processes on random graphs, it is shown in [1] that the R 0 dramatically increases for graphs that have heterogeneous degree distributions.
This results from the fact that E(d 2 ) E(d) and often the second moment is extremely large due to the long tail of a power-law degree distribution. This fact illustrates that epidemics are much more likely to occur on power-law networks because there are very high degree nodes that have the potential to infect a large number of other nodes.
The network-based epidemic process is modeled on the sampled subgraph by randomly selecting one vertex and applying the transmission rate to determine the number of neighbors that will become infected on the first iteration. Next, all newly infected vertices spread the infection to their respective neighbors according to the transmission rate.
A root mean square error (RMSE) metric is used to compare epidemic models of varied transmission rates over various types of graphs. The proportion of the Twitter network that is aware of the news is compared to the estimated proportion for each day. Transmission rates can vary from zero to one, and the minimum mean square error value identifies the transmission rate that best fits the actual spread of information on a particular graph type.
Basic graph theory, graphs characteristics, sampling techniques, and information diffusion processes introduced in this chapter will be applied to the Twitter social network in an attempt to explore how information diffuses across an online social network and summarized in Chapter 4.

Results
Formally, network graphs and the characteristics that are important when summarizing topological structure have been introduced in Chapter 3. In this chapter, the network characteristics will be presented in application to the Twitter network.
First, the node degree distribution analysis will be discussed in conjunction with a Hill estimator and its applications to subnetwork sampling from a power-law degree distribution. The corresponding local clustering coefficient and local density of the full network graph will also be examined. Next, six sampling techniques will be used and their corresponding subgraph visualizations and characteristic summaries including: clustering coefficients, average degree, assortativity, and density will be studied for each sampling method. Finally, a network-based information diffusion model run on the sampled graphs will be analyzed and compared to the observed spread of information on the social network, i.e., the spread of the news of the discovery of the Higgs particle on Twitter.

Topology Characterization of Social Network
Recall that the data used in this thesis was originally collected by De Domenico et al. [1]. The dataset consists of four large interaction networks related to the spread of the news of the scientific discovery in July 2012, namely social, retweet, mention, and reply networks. In this project, the main focus will be on the social network and the retweet network. The social network is the largest one, containing roughly 450,000 nodes with 14,000,000 edges, and represents the network of Twitter followers engaged in the dynamic spread of this news. The retweet network, with 425,000 nodes and 733,00 edges, represents all the retweets done during the spreading period.   Given the exponential nature of the degree distribution, it is very difficult to discern any knowledge of the tail because the first bin of the histogram is so large relative to the size of the other bins. Instead, the log transformed degree distribution is used to understand the extent to which the distribution is exponential.
The log degree distribution function shows a relatively straight line, and suggests that the node degree distribution is exponential (see Figure 3). The smaller degree vertices tend to arch upwards and deviate slightly from a perfectly straight line, indicating that the smaller degree vertices comprise a portion of the distribution that may deviate slightly from an exponential shape than the larger degree vertices.
However, once the degree values exceed one hundred, the distribution quickly follows a very straight line. This emergence of a linear trend in the log-transformed degree distribution indicates that the distribution is exponential. Recall that the exponential degree distribution, requires estimation of one parameter, α, that converges to the rate at which the function decreases. As mentioned in Chapter 3, the Hill estimator is used to estimate the exponential coefficient to better understand the nature of the exponential degree distribution.  Figure 4 shows that the exponential coefficient, α, quickly stabilizes to roughly 2.4 as k, the number of order statistics used by the hill estimator, increases. This value of 2.4 will be used later to construct a network subgraph with a power-law degree distribution similar to the degree distribution of the entire social network.
The distribution of the local clustering coefficients, illustrated in Figure 5, indicates that a large proportion of the vertices have a clustering coefficient below . it is unlikely that users would follow strangers that are also following the celebrity account. The small uptick at the end of the tail is likely attributable to small components of two or three users that are disjoint from the largest connected component, but connected to each other. These types of groups are likely users that were informed via another mode of media and decided to share the information with their close friends on Twitter.   The average degree of a vertex's neighbors indicates how connected a given vertex's neighbors are to the rest of the graph. Figure 7 illustrates that when a vertex's degree is relatively low, the variation of the average degree of its neighbors is very high. This suggests that low degree vertices are as likely to connect to low degree vertices as they are to very high degree vertices. However, as the degree values increase, the variation of average neighbor degree values converge to roughly 125 with less variance explained by a small number of higher degree nodes. Figure   7 visually supports a negative correlation between the degree of a vertex and the degree of its neighbors which, in turn, compliments the negative assortativity of -.135 (Table 1) when examining the entire observed Twitter network.

Social Network Sampling and Inference
Sampling the large social network is done to obtain representative subgraphs that are small enough in size to be able used in a sequential modeling of information diffusion processes. As mentioned in Chapter 1 and Chapter 3, the network of Twitter followers under consideration is too large to be analyzed using currently available software packages, and, therfore, must be sampled for further modeling.
The following sampling methods will be used to sample the original social network and compared in terms of efficiency at producing subgraphs with a structure similar to the original social network: The induced subgraph sampling method produces graphs that are sparse and disconnected. A clustering of vertices takes shape in the center of the graph, but there are only a few clusters and they consist of a relatively small number of nodes compared to the entire subgraph (see Figure 8). This small amount of clustering is expected since there is a very large number of nodes to be sampled with low node degree. Intuitively, the likelihood that two randomly sampled nodes from roughly a half million nodes are connected or have common neighbors is low.
Thus, it is not surprising that the degree distribution is heavily weighted towards a degree of zero because the likelihood that a given node from the sampled subset is connected to another node in the subset is also low. While the shape of the distribution is ostensibly similar to the original graphs degree distribution, the network graph visualization indicates that the very large proportion of zero degree nodes does not construct a subgraph that is representative of the original graph.
All sampling methods are random in nature. In order to assess the performance of the proposed sampling methods, one hundred graphs were randomly generated using each sampling method. The tables corresponding to each sampling method display characteristics that are averages and confidence intervals for all one hundred graphs for a given sampling method. Table 2 shows that this sampling algorithm produces graphs with very low clustering coefficients, negative assortativity, and a density that is very close to zero, on average. Density and assortativity values for these graphs are reasonably close to the characteristics of the original Twitter network, but the clustering coefficient is significantly lower. Since clustering characteristics are very important when modelling information diffusion, a sampling method that produces graphs with such dissimilar clustering coefficients can be problematic.Thus, other sampling methods will need to be employed in an attempt to obtain a more representative social network.   Table 2: Induced Subgraph Characteristics. Estimates obtained from averaging the graph characteristics of 100 induced subgraphs sampled from the original social network.
The incident subgraph sampling method randomly selects edges from the roughly fourteen million edges present in the entire social graph. Once a subset of edges is selected, it collects all nodes that comprise the edges in the subset.
Finally, it produces a graph of the edges and nodes from their respective subsets. Figure 9 suggests that almost all of the edges in the graph are disjoint and connect only to a few vertices. A lack of clustering, highlighted by average the clustering coefficient summarized in Table 3, is even more apparent in this sampling technique since it is highly unlikely to sample two vertices have common edges. This is mostly because of the extremely large number of edges that are all equally likely of being selected. Also, one particular feature that is absent in the subgraph obtained using incident sampling that is present in all power-law networks is the existence of highly connected nodes, sometimes referred to as super spreaders in epidemiological literature. These vertices play a crucial role in the spread of information on a social media graph because of their ability to infect a very large number of other users. The absence of these types of vertices requires another method is better suited to adequately sample the social network. Table 3 shows that the average connectivity of the nodes in incident subgraphs is very low, .29. This value is not a good estimate because it implies that a large number of the users in the social network have no connections to any other users.
The global clustering coefficients for the sampled graphs are all very near to zero, and the assortativity values are also very close to zero. These ranges of values for one hundred sampled subgraphs are significantly far removed from the true values of the original social network .   Estimates obtained from averaging the graph characteristics of 100 incident subgraphs sampled from the original social network.
The snowball sampling technique, which randomly selects a subset of vertices from the original graph and includes all neighboring vertices of the selected vertices, creates a graph that is in some ways an improvement over previously discussed methods. Given the nature of including all the neighbors of the randomly selected subset of vertices, the clustering of this graph will be much higher than previously discussed methods because every vertex that is selected with its cluster of neighbors. The disadvantage to snowball sampling a graph is that the likelihood that these neighborhood clusters have connections to other cluster that are randomly selected is very low. The chances of connections between clusters are higher than the chances of clustering in induced and incident subgraph sampling methods, but they are still very low compared to the original graph.
In terms of obtained graph characteristics, the average global clustering coefficient of the snowball sampled graphs is significantly greater than the clustering coefficient of the original graph (Table 4 ). Intuitively, this is expected because every single vertex in the snowball subgraph will contain all of its neighbors. The average of all of these local clustering coefficients will produce a global clustering coefficient that is higher than the original graphs coefficient. As shown in Figure   10, the clusters of vertices surrounding all the randomly selected seed vertices are quite dense. However, all of these clusters are disjoint, which is not representative of the original social.
Unlike network graphs obtained using the snowball sampling algorithm, the original social network is almost entirely connected and the largest connected component is only a few vertices less than the entire set of vertices in the whole graph (Table 1). The problem of disjoint clusters contributes to the higher than desired clustering coefficient (Table 4). This, in turn, would prove to be problematic when running information diffusion models on the subgraph because the information will not be able to disseminate past the cluster in which the infection began. This will contradict the dynamics of the actual information spread on the almost fully connected original social network. The redeeming aspect of a snowball sampling algorithm applied to a power-law network is that it is much more likely, as shown in Figure 10, to capture super spreader nodes from the original graph. The snowball subgraph possesses a node with a degree of roughly 1,200. As mentioned earlier, these nodes are crucial when modeling information diffusion.  Sampling degree values from an exponential distribution that is similar in shape to the degree distribution of the entire social graph will require the degree sequence values and the exponent estimate, α, by which frequency will decrease as a function of degree. As shown by the Hill plot ( Figure 4) the exponent estimate stabilizes at α = 2.4, and will be used when creating an exponential distribution from which degree values will be sampled. The unique degree values from the full social graph will be degree sequence that will be raised to the power of -2.4.
This means that as degree increases, the probability of selecting that degree value will decrease. The subgraph that is sampled from this distribution produces level clustering ( Figure 11) and the degree distribution is still exponentially decreasing as a function of degree. This graph modeling method has a fixed number of vertices, N V , and a varying number of edges, N E .
The average degree value, 1.99, for the power-law networks is significantly lower than that average degree value of the original social network, 65. The average of the global clustering coefficients is very near to zero, and significantly different than the global clustering coefficient of the social network. The assortativity confidence interval for the sampled graphs contains zero, indicating that there is no significant assortative mixing in these types of graphs.   Table 5: Power-Law Sampling Characteristics. Power-law Graph Characteristics. Estimates obtained from averaging the graph characteristics of 100 power-law graphs.
The Barabasi-Albert preferential attachment network modeling produced a graph visualized in Figure 12. The left panel of Figure 12 shows a graph with large, dense clusters around a small number of nodes. Recall that this method creates graphs by iteratively adding one node to the graph until a specified vertex count is reached. The newly introduced nodes are more likely to connect to higher degree nodes. This preferential attachment model produces graphs with that possess a small number of highly connected nodes. While this is certainly a characteristic of the original network, the clustering in the rest of the graph is very low and the connections between the nodes that connect to the preferred nodes is almost nonexistent. These characteristics contribute to the graphs very low clustering coefficient that is near zero. Although the very high degree nodes were able to be constructed using this method, the very low clustering coefficient prompts further analysis into other sampling techniques to determine if another method is preferable.  Table 6: Barabasi-Albert Sampling Characteristics. Estimates obtained from averaging the graph characteristics of 100 Barabasi-Albert preferential attachment graphs.
The Barabasi-Albert preferential attachment model requires a predetermined number of vertices, N V , and the resulting number of edges, N E , will be |V N | − 1.
This is true because at every iteration one new node is added and it makes on connection to an existing node, but the first node introduced does not have any possibility of connecting to an existing node. This results in a graph with an edge count that is one less than the vertex count.  in Algorithm 1, produces graphs by analyzing the ratio of node degree value and the degree of one of its randomly selected neighbors. If this ratio is larger than a value randomly selected from a continuous uniform distribution on [0,1], the algorithm includes the neighboring node and edge, and then restarts the selection process from the newly selected node.
The graph network illustrated in Figure 13 is a product of a MHRW sampling method applied to the original social network.  the normalized mean square error statistic is used to compare distributions using the difference in the proportion of nodes with a value less than or equal to a value for both distributions. As the characteristic value increases, the difference between the two proportions is expected to approach zero because both proportions will be approaching 1. Given this, the closer a given curve is to the x-axis, the better the fit. Another way to think about this is that if the original degree distribution was compared to itself the difference would be zero for every degree value and produce a line that runs along the x-axis.
As seen in Figure 14, Figure 14: Normalized mean square error (NMSE) of sample graph degree distributions as a function of degree index, k. NMSE is computed by computing the squared difference between the proportion of nodes in a sampled graph with a degree less than or equal to k and the proportion of nodes in the orignal graph with a degree less than or equal to k.
A NMSE analysis is also performed on the distributions of the local clustering coefficients to determine which sampling method produced a distributions of local clustering coefficients most similar to that of the original network. Figure 15 shows the MHRW and snowball sampling methods noticeably outperforming the other sampling methods. The NMSE plot for the snowball sampling method performs very well for small clustering coefficient values, but then rises above the rest of the plots when the clustering coefficient approaches .2. The MHRW sampling method produces a NMSE plot that performs slightly worse than the snowball method for local clustering coefficients between 0 and .1, but performs better once the clustering coefficient is beyond .1. i.e., zero to one, and denoted as k. NMSE is computed by computing the squared difference between the proportion of nodes in a sampled graph with a clustering coefficient less than or equal to k and the proportion of nodes in the original graph with a clustering coefficient less than or equal to k. Figure 16: Normalized mean square error (NMSE) of sample graph local density distributions as a function of all possible density values, i.e., zero to one, and denoted as k. NMSE is computed by computing the squared difference between the proportion of nodes in a sampled graph with a density less than or equal to k and the proportion of nodes in the original graph with a density less than or equal to k.
A NMSE analysis of the density distributions ( Figure 16) was also performed.
The distribution of the local density of a graphs nodes is compare the local density distribution of the original social network. Induced subgraph sampling performs very poorly, relative to the other sampling methods, when attempting to preserve the density distribution of the original network. The reason for this poor fit is that a large percentage of the nodes have a degree of zero and are not connected to any part of the graph. Snowball sampling performs better than incident, but is still noticeably worse than the other four methods. The performance of the other four methods, namely Barabasi-Albert, Induced, Power-Law, and MHRW are all comparable.
Given the analyses performed in this section, the three graph models on which information diffusion models will be run on the MHRW graph, the Barabasi-Albert preferential attachment graph, and the power-law graph. Also, only these three models managed to provide enough similarity to the original Twitter network in terms of topological characteristics. Of these three, the MHRW graph types can be considered the best because they also possess clustering coefficients that are very similar to the original network, and have an assortativity that is very similar to the original network.

Modeling the News of the Higgs-Boson Discovery on Twitter
The discovery of the Higgs boson was a monumental event in scientific history and such news can expected to spread through a social network of people in the scientific community. The results below illustrate how the spread of this news across Twitter evolved over the course of a week. The three types of activity, namely retweeting, mentioning, and replying, are plotted separately in Figure 17.
The results focus on the intensity of activity over time and how these actions on Twitter spread throughout the social network. Overall, Figure 17 shows the frequency with with each action occurred over the course of the week. For all three types of activity, it is clear that there is a surge in activity in the middle of the week with periods of relatively low activity at the beginning and end of the week. It is important to note that the while all three action have a very similar distribution, the number of replies is much smaller than that of mentioning or retweeting. Until this point of the results section, the analyses of the network characteristics and the analyses of the temporal dynamics of the information spread have been separated. Moving forward, the two will be analyzed in conjunction to understand how the information diffuses over the social network. Figure 19 shows the proportion of users that have become aware, through one of the three actions, of the scientific discovery. Over the first few days, the news does not reach a large proportion of the network, but in the middle of the time period the proportion of people aware of the discovery sharply increases. This sigmoidal shape of the proportion of users to which the information has spread is similar in shape to the proportion of people who become infected in a SI epidemic model. This similarity further supports the use of an epidemic model when attempting to quantify the scope of information diffusion on the social network. Figure 19: Proportion of Users Discovering The Higgs-Boson News Over Time as a Function of Time in Days.
As discussed in Chapter 3, the transmission rate, β, is an important factor when modeling information diffusion as an epidemic process. Susceptible-Infected, network-based epidemic models were run on the MHRW, the Barabasi-Albert, and Power-law sampled graphs with transmission rate ranging from zero to one. The minimum of each MSE curve is the transmission rate that produces a model that best fits the actual spread of information across the social network. For example, Figure 20 shows the MSE curve produced for information diffusion models on a MHRW graph for transmission rates ranging from zero to one. A minimum value of β was reached at .056. The transmission rate of .056 indicates that at every iteration of the process, every neighbor of an informed node has a 5.6 percent chance of retweeting the information of their informed neighbor. Figure 20: RMSE of SI models as a function of the transmission rate, β, for three graphs generated from sampling techniques. The minimum of each line represent s the transmission rate at which a SI model on a particular graph best fits the actual proportion of Twitter users becoming aware of the Higgs boson news ( Figure 19) The MHRW RMSE has the smallest minimum value of the three RMSE plots and has the best fit to the actual proportion of users becoming aware over time. MHRW Minimum at β = .056. Figure 21: SI epidemic processes run across three of the sampled graph types. The black line is the actual information diffusion process across the entire Twitter network. This network is used to compute the mean square error for each process by comparing the proportion of informed users at each timestamp. The processes on the Barabasi-Albert and power-law models have a transmission rate of 1. The transmission rate for the process on the MHRW graph is .056.
Next, a SI epidemic model process was run on each type of graph using the transmission rate for which each graph type had a minimum on the MSE plot, β = .056 for MHRW, and 1 for Barabasi-Albert and Power-law sampling. The diffusion models were depicted in figure 21 and superimposed on the original information information spread (black curve). The SI model on the MHRW most closely fits the actual spread of information on the Twitter network. The processes on the Power-law and Barabasi-Albert graphs take longer than seven iterations to begin to spread to significant portions the graph. If the number of iterations were extended beyond seven, the processes on these two graph types would look similar in shape to the original spread on the Twitter network. For example, Figure 22 illustrates epidemic processes run on a Barabasi-Albert preferential treatment graph and a power law graph. The processes were illustrated to show that the increase in the proportion of informed users is similar to that of the process on the MHRW, but that a larger number of iterations is required. The sigmoid, or S-shaped, growth curve is common to processes on all three graphs, but the Barabasi-Albert and power law graph processes do not begin to increase until the seven day period of Twitter news spread has already passed. Figure 22: Total number of iterations for epidemic processes run across the Barabasi-Albert and power-law graphs with a β = .4. The number of iterations visualized was extended from 7 to the total number required to completely reach the entire graph to illustrate that the graphs still conform to a shape similar to the MHRW and actual information diffusion curves (a sigmoid curve), but it takes more iterations to complete the process.
In summary, this chapter details the results of sampling algorithms and information diffusion processes. The six sampling algorithms were evaluated by comparing various sample graph characteristics to those of the original Twitter graph.
The information diffusion processes were evaluated by comparing the actual proportion of informed users over to time to the proportions estimated in epidemic models. In the next chapter, results will be discussed and topics for future work and continued analysis will be discussed.  Characterization of the Twitter network, related to the spread of the Higgs boson news, has revealed information about the structure of an online social network. Specifically, it has been found that the degree distribution closely follows a power law function. This fact indicates that most users have relatively small number of connections to other users, but that there are a relatively small number of users that posses an extremely large number of connections to other users.

List of References
These nodes are most likely celebrity accounts that are either prominent scientists or authority figures in the quantum physics world, or possibly official account of organizations involved the discovery, e.g., CERN. The negative assortativity value for the Twitter network also supports this idea since more popular users connect to a large number of nodes that are less connected.
The six sampling algorithms used to construct prototypic networks have offered a wide range of graph topologies. Specifically, incident and induced sampling methods have failed to produce networks with enough connectivity between sampled nodes. Snowball sampling has provided graphs with clustering coefficients higher than the observed Twitter network. While an improvement from incident and subgraph sampling, the dense clusters of nodes produced by snowball sampling are disjoint from one another presenting obvious problems when attempting to model epidemic processes. Barabasi-Albert preferential attachment and power law graph models have created graphs that were more representative, based on NMSE characteristic analysis, than previous three methods. However, the MHRW algorithm has managed to produce the most representative prototypic network of the six sampling algorithms both in terms of connectivity and cohesion of the nodes.
The network-based SI epidemic process has been modeled over the MHRW graph, the Barabasi-Albert preferential attachment graph, and the power-law graph. The process that best fit the actual spread of information over time is the epidemic model on the MHRW graph with β = .056. Evidently, MHRW has been the only graph sampling process that exhibited a sharp increase in the proportion of nodes informed, in a way similar to the actual information diffusion over a seven-day period due to the characteristic similarities of the sample graph and the actual graph, namely the clustering coefficient, the assortativity, and the density.

Future Work
It is worth noting that the epidemic process modeled in this thesis only accounts for retweet behavior, but not the mentioning or replying behavior of users.
Replies are more of a conversation between users and do not necessarily represent the spreading of information form one user to another. The reply network may, however, be used as to predict links between users to identify if users are more likely to respond to the tweets of people with whom they have common interests or friends. Users have the ability to mention another Twitter user in a posting.
This can also be thought of as a means of spreading information to other Twitter users. However, the method by which the information is spread is fundamentally different from retweeting because users do not decide to be mentioned in the same way that they decide to retweet another user's tweet. Modeling this behavior in conjunction with the retweeting behavior would involve creating a new stochastic process that incorporates two transmission rates and two neighborhoods.
where M i (x) is another type of neighborhood that can represent the group of people likely to be mentioned by an informed person. This type of neighborhood could be determined using community detection methods that identify communities of nodes [1]. These communities could serve as a likely subset of nodes that could be mentioned by a particular user.
Varying transmission rates over time could also be incorporated into the epidemic model. The transmission rate, β, was a fixed parameter in this particular analysis, but varying the transmission rate by day could produce a better fitting model.
Information diffusion models on Twitter could also be applied to news that isn't scientific, e.g., political sotries, celebrity gossip, sports news, etc. [2]. Twitter offers a very wide array of news stories and these types of stories, and the people who discuss them, may differ greatly from the dynamics of action surrounding a scientific discovery.