Epidemiology of Browser-Based Malware

The presence of personal financial data, intellectual property, and classified documents on University computer systems makes them particularly attractive to hackers, but not well prepared for their attacks. The University of Rhode Island (URI) is one of the few institutions collecting network traffic data (NetFlow) for inference and analysis of normal and potentially malicious activity. This research focuses on webbased traffic with client-server architecture and adopts simple probability-based transmission models to explore the vulnerability of the URI web-network to anticipated threats. The fact that the URI firewall captures only traffic data inand outof URI necessitates the modeling of internal un-observed traffic. Relying on a set of intuitive assumptions, we simulate the spread of infection on the dynamic bipartite graph inferred from observed external and modeled unobserved internal web-browsing traffic and evaluate the susceptibility of URI nodes to threats initiated by random clients and clients from specific countries. Overall, the results suggest higher rates of infection for client nodes compared to servers with maximum rates achieved when infection is initiated randomly. Remarkably, very similar rates are observed when infection is initiated from 100 different clients from each of selected countries (e.g., China, Germany, UK) or from one most active node from Denmark. Interestingly, the daily analysis over a three-month period reveals that the simulated infection rates that are not consistent with the intensity of the traffic and the pattern of network characteristics which are dependent on how the nodes are related in the network, such as assortativity and global clustering coefficient, may indicate the presence of compromised node activity and possible intrusion.


LIST OF TABLES
The storage of student/faculty personal financial data, intellectual property, and some classified government documents on the computer systems of academic institutions makes them particularly attractive to hackers [5,6]. Open networks, expansive volumes of data, scientific research results, and the flexibility of public access expose university computer systems to cyber threats that, unfortunately, come with consequences. For example, in May 2017, a strain of ransomware called 'Wanna-Cry' spread around the world, walloping millions of targets, including UK universities [10].
The University College London (UCL) reported that malware very likely passively spread from a 'compromised' website in the university system [14]. In July 2015, Harvard University announced a data breach that affected as many as eight of its colleges and administrative offices. At about the same time, the networks of six Japanese Universities came under simultaneous cyber-attacks [9]. In Mar 2016, the breach in the library of Concordia University, Canada potentially impacted anyone who had used the affected computers in the past year. Most of the recent cyber-attacks are web-based attacks. While there has been, some attention paid to the problem of webmalware spread on institutional networks [7], very little research has been done to collect and analyze network flow data of a university computer system. This type of analysis could be a valuable tool to understand the communication patterns of webbrowsing participants (i.e., clients and servers) in this type of system, to learn the mechanisms by which epidemic spreads, to model the future course of epidemics in the context of existing threats on graphs with non-random structure, and possibly to alert a University's IT staff of a potential intrusion.  [25]. This approach takes as input the node degree sequence for both layers and randomly generates a bipartite graph respecting those distributions. We adapt this approach to incorporate overall external activity of URI servers and clients (i.e., strength distribution) and the intensity of traffic over time thereby modeling a dynamic bipartite graph.
To simulate malicious activity that can propagate from clients to servers and servers to clients in a dynamic manner, we combine both observed external traffic and modeled internal traffic and construct a dynamic bipartite network, which will serve as a basis for SI propagation model similar to one described in [13]. We use the proposed simulation approach to evaluate susceptibility of URI nodes to threats initiated by random clients and clients from specific countries with the most vigorous communication with URI (e.g., China, UK). We perform simulations varying sets of parameters, number of iterations, observation periods. We employ parallel computing techniques to the speed up the simulation process.
A central theme of this study includes the following goals: b. Understand the pattern of fraction of infected nodes over time to predict the possibility of intrusion.

RELATED WORK
Network Flow data are records that represent aggregated traffic between two hosts.
The information saved in a network flow record includes the IP address and port numbers of the source and destination, the protocol type of the traffic, the volume of traffic sent and various other attributes. The data is collected at a granularity that is optimal for tools that aim to enhance network security or provide network situational awareness [16]. General properties of network traffic have been studied intensely for many years [12,13,14,15,16,18]. The majority of these traffic analysis studies have been focused on the packet level, IP flow, protocol information and end-to-end behavior for detection of anomalies. The Virginia Tech, Blacksburg University collected network flow data to perform research on malware propagation, but their research was based on ring-based flow model involving packet and flow data [18]. The IP-flow level of clustering of anomalies of similar behavior [13] was performed by researchers at University of Wisconsin to show that anomalies can be exposed effectively when aggregated with a large amount of additional traffic. In [15], numbers of IP-flow, bytes and packets based analysis were employed to detect anomalies.
Rather than becoming over-whelmed by trying to examine each packet that traverses the network, in our study, we look at higher-level trends of traffic flow across the network. These trends can reveal interesting patterns and provide enough information to be useful that may otherwise be "lost in the noise" if we try to examine raw packet traces. Several analytical papers presented their work on creating visualization tools, which can depict a wide range of information about the characteristics of an entire network on a single screen [16,17]. Though we involve identifying network characteristics in this study, our focus is mainly on evaluating fraction of infection over time using simulated epidemic spread on the bipartite network graph.
Epidemic modeling on graphs has been an area of intense interest among researchers working on network-based dynamic process models. Epidemic modeling is concerned with three primary issues: (i) understanding the mechanisms by which epidemics spread, (ii) predicting the future course of epidemics, and (iii) achieving an ability to control the spread of epidemics [23]. Below we provide a brief overview of results for a traditional epidemiological model, followed by analogous models that have emerged in the literature on network-based extensions.
Traditional epidemiological models are based on the assumption of population wide random-mixing; that is, each individual has a small and equal chance of coming into contact with any other individual. In practice, however, each individual has a finite set of contacts to whom they can pass infection. The ensemble of all such contacts forms a 'mixing network'. Models that incorporate network structure avoid the randommixing assumption by assigning to each individual a finite set of permanent contacts to whom they can transmit infection and from whom they can be infected. [24].
The most commonly used class of continuous-time epidemic models is the class of susceptible-infected (SI) or susceptible-infected-removed (SIR) models. A population of N individuals is divided into three states: susceptible (S), infective (I), and removed (R). In this context "removed" means individuals who are either recovered from the disease and immune to further infection, or dead [19]. The model states that, at any given time t, a new infective will emerge from among the susceptibles (due to contact with and infection by one of the infected individuals) with instantaneous probability proportional to the product of the number of susceptibles s and the number of infected i. Similarly, infected individuals recover with instantaneous probability proportional to i. These probabilities are scaled by the parameters β and γ, usually referred to as the infection and recovery rates, respectively. The product form for the probability with which infected emerge corresponds to an assumption of 'homogeneous mixing' among members of the population, which asserts that the population is (i) homogeneous and The underlying assumption of homogeneous mixing is admittedly simple and, for many epidemic processes, too poor of an approximation to reality. As a result, interest has turned increasingly towards 'structured population' models, in which assumed contact patterns take into account some structure(s) within the population of interest [19,23]. Models introduced in this area include independent household models, twolevel mixing models, random network models, and social clustering models. The end effect of all of these models is, in one way or another, to impose restrictions on the contact structure within the population. Often it is convenient to represent this structure as a graph G = (V, E), where the vertices i ∈ V represent elements of the population and edges {i, j} ∈ E indicate contact between elements i and j. The contact implies the possibility for infection. The lack of an edge between vertices indicates that no infection is possible between the two [23].
Web-based communication networks are built on client-server architecture and follow a bipartite graph structure with two sets of nodes and edges that only exist between nodes of the different types. Epidemic behavior usually shows a phase transition with the parameters of the model-a sudden transition from a regime without epidemics to one with. Many of the really interesting cases of epidemic spreading take place on networks that have more structure like bipartite networks [19]. The study [21] represents the spread of sexually transmitted diseases in heterosexual populations and showed that the bipartite nature of the network must be taken into account to model the behavior of the epidemic threshold. Specifically, Gomez-Gardenes et.al. demonstrates that the inclusion of the bipartite structure can strongly affect the epidemic outbreak and can lead to an increase of the epidemic threshold. The results also point out that the larger the population, the greater the gap between the epidemic thresholds predicted.
Another study [22] on Vector-borne diseases for which transmission occurs exclusively between vectors and hosts is modeled on a bipartite network. The study states that spreading of the disease strongly depends on the degree distribution of the two classes of nodes. This study also suggests that the present approach is generalizable to other models. Modeling the epidemics of malware within networks in close to real-time, however, still remains a fundamentally open task due to diverse networks and constantly changing attack patterns [18]. The above-mentioned studies serve as effective foundational methods to build an epidemiological model based on a bipartite network.

CLIENT -SERVER ARCHITECTURE
This study focuses on the analysis of web-based traffic using a Client -Server Architecture. To comprehend clients and servers: (a) Clients are personal computers on which users run applications. (b) Servers are powerful machines that provide multiple clients with data/services upon browser-generated requests. There is a fundamental difference how clients and servers get infected [7].
Clients get to be distinctly infected when they visit a compromised site.
Depending upon the infection classification, the injected malware frequently empowers an attacker to gain remote control over the compromised computer system and can be utilized to steal sensitive information, for example, individual documentation, email passwords and banking accounts. A compromised client, ignorant of its infection, will have the capacity to transmit infections to multiple servers by means of web pages stored on these servers and accessed by client.
Servers get infected when malicious content is injected into websites stored on this server through web server security vulnerabilities in the operating system or installed software, user contributed content (e.g., blogs, uploads), advertising (images, banners) and third-party content (widgets, scripts). Once infected, servers transform into storage for websites where some portion of the websites is infected with malware.
Once the client or server is infected, the adversaries can even take control over the personal computer or server network. The key strokes and other confidential transactions on the compromised system are at risk from being observed by remote adversaries. The sophistication of adversaries has increased over time and exploits are becoming increasingly more complicated and difficult to analyze [7].

GRAPH-BASED REPRESENTATION
As the network flow data is relational in nature, it can be represented with a graph In the static network represented in Figure 1, all the clients from UK, India and NYC US are connected to URI Server S1 without time taken component into consideration. The static bipartite graph representation will be used to characterize the daily traffic in terms of graph structure. The dynamic network graph takes time into consideration and though clients C1 from UK and C2 from India are connected to server S1, they are represented separately at different time t1 and t3. We use the dynamic graph to simulate the network and virus propagation in this paper. We can see the activity of nodes dropping down during the weekends and raising back during the mid-week. This provides some insight into the expected patterns of traffic on the university network.

NETWORK CHARACTERISTICS
Examining the simulated data through a bipartite network identifies some network characteristics that are useful to understand the distribution of nodes in the network and eventually influence the infection spread on the network. Graph partitioning methods are useful precisely because these characteristics will often be unobserved [3]. The presence of high-risk nodes can be quantified through two network topology features, degree assortativity and clustering coefficient. Assortativity of Bipartite Graph (r) is the correlation between the network nodes.

Degree of Bipartite Graph
In general, r lies between −1 and 1. by assigning an edge to any pair of vertices that both have edges in E to at least one common vertex in V2. Similarly, a graph G2 may be defined on V2. Each of these graphs is called a projection onto its corresponding vertex subset [3]. If nodes 'a' and 'b' share at least one common destination, they are connected in the bipartite network projection.
In Figure 4, example of a small bipartite graph with clients and servers is presented on the left panel and its two one-mode projections on the right panel. The projection is used in order to determine some of the network analysis methods such as clustering coefficient. Clustering coefficient of Projection Graph is a measure of the degree to which nodes in a graph tend to cluster together. The value of the coefficient lies between 0 and 1 [3]. If the network is highly clustered with coefficient value close to 1, the network forms more connected communities which tend to connect to same node with high density ties. When they form community, all the nodes irrespective of their degree are susceptible to infection. The global clustering coefficient is defined as: In this formula, the number of triangles or a connected triple is defined to be a connected subgraph consisting of three vertices and two edges. Thus, each triangle forms three connected triplets, explaining the factor of three in the formula. Intuitively, a measure of the frequency with which connected triples 'close' to form triangles will provide some indication of the extent to which edges are 'clustered' in the graph. The clustering coefficients have typically been found to be quite large in real-world networks [3]. [2] Retrieved from "https://advisory.ey.com/cybersecurity/cyber-threats-highereducation-institutions", EY Building a better working world. 1. with replacement server ̃ from a set of unique, active URI servers ⊂

List of References
proportionally to the strength of flows observed in the external traffic for ̃; 2. with replacement client ̃ from a set of unique, active URI clients ⊂ proportionally to the strength of flows observed in the external traffic for ̃; 3. without replacement timestamp ̃ from a set of timestamps recorded in the external traffic.
To ensure the uniqueness of ̃s , we add 0.5 seconds of each selected time. The sets of all unique selected servers and clients form sets ̃ and ̃ respectively. Note that the proposed approach produces a dynamic bipartite graph that preserves important properties of the observed external graph structure. The size of the internal network is generated based on a specified percentage of the size of the external network, where size is the number of data flows in the network. We simulate internal networks with three different sizes -10%, 25% and 50% of the size of the external network and refer to each internal network based on its size in comparison. In order to maintain consistent results, we first build the 50% internal network and form the 25% internal network from the 50% internal network. Similarly, the 10% internal network is formed from the 25% internal network. Based on the understanding of how a university network is typically used, we expect to observe more external web traffic data than internal data. This assumption, however, may not be valid for other organizations such as the banking sector where external communication is limited or restricted.

Figure 5 depicts external network with URI and Non-URI nodes and internal
network with URI nodes are combined and sorted based on the time variable (t).

EPIDEMIC MODELING
In this section, we describe the epidemic modeling by assuming a set of clients C i , i ∈ {1,2, … , N C } and a set of servers S i , i ∈ {1,2, … , N S } with the corresponding probabilities of infection and susceptibility: α S (i) = P(S i infected), β S (i) = P(S i susceptible) α C (j) = P(C j infected), β C (j) = P(C j susceptible), And the transmission probabilities computed as follows: p CS (i, j) = P(C i → S j ) = α C (i) × β S (j) I{C i infected}, p SC (j, i) = P(S j → C i ) = α S (j) × β c (i) I{S i infected}.
Then the fraction of infected servers and clients at time t is defined as: where Ni S (t) and Ni C (t) are the number of infected servers and clients, respectively.
f Suri (t) = Ni Suri (t) N Suri and f Curi (t) = Ni Curi (t) N Curi , where Ni Suri (t) and Ni Curi (t) are the number of infected URI servers and clients respectively.
In what follows in Section 4, we have adopted the outlined probability- In Figure 6, infection is initially introduced into the network from clients C1 and C2, the infection spreads to server S3 as its level of infection is less than P value. The server S1 is not infected as its level of infection is higher than P value. Also, in the final data flow, the infected server S3 infects client C3 as level of infection of C3 is lower than P value. Therefore, the infection spread and propagation is analyzed with the client connection directly to servers and indirectly to other clients from connected servers.

PARALLEL COMPUTING
Parallel computing is a type of computation in which many calculations or processes can be carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time using multiple processors [1]. As we have considered more than one probability of infection (P = 0.1, 0.3, 0.5) in the study, we have used parallel computing to compute infection propagation for each value of P.
This saves computation time and optimizes the code. In Figure 7, the graph presents the computation time taken using different numbers of cores. The sequential computation with function lapply takes less time than for loop. The best result of 9.093 mins is achieved using 4 cores on a i5 quad core computer system. Nearly 600% speed up is achieved using parallel computing methods and packages in R: doParallel package,

List of References
[1] K Gordon, Max, "How to go parallel in R," G-Forge, Feb 2015.

CHAPTER 4 PERFORMANCE EVALUATION
We address three types of results in this section: Graph-Based Characterization,

GRAPH-BASED CHARACTERIZATION
This section starts with the structural characteristics of internal, external and combined networks formed from the network flow data over one day, Wednesday, 02-12-2014. The choice of the day is based on the high volume of traffic expected in the middle of the week during regular school time. Using the data, we form three networks (internal, external and combined) and compute structural graph characteristics (see Table 2).
The number of unique clients and servers in each network type (e.g., external and combined) gives us an idea of how many nodes of each type are active and the number of data flows determines the size of the network and the total number of connections. The strength of clients and servers determines the connectivity in terms of the average number of connections observed/modeled for clients and servers. In the case of the internal network, the strength of servers is higher than clients since more clients connect to fewer servers. In the external and combined networks (see Table 2), the strength of servers and clients is similar, as there are almost the same number of servers and clients. The presence of high-risk nodes can be quantified through two network topology characteristics such as degree assortativity and clustering coefficient. The degree assortativity measures the likelihood that nodes will preferentially form unique connections with other nodes that have similar degree distributions. Negative the assortativity degree of all the net-works, particularly in the case of internal networks; suggests that there is high chance of more popular nodes connecting to less active nodes.
The values of clustering coefficient obtained from the projection graphs above 0.5 and close to 1 indicate that presence of clustered communities of clients that share common servers that they connect to; and clustered communities of servers that tend to be connected by the same clients. Overall, these results suggest that all the nodes in the network contribute to the propagation of infection to some extent.  The log-log plot of node degree distribution for URI clients, URI servers (

PROPAGATION OF INFECTION
In this section, we describe several experiments that explore the rate of infection of nodes to understand infection spread on the networks. To conduct these experiments, we consider the following networks: (1) only external network derived from network flow data, (2) only internal network with different percentage of total flows 10%, 25% and 50%, (3) combined network with external network and 10% internal network. The rate of infection is calculated for each probability transmission ( ) value set up equal to 0.1, 0.3 and 0.5. We initiated the infection with 100 randomly selected clients from the list of: (1) all unique clients, (2) unique URI clients, (3) unique clients from Canada, (4) unique clients from China, (5) unique clients from India and (6) unique clients from UK. To address a special case of unusual activity coming from Denmark, we also initiate the infection from a single node.
In case of the external network, URI clients are not connected to URI servers directly resulting in some unrealistic zero rates of infection (Table 4,  When we initially infect the network with 100 clients from each of the five countries mentioned in Table 4, the rates of infection for all types of nodes are almost equal with expected variability less than 5%. We started the experiment with the hypothesis that initiating infection with 100 clients of countries with history of attacks would target more important URI nodes and promote the spread of infection, whereas infection initiation with 100 random clients would not target any particular node of interest. The results summarized in Table 4 for India, China, and UK clearly do not support this hypothesis. Overall, the results demonstrate the higher rates of infection for client nodes compared to servers with maximum rates achieved when infection initiated from random nodes. At the same time, the results show that very similar rates when infection is initiated from 100 different clients from each of selected countries (e.g., China, UK) and from one most active node from Denmark. In the further experiments, we plan to investigate the infection rates when propagation starts with most/least active clients/servers and also vary the number/proportion of nodes to start with.   We have also conducted experiments on combined network by infecting 100 randomly selected servers from the list of: (1) all unique servers and (2) unique URI servers sampled from total number of servers. The Table 5 presents the results of rates of infection. When we compare results in Table 4 and  rates of infection on one fixed internal network estimated over multiple iterations with p = 0.1 showed less than 1% of variability for servers and clients and less than 2% of variability for URI servers and URI clients, respectively. Mean and Standard Deviation of rates of infection on combined network are presented in Table 6. We observed comparable variability when conducted analysis on combined network with external network and variable 10% internal network over multiple interactions. We have also performed analysis on combined networks with (1) external and 25% internal network and (2) external and 50% internal network. However, we did not see any abnormality in the results (shown in

EFFECTS OF TIME
In this section, we compute rates of infection over a period of time. In the quest to understand how rates of infection change over time, we conduct experiments using the 90-day data. Firstly, the graphs in Figure 9 depict the analysis of the dataset over the However, we would need the real internal communication traffic in order to analyze the behavior and vulnerability of URI clients. To understand further how the daily rates of infection change over time after an initial infection, we conduct experiments using the data collected over ninety-day period between February and May 2014. Figure 3 demonstrates the average activity of network nodes summarized separately for URI clients and servers. Figure 10 represents rates of infection estimated daily over ninety-day period. By comparing Figure 3 and Figure 10, one can notice that up until the middle of March, the estimated rates of infection followed the temporal weekly pattern consistent somewhat with the intensity of the traffic. For example, the fall in traffic intensity between 03/10/2014 to 03/14/2014 that can be explained due to spring break week at the university can be also observed in the estimated rates of infection. During the time period between 4/22/2014 to 5/6/2014 the collected during this period. Remarkably, the intensity of node activity after the spring break and before the attack has not indicated any suspicious pattern; however, at the same time, the rates of infection for URI servers show clear departure from the expected behavior ( Figure 10). This particular observation has led us to the following hypothesis: that the simulated infection rates that are not consistent with the intensity of the flow traffic may indicate the presence of compromised node activity and possible intrusion.
The dependency that caused the abnormality could be hidden under certain characteristics of dynamic network that needs to be explored further (Refer Section 4.4).  We have analyzed average of fraction of infected nodes per day in the week over 90-days and Figure 11 depicts the results. The weekends show less activity and week days, specially the mid-week Thursdays show maximum rates of infection. On an average, the rates of infection show proportional pattern to intensity of traffic and grand average states that URI clients are more proven to infection at 31%, URI servers at 24%, overall clients at 18% and servers at 15%. We have further analyzed infection propagation per week over the 90-day period. This analysis helps us to understand the infection spread, when infection propagation continues in the network through the week. The results demonstrate rates of infection of URI clients higher than rates of URI servers. In Figure 10, we could see normal activity with respect to rate of infection of servers, clients and URI clients, whereas the rate of infection of URI servers showed abnormal high during the 3/17/2014 and 4/20/2014. Similar results can be seen in Figure 12 for weekly propagation analysis.

NETWORK CHARACTERISTICS OVER TIME
We have focused on the network characteristics of internal traffic (10%) and combined network per day over the time period of 90 days in this section. Figure 13 depicts the unique number of URI clients and URI servers in internal traffic and Figure   14 represents the unique number of clients, servers, URI clients and URI servers in the combined network. However, the pattern of the number of nodes over the period is consistent with the intensity of the flow traffic.  Degree of nodes: The degree of URI clients and URI servers in internal traffic ( Figure 15) and degree of clients and servers in combined network ( Figure 16) show a similar pattern to that of the intensity of the flow traffic, which makes it hard to predict the abnormality and dependency. Our initial findings show that internal traffic preserves the node degree and time pattern.  Assortativity: Assortativity is a preference of nodes to attach to others that are similar in the network. Though the specific measure of similarity may vary, network theorists often examine assortativity in terms of a node's degree [1]. We can notice that the degree distribution of clients and servers (Figure 15, 16) followed the temporal weekly pattern consistent with the intensity of the traffic. But the assortativity pattern (Figure 17,18) in internal and combined network shows high variability and an interesting pattern that is not consistent with the intensity of flow traffic.
Technological and biological networks typically show disassortative mixing, or dissortativity, where high degree nodes tend to attach to low degree nodes [1]. Our initial findings validate that our University network is disassortative in nature, which can be explained by expected selective communication behavior pattern and heavy tailed distribution of nodes ( Figure 8). But Figure 18 clearly depicts positive values of assortativity, making the network random or assortative. Assortativity of zero value indicates close to random connectivity, which is unusual for a University network.
Positive assortativity is even more unusual, as it would imply communication only between popular URI servers and very active clients. Remarkably, the intensity of node activity after the spring break and before the attack has not indicated any suspicious pattern; however, at the same time, the network structure alters and pattern of assortativity shows clear departure from the expected behavior ( Figure 18). This particular observation may indicate the presence of compromised node activity and promising future direction to predict possible intrusion.  Clustering coefficient of nodes: The clustering coefficient is another property which is dependent on how the nodes are related in the network and based on the projection of network. The below graphs in Figure 19 and 20 depict global and local clustering coefficient of internal and combined networks, respectively. The local clustering coefficient of nodes shows a similar pattern to that of the intensity of the flow traffic, but the global clustering coefficient of servers shows interesting pattern that is not consistent with the traffic flow intensity. This may indicate compromised node activity and needs to be further investigated.