Network Data Analysis of Word Graphs With Applications to Authorship Attribution

Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This thesis shows that a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. This thesis applies nominal assortativity of parts of speech, a network data characteristic of word graphs, to the problem of authorship attribution and shows how these features are produced from a word graph model. Specifically, it is shown that the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in a word network to be connected by an edge, produces a feature set that can be used to predict authorship. These results are compared to the POS bigram model, a highly accurate authorship attribution model, and show that the nominal assortativity model is competitive. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech reveals regular structural properties of English grammar.

Clustering coefficient for five different authors over all samples. The x axis is the number of vertices, the y axis is the clustering coefficient. Black points are true transitivity and red points are the lower bounds for 1000 random graphs of same in/outdegree distribution. Counting the vertical bands left to right constitutes the number of words in the sample: 125,250,500,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000 Attempts to quantitatively evaluate writing date as early as the 19th century with studies on Shakespeare's plays [3]. By the mid 20th century Bayesian statistical analysis was applied to a small set of common words to speculate over the authorship of the Federalist Papers [3] [4]. The problem of authorship attribution falls into the domain of natural language processing (NLP) and includes uncovering plagiarism, determining ghost-writership and pen names, and speculating over the authorship of unsigned supreme court decisions or anonymous blogs [5] [6]. As is the case with most data applications in the 21st century, there is a wealth of written data: large online text resources, blogs, community message boards such as Twitter, Reddit, and Facebook, and traditional print sources such as newspapers and books. Not surprisingly, there are a multitude of models designed to classify authorship. These models are often quite complex both theoretically and computationally and do not produce straightforward descriptions of language. The network analysis techniques outlined in this paper are straightforward and provide interesting descriptions of the English language. The purpose of this article is to replicate a part-of-speech bigram model that has been used with prior success at authorship prediction [6] [7] and show that it is part of a larger word-network model [8]. From this network model, network data analysis allows us to observe structural regularities of English grammar.
The part-of-speech (POS) bigram model is described as frequencies of pairs of consecutive parts of speech in a sentence. For instance the sentence "the dog ate" comprises two POS bigrams: a determiner and noun pair, and a noun and verb pair. There are 36 parts of speech identified by the Penn Treebank. The Cartesian product of these 36 parts of speech is the POS bigram feature set.
One desirable trait of the POS bigram model is its reduced feature space. 1 In contrast to the 36 parts of speech, there are hundreds of thousands of English words. While modern technology can manage large volumes of data faster than ever before, and machine learning techniques such as random forests, support vector machines (SVMs), and neural networks can handle numerous dimensions efficiently, the high dimensionality of language data requires transformation or reduction of the data to manage dimensionality [6] [7] [9]. The POS bigram model enables speedy analysis of the English language through a reduced yet informative feature space.
2 2 Literature Review [1] To compensate for high dimensionality it is common to target a single characteristic of language for inclusion in a feature set, ignoring others. Broadly speaking there are two types of authorship attribution models: those that model the text's content and those that model its structure [6] [7]. Content models encompass topical information such as word frequencies. Stylometric models capture the structure of a language such as the frequency of POS bigrams in a sentence.
Diedrich, et al. [6] applied support vector machines to the two categories of models mentioned above in order to evaluate their authorship prediction performance. Hirst and Feiguina, on the other hand, found that a POS bigram model could discern accurately between Charlotte and Emily Bronte, sisters whose writing is known to be difficult to distinguish from each other's [7]. We present a model that offers a more complete representation of POS bigrams than those of Diedrich, et al. adding the observation that some authors, while less than the majority represented in the sample, cluster by network characteristic when plotted visually [13]. Research on word graphs sought to discover predictive features using network data analysis but did not include the part of speech as an attribute of a vertex. Lahiri and Mihalcea showed that although descriptive characteristics such as transitivity, clustering coefficient, density, etc., are not significant predictors, they may be beneficial when included alongside other features such as word frequencies [14].
Amancio et al. also concluded that word network characteristics could be used in conjunction more traditional approaches by examining additional network characteristics such as shortest path, betweeness, and intermittency but did not achieve remarkable accuracy with any feature set created by a combination of 15 network characteristics [15] [16]. Mihalcea and Radev measured degree assortativity which measures the tendency for vertices of same degree to be connected by an edge [17], as opposed to nominal assortativity which compares the attributes of a vertex. 1 Marinho, Hirst, and Amancio report the results from various authorship attribution techniques all with varying degree of success [12]. The problem with comparing accuracy results across different experiments is the lack of control over experimental design. Different trials include different number of authors, different size text, in addition to different scoring techniques. The bigram model proposed by Hirst and Feiguina achieved very high accuracy, but on only two authors [7] [12].
Foster et al. recognized the strongly negative degree assortativity when applying the pearson corellation, however, also kept their discussion limited to degree-degree assortativity [18]. These previous studies did not examine nominal assortativity of parts of speech. We find that while not as powerful as the POS bigram model, the POS assortiative mixture model is competitive at authorship prediction.
More to the purpose of this article, we offer a meaningful description of the English language in a way other discussions on authorship attribution regularly fail to produce. Neural networks are especially criticized for being "black box" models because neuron weights hidden in multiple layers do not naturally correspond to language features. Comparatively, while it may be natural to count word frequencies, the feature set by itself does not offer intuition about language.
Zipf discovered that word frequency is inversely proportional to rank [19] [6] [20], an empirical law [21] observable by plotting word frequencies in sorted order. In simplified mathematical terms, the sum of the relative frequencies is the harmonic series [21]. The application of Zipf's law is an approximation, not all corpora follow identical word frequency distributions, however, in general Zipf's approximation holds across languages including English and Chinese [22]. With the goal of continuing statistical insight into language, network data analysis provides an avenue for further exploration of linguistic relationships.
By measuring the tendency for same parts of speech to collocate, the nominal assortative mixture model offers a glimpse into the characteristics of English stylometry. Assortative mixture captures fundamental language characteristics such as what same parts of speech pairs do and do not regularly occur in writing. For instance, we find that the determiner-noun pair is a disassortative bigram that occurs often in the English language. These frequent diassortative pairs contribute to an overall distribution of English parts of speech that is disassortative. However there 5 are parts of speech that do exhibit assortative qualities such as adjective-adjective pairs. As a feature set the nominal coefficients distinguish stylistic preferences between authors, yet regularities across authors reveal a grammar "signature" for the English language.
The rest of this paper is outlined as follows: We begin with a definition of the authorship classification problem in Section 3.1 and a description of the data in Section 3.2. In Section 3.3 we describe the network model. In Section 3.4 we show how the POS bigram model is constructed from the graph object, and explain how to calculate the nominal assortativity coefficients from POS bigram data. In section 4.1 we visualize the data, describe network characteristics such as graph density, and discuss the relationship between nominal assortativity and degree assortativity for word graphs. In Section 4.2 we report the model testing results for authorship prediction and conclude with a brief discussion in Section 5.

3 Methodology [1] 3.1 Authorship Attribution Problem Formulation
Authorship attribution is a classification problem. The authors of text are the classes and their related works are represented as a feature set. The task is to apply the feature set as a labeling function to accurately discern authorship given labeled training data. While the first step is described in Sections 3.4, the second step is formally presented as follows, given: 1. a universe X of n written works by m authors where k is the number of works written by author A i and a i j is the j th work In the above definition, the universe of written works X includes n written works from m distinct authors. Each label a i corresponds to an author labeled 1 to m. Labels are applied to the sample S of X such that each data point is labeled with a single author to produce training dataset D. Since the target labeling function f (x) is not truly known and can be applied in retrospect only, the functionf(x) is an approximation of the original function f (x) [23].

Data
The data set analyzed included 5 authors chosen from a subset of the Gutenberg data set made available by Michigan University [24]. The authors were Jerome Klapka Jerome (1859-1927), Thomas Hardy (1840-1928, Sir Arthur Conan Doyle , Jane Austen (1775-1817), and Nathaniel Hawthorne (1804-1864). For each experiment each author was represented by 30 fixed length excerpts (between 125 and 10000 words) taken from their larger written works. Each sample was manually pre-processed to remove the author's names, chapter titles, chapter numbers, subtitles, author's notes, editor's notes, and extraneous syntax such as brackets and asterisks. The main purpose of cleaning the data was to avoid speech tagger errors and remove extraneous information.

Word-Network Model
A word-network model is a directed graph G=(V, E) with a set V of vertices represented as unique words and a set E of edges, where elements of E are ordered pairs u, v, or bigrams, of distinct words u, v ∈ V appearing consecutively within sentences in a sample text. The direction of an edge is consistent with the order in which two words occur within each sentence, but edges do not span from a word that ends a sentence to one that begins the next. Each edge represents a unique word bigram, and its weight corresponds to its frequency. The degree d v of a vertex v, in a word graph G, counts the number of edges (bigrams) in E incident upon v. By computing the out-degree of each vertex in G, one can construct the word frequency model 2 discussed in the introduction. Each vertex in a word graph is attributed with its part of speech. By reducing the word graph G to a POS graph G p = (V p , E p ), where a set V p of vertices represent unique POSs and a set E p of directed edges represents unique POS bigrams, we can count edge weights to produce the POS bigram frequency model.
The directed graph in Figure 1 represents the following sentences: • The quick brown fox jumped over the lazy dog.
• A fox jumped over Sir Walters the lazy dog The representation of text as a word graph produces the well documented feature sets such as the word frequency model and the part-of-speech bigram model used in authorship attribution tasks (see Table 1 and Table 2). Additionally, this graph representation allows application of various network data analysis methods, such as the reporting of network characteristics including degree distribution, graph density, and nominal assortativity.

Feature Sets
In this section we take a closer look at word graph analysis and the part-ofspeech bigram model as tools for feature set selection for authorship authentication and outlook for the structure of English grammar.

Part of Speech Bigrams
POS bigrams represent adjacency between two consecutive parts of speech as described in Section 3.3. POS bigram frequencies are derived from the word graph representation in Table 2. The feature set includes 34 of 36 Penn Treebank parts Table 1. Word bigram matrix of the two example sentences. the a quick brown fox jumped over Sir Walters lazy dog the/DT 0 0 Hence, the resulting feature space is a 1156 (34 x 34) element vector where each element is an ordered pair of sequential parts of speech.
Hirst and Feiguina used Cass, a partial parser to tag parts of speech from short text and construct POS bigrams. The choice to do partial parsing was a compromise between quick computations and complete parsing. While not as accurate as complete parsing it was "accurate enough" [7].
This paper uses the POS tagger from the Stanford Natural Language Processing Group. It achieves high accuracy and is very fast even on large documents [25].
The accuracy of the Stanford POS tagger enables more complete parsing of syntactic labels compared to the partial parser used by Hirst and Feguina. We do not 3 excluding symbols and list items markers feel it is necessary to compare Hirst and Feguina's POS bigram model to our own, however, since our results support their conclusion that the POS bigram model distinguishes between authorship on small samples. Instead we expect that more accurate and detailed parsing will improve results.
Tagging parts of speech for each sample, we represent these samples as word graphs and produce the the feature set from the POS bigram frequencies. We apply classification tools including random forest and support vector machines for authorship classification. The results are summarized in Section 4.

Assortative Mixture of Parts of Speech
A word-graph model can be summarized by characteristics including degree distribution, density, and assortativity. Nominal Assortativity is a vector of coefficients ranging between 1 and -1, each of which measures the absolute tendency for graph vertices with the same attribute to share an edge. Positive coefficients indicate positive assortativity, negative coefficients indicate negative assortativity. The assortativity coefficient is analogous to the Pearson correlation coefficient [11] [26].
The attribute being measured in this paper is the part of speech. For parts of speech in a word graph, a positive assortativity coefficient suggests words of the same POS occur sequentially, while negative assortativity suggests they do not.
The assortativity coefficient is calculated for each of 34 POSs to generate a feature set of 34 elements. For each POS i, a nominal assortativity coefficient r i can be computed as: where f ii is the fraction of edges in a graph G that join a vertex in the ith category to a vertex in the same (ith) category, f ir and f ic are the marginal row and column sums respectively (see [10] [11] for more details). For the results in this paper, the assortativity coefficient was calculated using word graph objects with directed edges.
As an illustration, we computed nominal assortativity coefficients for the toy sentences represented as a directed word graph visualized in Figure 1. Specifically, to calculate the assortativity for POS, applying Equation (1) to the values in Table   2 reveals that the parts of speech DT, NN, VBD, and IN all have assortativity coefficients of -1, while JJ and NNP have coefficients -0.202 and 0.333 respectively.
Here the parts of speech constitute a feature set of five elements. This example supports our more general finding for larger data samples that parts of speech possess some assortative (disassortative) properties.

Motifs
Motifs are defined as "small subgraphs occurring far more frequently in a given network than in comparable random graphs" [11]. To Marinho, Hirst, and Amancio, "the topology of a complex network is characterized by the number of motifs found on its structure" [12]. Marinho et al. examined word graph motifs between three vertices (see [27] [11] for more information on motifs. Figure 2 shows the 13 possible motifs from three vertices.) and applied motif frequency to the problem of authorship attribution [12]. between languages that "languages possess an intrinsic structure, which divides words into categories" where "words from one category (e.g. prepositions) tend to be with others from different categories (e.g. nouns or articles) [12]. Marinho's conjecture that word category collocation is dissasortative is supported by the results in this paper.

Clustering Coefficient
If the three vertex motifs described in [12] [2] are not random , thus constituting motifs, one indication would be that network characteristics generated by random graphs are significantly different than characteristics of the true network.
One characteristic that applies to three vertex motifs is the clustering coefficient.
The clustering coefficient, also known as transitivity, measures the proportion of connected triples that are triangles to those that are not [28] [11]. The ratio between these two generic motifs (triangles, motifs 7 -13 figure 2 vs connected non-triangles, motifs 1 -6 figure 2) describes which motif has a larger presence.
A graph exhibiting transitivity suggests there is a high proportion of triangles compared to connected triples (that are not triangles), and vice verse.
To test the assumption that the presence or absence of these motifs is not random, the transitivity for every sample of 1000 POS or more is compared correspondingly to the transitivity of 1000 random graphs produced using the same in-degree and out-degree as each sample (for details see degree.sequence.game random graph generator from iGraph [11]).
The transitivity for word graphs of writing samples greater than or equal to 1000 POS is modeled using linear regression and visualized in figure 8. In equation 2, the dependent variable y is the clustering coefficient for a word graph from a sample of n POS greater than or equal to 1000. The independent variable x 11 is the number of vertices for a word graph representing a single sample. The discreet categories of sample size (1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 13 9000, 10000) are categorical dummy variables x 1 through x 10 . The interaction between number of vertices x 11 and the dummy variables x 1 through x 10 produces 10 separate regression lines, one for each category of fixed sample size. y = β 0 +β 1 * x 2 +β 2 * x 3 +. . .+β 10 * x 11 +β 11 * x 2 * x 11 +β 12 * x 3 * x 11 +. . .+β 19 * x 10 * x 11 + In equation 2 above, β 0 is the coefficient for the slope. β 1 through β 10 are the coefficients for the dummy categorical variables for sample size (x 2 through x 10 ). β 11 is the coefficient for the dependent variable x 11 . β 12 through β 19 are the coefficients for the interaction terms. Lastly, is the error term. Regression was applied under the assumptions that there is a linear relationship between the dependent and independent variables y and x 11 and that the distributions of the variables are multivariate normal. Additionally it is assumed that there is no collinearity between variables and that the error values are evenly distributed along the regression line.

Results [1] 4.1 Visualizations And Descriptive Analysis
In this section we demonstrate how the written text samples of the selected authors listed in Section 3.2 can be visualized and characterized using word graphs described in Section 3.3. Focusing on nominal assortativity coefficients of parts of speech, we compare different authors and explore if their individual preferences for POS usage elucidate their writing structure. Next, we apply the POS bigram and POS assortative mixture models outlined in Section 3.4 to the problem of authorship attribution using the data described in Section 3.2. We compare these models in terms of predictive accuracy for authorship of various writing sample sizes.
We begin by constructing a word graph for a selected writing sample.  Table 3. We will discuss these characteristics later in this section.
Turning our attention to nominal assortativity, we computed POS nominal assortativity coefficients for each author and visualized them in Figure 3. Note that each colored bar in Figure 3 corresponds to an author and shows the range of the assortativity coefficient for a POS. Positive assortative values indicate an author's preference for selectively linking the same POS, while negative values indicate preference for dissassortative relationships between twin bigram pairs.
While the magnitude of each coefficient does not reliably measure the magnitude of assortativity, the sign of the coefficient does distinguish between assortative and disassortative preferences [26] [29]. Comparing different authors, it appears that individual preferences for POS usage differentiates writing style. While only    an artifact that produces highly assortative values. These findings seem indicative of the English language structure in general. Consistencies across authors suggests an assortative "signature" for English grammar.
As opposed to nominal assortativity, which produces an assortativity coefficient for each distinct attribute of a vertex (the part of speech in the case of word graphs), degree assortativity is a single coefficient that measures the likelihood for vertices of fixed degree to attach to other vertices of the same degree (see [11]). For word graphs, degree assortativity is negative, reflecting the fact that the most used parts of speech (nouns, verbs prepositions, and determiners) are disassortative by type (see Figure 6). However, degree disassortativity also accounts for the fact that, for instance, that there are many determiner-to-noun transitions in English speech 4 (see Figure 6), with a few hub determiners connected to a diverse set of smaller-degree nouns (see Figure 5). Table 3 gives descriptive statistics of a single writing sample of size 3000 POS. Recall that a vertex in a word graph represents a unique word, and the 4 consider the French use of articles le and la total number of vertices in a word graph represents the total number of unique words used by a particular author within a single writing sample. Since the the number of vertices (unique words) determines the number of possible unique word bigram edges, it follows that the more unique words an author uses within a span of 3000 POS, the less dense the graph. This is a consequence of Zipf's observation that word frequency is inversely proportional to rank usage. In order for a new vertex to contribute to an increase in word graph density it must form unique edges with enough other vertices to exceed the ratio of vertices to edges from the previous graph missing the new vertex. However, by Zipf's law, a new word should contribute fewer edges to the rest of the graph than previous words because, by virtue of having a low frequency rank (a new word occurs once), also as a consequence of occurring later in time, is it used less frequently given a fixed sample size. This is supported by the information in table 3. Austen, with the fewest number of unique words used, has the most dense graph, while Hawthorne has the least dense graph because of his broader vocabulary use. However, the same correlation does not occur with degree assortativity. Austen and Jerome have approximately the same degree assortativity but the number of vertices differ substantially.

Model Testing
In this section we apply the POS bigram and POS nominal assortativity model to authorship prediction. For each of five authors, we use 30 written text samples of fixed length to derive the two models following the procedures described above. We evaluate and compare these models in terms of accuracy for various text sizes using support vector machines (SVMs) and random forests via 10 fold cross validation.
The results reported in Table 4 suggest the POS bigram model with support vector machines is highly accurate confirming the power of the POS bigram model.
Even on small data sets of only 125 words the POS bigram model performed well.
The assortative mixture model, on the other hand, performed best using random forests but could not achieve the near perfect accuracy of the POS bigram model on samples of larger text. The 90% confidence intervals were calculated using the bootstrap method on the 10 fold test statistic for the best performing classifier given each model. While significantly not as predictive, the assortative mixture model does perform competitively, especially for larger text sizes.
The assortativity model was also applied using a single layer neural network, however, the results were mediocre compared to support vector machines and random forests. While a more complicated neural network may perform more optimally, the pursuit of such a network is not within the scope of this paper.
With five authors each with 30 samples, the sample space for each experiment was not very large. This introduces the possibility of overfitting. The POS bigram model using support vector machines was highly accurate, however, a large feature space with few instances of test targets (only approximately three target authors per cross validated test sample) may be the reason for exceptionally high accuracy.
Using fewer folds for validation or including more samples per author might might remedy the potential for overfitting. 100.0 -100.0% Figure 7. The variable importance plot ranks the most important parts of speech from the nominal assortativity feature set when trained on random forests.

Variable Importance of Features
The variable importance plot in figure 7 returns by rank the most predictive features when using random forests. For word graphs the nominal assortativity of DT, CC, MD, and NN were the most important features. It should be noted that different iterations of the random forest algorithm will produce different rankings, however, in general the most used parts of speech were among the most important variables and the ranking in figure 7 is reflective of that distribution.

Using Assortativity to Compare Word Networks
Caution is required when using the assortativity coefficient. Hofstad and Litvak showed using a synthetic graph technique, as well as real-world network data, that for disassortative networks the magnitude of the correlation coefficient (also known as assortative mixture) decreases as the network increases in size. The assortativity calculation does give the correct sign of the coefficient (assortativity vs disassortativity), however, the inconsistent magnitude introduces the problem that the assortativity coefficient is not a good measurement for comparison between graphs of different size [26] [29]. Table 5 shows the results using the assortativity model on varied sample sizes between 9000 and 14000 words. While the model still exceeds 90%, compared to tests on fixed sample size the model suffers considerably (see table 4). The improved results when controlling for sample size reinforces the conclusions of Hofstad and Litvak 5 . To observe more closely the behavior of the nominal assortativity coefficient, the charts in figures 11 through 13 plot the nominal assortativity coefficient against the size of the word graph for a particular author. The charts in figure  For smaller word networks the assortativity coefficient appears highly varied. As the size of the word network increases, the assortativity coefficient converges to tight bounds. For DT and NN the assortativity coefficient for all fives authors exhibits this behavior. The same behavior is observed for JJ, VB, IN, and CC, for Hawthorne. It appears that for all parts of speech the assortativity coefficient is more varied for smaller networks and converges as the size of the network increases.

Motifs
For an example test of the assumption that three vertex motifs are not random table 6 gives the transitivity of the 5 writing samples from table 3, as well as the   23 transitivity range for 1000 random graphs with the same degree distribution as the comparable sample of writing. In all five cases the true clustering coefficient of the writing sample is below the range for 1000 random graphs. This indicates that the clustering coefficient of word graphs is not random, and is consistently lower than for random graphs. The low clustering coefficient means the absence of triangles is not random for word networks. The low transitivity for words graphs is explained by Milo et al. who found that three vertex motifs that were not triangles had a higher significance of occurring than three vertex motifs that formed triangles [2]. The charts in figure 9 plot the clustering coefficient against vertex size for all samples of each of the five authors. For large enough words graphs (about 1000 POS cumulative, or roughly 500 vertices) the clustering coefficient is consistently below the lower bound for the transitivity of random graphs. However, for smaller networks, the same is not true. For smaller networks the coefficient shows high variance. As the network size increases the coefficient falls within bounds. Once the word networks are large enough the clustering coefficient falls within regular bounds for each of the five authors.
The charts in figure 9 show clustering of transitivity value for word graph with similar number of vertices. In figure 8 vertical strips of points constitute samples of a fixed frequency of POS. It appears by the downward slope of the bands that for samples of fixed size an increase in the number of vertices results in lower transitivity. On a more macro scale, however, for large enough word graphs, the range of the clustering coefficient between bands is consistent. This suggests that while an increase in the word graph size results in lower trasnsitivy for fixed samples sizes, transitivity is range bound for most of an author's writing. Graph size does appear to have a macro effect on transitivity in that the more unique words an author uses, the lower the transitivity. Hawthorne, with the largest word networks, produced the lowest clustering coefficient values, while Austen, with the smallest word networks, produced the largest values.
Since the absence of triangles is significant for word graphs, it is worth asking what triangles, by part-of-speech, do appear often. Given a sample of size 3000 POS, the triangle composed of verticex types DT, IN, and NN occurs singly more often than any other triangle for each of the five authors. Compared to most other POS triangles, with frequencies less than five in most cases, the DT, IN, and NN triangle occurred tens of times more often.

Future Research
Future research could explore the relationships among natural languages (and languages families) using nominal assortativity similar to the comparative work done on motifs by Milo et al. Milo found that three vertex motifs occurred at similar rates for different language such as French and Japanese.   Figure 11. Nominal assortativity DT plotted against graph size for five different authors. The x axis is the number of words in the sample, the left (black) axis is the number of vertices, and the right (red) axis is the assortativity coefficient. Figure 12. Nominal assortativity NN plotted against graph size for five different authors. The x axis is the number of words in the sample, the left (black) axis is the number of vertices, and the right (red) axis is the assortativity coefficient. Figure 13. Nominal assortativity JJ, VB, IN, CC plotted against graph size for Hawthorne. The x axis is the number of words in the sample, the left (black) axis is the number of vertices, and the right (red) axis is the assortativity coefficient.
Representation of text as relational data provides many advantages. The information contained in a word graph produces several well documented feature sets used in authorship attribution tasks including the part of speech bigram model examined in this article. Since these feature sets have shown success in authorship attribution tasks, it is worthwhile to analyze these models for insights into the English language. Network data analysis provides an avenue to explore language in this way.
When computed for different authors, nominal assortativity by parts of speech appears to distinguish between the individual preferences of authors for part of speech usage, revealing aspects of authors style. Assortative regularities across authors reveals a grammar "signature" for the English language that exhibits mostly disassortative properties but permits some assortative relationships. These disassortative properties span the layered components of language. At the phonetic level different speech sounds are put together to form syllables and make distinct words. The same principle describes using different letters to write words. At the grammar level, words of different parts of speech collocate to create sentences.
The combination of differing components enables the structured use of sound and meaning.