Assortative mixture of English parts of speech

Document Type

Conference Proceeding

Date of Original Version



Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This paper presents an application of network data analysis to the authorship attribution problem. Specifically, we show how a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in the word network to be connected by an edge, reveals regular structural properties of English grammar.

Publication Title, e.g., Journal

Studies in Computational Intelligence