Date of Award

2018

Degree Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Statistics

First Advisor

Noah Daniels

Abstract

Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This thesis shows that a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. This thesis applies nominal assortativity of parts of speech, a network data characteristic of word graphs, to the problem of authorship attribution and shows how these features are produced from a word graph model. Specifically, it is shown that the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in a word network to be connected by an edge, produces a feature set that can be used to predict authorship. These results are compared to the POS bigram model, a highly accurate authorship attribution model, and show that the nominal assortativity model is competitive. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech reveals regular structural properties of English grammar.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.