Machine Learning for the Automated Identification of Cyberbullying and Cyberharassment

Cyberbullying and cyberharassement are a growing issue that is straining the resources of human moderation teams. This is leading to an increase in suicide among the affected teens who are unable to get away from the harassment. By utilizing n-grams and support vector machines, this research was able to classify YouTube comments with an overall accuracy of 81.8%. This increased to 83.9% when utilizing retraining that added the misclassified comments to the training set. To accomplish this, a 350 comment balanced training set, with 7% of the highest entropy 3 length n-grams, and a polynomial kernel with the C error factor of 1, a degree of 2, and a Coef0 of 1 were used in the LibSVM implementation of the support vector machine algorithm. The 350 comments were also trimmed with a k-nearest neighbor algorithm where k was set to 4% of the training set size. With the algorithm designed to be heavily multi-threaded and capable of being run across multiple servers, the system was able to achieve that accuracy while classifying 3 comments per second, running on consumer grade hardware over Wi-Fi.


Introduction
The purpose of this research is to develop a technique to automatically identify cyberbullying, cyberharassment and other prohibited speech. This research will implement an algorithm using existing machine learning techniques that will be able to identify cyberbullying in a single sample. With retraining, the algorithm must be able to adapt as laws about cyberbullying are changed. With this research, major social networking sites, such as Facebook, would be able to automatically identify harmful comments, relieving some of the stress on moderators.

Background
Cyberbullying is a growing phenomenon that is plaguing today's youth and is increasing at an alarming rate. As technology advances and becomes prevalent in more facets of our lives, the potential for bullies to reach into a teens' life increases, causing additional hardship leading to depression and, in some cases, even suicide.
The Cyberbullying Research Center's research [1] showed that in 2013 about one in four teens had been the victim of cyberbullying and one in six teens was involved in the bullying. Their research also shows, in every study, that cyberbullying is on the rise. Extrapolating from the studies, they estimate that 2.2 million teens were cyberbullied nationwide in 2011 up from an estimated 1.9 million in 2009. This number is expected to increase as both teens and adults continue to have an increased online presence.
One of the major problems behind cyberbullying is the difficulty for parents to spot and identify the bullying. Research shows that only one in ten teens will ever report it to an adult [2]. This lack of reporting also overlaps with the fact that there are no physical signs that cyberbullying is occurring thus, without manually 1 monitoring all of the child's on-line interactions, it can easily go unnoticed. Then, even if a parent does recognize that some communication could be construed as cyberbullying, they do not know any relevant rules and regulations in order to stop it effectively.
Cyberbullying is a major issue because many teens have committed suicide due to the pressures of cyberbullying. This can be evidenced by the death of a 14 year old girl, Rebecca Sedwick, which resulted in the arrest of her 12 and 14 year old classmates [3]. In 2011 and the first four months of 2012, there were 18 cases of suicide that were linked to cyberbullying in the US, UK, Australia and Canada. This is up from 23 cases identified between the years of 2003 and 2010 [4].
Cyberbullying takes place in numerous different locations and as such cannot be monitored with just one application. For example, one project that will be discussed below is an attempt to make an application that will identify and report cyberbullying taking place in Facebook, and while that is a great idea, it needs to be expanded to include other sites such as Twitter or text messaging. In 2012 and 2013 alone there were 9 suicides that were linked to the social network Ask.fm [5].
As the number of these social media outlets is always increasing, so to will the avenues of cyberbullying.
The current method of dealing with the problem is to use human moderators and administrators to remove offending comments and ban repeat offenders. However, on most sites, the number of comments far outweighs the ability of moderators to read and approve every comment. Thus, in order to combat this, moderators typically rely on users to flag or report offending comments. This means that users have already seen and been affected by the cyberbullying at which point it is too late to remove it. This algorithm will improve this situation by automatically flagging offending comments, at which point a human moderator could approve or 2 deny them without the intended victims having ever had to read it.

Goals
These are the goals I will be trying to achieve in order to complete this dissertation. can be used for cyberbullying, and it also removes the requirement of repeated behavior.

Relevant Laws
Now that we have a sociological definition of cyberbullying, a legal method needs to be completed so that rules can be devised that will be applied to the comments for classification. In order to inform the classification in a method that infringes the least amount on the first amendment, the laws at both the state level [3] and federal level [4] were utilized to create a definition that mimics what is considered a federal or state crime.

U.S. Code
The U.S. Federal Code is the laws that govern all American citizens across the country regardless of state. Within the US Code there are two federal titles that concern cyberbullying, Title 18 Crimes and Criminal Procedure [5] and Title 47 Telecommunications [6], and several subsections each of which will be explained. years, or attempts to do so..." [7] This means that if you use a commercial communications method (such as YouTube) to transfer obscene material to someone you 6 know to be under the age of 16 then you are in violation of federal law.

Section 1514 Civil action to restrain harassment of a victim or witness
The only important piece of this section is the definition of harassment [8] which is an act or course of conduct directed at a person that causes substantial emotional distress and serves no legitimate purpose.
Section 2261A Stalking This section is an important one in this research because many of the pieces of cyberbullying can be found under the stalking laws. [9] Some relevant portions are:

Rhode Island General Law
Outside of federal law, each individual state has their own laws that govern citizens present or doing business within that state. While this does make it more difficult to come up with a one-size fits all definition that meets all national laws, the laws from the State of Rhode Island will be used as that is the jurisdiction in which this research was conducted.

Title 11 Criminal Offenses
Within the Rhode Island general law there is only one relevant title which is Title 11 on Criminal Offenses [12].
Section 11-42-2 Extortion and Blackmail Most of this section is outside of the scope of cyberbullying, however, there is a small section that does come up.
This law states that anyone who maliciously threatens any injury to the reputation of another is in violation of state law. [13] Section 11-52-4.2 Cyberstalking and Cyberharassment This section just reinforces that the laws that govern physical conduct such as the laws against stalking, are also present to any communication transmitted over an electronic device. [14] Section 11-59-1 Definitions The section definitions just confirm the definition of both harasses and course of conduct as found in the federal law. [15] The state 8 definition of "harass" does include additional things such as the intent to seriously alarm, annoy, or bother the person.
Section 11-59-2 Stalking In the Rhode Island general law, the law against stalking is very straightforward, it is simply any person who harasses another person. [16]

BullyBlocker
A similar project being worked on, is the BullyBlocker from Arizona State University [17]. Its primary purpose is "to exploit social media data and, based off of a model built on previous research findings in areas of traditional and cyberbullying in adolescents, to then identify an instance of cyberbullying and notify the parents." To do this they have designed a calculation they are calling the Bullying Rank and, using calculated warning signs and vulnerability, they are able to calculate a risk ranking that will place the child in either a low, moderate, or severe risk category. Using these categories parents will be notified via e-mail and will be able to provide feedback in order to improve the identification.
While there has been future work mentioned to integrate machine learning, the problem exists that all of the research is focused solely upon Facebook and requires multiple warning signs in order to identify if a message was bullying or not. The proposed solution should be able to use machine learning to identify bullying in a single message and flag it as such.

Hate Speech Detection
A subset of the problem has already been solved by Columbia University where they utilized support vector machine learning in order to classify hate speech. [18] They define hate speech as "any communication that disparages a person or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic." Thus, while not all hate speech is cyberbullying (some hate speech has made it into the common vernacular and at this point is so common as to not be considered bullying), there is an overlap between the two.
When classifying anti-semitic speech they were able to achieve an accuracy of 94% and a precision of 68%. They used a sample set of 1000 paragraphs and determined if they contained hate speech by having 3 different annotators classify if it was or was not anti-semitic. To process this data, they used a template-based strategy which applied various positive and negative templates to the paragraph and kept count of how many occurrences were found, which they called the logodds. Overall they managed to create an algorithm that equaled the performance of their annotators.
There are several differences between this research project and the research being done by this dissertation. First, the research was only classifying anti-semitic speech which is a small subset of the harassment that occurs. Thus, while they will achieve greater accuracy within the context of a certain type of harassment it does not necessarily generalize to other forms of harassment. Second, they utilized works by the paragraph where many of the cyberbullying comments online are small statements as sites such as Twitter[19] restrict you to 140 characters. racist, or sexist language is used by up to 5% of the player base. [21] In an effort to combat this, Riot games created an automated player reform system. This new system would allow them to apply bans, such as two-week or permanent bans, due to homophobia, racism, sexism, death threats or abuse. However, they still rely on the offended player to flag the comments, at which point the automated system determines if the flagging is correct and then applies the appropriate ban. [22] After several months of using the system and utilizing several million games worth of data, they expanded the system to handle more complex behaviors such as determining if your character was intentionally feeding the enemy (allowing the enemy to kill your character intentionally thus making them more powerful).

League of Legends Player Reform System
[23] While this system does appear to be robust and was created with a large amount of data and training, the major downside is that this is a commercial enterprise that has designed this system for use solely within their product. None of the research that has gone into this project has been published or peer-reviewed and thus the system acts as a closed box. This prevents others from taking this research and utilizing it to work in their own situation. The lack of peer-review also means that the black box could be malfunctioning and we will never know as they are unlikely to self report the statistics of false positive bans.

Twitch AutoMod
Starting in December of 2016, Twitch, which is a online video game streaming service owned by Amazon, rolled out a new tool called AutoMod.
[24] This is a machine learning based automatic moderation software designed to hold back messages for moderator approval. It looks for and filters based on identity language, sexually explicit language, aggressive language, and profanity. Moderators can also set what level of filtering they wish to use which will ignore some of the filtered types depending on which of the four tiers is chosen. [25] Unlike the research done in this dissertation, AutoMod seems to utilize a dictionary that must be kept up by the developers at Twitch. Beyond this, none of the research Amazon put into this has been published and because of that, as of the writing of this dissertation it is unknown what sort of accuracy is possible with this product as well as how much machine learning plays into the tool versus simply using the dictionary and pattern matching to flag comments with certain works or combinations of words.
This research goes beyond the Twitch AutoMod tool in better matching the actions of moderators. Instead of utilizing training based on the moderators, the developers have arbitrarily designed four security levels that the moderators can choose from. What each of these levels has been trained on is unclear and must have been selected by the developers. The goal of this research is instead to utilize the existing moderators and to attempt to simply match their moderation patterns without regard to "types" of speech such as race, religion, gender, orientation or disability as the first tier of the AutoMod handles.

Machine Learning
Machine learning dates back to the 1950's [26] and a variety of algorithms now exist to do everything from language translating [27] to financial trading [28]. Of the different possible machine learning techniques, the two best possibilities for classifying cyberbullying are neural networks and support vector machines.
Perceptrons and later neural networks are based on the use of neurons; one in the case of a perceptrons and groups of them in a neural network. Each of these neurons receives n input signals along with n associated weights telling the neuron how to evaluate those inputs. After processing those inputs, it passes through a transfer function that will return either a +1, if the solution was positive, or a -1, if it was negative. The training of these perceptrons involves updating the associated weights until the n inputs on each of the training data sets correctly resolves to either a positive or negative result. There is a problem, however; while the perceptrons and neural networks will find a solution, that solution is not guaranteed to be the optimum decision surface. The algorithm simply stops once it tests a solution that is found to be correct. [29] Another problem is that the created model is extremely complex, having multiple inputs and outputs and various weights on all of them. This makes it difficult to understand how it is arriving at an answer and to influence that process to a better answer. [30] Decision trees are much easier for humans to understand as they are simple statements that you can easily follow to a conclusion. While they are simple to follow, that does not mean that they cannot become quite complex. At each node on the tree there is a test, and depending on the outcome of the data on that test, you go down to a certain leaf that may contain the solution or another test. These algorithms tend to be fast learning and have good accuracy and as such are used in things such as medical diagnosis. The downside to decision trees is that they have some limitations such as the inability to express all first order logic as well as the fact that duplication of tests can occur on the tree leading to much larger trees then necessary. [31] Support vector machines are a dual maximum margin classifier which means that the algorithm ensures that the decision surface that it generates is equidistant from both sets of data, which is assumed to be the optimal placement. In order to properly classify as many different types of data sets as possible, support vector machines also contain a kernel which can be changed to several different formulas to better fit. Another main advantage is that, because the support vector machines generate an optimum solution, they will always return a unique solution for the given data unlike neural networks which will just give a solution though a better one may exist. [29] Similar to the neural networks, however, the models from a support vector machine are not recognizable by a human as a solution to the 13 problem. [30] In this research, the support vector machines will be utilized as the machine learning algorithm as was used in the Columbia University research. Unlike perceptrons, the support vector machine is a maximum margin classifier, so while a perceptron will arrive at an answer, it is not guaranteed to be the best answer as once a solution is found the algorithm stops. Furthermore although the initial training of a support vector machine can take some time, the speed of classifying subsequent comments onto the decision surface is considerably faster then decision trees as there is only one comparison.

Natural Language Processing
Computational linguistics is the field of computer science that deals with the processing of language and has been an active field of research since the 1950s [32].
One of the first researchers who looked into the field was Alan Turing who created the Turing test to identify the point at which a computer was considered to be intelligent [33]. Since its inception, it has been used for a variety of purposes from the translation from one language to another in software such as Google Translate [34], to the processing of spoken language into text with software such as Dragon [35].
This research is built utilizing the research done throughout the field of computational linguistics in order to process the meaning from small statements of natural language into a form usable by the machine learning algorithm previously mentioned. The most useful model that will be used in this research is the ngram models in which the frequency of words is used in order to classify types of speech [32].

N-Grams
N-grams are quite simply a word, called a token, or a group of tokens, that can be used to predict a statistical language model [32]. N-grams are used in many different types of language processing, from speech recognition to ensure that any words not clear are guessed correctly, to machine translation to help select the most accurate translation.
They can be done for multiple lengths which is typically denoted by replacing the N with the length number. Take the following example phrase: The quick brown fox jumps over the lazy dog.
This phrase contains 9 words, 1 capital letter and one punctuation character. For the purposes of N-grams capitalization is ignored and punctuation and spaces are removed as well. So for the 1-grams there is the, quick, brown, f ox, jumps, over, the, lazy, dog. Now if the 2-grams are calculated they would contain the|quick, quick|brown, brown|f ox, f ox|jumps, jumps|over, over|the, the|lazy, lazy|dog. This would continue for the 3-grams all the way up through 9-grams at which point there would be no further difference with this phrase. Now, once the n-grams are separated, they are used to calculate the frequency of occurrence since machine learning algorithms work on numbers. This is done differently depending on the language processing in question but in general takes the form of: So again looking at the phrase we see that in 1-grams the token "the" has a 22.2% occurrence while all the other words have a 11.1%.

Information Theory
Information theory is the study of how information is encoded into bits either for storage or transmission. It began in 1948 with a paper called "A Mathematical Theory of Communication" that was published in the Bell Systems Technical Journal by Claude Shannon. In this paper, Shannon identified the mathematical limit to how fast information could be transmitted without error. Combined with that, he was able to describe how all information could be encoded in bits that could then be compressed and transmitted, which is considered to be the beginning of the digital age [36]. The most important piece of information theory that is being used in this dissertation is the calculation of entropy and information gain.

Entropy and Information Gain
In the field of information theory, entropy (H) is the measure of how much information is being held towards a given probability, in a symbol such as a bit or in our case an n-gram [37]. Using this information, the n-grams can be pruned down to only those that provide solid information instead of utilizing all of the n-grams, many of which will just increase the complexity of the machine learning without aiding in separating the classes. This will allow us to eliminate low information n-grams such as the token |a| which would not be an indicator of bullying, while prioritizing the features that are the best indicators of the classes.
In order to calculate the entropy, first we calculate the probability of a class with or without a certain feature. So for each n-gram we needed to calculate the number of positive comments with the n-gram, the number of negative comments with the n-gram, the total number of comments with the n-gram, and then the same three statistics but for comments without the n-gram. So first using the probability with a feature, for each class i you would calculate: With both p + and p − calculated for the comments with a feature, next the entropy of a comment having a specific feature can be calculated with: Fixing up the fraction this simplifies to: Since we are only using two classes, that finally becomes: This is all repeated by calculating p + and p − on the comments the do not have the feature. Those two H(P) values are then combined into a weighted averaged [38].
This is done using the total number of comments (T), the total number with the feature (TW), and the total number without the feature (TWO): The last piece needed before the information gain can be calculated is to get the amount of information stored in the Parent before splitting on this feature. This is done again using the total number of comments (T), the total number with the feature (TW), and the total number without the feature (TWO): And finally we can than calculate the information gain from the attribute: When this is positive, it indicates that the n-gram in question is a good candidate to separate the classes. N-grams that are negative contain a loss of information. For this research, the n-grams are sorted by their Information Gain in descending order, and we then take a certain percent (being decided in testing) of those top most values.

Methodology
There are several distinct steps to the methodology employed in this research.
First, comments need to be gathered for use in the research. Next, the definition of cyberbullying needs to be clearly laid out in order to classify the comments.
After that, several different programs need to be utilized in order to process the comments and then to train and test the support vector machine.

Gathering Data
The first step in the research was to gather both bullying and non-bullying comments to both train and test the machine learning algorithm in its ability to detect cyberbullying. In order to collect enough comments, a web crawler was utilized to harvest comments off of Twitter and YouTube. This web crawler was designed to grab all of the comments from the videos found in the YouTube playlist Popular Right Now by #PopularOnYouTube [1] where the top 200 videos at that time are displayed. The web crawler was designed in such a way that no user information was recorded and any username in the replies were stripped out.
These comments were grabbed while the video was still popular which helps to ensure that the comments that were present at the time the site was crawled did not have time to be fully moderated, removing the offending comments that are needed for the research.
For the web crawler comment process, YouTube was chosen for several reasons. First the site has a variety of comments, users, and viewpoints resulting in arguments that can get heated. Next, it allows users to post with usernames that are completely removed from their non-internet alias. This allows users to have no consequences for anything they might say, outside of having their comments 22 moderated or their account shutdown. Another key factor was the fact that the site has no differentiation between public and private comments. This means that any comment made by a user, that has not been removed by a moderator, is visible and public to every other user. Facebook and Twitter, on the other hand, allow users to specify who can view comments and as such there is a greater expectation of privacy on those sites.
After crawling for only two hours, across two different days, over 118,000 comments were recorded into the database. This amount provides a large selection of comments for the classification and will be pared down as needed in later steps.
In order to ensure that the algorithm would be capable of handling comments regardless of site, a web crawler was used to grab comments from Twitter as well.
In order to ensure that this also grabbed data that was as random as possible, data was crawled from the Twitter public sample stream which represents 1% of all Twitter messages posted at the time the web crawler was running. The only filter that was used was to restrict the 1% to English language tweets in order to ensure that it would be possible to determine if they are bullying. After running the web crawler for just a few hours, over 72,000 tweets and re-tweets were collected.

Classifying Cyberbullying
The next phase of the research was to determine the criteria that would be utilized to mark comments as cyberbullying for the purposes of the research. To that end two different methodologies were chosen to show the ability of the algorithm to conform to the terms of service of various sites allowing it to be used more generically. The first method involved the use of the Rhode Island General Laws and the US Federal Code while the second method involved the use of a theoretical website's policies on moderating comments. The algorithm was first trained on the legal method to assess it's accuracy, and then, with that complete, was retrained on first the terms-of-service method and then a combination of both methods to determine how well it can be retrained.
While in a court room, the legal method would be decided by a determination of twelve jurors who would need to agree that something is obscene. On the internet, it is typically one to several moderators who are given the full authority to remove comments they find harmful to the ecosystem of the site. Because of this, all of the comments were classified by the researcher by developing a set of standard rules that each comment was weighed against. Every effort was made to keep the rules objective and to remove as much subjectivity from the methodology as possible. Because this tool is designed as a website filter and not a replacement for a legal jury, this methodology simply shows the ability of the algorithm to replicate the moderation techniques done by the small moderation teams.

Legal Method
The less restrictive of the two definitions is the definition following both Rhode Island general law as well as the U.S. Code governing federal law. Unlike the field of science, the field of law is not a strictly defined medium. Most laws are written in such a way as they need to be interpreted by the individual lawyers and judges rather then being strictly defined. Even defined terms such as Obscene are defined using tests such as the three-pronged Miller test [2]: 1. Whether the average person, applying contemporary adult community standards, finds that the matter, taken as a whole, appeals to prurient interests (i.e., an erotic, lascivious, abnormal, unhealthy, degrading, shameful, or morbid interest in nudity, sex, or excretion); 3. Whether a reasonable person finds that the matter, taken as a whole, lacks serious literary, artistic, political, or scientific value.
As the quote shows, much of the law has to do with what an average person would find to be the case. In the court of law, this average is done by forming a 12 member jury of random citizens taken from the local area to help to establish an average. However, there is a problem with this approach. When it comes to the internet, what community standard should be applied? In the supreme court case Miller v. California [3] the opinion of the majority written by Justice Burger is that there should be no national community standard and that obscenity should be decided at the community level. However, as the years have progressed, it has been shown that not having a national standard is beginning to cause issues.
Recently the Santa Clara University School of Law reviewed the Miller Test in light of a new circuit split in attempting to apply the Miller test to the internet [4]. In the review it was pointed out that even after the Miller case was decided, there have been several other supreme court cases dealing with obscenity that have shown the Miller test is not sufficient. In each case, however, the majority opinion has been that a national standard can not be created and a community standard must be used. The major issue is which community do you apply it to? In Sable Communications v. FCC, the courts ruled that Sable Communications, a "dial-aporn" business, must meet local obscenity laws and "may be forced to incur some cost in developing and implementing a system of screening the local of incoming calls." This will not work in the day and age of the internet however, and currently the Ninth and Eleventh circuits are split on the issue.
Because of the split and the lack of a firm standard, this research simply treated all comments that are sexual in nature as obscene. This is because the sites that are being utilized in the research are designed to be used by both minors and adults and do not contain nor allow adult content. While this is of course stricter than allowed by the First Amendment, this tool is not designed to replace the legal juror system and is only intended to be run on a private companies website, where the company can decide which communication it deems to be obscene based upon its community.
Each of the comments were analyzed for the following to conform with all of the laws: 1. Is the comment sexual in nature?
2. Is the comment intended to seriously alarm, annoy, or bother the person?
3. And does the comment serve a legitimate purpose?
By using those three tests all of the state and federal laws are satisfied. See table 1 for four examples of comments that were classified and the reasoning as to why.

Terms of Service Method
The second method of determining if a comment is cyberbullying is to use the terms of service that governs a social media site to determine what they do or don't want on their site. These terms of service are enforced by the moderation team and the goal of this research is to be able to implement these terms of service in an automated fashion. For the purposes of determining what rules would be followed, This was not bullying because while it fails step two, it is legitimate and because the subject of the harassment is not present in the conversation We'll see this lil shit on Ellen -1 This one is bullying because the subject of the comment is the person who posted the video and is likely to read the comment and be bothered by it Please show this to the narrow-minded right wing fucks! -1 While this comes close to serving a purpose, the likely-hood of a right wing reader being upset by this outweighs the need for this comment to exist Butterface bitch. I bet if she didn't have those big ass tits the comments here would be she's ugly and everything lol. -1 This one fails every step of the test We do not tolerate the harassment of people on our site, nor do we tolerate communities dedicated to fostering harassing behavior.
Harassment on Reddit is defined as systematic and/or continued actions to torment or demean someone in a way that would make a reasonable person conclude that Reddit is not a safe platform to express their ideas or participate in the conversation, or fear for their safety or the safety of those around them.
Being annoying, vote brigading, or participating in a heated argument is not harassment, but following an individual or group of users, online or off, to the point where they no longer feel that it's safe to post online or are in fear of their real life safety is.
A major difference between this and the legal method is that because the terms of service are not designed as a legal document, the spirit of the rules are more important than the strict wording, and moderators are given more leeway in deciding what is unwanted on the site. For each subreddit, the creator can define who they wish to have as moderators, and those people have the ability to suppress or remove any comments that violate not only Reddit's terms of service, but also whatever other rules they create for that sub-forum.
One of the goals is to allow for retraining to handle new situations as the laws and culture change. In order to ensure a sufficient difference between the legal method and the terms of service method, the comments that were used in the legal method were reclassified as if they were posted on a fictional site that simply had the rule: Content is prohibited if it discusses politics and/or religion. This will ensure a substantial difference as most of the political talk in the comments did not rise to the level where it could be deemed illegal.
Finally, after testing just the terms of service method, the combination of both the legal method and the terms of service method will be run, simulating a more 28 realistic portrayal of the terms of service actually used on websites. For both the terms of service method and this method, the only test run is to confirm that by simply retraining, the algorithm is capable of generating a useful model without the need to re-optimize all of the parameters.

Support Vector Machine Model
For the model to reach its optimum potential there are a number of different parameters that were optimized. The writers of LIBSVM [6] recommend that the optimization take place with, first knowledge of the data set, and then a grid search over the relevant parameters. This is, however, only done over the parameters directly built into the model. For this research, there are additional parameters which will each be optimized individually.
Before any parameters can be optimized, the range needs to be established on each of the parameters to ensure that the testing covers all necessary values.
The first parameter is the number of comments that are used for testing purposes.
While a balanced data set will be used to ensure that there is no bias introduced to one class or the other, we still need to determine the optimum number of comments to ensure that we are always getting a reasonable training set while also keeping complexity, and thus training time, low. For this reason, we will test from 50 comments (25 bullying) to 500 comments (250 bullying) in steps of 50.
Next, we need to test which of the four possible SVM kernels will perform best on the data. For the initial tests, we will be using the linear kernel which is the faster kernel and contains the smallest number of additional parameters to optimize. Once we have optimum values for the other parameters, we will also test the Polynomial and the Radial Basis Function (RBF) kernels. We will not be testing the Sigmoid kernel because of research done at the National Taiwan University. They have shown that, not only is the kernel not positive semi-definite (will not find a solution for all valid values of parameters), but the kernel also does not perform better than the RBF kernel in general since it was designed to mimic the function of neural networks within a SVM [7].
With any of the three kernels, one of two error weighting parameters, C and ν, must be used. C is the first soft margin parameter that sets the cost of an error to allow potentially mislabeled data to exist across the boundary. It can take any positive value and, due to its functionality, an exponential grid search is the best method to find the optimum [8]. To this end, C takes the form of 2 k with k ranging from -5 to 15. ν is a newer soft margin parameter that replaces C in order to reduce the allowable values from all positive numbers to 0 to 1. ν can be shown to have the same optimal solution set as C [9], and as such does not completely replace it, but, in certain circumstances, one may perform better than the other and as such both will be tested. Because ν can be any number between 0 and 1 the test will begin at 0.1 and go to 0.9 in 0.1 increments.
With the polynomial or RBF kernel there is an additional gamma parameter which could be found through grid searching. Alternatively, research has shown that gamma can be estimated mathematically and to do that a C# package called the Accord.NET Framework was used [10].
The final free parameters are the degree and Coef0 which are only used in the polynomial function. The degree will be constrained from 1 to 10 unless 10 is found to be optimal at which point we will expand the range. Coef0 on the other hand is only used in special data sets and in general should be fine as 0. To ensure this isn't one of the special cases, we will test with a value of 0 and 1.
The next step to training the SVM is to determine what attributes are going to be used for the data. With even just 200 comments, the number of unique 6grams was over 19,000 and due to this, the generation of the training file was taking in excess of 2 hours, while each comment was taking one minute just to generate the data file to be passed into the SVM. In order to reduce this time without substantially reducing the accuracy of the model, the entropy of the n-grams were calculated.
This introduces two more parameters to be optimized in the testing. First what length of n-gram provides the best accuracy while minimizing the amount of time required. The other is what percentage of the n-grams should be utilized of the ones that have an entropy greater than 0. For the first one, we will test n-gram lengths from 1 to 10, while for the second we will test 1 to 10 in 1% increments and, assuming 10% is the optimal, we will increase by 5% increments until performance degrades. In each case, the n-grams used are marked as being used in the training set and are then calculated for each message. This marking ensures that the same n-grams will be used for the testing comments later, and because the n-grams are then sorted by a unique identifier, ensures that there will be no discrepancies between the data sets.
With all of the data being passed in, there is the potential that the support vector machine may not perform well due to the number of comments that are not on the decision surface that may still be influencing the model. In an effort to reduce that and the time that the cross validation will take, the research is borrowing an experimental function from Leandro Costa at URI who is developing a method to reduce the training set [11] built upon the K-Nearest Neighbors algorithm (knn) [12]. This method calculates the distance of each comment from every other comment and then uses that to try to determine if it is against the decision surface or not. This will add another parameter as the method takes a count of how many closest points to look at to test if it is close to the decision surface. A low number here will greatly trim the data to only those points directly next to training. This was tested from 10% up to 100% in 10% intervals.
After all of these individual parameters have been tuned based on the best performing linear kernel, the next step was to perform a grid search on the three possible kernels and all of their parameters. While in the first step only the linear kernel was utilized to kick-start the process, here all of the possibilities were run to ensure the optimum was found.
In an effort to help distinguish good performing parameters, the full testing set of data is run through each created model and the training time and testing time are recorded along with the results of the test. Next, all first run tests were duplicated 10 times and all this data was graphed to help give a visual representation to the performance.
Finally, the average standard deviation was calculated and used to find the required sample size (figure 1) to ensure that number of runs were enough [13].
For this calculation, the 95% confidence level was used which results in a z-score of 1.96, and a margin of error of 5%. If this resulted in more runs being required, they were conducted as appropriate.
Once the full run-through of the initial parameters is completed, a second runthrough was done with the optimums found on the first attempt to ensure that the starting parameters did not influence the results. Instead of doing 10 runs on each test, the number that was calculated on the first set of runs was used instead.
With those done a new sample size calculation was done to ensure that it is still a proper result. The primary motivation for doing this second run through was due to the internal stratification that LibSVM utilizes on their implementation of cross-validation.
When an attempt was made to calculate the 95% confidence interval utilizing bootstrapping, the numbers returned were often higher than the original number. For this reason, there was not a high level of confidence that one pass on the variables would be sufficient to get the optimum model parameters.
The final test of the system will be to calculate the throughput of the system.
Because the units of the standard deviation in this test will be in milliseconds, instead of an accuracy, the coefficient of variation will be computed instead since it removes the units [14]. The coefficient of variation is then used as shown in figure   2 to calculate the sample size [15]. The same z-score of 1.96 and margin of error of 5% was used as the standard deviation method.

System Design
The system designed for the dissertation is broken into four separate sections.
The first section is the website which facilitates access to the rest of the pieces.
The second section is the database where all of the data is stored and worked on by the other sections. Third is the service where the data is actually processed, manipulated and finally passed into the Support Vector Machine. Finally is the Support Vector Machine itself where the model is trained and the data is classified.
As you can see in figure 3, the overall system is complex. This diagram represents the ideal system in an actual website setup. The system used by the research contains identical functionality but at a smaller scale, combining all servers and databases onto one virtual machine. In the following sections each piece will 33 Figure 3. Overall System 34 be broken down to make it clearer what is happening.

Website
In order to show that this algorithm is fully functional regardless of the website, the website was written separately from the algorithm and simply utilized SQL stored procedures to function. Because of this, it can be shown that any web language capable of interacting with a Microsoft SQL Server database would have the same functionality.
The first site is designed to show some of the potential functionality that can be built into a website if required. This site is written in ASP.NET, utilizing the C# language and uses a very utilitarian visual style. It first takes and allows the user to submit a comment for analysis. After it uploads the comment, it waits for the comment to be processed, and then to be analyzed while letting the user know the steps are happening. Finally, once the analysis has been completed, it lets the user know if the comment that has been posted is considered bullying or not.
Another simple website was created in C# to allow for easy classification of data. This site selects a random comment from the database and then provides two buttons, one for Bullying and one for Non-Bullying. This site is the principle way in which the data was classified. After the initial classification, the site was also used to reclassify the comments for the other tests required.
Finally, a website was setup that captures the current statistics to get an accurate count of the throughput of the system. This website shows how many comments are waiting to be processed and analyzed, and shows the average time required to handle a single comment.
In a fully implemented system, the website component would function as shown in figure 4. In this ideal system, there are three possible classes of user.
There are the posters who are adding comments to the site, the readers who are permanent. This will ensure, that while some users will still flag it for human review when they know it is bad, the majority should just move on to either toning down the message of their comment or not commenting altogether.
The system, as shown, is not what was fully implemented because in the case of this research there are no readers. Thus, the first website was setup to mimic the functionality of the posters and the second site is similar to that of the moderators.

Database
For this research two different databases were utilized. Both were run on SQL Server 2012 (11.0.2100). The first database was used to store the data that was collected from both YouTube and Twitter. This data was stored in plain text with no processing done to it, outside of the stripping of user identification information, as well as storing the original source and holding the manual classification for any of the users.
The second database is the more important and is the database where all of the data is held for the processing and analysis sections. This database contains The first is the Comment table which contains both the original comment as well as a cleaned version that has had white space and symbols removed. It also stores comment specific information such as the percent of the comment that is capitalized, the manual classification value, and if it was processed or analyzed yet, when the service occurred, and by which thread.
The second table is the NGram table which contains the list of every n-gram that was found in any processed comments up to the 10-grams. That means that it contains every grouping of single words all the way though ten consecutive words.
The other important feature of the NGram table is that it stores if the n-gram is new, as in, it was added from a comment after the SVM was last trained, as well as the entropy of the n-gram which can be used to decide which to use as attributes. This information is important because any n-gram that was added after the training run is a feature that has not been utilized by any of the training set and as such cannot be a feature of any classification set without retraining. Finally, there are a number of secondary tables that were used to store temporary testing sets as well as results from each of the analysis runs. These are used only for the purposes of optimizing and testing this dissertation and would not be present in a product utilizing this algorithm.
Along with the three tables, there are also some stored procedures that are used to handle tasks such as locking a record to a certain thread or selecting all unprocessed comments etc. This is done to ensure consistency across threads and to ensure data consistency is kept at every stage. These stored procedures are, for the most part, dependent on the individual implementation, but there are two that will need to be in any implementation. The first calculates the entropy of the n-grams in regards to the comments that are being utilized to train the SVM. The second procedure clears out the training set and then randomly selects an equal number of both classes and sets up their features for the machine learning.

Services
One of the goals of the algorithm is to allow for as much multithreading as possible to ensure that the algorithm can scale as needed to handle large websites 39 Figure 6. Services System Diagram such as Facebook. To this end, the programming for the processing and analyzing of the comments were created in services that are run on the server. The main workload in this case is spread across two different services that can run concurrently to spread the work out as much as possible. In fact, as shown in figure   6, there is no reason these services could not be setup to run on multiple server clusters to handle as many comments as needed.
As shown in the diagram, the first step is the web server where the new comments come in prior to processing and the comments marked safe are hosted for the readers. The incoming comments are first placed into a database where they can undergo the initial processing. At the database level, the comments have all of their non-alpha numeric characters stripped out and replaced by | along with all spaces. Multiple | are also then combined so that there there is a single | at the start and end of the message as well as in-between each unigram.
The first service that is utilized by the comments is the Processor. As soon as a new comment is inserted into the database, the processor is run against it in order to calculate the different stats that are needed for the classification, such as the percent capitalization and the creation of the n-grams needed. In order to speed up later steps, the percent of the comment each n-gram represents is also calculated at this step and stored in a separate table. This ensures that everything that is needed for the SVM classification is already handled so that the only thing the analysis service needs to handle is the classification itself. Because the processor only handles a comment at a time per thread, it is easy to increase not only the number of threads available, but to also scale this across multiple servers since the results do not rely on any factors external to the single comment.
The other service is the one that handles the SVM classification. While the training of the SVM is a time consuming task that must be completed at the startup of the service, once the SVM is trained, the service is setup to allow multiple comments to be classified at the same time with proper distributed multithreading.
This SVM model can either be recreated at each service startup on every server, or after the first service is brought online and trained, the SVM model can be stored to a file which is then utilized on startup of the other servers to reduce training time and allow for a more efficient spinup of additional servers during peak times.
Finally, after the analysis is complete, the result is stored in either a new database or, in the case of this small scale research, as a simple flag in the comment

Support Vector Machine
Because the purpose of the research was to use an existing machine learning algorithm in a new and novel way, the project used a wrapper called LibSVMsharp [16] which calls a C++ dll implementation of LIBSVM [6], fully implementing Support Vector Machines within the .Net language. This allowed the research to focus on the training and testing of the model rather then focusing on reimplementing the existing Support Vector Machine algorithm which may have introduced both additional complexity as well as being an added vector for bugs.
There were several changes made to the DLL's utilized by the wrapper, however, to optimize the functionality of the libsvm implementation. All of the changes were generated from the libsvm faq page in order to better parallelize the LIBSVM dll and allow it to better utilize a multicore system for training and predicting.
This DLL was based on the 3.21 version of the LIBSVM code and LibSVMSharp was rebuilt using this code as well. Both the DLL and the Wrapper were also compiled in 64-bit instead of their normal 32-bit version in order to accommodate the large data sets required for the research.

CHAPTER 4 Findings
This section begins with the findings from the first two optimization runs where each of the parameters are set to their optimum value in turn. For this research, the optimum is a balance between the highest accuracy and the best performance. With the multiple parameters optimized, it then moves on to the average accuracy against all of the classified data, first using the same legal method that the parameters were optimized with, and then with the terms of service method and the overall method. After that, it tests the Twitter comments to ensure that it works across sites. Finally, it will test two methods of retraining to see which performs better followed by a performance test.
All graphs in this section will show the average cross validated accuracy (Avg. Accuracy) and the average weighted real accuracy (Avg. Real Accuracy). The trend lines and confidence bands, as calculated by Tableau [1], are also shown for each of the values to help illustrate what is going on with the data.

Preliminary Findings
These findings show the results of the optimization steps taken on the support vector machine. All tests were run in a virtual machine running Windows Server 2012 R2 and were run on a desktop with an Intel i7 5930k processor overclocked to 4.2 GHz, 32 GB of Crucial DDR4-2400 RAM and a Samsung 850 EVO SSD. 6 Cores and 27.9 GB of RAM were assigned to the virtual machine.

Initial Run
This is the first run-through on each of the parameters to get an initial best case.

Initial Grid Search
The first grid search is used to choose the initial C or ν value on the linear kernel so that we could begin with a decent starting point. The linear kernel is used in the initial run because it has the smallest number of input parameters.
For the other parameters this run used 100 comments as the training set, 10% of 1-grams, and a knn level of 10%.
On this test, the full 10 runs were completed and resulted in an average standard deviation of 7.37. This means that the minimum number of runs necessary to be sure of the results was 8.35.
In order to properly visualize the data, the C and ν sets were split into two different graphs so that their individual effects could be seen. The first graph  Figure 8. Initial ν Accuracy shown in figure 7 illustrates the effect of varying the C parameter on the linear kernel. In this graph, you can see that the best the cross validation achieves is 55.96% accuracy with a C value of 2. However, the slope is flat on the cross validated accuracy and close to logarithmic on the other accuracy. This shows that the best possible C value is between 2 and 7 since the weighted real accuracy begins dropping slightly.
The second graph in figure 8 shows the effect of varying the ν parameter on the linear kernel. In this graph, you can see that the best the cross validation achieves is 56.48% accuracy with a ν value of 0.7. In these tests, the real accuracy was a relatively flat slope that turned down at high ν values.
As figures 9 and 10 show, the ν SVC method with a ν value of 0.7 is the 48 Figure 9. Initial C Cross Validated Accuracy 49 Figure 10. Initial ν Cross Validated Accuracy  Table 3. Initial Number of Comments parameter with the highest cross validated accuracy. This is because even though ν 0.7 has a higher deviation than the next best, which is C of 2, it has its average being pulled down by a few bad runs with 4 runs over 60% versus only 2.

Number of Comments
With the initial linear kernel grid search done, the next parameter to isolate was the number of comments needed for the training set. From the last test, we are using the ν SVC linear kernel with a ν value of 0.7, 10% of length 1 n-grams, and a knn level of 10%. To ensure that one class does not overwhelm the other, the training set is balanced so in each case half is pulled from each class.
On this test the full 10 runs were completed and resulted in an average standard deviation of 3.31. This means that the minimum number of runs necessary to be sure of the results was 1.68.
As figure 11 shows, increasing the number of comments improves the accuracy of the model with a logarithmic curve. Under 150 comments the accuracy quickly drops off making small training set sizes too inaccurate even though they will perform faster. With these graphs, the best size was determined to be 300 comments since it maintained the best ratio of accuracy and real accuracy vs the training and testing time. In an ideal world, where time is not a factor, this should be set to the maximum size possible, but since time is always a factor, a size must be chosen that will allow the throughput required while giving acceptable accuracy.

N-gram Length
With the number of comments optimized, the next phase was to find the best the n-gram length. This test used 300 comments as the training set, the ν SVC linear kernel with a ν value of 0.7, 10% of the n-grams at the various levels, and a knn level of 10%.
On this test the 9 runs were completed and resulted in an average standard deviation of 2.87. This means that the minimum number of runs necessary to be sure of the results was 1.26. There are only 9 runs because during the analysis it was discovered that something happened during run 3 that caused the training time to increase to greater than 2,000 seconds and so it was excluded from the analysis.    with an increase of n-gram length. However, the time required to build the training and testing set does increase exponentially as seen in figure 18. For this reason the 6-gram was chosen as after that point there is some improvement, but not enough to justify the exponential increase in disk I/O time.

N-gram Percent
Now that the 6-gram has been chosen, it is time to figure out the optimum percent of those 6-grams to utilize. Continuing from the last test, this test used 300 comments as the training set, the ν SVC linear kernel with a ν value of 0.7, a length up to 6-grams, and a knn level of 10%.
On this test the full 10 runs were completed and resulted in an average standard deviation of 3.20. This means that the minimum number of runs necessary to be sure of the results was 1.58. Figure 19 shows that, like the n-gram length,the cross validated accuracy does not alter much regardless of the n-gram percent, while the upper and lower bounds As the N-grams are sorted based on the amount of information they bring to the machine learning, it wasn't surprising that increasing the percent of them that were taken had little effect on the overall result. Increasing the percent taken also increased the disk IO access time exponentially as seen in figure 22, so for that reason, 4% was taken as the optimum. Because of the exponential nature of the disk IO time, only up to 10% was tested as already the diminishing returns were not worth continuing and the trends did not point to it getting any better.

KNN Level
With all of the N-gram parameters locked down, the next step was to identify

Grid Search
The grid search test is designed to optimize all of the parameters across the three possible kernels that were used and to test all of the different possible parameter combinations used by each of them. For this test a 300 comment training set, 4% of the n-grams up to length 6 and a KNN level of 5% were used.
On this test, 3 runs were completed and resulted in an average standard deviation of 3.78. This means that the minimum number of runs necessary to be sure of the results was 2.20. On this test the minimum cutoff was used because each of these runs took an average of 2 weeks from start to finish. Due to the large amount of data found in the polynomial kernel run, the Coef0 parameter was split into two charts, and the high degree and low C values were excluded from the charts since they were low performing, as is shown in the graphs.  In an effort to simplify the data and to give us a better idea of what is going on, first the parameter Coef0 was analyzed to see which of the two states was better.
As figure 32 shows, a Coef0 of 1 outperforms a Coef0 of 0. In fact, it goes even further than this. If you check the charts like tables 10 and 11, comparing any of the rows, it can be seen that the Coef0 = 1 chart always outperforms the Coef0 = 0 chart.
After isolating the Coef0 value, the next parameter to choose is the Degree, although this one is not as clear cut. Figure 33 shows the the cross validated accuracy starts highest at a degree of 1 and slopes downwards after that, while the other three accuracies trend up to a degree of 3 and then slope downwards. Because of this, the next step is to try to narrow down the C and ν values with a degree of less than or equal to 3. While the degree of 1 is poorly performing in this graph, it should be noted that this is due to some poor performance on the C and ν choices that will become apparent later.
If you compare figures 34 and 35 to figures 30 and 31, all of the accuracies have improved so this is on the right track. Now, within the C graph, it again is obvious that a C of less than 0 is poor performing. In the future tests, we will remove C < 0 as it was consistently poor across all kernels and all other parameters. In this case a C of 2 was chosen as the best as the cross validated accuracy sloped downwards while the other accuracies were fairly flat after that point.
Looking at the ν graph, again it is obvious that after a ν of 0.7 the accuracies 84 Figure 35. Grid Search Polynomial ν Degree 3 Accuracy 85 Figure 36. Grid Search Polynomial Degree C 2 Accuracy dropped quickly. Because that was also consistent across the kernels, the future tests will only go up to a ν value of 0.7. In this case, it appears that again a ν value of 0.4 seems to be the best balance of cross validated accuracy to real accuracy.
Finally, the degree can be reexamined now that the poor performers of C and ν have been removed from the data. As figures 36 and 37 show, while the degree of 3 was the best when bad data is included, the degree of 1 actually outperforms the others on good data. With this, we get a best-case accuracy on C of 67.60 and 77.02, while the best case ν has an accuracy of 66.82 and 77.10.
Because the cross validated accuracy of the best C was almost 1% better while the real accuracy was only .08% worse, the C method was chosen to go forward with. When compared against the linear kernel, which was the best before this, it  Table 16. N-gram By H(P arent) Entropy again out-performs the cross validated accuracy by over 1% and again only loses the real accuracy by 0.23%.

Entropy Test
The final test of the first run is to see what effect modifying the entropy formula has. Early on in the research, the formula was being calculated incorrectly by replacing the total entropy of the parent H(P arent) with a hard coded 1. This results in a much different ordering of the N-grams as can be seen in figures 16 and 17. More interestingly, this error was also producing significantly higher accuracies in the preliminary testing. Therefore this test will confirm which method gets the best results. This run was done with a 300 comment training set, a length up to 6-grams, a n-gram percent of 4%, a KNN level of 5% was used and the polynomial kernel using a C of 2, a degree of 1, and a Coef0 of 1.
On this test, 4 runs were completed and resulted in an average standard deviation of 3.70. This means that the minimum number of runs necessary to be sure of the results was 2.10. As can be seen in figure 18, there is no need to graph this as the hard coded 1 value significantly outperforms the H(P arent) method.

Second Run
With the first full run of parameter optimization completed, it is time to move onto the second run in order to check how much the data changed based on the other parameters.

Number of Comments
The first parameter that had been optimized is the number of comments used in the training set. Again, this will test between 50 and 500 comments and will use a length up to 6-grams, a n-gram percent of 4%, a KNN level of 5%, and the polynomial kernel using a C of 2, a degree of 1, and a Coef0 of 1.  Figure 38 shows that the 300 that was used before is pretty good, but it appears as though 350 represents a substantial bump in this case. Again, were the comments to increase more, the accuracy would as well, but the disk access time would increase exponentially and the accuracy after 350 does not justify that.

N-gram Length
With the training set size re-optimized, the next parameter is the n-gram length. This was retested with 350 comments, a n-gram percent of 4%, a KNN level of 5%, and the polynomial kernel using a C of 2, a degree of 1, and a Coef0 of 1.
On this test, the 2 runs were completed and resulted in an average standard deviation of 1.80. This means that the minimum number of runs necessary to be sure of the results was 0.50.
As can be seen in figure 39, until the length 3 n-gram, the accuracies are low, but after 3 there is no significant increase in the accuracies. This is most notable when comparing length 3 to the length that had been used of 6 where both the cross validated and real accuracies are now higher at 3. Thus, going forward, the   Table 21. N-ram Percent Run 2 lower length of 3 will be used.

N-gram Percent
With the new n-gram length chosen, the next step is to reevaluate the percent of n-gram's we are taking. This was retested with 350 comments, a length up to 3-grams, a KNN level of 5% and the polynomial kernel using a C of 2, a degree of 1, and a Coef0 of 1.
On this test, the 2 runs were completed and resulted in an average standard deviation of 1.56. This means that the minimum number of runs necessary to be sure of the results was 0.37.
As figure 40 shows, at a low percent the accuracies are poor, but the gains after are not dramatic. In the case of this data, the 7% was the best performer and was used for the rest of the tests.

KNN Level
After the n-grams are optimized, the next parameter was the KNN Level.
This was retested with 350 comments, a length up to 3-grams, a n-gram percent of 7%, and the polynomial kernel using a C of 2, a degree of 1, and a Coef0 of 1.  On this test, the 2 runs were completed and resulted in an average standard deviation of 1.03. This means that the minimum number of runs necessary to be sure of the results was 0.16.
As happened in the first run through, the KNN level eventually levels out which can be seen in figure 41. The best-case prior to the leveling out is 4% which managed to outdo the eventual level rate of 81.71.

Grid Search
Finally, it is time to rerun the grid search to see how much has changed with the new optimizations. For this run we retested with 350 comments, a length up to 3-grams, a n-gram percent of 7%, and a knn level of 4%. Because of the results from the last run, the range on C has been restricted to 0 to 7, the range of ν is  For the RBF kernel, figure 44 shows a C of 2 being the best performance. Even though a C of 3 has a slightly higher cross validated accuracy, the real accuracy drops considerably. This results in a best case of 79.09 and 80.55. Like with the linear kernel, figure 45 shows that ν of 0.2 is the best choice. This has an accuracy of 79.08 and 80.80.
This time, because of the data being restricted to fewer parameters, the first graph that was analyzed for the polynomial kernel was the degree graph shown in figure 46. It is clear in this case that the best degree this time around was 2.
Taking that, figures 47 and 48 were restricted to showing only data at degree 2.
This showed that C peaked at 1 and ν peaked at 0.3. This results in a best case for C of 80.66 and 81.15 while ν managed 80.52 and 80.96. Picking the best option from the grid search is one of the most subjective tasks of the dissertation. However, either the linear ν or the polynomial C options would be a good selection. In this case, the choice used from here on out is the polynomial C because the additional flexibility of the polynomial kernel may aid on differing sets of data.

Testing the Model
This section will test what the best case accuracies are given different groups of comments. In each case, at least 5 runs of the data will be used in order to ensure a good average, but the standard deviation will still be calculated to check if more than 5 are required. For each of these tests, the parameters used were 350 comments, 7% of n-grams up to a length of 3, a knn level of 4%, and the polynomial kernel with a C of 1, a degree of 2 and a Coef0 of 1.
Along with the cross-validated accuracy and the real accuracy that were reported on all other figures, when testing the model the .632+ bootstrap values were also calculated [2]. In this method 200 training sets were randomly generated from the training set with replacement, and then after training the model on those comments, the remaining comments that were not randomly selected were used as the testing set. The error from this calculation, Err boot , is then averaged with the training error from the original training set err to produce a range of error estimations: Err .632 = 0.368 * err + 0.632 * Err boot (11) It is from these estimations that the 95% confidence intervals were calculated.

Legal Method
With all of the optimization finished, it is time to get a good reading on the maximum accuracy that can be gained on the legal method that tuned the original optimization.
After 5 runs, the standard deviation is 0.96 which means that 0.14 runs were needed for a good result.
As both the data in table 29 and figure 49 show, the cross validated accuracy is above 80%. The real accuracy also agrees with that, although there is a slight bias towards correctly identifying the banned comments.

Terms of Service Method
After getting the average accuracy of 81.76 on the legal method, all of the comments used were reclassified as to whether they contained politics or religion in them. This will test how accurate the optimized parameters can be on training sets that are significantly different from the set used to optimize the parameters.
After 5 runs, the standard deviation is 3.59 which means that 1.98 runs were needed for a good result.
The terms of service method performed lower than the legal method as seen in table 30 and figure 50. The cross validated accuracy hovers around 70% in this

Overall Method
With the accuracy of the legal and the terms of service method found, the last test for the YouTube comments is to see how well the model handles an integrated method combining both of the prior methods. For this test, any comment that was marked as restricted in either of the prior tests is now restricted in this one.
After 5 runs, the standard deviation is 2.19 which means that 0.74 runs were needed for a good result.
Like the terms of service method, the overall method does perform worse than the legal method that all of the parameters were optimized on. YouTube comments using the legal method. The average cross-validated accuracy is 83.76% which is slightly higher than the YouTube comments and much higher than the terms of server or overall method. The real accuracy is even more skewed towards the negative classes, however.

Retraining
With the maximum accuracies established, two different retraining methods were tested to see which has a better performance in increasing the accuracy. For the first one, the comments that were incorrectly classified were added to the training set allowing the set to grow. For the second method, the training set was held at the 350 comments, but the training set was built out of the comments most often misclassified. For both of these test, the original set of YouTube comments was used with the legal method classification scheme.

Adding Comments Retraining
For the first method of retraining, after each run all comments that were misclassified were marked by changing their training value from a 1 to a 2 or a -1 to a -2. The next time the training set was built, all of the -2 values were added to the training set and then an equivalent number of 2 values were added as well.
This was done because there were much fewer bullying comments than normal ones, so this was the only method to keep the set balanced. This was repeated until such a time as no more bullying comments were added to the set.
As figure 33 shows, at each step the accuracy improved, with the sole exception being the fourth run. As this result was not entirely suprising given the earlier observation that simply using more comments resulted in a higher accuracy, 10 additional runs were computed utilizing the same number of comments, but chosen at random instead of relying on pulling in misclassified comments. runs. This shows that the method of adding misclassified comments to the training set does outperform simply utilizing more comments.
While not exactly the same method used, this strategy is similar to the boost-115 ing algorithm created at the University of Ottawa in order to handle data with imbalanced data sets [3]. In their case, rather than keep the data sets balanced, they put all of the training data in and then modified the weights on the minority class so that the misclassified points had more of an effect on the final model. This would begin with a model that was heavily skewed towards the majority class, with most, if not all, of the minority class data being misclassified. Then as the weights on the minority class were raised the model would approach the optimum. In our case, rather than modifying the weights we are simply adding the outlying points that may not have been addressed by the existing model while still maintaining a balanced set.

Priority Comments Retraining
For this method of retraining, the number of comments was set to the fixed 350 total comments that was chosen during the parameter optimization. Each time a run was completed the TrainValue was incremented by one if the classification did not match the training value. Then the new training set was built by randomly selecting the comments, but always taking the highest value (most misclassified) training value first. Figure 35 clearly shows that this method did not work as expected. The average standard deviation was 13.64 which required 28.57 runs. This method did show one weakness of the system in which some comments could be misclassified even when always in the training set. The worst case was the comment "fucking brilliant" which out of 30 runs was misclassified 23 times as bullying even though it is not.

Dual Core Speed
The purpose of this test is to establish the average speed that can be achieved by a dual core computer. A dual core is used instead of a single core because both the processing and analyzing programs were designed to create a thread for one less than the total number of cores available. This is to ensure that there remains processing power available both for the database and for the primary thread of the programs.

Processing Speed
The processing run on the 15,000 comments took 6 hours, 9 seconds and 200 ms to complete or 1,440 ms per comment. The standard deviation of the processing was 4,217.33 ms which equates to a coefficient of variance of 294.10%. This means a minimum of 13,291 comments were required. As seen in table 36, the average was 1,434 ms per comment but it was 1,440 ms per comment overall showing that there is some overhead processing involved between the processing of a comment and the start of the next. The median time was only 586 ms which points to a few outliers skewing the data, so the 95% confidence level was also calculated and shown in the table. This reduced the average to 1,073 ms.

Analyzing Speed
The analysis run took 8 minutes, 31 seconds and 960 ms to complete or 34 ms per comment. The standard deviation of the analysis was 2.58 ms which equates to a coefficient of variance of 8.90%. This means only 13 comments were required.
As seen in table 37 the average is 29 ms while the median is 30 ms which shows how consistent the data is. Even at the 95% confidence level the average remains 29 ms.

Multi-Core Speed
This test will show how well the algorithm scales as the number of cores increases. As mentioned in the last test, the services are designed to create one less thread than the number of cores available in the system. For this test, the system will have a hexacore processor assigned to it in VMWare.

Processing Speed
This processing run took 1 hour, 46 minutes, 41 seconds and 103 ms or 426 ms per comment. This is 2.38 times faster than the dual core method while using 3 times the number of cores. However, table 38 shows that the average processing time each comment takes has actually increased and both the 95% confidence level average and the median agree. However, because it is now able to handle 5 comments at the same time it actually reduces the effective average to 426 ms per comment. The standard deviation is also reduced to 5,579.81 ms which means a 262.83% coefficient of variance. This means that 10,615 comments were required.

Analyzing Speed
The analysis run took 6 minutes, 16 seconds and 823 ms to complete which is only 0.36 times faster than the dual core method. Again table 39 shows that the average and median time per comment increased while the per comment time was reduced to 25 ms. The standard deviation increased to 15.42 ms which is a coefficient of variance of 37.62%. This means only 218 comments were required.

Multi-Computer Speed
This final speed test is designed to show how well the system design scales when additional computers are added in. This will allow for servers to be brought online to scale the throughput required either based on expected workload or in Average 95% Average Median 4434 [3,776183] 3351[76,23168.35] 1800  [16,1236] 63 [30,140] 60 Table 41. Multi Computer Analysis Stats response to a sudden increase in the frequency of comments. Note that in these tests, the two additional computers utilized are much lower power laptops to the primary machine that has been used in all other tests. So, while an increase in performance is expected, the expected increase will not be linear.

Processing Speed
This processing run took 1 hour, 15 minutes, 18 seconds and 976 ms or 301 ms per comment. This is only .42 times faster than the multi-core method while table 40 again shows an increased average time per comment. The standard deviation was 11,800.82 ms which is a coefficient of variance of 266.14% and requires 10,885 comments.

Analyzing Speed
Finally, the analysis run took 7 minutes, 22 seconds and 750 ms or 29 ms per comment. This is actually 0.18 times slower than the multi-core method. Table 41 shows that again the average and median are higher, but in this case the average over time is increased as well. The standard deviation is 47.91 ms with a coefficient of variance of 71.51% with a minimum of 786 comments required.

Analysis of Goals
This section will analyze each of the goals to show that the research was able to meet the goal.

Legal Definition
While the analysis of the laws regarding cyberbullying showed them to be highly subjective and designed with juries making the final determination, the research was able to narrow it down to some simple rules that could be deployed with a minimal subjective requirement. The first test is if the comment is sexual in nature. While this is stricter than the law requires, these comments are present on sites with minors present. The second test is if the comment was intended to seriously alarm, annoy, or bother the subject. The final test is does the comment serve a legitimate purpose. This third rule is the most subjective of the rules, but is still simple enough to work for this research.
One thing the creation of this definition showed was that at this time there are no specific laws for cyberbullying on its own. Instead, it falls under the broader laws for harassment. For this reason, although the research was primarily aimed at combating the increasing cyberbullying, the end point proved successful against a much broader range of restricted speech.

Distinguish Cyberbullying
After optimizing all of the parameters, the optimum was found to be a training set with 350 comments, 7% of n-grams up to length 3, a knn level of 4%, and the polynomial kernel with a C of 1, a degree of 2 and a Coef0 of 1. The system was capable of identifying cyberbullying 81.8% of the time. This means even if a 123 human moderator has to check all of the misclassified comments manually as users flag them as incorrect, it would still drastically cut down on their workload. This will allow fewer moderators to handle an increased load of commentators without having to sacrifice the safety of the users.
Switching the moderation from matching the legal definition of cyberharrasment to a method based on a terms of service decreased the overall accuracy to 73.4% and using both methods resulted in an accuracy of 69.9%. Utilizing the legal method on the comments taken from Twitter resulted in an accuracy of 83.8%.

False Positives and Negatives
From the comments that were gathered off of YouTube, there were less than 10% of the comments classified as cyberbullying. This means even if the algorithm marked all comments as positive, it would have achieved a 93.4% overall accuracy.
In practice, however, the algorithm generally had the negative class accuracy within 10% of the positive class due to the balanced training file.

Allow Retraining
After testing several different retraining strategies, it was shown that the best way to retrain is to add all of the misclassified negative comments and then balance the class with misclassified positive comments. While this will cause the training set to grow beyond the optimized 350 comments, the added time in generating the training set will be more than made up for with the increased accuracy. In testing, just 6 iterations increased the accuracy by more than 10%.

Speed and Parallel Operation
In 2016, Twitter averaged around 6,000 comments per second [1]. With a dual core processor and a single thread, the algorithm was only able to process 2 comments every 3 seconds. When the system was scaled up to a hexa-core processor, that same system was able to handle 2 comments per second. Adding on 2 additional laptops brought the final speed to 3 comments per second. In total, the 3 comments per second represented approximately 18 logical processors. Some of the poor scaling in these tests is due to the three computers communicating over wifi, and everything utilizing a single ssd leading to multiple points of bottlenecking. Thus, since even popular YouTube channels can afford 36 core, 72 logical core servers for rendering [2], assuming a linear scaling puts them at handling 12 comments per second even without accounting for the raid disks and running on a host OS instead of a VM. This means that without finding additional optimization angles, it could be assumed that 500, 36 core Xeon servers may be able to handle the 6,000 comments per second. Given that as of 2010, technical presentations put Facebook as having over 60,000 servers [3], that is not out of the realm of feasibility for a company expecting to handle thousands of comments per second.

Future Work
While the scalability of the system was tested as part of the research, it was only done on enthusiast consumer grade hardware. The first test that should be performed is to test the performance of the system on a high speed dedicated Xeon system, properly setup for handling high I/O databases. This will allow for a better estimation of the number of comments per second a computer can scale to. With that, a second server should be added with a 10 GB fiber connection to see how well it continues to scale across data center hardware.
The accuracy estimations in the research are all based on the classification of a single researcher, and while it does show the method is able to accurately classify in the same method as a single moderator, it does not necessarily correlate to how well it will do in an actual system. Thus, the algorithm should be run in tandem on a large system side-by-side with a moderation team, and utilizing retraining,