VOTING NEAREST NEIGHBORS: SVM CONSTRAINTS SELECTION ALGORITHM BASED ON K-NEAREST NEIGHBORS

Ninety percent of the world data today was generated over the last two years, boosted by the great speed in which information is created over the Internet and the low prices for storage and sensors. This new paradigm is what we call Big Data. One of the biggest challenges in the field of Machine Learning today is how established algorithms perform on Big Data. The sheer size of these datasets can make it infeasible to use know algorithms to create a decision surfaces in a reasonable time. Support Vector Machines is one of the algorithms that experience a steep increase in runtime when creating a decision surface for Big Data. This fact led to the decline of its use for classification on these types of datasets. This dissertation introduces Voting Nearest Neighbors, a new preprocessing algorithm that assists Support Vector Machines on dealing with Big Data by creating a voting system based on k-nearest neighbors. The algorithm will select points close to the border between classes that have a higher chance of being used by a Support Vector Machine as Support Vectors, while removing outliers that would negatively impact the margin created. These points will be the only ones used in the training of the Support Vector Machine, allowing it to create the a decision surface in a reasonable time. In order to guarantee a good performance in a reasonable time, the algorithm is implemented in parallel using CUDA on GPU. The technique was successfully tested against 5 datasets that cover a broad range of sizes, from the Iris containing just 150 points to the Air Pressure system Failure and Operational Data for Scania Trucks Dataset which has 60,000 points, with an encouraging diminish in runtime for Big Data datasets and a impressive performance when used to classify imbalanced datasets.

ACKNOWLEDGMENTS I thank my advisor Dr. Lutz Hamel for all the support over the past years, for helping me come up with a first topic idea and for helping me start again when we found that the topic had already been done.  While the current set of algorithms for Support Vector Machines such as SMO [2] are accepted as a good way of finding decision surfaces, they are usually infeasible when working on big datasets as the time it takes to train a classifier grows with its size. But Big datasets are becoming a intrinsic part of machine learning, with Data being created at speeds never seen before, generating bigger and bigger datasets. When those datasets need to be classified, Support Vector Machines end up being put aside for other techniques better suited to deal with Big Data.
The goal of this project is to create a preprocess algorithm that will generate a much smaller subset of the original dataset, containing points with a higher chance of being support vectors and at the same time trying to eliminate potential outliers. This new subset will be used to train a Support Vector Machine instead of the full dataset, creating the decision surface faster while getting a accuracy comparable to a Support Vector Machine trained using the full dataset.

Significance of the Study
Support Vector Machine (SVM ) is a powerful machine learning technique, but can be very compute-intensive, especially when working with big datasets. This happens because the most used algorithm used by SVM s, the Sequential Minimal Optimization (SMO) has to break the SVM problem into smaller sub-problems and, the bigger the dataset the more sub-problems the algorithm has to take into consideration causing the run time to go up significantly.
To avoid this escalation many SVM s try to create different algorithms to find the border. Some do small localized SVM s whenever a new query is received [3], others will enclose each class in a polytope and try to find the support vectors by looking into the points in the border of each polytope [4,5].
But instead of creating new algorithms it is possible to increase the speed of the SMO (and of the SVM ) by preprocessing the training data, trying to find data points in the border between classes and using just this subset as the SVM training data [6,7,8,9]. These techniques are more interesting because they allow us to use establish and optimized SVM s already in existence independently of the size of the datasets.

Purpose of the Study
This project tries to improve on the techniques that increase the speed of SVM s by finding the points in the border between classes, more specifically the KNN-SVM [7] and  techniques. Both of them are successful preprocessing algorithms that look at a dataset and find a smaller subset for a faster SVM training based on its k-Nearest Neighbors (kNN ). These techniques will be explained in full in Section 2.2, but in layman terms they work as follows.
To find the points in the border between classes, the KNN-SVM divides the data by classes and make each point find the k-Nearest Neighbors from a different class. All the k-Nearest Neighbors found are selected to create a new smaller subset that will be used in the training of an SVM.
The hypothesis is that the information generated by the k-Nearest Neighbors is being underutilized by the algorithms. Figure 1a has an example run of KNN-SVM where the algorithm selects the 3-Nearest Neighbors, in that example most What if, instead of adding every point to the subset, the algorithm kept a score of how many times a point is one of the k-Nearest Neighbors to a point of the other class. The result would look like the example found in figure 1b, in that image we can see that points near the border got a higher number of votes than the ones further back, with the exception of the red outlier that received as many votes as the points of the real border between classes. Now, instead of using all points that received a vote for the border, the algorithm can select the minimum and maximum number of votes needed for a point to be selected for the new border, giving the user more control over the border to be selected. In figure 1c we see the new set selected to be processed by the SVM generated by voting for the 3 Nearest Neighbors and selecting points with a minimum of 4 votes and no maximum number of votes.
This is an example of the Voting Nearest Neighbors (VNN-SVM ), the algorithm proposed on this dissertation. The hypothesis being that, by using the k-Nearest Neighbors to cast votes, instead of immediately selecting the points, the algorithm will be able to select fewer and better points for our possible border.
This is expected as the points right in the border between classes should be the ones receiving the majority of the votes.
The number of votes received by each point could also be used to further improve the selection of the possible Support Vectors. In [10] and [11], the authors show that pruning a significant portion of Support Vectors of an SVM can be done without a significant loss of accuracy, sometimes even achieving a greater generalization of the decision surface. If the votes can be used in this way to find outliers they could be removed from the selection to create a more generalized decision surface.

Goals
To be considered successful the algorithm will need to achieve the following goals: • Select a smaller set of points for SVM training: The algorithm has to be able to select a subset of the full data that can be used to successfully train an SVM.
• Remove points far from the margin between classes: As the number of votes grows more points far from the margin between classes will start to receive votes. These points will have fewer votes than the ones close to the border and may not be as important for the SVM so can be safely removed from the subset selected. The algorithm should be able to remove those points based on the number of votes they received.
• Remove outliers based on how many votes they received: Outliers are points that are far from members of its class and most of the time end closer to members of classes different than theirs, making them very hard to classify. When applying the proposed algorithm the outliers will receive a great amount of votes so the algorithm should be able to remove those points based on the number of votes they received.
• Be able to achieve a better generalization of the SVM decision surface by changing which points are selected: By selecting fewer points for training and removing outliers the algorithm should select a subset that should be less susceptible to overfitting when training the SVM. The decision surfaces created that way may have a better generalization by selecting fewer points as support vectors.
• Run in a reasonable time: The algorithm was created to save time on Big data datasets, so it should run fast and its time plus the of the subsequent SVM should take less time than the time taken to run the SVM using the complete dataset.

Organization
This thesis is structured as follows: Chapter 2 Background : This chapter contains all definitions and background needed to understand the problem, it also highlights the algorithms that were used as a base for the creation of the Voting Nearest Neighbors. Chapter 4 Results: This chapter presents the datasets to be analyzed, the metrics used on the analyses and the results.
Chapter 5 Conclusion: This chapter will review the performance of the algorithm given the goals described in section 1.4 as well discussing possible future work derived from this work.
[11] X. Liang, "An effective method of pruning support vector machine classifiers," IEEE Transactions on Neural Networks, vol. 21, no. 1, pp. 26-38, 2010. In figure 2 we can see an example of a margin created by an SVM, the three highlighted points are being used as support vectors (2 red and 1 blue). They are the only points needed to define the margin, the maximum distance between classes (represented by the green parallel lines in the figure).
In the middle of the margin and parallel to it is the decision surface, the The SVM will find this decision surface based on only the support vectors.
In our example when new points are classified, if the point is over the decision surface, it will be classified as red, if it is below the decision surface, it will be classified as blue. The only uncertainty is when the point is on the decision surface meaning it could be of either class, to solve that problem the different implementations of SVM s will usually hardcode that any points in this situation will always be classified as one of those classes.
In an SVM the Support Vector are found as a result of the dual maximum margin optimization equation (1), where the algorithm will find the best values for the Lagrange Multipliers (α) for every point of the training set, subject to the constraints in (2), where: • φ is Maximum Margin Lagrangian dual that will maximize the border; • α is the set Lagrange multipliers. Each point ith of the dataset has its own Lagrange multiplier α i , this variables are the ones being modified by the optimization in order to find the support vectors; • y i is the label of the class of the i th element of the training data, in SVM s the labels are either 1 or -1; • κ is the kernel function, this is one of a set of functions used by the SVM s that can calculate the distance between points on the dataset. On figure 2 the function used is the dot product, the reason we have a set of functions for this part will be explained later this section; • x i is the vector that represents the i th point of the dataset with all its values; • l is the number of elements in the training data.
When maximizing the equation based on these constraints we find that the points that make the margin will have α i > 0, while points outside the margin will need α i = 0. Those points with α i > 0 are the support vectors and with them we can finally create the decision surface. As said before the decision surface can be written as w · x = b and using κ, α, y and x (where x sv + is the value of one support vector from the set of available support vectors) we can use equations (3) and (4) to find w and b respectively.
Now to classify a new point the SVM will need to find where that point is in relation to the decision surface, this is done using equation (5), substituting w and b on (5) we get the equation (6) that will give us the classification based on our support vectors.
As mentioned before, all points one side of the decision surface will be classified as one class and all points on the other side will be classified as the other. This is represented by the sign function in equation 6, where all positive numbers will be classified as of the positive class and all negative points as of the negative class.
In equations (1), (4) and (6), κ refers to a kernel function. In a simple linear classifier the kernel function is nothing more than the dot product, calculating the distance betweenx i andx j . However, with kernel function we can use an algorithm originally designed to find linear classifiers on non-linear problems, using what is commonly referenced as the kernel trick. The kernel trick consists in changing the dimension of the data before calculating the dot product and creating the decision surface. In Figure 3 the first plot has the original dataset in 2 dimensions, there is no way to divide the data as it is with a simple line. But, in the second plot, a 3rd dimension was added based Figure 4: The kernel trick decision surface [2] on the 2 original dimensions (in this case: z i = x 2 i + y 2 i ). Now, the inner circle has lower z values than the outer circle, making it easy to create a plane that will classify correctly all data, as seen in the 3 dimension plot of figure 4.
The trick part of the kernel trick is that we don't need to save the database with a new dimension or even calculate what that dimension would be. The kernel functions implicitly calculates the dot product in this higher dimension given just the original points. In our example instead of creating a new dataset with the z dimension we just use the kernel (7): The decision surface created by the algorithm is still an hyperplane that defines a linear classifier but in kernel space, for the user it will look and behave like a non-linear classifier in feature space, as shown in the second plot of figure 4 with the decision surface being the green circle dividing the classes.
Kernels like (7) are of limited use in real life as it is restricted to two dimension databases, but there exists more generalized kernels, like Polynomial, Gaussian and Sigmoid kernels, being used in everyday applications. The study and creation of new kernels is an area of research by itself.
The way the SVM was described until now will work only when the data is perfectly divisible, as a maximum margin classifier wouldn't be able to create a margin if points of different classes are mixed together, because there wouldn't be a way to separate the classes. To deal with this problem the SVM changes and uses a soft margin classifier show in equation (8), this change allows the SVM to accept missclassified points and points inside the margin in order to produce the greatest margin possible under the new parameters.
Subject to the constraints: In equation (8) there is no change to the Maximum Margin Lagrangian, but to the constraints, where the variable C (Cost) is added. The Cost allows the addition slack variables (points miss-classified or inside the margin) to the SVM and keeps track of how much error is being introduced, the relationship between C and the margin is as such: • Large C creates an SVM with a small margin (more closely related to the original SVM ); • Small C creates an SVM with a larger margin that will accept more slack variables in it.
Now points on the decision surface will have α = C, points miss-classified or inside of the margin will have C > α > 0 and points far from the decision surface will keep α = 0. So by manipulating the value of C the SVM can admit more slack variables and may creating a more generalized decision surface.

The k-Nearest Neighbors Algorithm
k-Nearest Neighbors (kNN ) [3] is a machine learning technique that classifies a point based on its k-nearest known data points in the training data. It is a type of instance-based learning, where all computation is deferred until classification.
The algorithm is very simple and can be described as follows: 1. Training Phase: Store all known points of the training data and their respective labels.

Classification phase:
To classify a new entry e: (a) Calculate the distance between e and every point in the training data using the distance function D.
(b) Find the k closest points in the training data.
(c) Classify the new point e as the most frequent class between all the k points.
In Figure 5, we show a simple example of kNN. The algorithm will classify the small black circle in the middle of the concentric rings as a blue square if k =3.
However, this classification would be changed to a red diamond if k =5.
The distance function D can be specified by the user when creating the classifier. The most common distance functions used are the Euclidean distance, when working with continuous variables, and Hamming distance, when working with discrete values.

Big Data
With the advance of storage space, sensors and the Internet, data is being generated in an incredible amount every day, and collecting them became something anyone can do. It's more common to find some of these new datasets containing dozens of Gigabytes or Terabytes of data.
This area of massive datasets is now part of what we call Big Data. Because of the speed in which data is being generated and storage capacity is increasing so the definition of Big Data changed a lot since its conception. For this study I will use Doug Laney's 3 V's definition for big data [4]. In his paper he used 3 keywords to describe Big data: Volume The quantity of data. Big Data will have big databases. The size of which it can be considered Big Data depends where the problem is being worked, what is Big Data for a personal computer might not be for a company server.
Variety The nature of the data. How diverse is the data. Companies like Facebook and Youtube deals with large amounts of text, images and videos. The way to work and organize each has to be well decided.
Velocity The speed in which data is generated. How fast the database is growing.
Small studies could grow in batches when data is collected back from the field, social networks can have tens of thousands of new entries every second.
When working with Big Data some of the most common algorithms and techniques can become impractical due to the computation time required or memory used. For that reason new algorithms have to be created.

CUDA
CUDA is a parallel programing platform and programming model created by NVIDIA in 2006 that uses the Graphics Processing Unit (GPU ). The goal of the platform is to enable use of the computational power of the thousands of specialized computing cores for generic problems. Because of the specific architecture behind GPU s the programs written in CUDA have to take different precautions from normal parallel programming to achieve maximum speed up.
The CUDA programming toolkit released by NVIDIA is a free solution for CUDA programming with a built in Visual Studio integration and a diverse number of already compiled libraries that uses the GPU to it's fullest.
When programming in CUDA you are writing code for both the CPU and GPU, usually the CPU code is for memory management and I/O while the GPU code is where most of the work takes place. The GPU code written by the user (not from a library) is referenced as a CUDA kernel.

Previous Work
The proposed algorithm will modify and improve on the following previous works.

A fast training algorithm for support vector machines based on K nearest neighbors (KNN-SVM )
The KNN-SVM algorithm [5] focuses on preprocessing data to find the border vector. The border vector refers to the data points on the boundary between classes. This will be achieved by adding to a subset all data points of a class that are the kNN to any data point of the other class. The algorithm goes as follows: Step 1: Divide the training set A into positive set ., x + n 2 }, n 1 , n 2 are the number of positive and negative examples of the training data respectively. Select parameter k and kernel function κ.
Step 2: Calculate distance matrix D = (d ij ) n 1 ×n 2 from each data point of A + to all data points of A − . Arrange all the elements of each row D from small to large and extract the first k columns to get a new matrix D 1 . Then find the corresponding column sign j of each element of D 1 in matrix D and obtain corresponding elements in A − , which form border vector S − of A − .
Step 3: Calculate distance matrix D = (d ij ) n 2 ×n 1 from each data point of A − to all data points of A + . Arrange all the elements of each row D from small to large and extract the first k columns to get a new matrix D 1 . Then find the corresponding column sign j of each element of D 1 in matrix D and obtain corresponding elements in A + , which form border vector S + of A + .
Step 4: Final border vector set of two class samples: A = S + ∪ S − Step 5: Train the SVM by substituting the training set A for the border set A , obtain the support vector set then construct an optimal separating hyperplane. Steps 1 through 4 are the preprocessing of the data to find the border vector, with step 5 being the use of any SVM classifier. The ability to correctly create an optimal hyperplane will depend on the choice of k based on the complexity of the data to be analyzed. A small k on a complex dataset can miss important data points on the border vector, while a big k on an easy dataset can incur in a border vector full of irrelevant points.
The original paper didn't discuss how to support multi-class datasets so for tests that needed it the one-vs-one method was used. That means the algorithm was run once for each pair of classes, this will guarantee that the border vector will select the most relevant points.
To analyze the ability of the KNN-SVM to select the border vector without changing the shape of the data the algorithm was tested on artificial datasets with well defined shapes as shown in Figure 6.
The tests showed that the KNN-SVM was successful in selecting the border vector without compromising the shape of the border between classes and any SVM used to create a decision hyperplane would have similar performance using either the original training data or the border vector.
The algorithm was then tested in 3 real data sets, two of them from the UCI database, the last from the epil R library. All tests were run in RStudio using the SVM contained in the library e1071. The algorithms will be compared based on SVM training size, number of support vectors and accuracy over the full data set. The results in Table 1 show that the average size of the border vector was 30% of the original dataset, and the SVMs trained used in average 70% of the original support vectors. All of that without any significant loss of performance.

SVM Constraint Discovery using kNN applied to the Identification of Cyberbullying (KNNFilter )
Before finding the KNN-SVM I implemented my own version of a kNN method of border selection. Differently from KNN-SVM, the algorithm would look for all neighbors independent from class and would decide if the point analyzed is a good candidate depending on them. The KNNFilter algorithm works as follows: Step 1: Create an empty list SV C that will hold the Support Vectors Candidates.
Repeat steps 2 through 6 for every point i in the dataset.
Step 2: Create a vector (DistVector ) to hold the distance between i and every other point in the training data calculated using the distance function D.
Step 3: Append a new row with the indexes of the other data points creating a new matrix DistMatrix.
Step 4: Sort DistMatrix based on the distance row. Now it holds in one row distances from i to all other points of the dataset, in increasing order, and the corresponding index of said point in the original dataset.
Step 5: Select the k -nearest neighbors of i by using the first k elements of the index row.
Step 6: Compare the class of those k neighbors to the class of data point i. If any neighbor has a different class than i, then i is added to SVC.
Step 7: Train the SVM by substituting the training set A for the border set SV C, obtain the support vector set then construct an optimal separating hyperplane.
The main difference between KNNFilter and KNN-SVM is the value of k needed to find similar borders. Because KNN-SVM looks at the other class for points, so small values of k are enough to create a good border independent of how the classes are distributed. For KNNFilter the k value will always be higher than KNN-SVM. Also, they are more dependent on how the classes are distributed, classes with big overlap need small k, while when working with well defined classes the k can get so big that most of the points were selected for training.
KNNFilter had similar success on the same tests that were applied to KNN-SVM. The artificial dataset tests shown in figure 7 demonstrate the KNNFilter capability in maintaining the border shape (the last two datasets needed a k so big that almost no pruning happened and were omitted).
KNNFilter was also tested with the same datasets as KNN-SVM for very similar outcome. Its results are shown in table 2.

A Fast Incremental Learning Algorithm for SVM Based on K Nearest Neighbors (KNN-ISVM )
The KNN-ISVM algorithm [6] introduces the concept of incremental training, now the training data can be divided in several incremental steps, each step will function like the previous algorithm but it will add new points to the previous Suppose there is a training dataset A and an incremental training dataset B, and assume that they satisfy A ∩ B = ∅.
Step 1 through 5 This are identical as the KNN-SVM algorithm.
Step 6 Add the incremental training sample set B, let A = A ∪ B, then return to step 1.
Repeat steps 1-6 for each batch of incremental samples.
One important change between the KNN-SVM and KNN-ISVM is how they calculate the distance matrix. In the previous paper the distance was measured in the original feature space, while in the latter all distances were calculated in kernel feature space using the equation (9): Where κ(x 1 , x 2 ) is the kernel function of high dimension feature space and Φ(x) is a non-linear map of vector x.
This change guarantees that the KN N pruning will work in the same feature space as the SVM that takes place in step 5 but more tests are needed to see how the change in feature space impacts the selected border vectors. This algorithm is the only one not tested but is expected to have an accuracy comparable with KNN-SVM.
[3] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE transactions on information theory, vol. 13, no. The Voting Nearest Neighbors algorithm (VNN-SVM ) improves on the KNN-SVM technique by adding a voting system, now, instead of just adding any possible data selected via KNN to the subset, each selection will increment the number of votes for a specific data point. After all votes are cast the algorithm will select the best candidates for possible margin members using the number of votes cast on each data point. The hypothesis is that the extreme low voted data points aren't relevant points to the border and extreme high voted data points might be a possible outcome of outliers and don't need to be added to the subset. The algorithm applied is the following: Step 1: Divide the training set A into positive set A + = {x + 1 , x + 2 , x + 3 , ..., x + n 1 } and ., x + n 2 }, n 1 , n 2 are the number of positive and negative examples of the training data respectively. Select parameters k that represents the number of votes each point will cast, kernel to be used by the SV M , lowerBound and upperBound, this two will be used to find the minimum and maximum number of votes a point will need to be used in the new border; Step 2: Calculate the Total number of Votes, T V , cast on each set, with T V − = n 1 * k and T V + = n 2 * k; Step 3: Calculate the lower bound cut lbc * with, lbc * = T V * * lowerBound/100 and upper bound cut ubc * with, ubc * = T V * * upperBound/100, for both positive and negative sets; Step 4: Create two extra Vectors V otes + and V otes − of length n 1 and n 2 and initialize all elements of the vector to 0, these will hold the votes received by sets A + and A − respectively; Step 5: Calculate distance matrix D = (d ij ) n 1 ×n 2 from each data point of A + to all data points of A − ; Step 6: Copy and transpose the matrix D into matrix D , this matrix will store the distance from each data point of A − to all data points of A + ; Step 7: Sort each row of D while keeping track of the original column index of each distance. These indexes will correspond to which point in the negative set will the votes be cast; Step 8: Sort each row of D while keeping track of the original column index of each distance. These indexes will correspond to which point in the negative set will the votes be cast; Step 9: Use the indexes of the first k distances of each row in D to cast the votes. This is done by adding 1 to the V otes − [index] = 1 + V otes − [index]; Step 10: Use the indexes of the first k distances of each row in D to cast the votes. This is done by adding 1 to the V otes + [index] = 1 + V otes + [index]; Step 11: Sort all votes while keeping track of its indexes and create a new running sum array RS * for both V otes * arrays; Step 12: Find the minimum number of votes for both vectors by looking at the running sum vectors, and if RS * [x − 1] < lbc * and RS * [x] > lbc * , then Step 13: Find the minimum number of votes for both vectors by looking at the running sum vectors, and if RS * [x − 1] < ubc * and RS * [x] > ubc * , then Step 14: Select all points in the positive set that have received between minV otes + and maxV otes + and add it to the new training set A ; Step 15: Select all points in the negative set that have received between minV otes − and maxV otes − and add it to the new training set A .
This algorithm can be also modified to implement the incremental aspect of KNN-ISVM, when the data is coming in batches or when the data is too large to be analyzed in one pass.
The goal of the voting system is to remove low voted points that are probably near outliers or not really close to the border and to remove the high voted points that may be possible outliers inside the others class influence. The removal of these outliers may make the hyperplane created by the SVM better generalized as it won't have to take in consideration the "pull" created by these points on the soft margin. areas contain outliers, they represent the highest voted data points. Next to them we find the lowest voted data points, these are the points selected by the outliers, each outlier will vote in k low relevance data points. All of those would be selected by previous algorithms but the proposed algorithm should be able to skip those points.
Apart from the special case of outliers the voting system will also help in minimizing the size of the training data. In figure 9 we have another example of the voting system (stopping just after step 10 of the algorithm) applied to an XOR

Implementation
Neither the KNN-SVM [1] or KNN-ISVM [2] papers were specific about how their algorithms were implemented, something to be expected when working with the short scope of a paper. When testing those algorithms I implemented them in R for fast prototyping and better visualization of the results. On those tests when dealing with a high amount of data the kNN would sometimes take more time than the SVM using the full dataset, which partially defeats the purpose of pruning points for a faster classification. The discussion on the implementation will be broken up in the different sections that make the algorithm, Distance Matrix Calculation, KNN, Interval calculation and Support vector candidates selection.

Distance Matrix
One of the first decisions when calculating the distance is that there is no need to use the normal euclidean distance as show in equation 10, instead it is as effective to calculate the square of the distance and compare those . This way there is no need to calculate a the square root on all distances, and this shouldn't impact the algorithm because if x < y then √ x < √ y, keeping the relationship between points the same.
The distance matrix calculation is one of the more computationally intensive elements of VNN-SVM. But the challenge of calculating the distance matrix on GPU was already studied by [3,4].
On CPU to calculate the distance between points x = (x 1 , x 2 , ..., x n ) and y = (y 1 , y 2 , ..., y n ) you would use equation 10. And if you needed to calculate the distance from every element of a group of points X to every element of a group of point Y we would iterate over all elements of X and Y saving each result in an entry of the distance matrix.
Although using equation 10 would be a perfectly normal way to compute the distances it is not well suited for GPU programming, even if you had each core doing the sum for a singular element the constant memory accesses would minimize the speed gain from it. So I changed how to calculate distance to use just matrix operations instead of calculating each of the elements individually. This is done with the following equation: Where . is the Euclidean norm, x is the transpose of x and x y is the dot product between x and y. These Euclidean norm can be calculated easily with very simple CUDA kernels or using specific libraries like Thrust. I have written both codes but the one using the Thrust library outperforms the normal CUDA kernels and was used for all results.
The dot product is more interesting as it is by far the most computationally expensive point of the calculation. VNN-SVM will have to calculate the distances between 2 matrices containing the positive cases and negative cases, matrix A + and A − respectively. The dot product of point x i and y j can be found simply by doing the matrix multiplication M = A + A − and selecting the element M ij .
Writing specific kernels in CUDA for matrix multiplication is a hard task if you are trying to maximize the GPU use, for that reason the VNN-SVM uses CUBLAS for matrix multiplication. CUBLAS is a implementation of the Basic Linear Algebra Subprograms library (BLAS ) for CUDA. This library, originally created for FORTRAN, implements well known linear algebra algorithms that will make use of the full processing power of the system they are designed for.
The VNN-SVM distance matrix computation on GPU follows these steps: Step 1: Divide the training set A into positive set A + = {x + 1 , x + 2 , x + 3 , ..., x + n 1 } and ., x + n 2 }, n 1 , n 2 are the number of positive and negative examples of the training data respectively.
Step 2: Calculate the vectors S + and S − , where S * i is the squared euclidean norm Step 3: Create the matrix C (n 1 ×n 2 ) , where Step 4: Use CUBLAS to calculate M = −2 * A − A + + C On step 4 the order of A − and A + are inverted because CUBLAS uses column major input order so the change was made to get the resulting distance matrix in the correct row major order received as input.

K-Nearest Neighbors -KNN
The distance matrix M contains the distance between all points of A + to A − , with M ij corresponding to the distance between A + i and A − j . The KNN s can be found if the distance matrix elements (M ij ) are used as keys and the matrix indexes This is not done only by convenience of reusing the same code, but to use CUDA to its fullest as the speedup gained from CUDA comes from coalescing memory reading.
The Thrust library has a implementation of sort by key that will be used by the algorithm. Thrust is tuned to work with CUDA vectors not matrices so I had to apply its sorting algorithm for every individual row. The sort by key algorithm will sort each row of M and M T at the same time changing the order of a second vector that has values 1 to j for A + and 1 to i for A − . Those second vectors are going to form new matrices Index + and Index − containing the indexes of the kNN to the elements of A + and A − respectively.
With Index + and Index − created the last step of the kNN is casting the votes. The votes will be saved in 2 different vectors V + and V − of sizes i and j respectively, both initialized with zeros. To tally the votes a small CUDA kernel was written, each thread of the kernel will look at an index value iv in the Index * vector and add 1 to V * [iv].
To assure the correct result I used atomic operators on the kernel, this way if multiple threads are adding to the same element then they will have to wait in a queue. This is the only part of the code outside the libraries that has any thread concurrency. But the max concurrency possible can be estimated looking at by k and the j, the number of elements in the other class, when all elements vote for the same k points the kernel will have k queues of j threads.
On the original KNN-SVM algorithm this would be the last step. All points with votes would be selected as Support Vectors Candidates, with no need to sum the votes, just find which points got any number of them.

Interval Calculation
Now with the votes vectors V + and V − filled, the algorithm will calculate the minimum and maximum amount of votes a point needs to be selected as an SVC.
Four new vectors are created, two Index vectors I + and I − used to hold the indexes of each point of V + and V − . And two running sum vectors RS + and RS − that will be used to find the interval.
The Thrust library sort by key algorithm will be used again to sort V + and V − while also sorting two new Index vectors, so V + and V − will be in ordered from least voted to most voted points and I + and I − will contain the corresponding indexes of the points in the original dataset.
After that the algorithm will use the inclusive scan method of the Thrust library to calculate the running sum of each of the votes vectors V + and V − and save in RS + and RS − respectively. The last element of each RS * vector will contain the total number of votes cast by its respective V * vector.
Two more variables are needed to calculate the interval, lbd (lower bound) and ubd (upper bound), these variables are integers between 0 and 100 with 0 ≤ lbd < hbd ≤ 100. They represent a percentage of the total number of votes that will be used when selecting the Support Vector Candidates.
With all vectors populated and lb and ub selected the algorithm can now find the minimum (minV otes) and maximum (maxV otes) number of votes needed for a point to be a possible SVC for both positive and negative sets. For each element x of the RS vector we will do the following in parallel: Step 1: Get the total number of votes by selecting the last element of RS * and save in T V * .

Support Vector Candidates Selection
With minV otes and maxV otes for each set found then the last step is the final selection of Support Vector Candidates. To do that in one parallel two more boolean vectors are created SV C + and SV C − with all values previously set to false.

Computation Overview
This section will go through a example computation of VNN-SVM on the Iris dataset showing partial results on important steps of the algorithm for better understanding.
Because Iris has 3 different classes the algorithm will have to be applied to all the possible pairs. All pairs containing Setosa are always perfectly divisible and therefore uninteresting, we will overview the computation of the Virginica/Versicolor pair as it is more interesting. More information about the dataset is found on 4.2.1 but for the purposes of this section we need to know that both Virginica and Versicolor have 50 entries with 4 attributes each and that it isn't linear divisible with outliers in both classes.
I will not count as part of this algorithm the breaking of the original dataset into its smaller subsets containing only one class as this can be done at the same time as the usual data preparation (cleaning, normalization, etc).
So first the user needs to select a k, lowerBound (lbd) and upperBound (ubd).
In this example we will use k = 10 , lbd = 20 and ubd = 80, aiming to cast enough votes to find a border while removing some of the lower and higher voted points.
Before any calculations are made the first transfers of data from CPU to GPU will take place, that is when all points of both classes are copied to GPU memory.
This has to occur because the GPU and CPU don't share the same memory and therefore all information used by the GPU needs be copied there first. Data created by the GPU don't need to be copied but its space needs to be allocated by the algorithm before using it. These memory copies were omitted from the implementation of the algorithm for brevity but all take computational time and will affect the runtime of the algorithm and are being taken in consideration in section 4.5.
After copying the data to GPU it can finally start by calculating the Distance Matrix as shown in 3.2.1. To do that the algorithm will allocate the needed memory space for the distance matrix and other variables used in equation 11 and calculate the matrix M .
In our case this distance matrix M calculated has size 50 × 50. By looking into a row of M you find the distance of between that element of Versicolor to all elements of Virginica. The algorithm will then make a copy of M and transpose it calling it M T , so the rows of M T will reflect the distance from an element of Virginica to all elements of Versicolor.
The reason we have to transpose one of the matrices is to get the best performance possible on the next step, where the algorithm will sort those rows to find the nearest neighbor of each element. Now both M and M T will be sorted row-wise while at the same time changing the position of the elements of an index vector to reflect which element each distance corresponds to, fortunately Thrust contain a function sort-by-key that can be used for these specific cases.
A small example of this step can be found in figure 10, the first row corresponds to the distance between one point of one class and the first 8 points of a different class. Figure 11 shows the results of the sort-by-key algorithm, with the distance On the original kNN-SVM algorithm that would be the last step and those points with no votes would be removed.
But the VNN-SVM will take one step further by removing points with very few or too many votes. To do this the algorithm will use the lowerBound and upperBound variables (20 and 80 respectively) selected at the beginning of the  where lower bound will be 500 * 20/100 = 100 Votes and the upper bound will be 500 * 80/100 = 400 Votes. With the values figured out the algorithm will select all points where running sum are between lowerBound and upperBound. Figure 13 shows graphically the selection for both Iris-Versicolor and Iris-Virginica.

Voting Nearest Neighbors 2 Pass(VNN-SVM 2 Pass)
The lowerBound and upperBound variables added to the Support Vector Candidate selection on VNN-SVM gives the user the ability to be greedy on their candidates selection, for example, when using the lowerBound = 75 and upperBound = 100 pair the algorithm will pick just a small number of very high voted points. But most of the greedy approaches selected SVC s that created appalling borders. When analyzing those results we find out that the select points were composed of mostly outliers and elements inside the margin had the SVM used the full data.
This result meant running a greedy selection won't always work for selecting useful SVC, but was very useful to find outliers or elements that are hard to classify and could lead to overfitting if used on the SVM. I decided to use this to implement a new version of the VNN that could try to be more greedy on how many points it selects for the SVC by running it twice.
The first pass will be used to remove those points hard to classify. This way the algorithm can do a second pass that can be more greedy than normal while still having a good chance of getting a acceptable margin. I called this algorithm Voting Nearest Neighbors 2 Pass(VNN-SVM 2 Pass). And a run of this 2 pass version will be like this: Step 1: Select a k 1 , lbd 1 and ubd 1 , the bound selection should aim to be very greedy so it will select the elements very close to the border.
Step 2: Run a version of VNN like the one described in 3.1 on dataset A and find the subset SV C 1 .
Step 3: Instead of creating A by selecting all elements of SV C 1 , create A by selecting all elements of A except the ones appearing in SV C 1 , in mathe- Step 4: Select a k 2 , lbd 2 and ubd 2 .
Step 5: Run of VNN exactly like the one described in 3.1 on dataset A and find the subset SV C 2 .
Step 6: Creating A by selecting all elements of SV C 2 as your possible support vector candidates.
Step 7: Find the separating hyperplane using SVM over A .

Time complexity
One of the main goals of the VNN-SVM is to speedup the training of SVM s when working with Big Data. To do so, it is imperative that the time it takes to run the VNN-SVM is as fast as possible, that the algorithm was implemented in parallel using CUDA.
There are two types of code used on this implementation, one the user created kernels, these codes were written just for this algorithm and are specific to this implementation, this are the CUDA kernels. The other type of code used is the use of libraries, in specific CUBLAS and Thrust, these are generic built code distributed with CUDA and are highly optimized to maximize GPU use.
When analyzing the code we have a perfect knowledge of the time complexity of the CUDA kernels created for this project but are dependent on the libraries documentations for CUBLAS [5] and Thrust [6] functions.
The analysis will be divided in the same way the implementation was discussed.

Distance Matrix
As seen in section 3.2.1 the distance matrix is the result of the equation (12).
This is done in 3 steps, calculating the Euclidean Norm squared, creating the matrix C and finally using CUBLAS to calculate M = −2 * A − A + + C.
The first two steps were done using CUDA kernels and they have a complexity of O(n) and O(n 2 ). The matrix multiplication and addition were done using CUBLAS, based on the documentation the algorithm used for this operation is specific to the system it is being implemented in, but is always based on the General Matrix Multiply (GEMM ) [7], this algorithm has a complexity of O(n 3 ) but is optimized to minimize memory access.

K-Nearest Neighbors
The steps of the kNN calculation are the transpose of the distance matrix, the sorting of the distances and the casting of the votes on the k neighbors.
The transpose is using a CUBLAS function but it is straightforward with a complexity of O(n). The sort is being done using sort by key from the Thrust library, the algorithm implemented by that library is the Radix Sort with a com- where b is the number of bits required to represent the largest element of the array. But the sorting will need to be applied to all points in the dataset, raising the complexity to O(b * n 2 ). The voting uses a very simple kernel with a complexity of O(n).

Interval Calculation
The steps that make the Interval calculation are the sorting and running sum of the votes and the finding the interval.
As stated before sorting has a complexity of O(b * n) and this time it will be done just once, keeping the complexity as it is. The running sum is also using the Thrust library, more specifically the inclusive scan function, this is also a straightforward implementation that has a complexity of O(n). Finding the Interval doesn't use any libraries and its kernel has a complexity of O(n).

Support Vector Candidates Selection
The last step of the VNN-SVM consists of doing the final selection, this is also done with a simple kernel that has a complexity of O(n).

Parallelism
The time complexity analysis of a CUDA algorithm is different from a normal algorithm because it is not only tied to algorithm being used, but to how paral-  Table 3: Evolution of processor units on NVidia graphics cards problem that is being solved with an O(n 2 ) algorithm that can run fully in parallel, then you can compute it at a speed comparable with an O(n) algorithm if you can run it on n CUDA cores at the same time.
We can see the evolution of CUDA cores on table 3, in less than 10 years the number of cores grew almost 10 times, with the architecture accompanying them also improving the overall performance. Nevertheless, these numbers are not enough when comparing with the size of Big data where 5000 points constitute a small Big Data problem. So, for now, commercial grade hardware will not have the same number of cores to change the time complexity from O(n 2 ) to O(n). But it can make your O(n 2 ) algorithm more than a 1000 times faster by having it run in as many cores as possible.
The VNN-SVM was programmed to do as much of the computation in parallel as possible. The only part of the code with concurrency is the casting of the votes, where, to compute the right number, it creates a queue any time more than one core wants to cast a vote to the same point. This means that all time complexities other than that one are being speed-up by a factor of c where c is the number of cores available on the hardware being used. Taking that in consideration, we can express the time complexity of the VNN-SVM implemented based on the input size n and number of CUDA cores c as O(n 3 /c).

Methodology
The VNN-SVM has 3 variables that will define a run of the algorithm, k the number of votes each entry will cast on the other classes, lbd the percentage of the total votes that indicates the minimum number of votes that have to be cast before selecting possible SVC s and ubd the percentage of the total votes that indicates the maximum number of votes that will be used to select possible SVC s.
When discussing the results it will be easier to reference the selected bounds as a pair, because of that the notation lbd ubd will be used to denote a specific pair henceforth.
A grid search will be done to analyze the impact of k, lbd and ubd and determine their best values. I decided to handpick the lbd ubd pairs to cover the most interesting cases. From very greedy bounds, ones that have a small range and larger values like the pair 75 100, to more conservative bounds, ones with a large range or that includes low voted points like the pair 0 75.
To study which values give the best results each test will be analyzed using these 4 metrics: 1. Size of the border vector: The number of SVC s selected by the Voting Nearest Neighbors. We want to train the SVM using as few points as possible to make it fast; 2. Run-time: Time taken by the Voting Nearest Neighbors to create, select a new border vector and run the SVM. I divided the total runtime of VNN-SVM into these 2 different items.
• VNNRuntime: The total time taken by the VNN-SVM excluding the time it takes to train the SVM ; • SVMRunTime: The time it takes to run the SVM on the SVC dataset.
The TotalRuntime measured in testing is represented by adding VNNRuntime and SVMRunTime and it will be compared to the time it takes to run the SVM using the full training data, and to be considered successful it should always be smaller than the original SVM. This runtime will be used to calculate the speedup.
3. Accuracy: The final accuracy of the SVM trained. A small loss in accuracy is acceptable but the main goal is an accuracy as good or better than the original SVM. This accuracy will be calculated by testing the margin created by the SVM created using the SVC selected over the original dataset and Validation datasets when available. The accuracy will be calculated as such: A normal SVM will be trained with full datasets, these will be used as the control to which the VNN-SVM will be compared.
All tests done with VNN-SVM will also be performed on the original algorithm KNN-SVM by running VNN-SVM with the 0 100 pair. Because it's using the same algorithm I don't expect to see a big difference in runtime between the original KNN-SVM and VNN-SVM but it will be interesting to see how much the bound variables can change the SVC selection.
To prove that the selection of the SVC s are indeed helping to create the best SVM possible I will compare its results versus a random test. This test consist of training an SVM s using the same number of points as the number of the SVC s of that particular run, but these points will be sampled randomly. I will collect the average accuracy of these tests as well as best and worst individual accuracy of the random SVM s.

Datasets
The VNN-SVM algorithm was built generic enough to be used by any type of datasets being able to find adequate subsets that could be used to find margins as good as if running all data, but the added computation in smaller datasets might make it irrelevant. For that reason I chose 5 different datasets from very small to very large so the effect and performance of the algorithm could be studied on a broad range of datasets. Here are the descriptions of the datasets used:

Iris Dataset
The Iris dataset is one of the most recognizable machine learning datasets in existence being used to compare different techniques in performance. This is a biology dataset that was first published in 1936 by Ronald Fisher in his paper The use of multiple measurements in taxonomic problems and reproduced many times over. The version used here was downloaded from the UCI machine learning The Iris dataset consists of 150 points equally divided in 3 to represent 3 different iris species, Iris-Setosa, Iris-Versicolor and Iris-Virginica. Each point has 4 different attributes: 2 refer to petal size and 2 to sepal size.
Although this dataset is not representative of the type of datasets the algorithm was design for, it is nevertheless interesting to test how the algorithm will perform on a well known dataset.
Because this dataset is so small it is expected that running VNN-SVM might take more time than just running an SVM with the full dataset. So this test will focus more on the accuracy of the new pruned dataset versus the full dataset.

Wisconsin Breast Cancer Dataset
The Wisconsin Breast Cancer Dataset [2] is also well known in machine learning. It is another biology dataset that consists of 569 points divided into 2 classes, 357 benign tumors and 212 malignant tumors. Each point has 30 attributes, all of them real values related to the cell nucleus.
As with the Iris dataset, the Wisconsin Breast Cancer dataset is not representative of the type of data the VNN-SVM was designed to help, but it as a step between the small Iris dataset and datasets approaching big data and will provide a point to test the algorithm.

Gisette Dataset
This is the first of the datasets that I will treat as big data. The Gisette dataset [3] is an example of the handwritten digit recognition problem. It contains 6000 points divided equally into 2 classes, 3000 points representing the digit '4 and 3000 points representing the digit '9 . This dataset was selected because of an interesting characteristic of having not only a large number of data points but also having a large number of attributes, 5000 attributes to be precise.
This large number of attributes is not natural as this dataset was tailored to be used in a feature selection challenge. Distractor features were added in a way that didn't give any predictive power, as well as sampling pixels at random from the region containing the information necessary to disambiguate 4 from 9. Higher order features were created as products of these pixels to plunge the problem in a higher dimensional feature space.
To test the performance of the VNN-SVM algorithm when dealing with high dimensionality problems no feature selection was done to this dataset as I wanted to test it as it is.

Kepler Exoplanet Dataset
The Kepler Exoplanet Dataset [4] was created by NASA and is operated by the California Institute of Technology. This dataset is an online astronomical catalog collating information on exoplanets and their host stars. There is information on 9564 exoplanets divided into 3 classes 2283 confirmed exoplanets, 2158 exoplanet candidates and 4544 false positives. It contains over 150 attributes divided into 3 main types: • Exoplanets attributes: such as orbital parameters and masses; • Host star attributes: such as temperatures, positions and magnitudes; • Discovery attributes: such as published radial velocity curves, photometric light curves, images, and spectra.
Of those attributes I selected 51 of them, 35 with information on the exoplanet, 16 with information on the host star. I removed all categorical attributes to keep the SVM simple, for the numerical attributes I kept the ones that could be used to describe the exoplanet or star in layman's terms like mass, temperature, orbit, distance, etc.
This will be a good example of how the algorithm will perform when the datasets have more than 2 classes.

Air Pressure system (APS ) Failure and Operational Data for Scania Trucks Dataset
The Air Pressure system Failure and Operational Data for Scania Trucks Dataset [5] was created by the manufacturer Scania AB for the Industrial Challenge 2016 at The 15th International Symposium on Intelligent Data Analysis (IDA).
The dataset contains 60000 elements describing system failures, but more interesting is that the dataset is very unbalanced with 1000 elements belonging to the positive class when the error is related to a specific component of the APS and 59000 elements belonging to the negative class when the error is not related to the APS. Each element has 170 attributes but its names and descriptions were anonymized for proprietary reasons before releasing the data to the challenge.
This is the biggest dataset tested and is where I expect the results of VNN-SVM to be more expressive.

Data preprocessing and algorithm modification
The preparation of data before its use is an intrinsic part of the process of data classification and its results can have profound impact on the classifier final results. This is true for most machine learning algorithms and as such should be taken into consideration when proposing any new technique. This section will overview the effect and propose ways to deal with Missing Values, Data Normalization and Unbalanced Data when working with VNN-SVM.

Missing Values
Missing values are a very common occurrence in data science and can be defined when one or more variables have no value stored for an observation. This problem can occur in any step of the data collection process and can have many causes, faulty sensors, problems in transmission, no response from survey, ugly handwriting, etc.
Because of the intrinsic importance of the distance function explained in chapter 3.2.1 on the VNN-SVM, any missing values can make this calculation impossible. To deal with this problem there are 2 recommended ways Partial Deletion and Imputation.
• Partial Deletion: Partial Deletion is the act of removing just the entries that have missing data. The more common case of partial deletion is having to remove just the specific entries that are missing one or more values, but if the problem that generated the missing values were specific to a single column on multiple entries then it is better to just remove the attribute missing several values and keep all entries. The Breast Cancer dataset being used is an already preprocessed version of a bigger dataset that originally had 699 entries, but was reduced to 569 by removing points with missing values.
• Imputation: Sometimes if you try to remove all points you might lose too many points. The APS dataset is a example of this case, most of the points on this dataset are missing at least one attribute value and these missing values are well distributed between many different attributes so removing a few of them will not help. For cases like this the only solution is to replace the empty values with new values and this technique is called imputation.
The imputation solution selected for VNN-SVM is to add the mean of an attribute to all points missing that value, this way no bias is created on the specific attribute. This was the solution used on the APS dataset.
When using any of these techniques it is important to apply all changes to points you need to classify in the future. To do so it is necessary to know which columns to remove in case of a partial deletion and to keep a record of the mean of every attribute used if an imputation is needed to any new point.

Data Normalization
Any machine learning technique that uses distance between points to do the classification, like kNN and linear SVM s, assumes that the range of the variables are the same or at least close to each other. This is because if you have a variable with a bigger range than the rest of the other variables, then that big range variable will have a disproportional impact on the distance between points and consequently it will also have a big impact on classification.
Data Normalization is the mechanism which changes the range of all variables to be the same size. The naive way of doing that is to map the minimum and maximum values of each attribute to −1 and 1 respectively and scale all values in between to be in this new range. The problem with that solution is that if just one of the values is skewed to a value very far from the normal range of the attribute it will make the naive normalization have various points close to the range and just the outlier value being maximum or minimum by itself with no points near it, creating the problem of different ranges all over again.
For that reason I recommended that Soft Normalization should be used. With this technique the majority of the points are arranged into a common range but any outlier, while scaled, will continue to be an outlier. Soft normalization is done as follows: Step 1: Given the dataset A with n 1 rows and n 2 columns; Step 2: For each column c j in A; Step 3: Step 4: Calculate standard deviation (σ j ) of c j where, x ij − µ j n 1 − 1 ; Step 5: Substitute each point x of c j following this formula, It is important to notice that the user will need to save the mean and standard deviation of each column because this normalization needs to be done to all points that need to be classified later using the original values and this is done by repeating Step 5 on all new points.
Both APS and Kepler Exoplanet datasets had their variables normalized as at least one of them contained literal astronomical numbers being used next to smaller numbers like orbital period that is measured in only days.

Unbalanced data
Unbalanced data is a common occurrence in data classification and happens when one of the classes has many more instances than another. With these datasets it is very easy to build classifiers with good accuracy that in the end are just classifying all incoming data as the one class, the one that has the majority of entries. Because of that it is usual in these cases that the classification of the minority class is more important than the overall classification.
In our tests the Breast Cancer Dataset, the Kepler exoplanet and the APS dataset all had unbalanced class sizes but just the APS could be considered a truly unbalanced dataset as the difference between the positive class and negative class was 1/59 while in the other datasets the worst case was around 1/2.
Because of this extreme difference between majority and minority class I decided to change the VNN-SVM for these types of datasets to guarantee that the minority classes would be well represented in the selection of the SVC s.
To do so the algorithm was changed so that the votes for the minority class will not take place and by default the full minority class will be selected as SVC s.
The algorithm will still cast votes for the majority class as usual for the selection of the possible of SVC s that will be added to all points of the minority for the new dataset to be used to find the border.
With this change the margin to be created will be more favorable to fairly represent the minority class as the user can select k, lbd and ubd to make the SVC more balanced.
For the tests of APS dataset not only the overall accuracy will be tested but the accuracy of minority and majority classes will be analyzed individually.

Multiclass data
The algorithm was implemented to handle classification of more than 2 classes by utilizing the One-vs-One approach, meaning that the full algorithm will be applied pairwise between all classes present in the dataset. This was done to guarantee that elements of all classes would be selected for the SVC subset.

Resources Used
The VNN-SVM and all its auxiliary functions were written and compiled

Results
This section will go over the results of the VNN-SVM algorithm over the datasets described in section 4.2. The results are show in tables that may contain all or some of the following result columns as needed: • VNNAcc: Accuracy of VNN-SVM ; • RandAcc: Average accuracy of random SVM s created by sampling the same number of points as number of the SVC s selected on this run.
• №SVC: Number of SVC selected by the VNN-SVM ; • №SV VNN: Number of support vectors of the SVM created using the SVC s; • №SV Rand: Number of support vectors of the SVM created using the SVC s; • PositiveAcc: Accuracy of VNN-SVM on the positive class (minority class); • NegativeAcc: Accuracy of VNN-SVM on the negative class (majority class); • PositiveAccRand: Average accuracy of the 20 random SVM s on the positive class (minority class); • NegativeAccRand: Average accuracy of the 20 random SVM s on the negative class (majority class); containing some of the following metrics depending on the dataset: • VNN-SVM Accuracy: Accuracy of VNN-SVM on the full training data; • Rand Accuracy: Average accuracy of the random SVM s on the full training data; • Min Rand Accuracy: Minimum accuracy achieved by the random SVM s; • Max Rand Accuracy: Maximum accuracy achieved by the random SVM s; • Validation Accuracy: Accuracy of VNN-SVM on the validation data; • Rand Validation Accuracy: Average accuracy of the random SVM s on the validation data; • № SVC: Number of SVC selected by the VNN-SVM ; • № SV VNN: Number of support vectors of the SVM created using the SVC s; • № SV Rand: Average number of support vectors of the random SVM s;

Iris Dataset
Results of an SVM created using the full Iris dataset are displayed in Table   4. The border was found using a linear kernel with cost = 0.1.  Of the 150 points that make the Iris dataset 68 are used as support vectors to create a margin with 97.3% accuracy. Table 5 has the results of VNN-SVM when k = 10 using a linear kernel with cost = 0.1.
The first thing to notice in these results is that the random tests have a big range between minimum and maximum accuracy for almost all cases except when the number of SVC is greater than 69. From that we gather that by sampling  Table 5: Iris Results VNN-SVM k = 10 around half of the dataset you have a good chance of getting accuracies above 90%. But even in these cases the border created by V N N − SV M achieved a better accuracy than the average of all random cases.
The VNN-SVM was successful in selecting viable SVC s to train the SVM in most cases having its best performance on the more conservative border selections like the original kN N − SV M , 0 25, 10 50, 10 75, 10 90 and 10 100, but more interesting are the results from 0 50 and 0 75 that, by removing the points with most votes, increased the accuracy to 98%, better than the original SVM. This means that VNN-SVM was able to create a more generic decision surface by using fewer points than the full dataset.
In this run the very greedy bound pairs also did a good job to create the decision surface as seen with the results of 75 100, 80 100 and 90 100 with accuracies all over 90% using just 20 and 21 points. But interesting enough the 50 100 and 66 100 bounds were where the algorithm performed the worst, with accuracies as low as the worst random accuracies. If we look at the number of SVC s selected we see that the bound pair 66 100 and 50 100 added 3 and 10 more points than the good cases that come after them respectively. With that we can infer that those extra points were enough to shift the margin to that less desirable configuration.
This is reinforced when you look at the number of support vectors that rose from 17 in the good surfaces to 19 and 24 in the bad ones. But even with the success of the greedy boundaries of this run with k = 10 the greedy algorithm is not really recommended as it had a non consistent performance with other ks tested as shown in  Table 6: Greedy bounds selection performance The time it takes to run the SVM using just the SVC s went from 0.026 seconds using the full dataset to just 0.0021 seconds using the subset of SVC s as seen in Figure 14a, a speed up of 12x. But unfortunately, as expected, the TotalRuntime of the VNN-SVM was much greater than just running the SVM with the full dataset. In figure 14b we can see the comparison between runtimes with the VNN-SVM taking 0.58 seconds to run by itself. This happens because CUDA adds a big overhead to computation specially when transferring data between CPU and GPU and vice-versa.

Wisconsin Breast Cancer Dataset
Results of an SVM created using the full Wisconsin Breast Cancer Dataset are displayed in Table 7, the border was found using a linear kernel with cost = 0.    infer some facts about it. We know that the data is not perfectly divisible as not even an SVM created using the full dataset has 100% accuracy. But the classes "shape" must be well defined for most points because the SVM we can achieve a good accuracy with an SVM created using just a few a random points for training.
These facts led to the developement of the VNN-SVM 2 pass, where by removing the border in a greedy first pass and voting on the remaining points in the second pass the algorithm could select few but good points that defines the shape of the classes and find a smaller more generalized margin. In

Gisette Dataset
Results of an SVM created using the full Gisette Dataset are displayed in Table 10. The border was found using a linear kernel with cost = 0.1.
In Table 11 we find the results for VNN-SVM with k = 2, we can observe that the best accuracies occurred when using conservative bounds, with over 97%    points. Because of that I decided to test this dataset with the 2 pass algorithm.
As pointed out above even small ks can select more than half of the dataset depending on the bounds, because of that I decided to keep k = 1 on the first pass so it would not remove too much of the dataset. To control what would be removed I decided to change the bounds, these were the bounds tested: • 0 100: This test will try to remove enough of the margin that the second pass would select from the general population of each class, this is the test that would remove the most points.
• 0 75: This test will try to remove enough of the margin that the second pass would select from the general population of each class at the same time leaving some of the most voted points there to see if the accuracy would increase as these points are taken in consideration for the border; • 30 85: This is a middle ground between the previous 2, leaving the least and most voted points in place. This will be the test that removes the least amount of points in the first pass.

Kepler Dataset
The kepler dataset is the first not to use a linear kernel, with the best SVM created using a radial kernel with γ = 0.3 and cost = 4.1, the results of this SVM are found in Table 16. One point that makes the Kepler dataset different than the ones tested before is that the SVM uses 7029 points as support vectors, almost 80% of the dataset.
It is hard to expect that VNN-SVM will be able to remove as many points as it has for the previous datasets but it should be able to select the best points    To reduce the number of points to just 62% while keeping a good accuracy is a impressive feat but it didn't do enough to save time on the SVM. On Figure 17a The disparity between this results and the ones for the Gisette dataset is staggering, but can be explained. First when comparing SVM s we see that Gisette took 3 times longer to find the margin. This happened because Gisette is the bigger   The VNN-SVM was successful on selecting useful SVC s as show on Table 20.
We can see that the conservative bounds achieved an accuracy of over 90%, with the exception of the 0 25 bound, by selecting between 4000 and 4500 points as possible SVC.
When analyzing the new runtime it looks better but it wasn't enough to save time. On Figure 18a we see that the SVM with the 2 classes dataset takes about the same time as the 3 classes dataset, and the SVMRuntime of kNN-SVM was 5.9 seconds, a little more than half of the time achieved when using the SVC s selected for 3 classes. The VNNRuntime was 83.37 seconds about half of the VNNRuntime needed when using the 3 classes dataset.

APS Failure and Operational Data for Scania Trucks Dataset
Results of an SVM created using the full APS dataset are displayed in Table   21. The border was found using a radial kernel with γ = 0.5 and cost = 1.   But this accuracy came with a high cost in runtime with one run of the SVM taking almost 15 minutes, also notice that to create such a decision surface the algorithm selected 8170 support vectors, meaning that approximately 13% of the points are necessary to represent the border. In a dataset as big as the APS this means that it will take a substantial amount of space in memory and time to classify new points using the SVM created.   When trying a more greedy selection of bounds the algorithm had its worst performance yet but that was expected as the more greedy approaches inverted the unbalance of the dataset in favor of the minority having in some cases less than a hundred negative points for the thousand points of the positive class. So the SVM s created with those points were extremely good on classifying almost everything as positive.
But the best outcome of these tests were without a doubt the runtime. On But the small changes on the unbalanced algorithm explain this difference.

Conclusion
The objective of this dissertation has been the development and study of a data preprocessing method that will select a subset of points in the dataset with the best chance of being used as Support Vectors by a Support Vector Machine.
While at the same time removing any possible outliers in a way that this new subset can be used to create an SVM faster than when using the whole data.
That culminated in the creation of the Voting Nearest Neighbors algorithm (VNN-SVM), an algorithm that uses each point of every class of a dataset to cast a vote on the members of a different class as possible Support Vector Candidates and uses these votes to determine which points should be selected for this new subset.

Goals Revisited
On this section the five goals listed in section 1.4 will be analyzed based on the results from the previous chapter.

First Goal
"Select a smaller set of the points for SVM training" For all accounts the algorithm was successful in doing so. In our tests the subset of support vector candidates selected by the VNN-SVM algorithm used between 5% and 65% of the original points depending on the dataset tested, promoting significant speedups for the SVM training.

Second Goal
"Remove points far from the margin between classes" The VNN-SVM algorithm was designed to be an improvement over the KNN-SVM, where all points with one or more votes were selected for training. So this goal was created to test one of these improvements, the capacity of VNN-SVM would have to further prune the subset selected by removing points with very few votes.
While the algorithm proposed does have the ability to remove those points the tests proved that those points are needed more than expected to correctly classify the data, the decision surfaces created with sets where the lower bounds were equal or bigger than 20 usually resulted in SVM s with significantly lower accuracies if not completely missing what the "shape" of each class was supposed to be.
As seen in section 4. the information needed to store the "shape" of the class being voted on.
For that reason even though it is possible to remove the least voted points using VNN-SVM it seems that the better accuracy achieved on the margins created when using those points overcomes any gain we could have in performance by running the SVM s faster when removing them.
The VNN-SVM 2 pass algorithm was somewhat able to remove low voted points in its second pass while maintaining a good accuracy because the first pass would make it so the algorithm was selecting just the points inside the class that represents its "shape". But that algorithm was not able to do so for all datasets tested just on the linearly divisible datasets like the Wisconsin Breast Cancer and Gisette datasets.

Third Goal
"Remove outliers based on how many votes they received " The pruning of outliers and high voted points was much more successful that the remove of low voted points. When selecting an upper bound of 50 or higher while keeping a lower bound of 0, the VNN-SVM algorithm was successful in selecting SVC with hundreds fewer points than the ones selected by kNN-SVM while having little to no drop in accuracy. In some cases by removing those points the decision surface achieved had a better accuracy than the original.
The lower number of SVC s were propagated to a smaller number of Support Vectors used to create the SVM s decision surfaces, meaning faster times to evaluate new points and a smaller space taken in memory to hold the SVM s created.

Fourth Goal
"Be able to achieve a better generalization of the SVM decision surface by changing which points are selected " This one goal I knew would be hard to achieve and very dependent on the datasets selected. When testing the kNN-SVM, some of the decision surfaces created were able to increase the overall accuracy very slightly, so one of the goals of the VNN-SVM was to see if the same could be replicated or improved when using the voting method.
The tests performed on the Iris dataset showed an accuracy increase on the 0 50 and 0 75 pairs going from 97.333% achieved by the full dataset to 98%, but I knew that these cases were more likely to occur on the small datasets as each point selected has more impact on the decision surface.
I was pleased to see that there is a chance of the same happening even when working with larger datasets. In section 4.5.4 discussing the results of the Kepler dataset, we can see results from both the kNN-SVM and VNN-SVM being able to find better decision surfaces as shown in Table 18 and the same can be seen on the APS dataset. Although the difference between the accuracies are very small it shows that the algorithm is capable of creating more generalized margins in specific cases.

"Run in a reasonable time"
This goal is the one that really defines the usefulness of the VNN-SVM when compared to just running the SVM and the reason why it was crucial to implement the algorithm in parallel for maximum speed.
When we look at the results of the 5 datasets studied it is easy to see why this preprocess should be aimed at large datasets. The overhead added by the VNN-SVM to small datasets made the total runtime almost 140 times longer on the Wisconsin Breast Cancer and 20 times longer on the Iris dataset. So if our goal is only speedup training this will not work for datasets this size.
As datasets get larger this picture starts to change, when comparing the times of Kepler, Gisette and APS, we see the first still taking more time when running VNN-SVM and the last 2 having speedups of 1.3 times and 4.3 times respectively by finding the SVC and calculating the decision surface using only that subset.
Somewhere between the size and complexity of the Kepler and Gisette datasets is the line where the VNN-SVM starts to save more time than it uses to create the SVC subset.
But the biggest time save is without doubt when the method is used on unbalanced datasets as shown with the APS case. Because the way the algorithm was modified the runtime of the VNN-SVM will be closer to the runtime of datasets that have the same size as the minority class, saving the maximum amount of time.

Future Work
There were some ideas that I wanted to test in this thesis but had to be cut off because of time. Future work could investigate some of the following ideas: • Test the voting method with the SVM kernels instead of the nor-

Final Verdict
Overall the VNN-SVM algorithm performed as expected, being able to select a subset of points based on the number of votes and that could be used to create decision surfaces with Support Vector Machines.
One of the more interesting outcomes of the tests was how dependent the VNN-SVM was on its low voted points to achieve high accuracies. When starting the project this was one of the cases I was sure would be easy to remove from the SVC subset and as more tests were done I started to realize how much it influenced the achievement of the best accuracy possible. That said the tests discussed in section 4.5 focused on results with accuracy as close to the original as possible, but if the user is willing to forgo more accuracy for speed it can definitely achieve smaller SVC subsets by raising the lbd of the run, Appendix A has more tests were we can see the effect of k and lbd on the subset selection. The VNN-SVM 2 pass variant presented in this paper also produced interesting results being able to prune a lot more points than the normal algorithm but not achieving a consistent accuracy gain on all datasets tested.
And although the VNN-SVM could not save time on small datasets I don't think it should be written off as a possible preprocessing method to be applied to them. As seen with the Iris dataset by running the VNN-SVM and removing certain high voted points, it did achieve a better accuracy than the original dataset.
For the Wisconsin Breast Cancer dataset results, if we look at Table A.14 and can find that for a VNN-SVM with k = 10 and bound 0 75 the accuracy achieved was better than when using the complete original dataset.
Because both the SVM and VNN-SVM run so fast on those datasets it could be worthwhile to experiment with VNN-SVM by varying k, lbd and ubd to see if you can get a better decision surface using the same data. It is interesting to notice that because this decision surface is not using more but fewer points than the original it represents a more generalized answer to the problem being classified.
It is clear that the size and complexity of the dataset play a great part on the question of speed up, but the datasets tested here were on the smaller size of big data with 2 of them between 5 and 10 thousand points. So when applied to the great majority of Big Data problems the performance should be closer to the results found for the APS dataset.
The future also looks promising for GPU parallel programs as both CUDA SDK and GPU architectures are improving at impressive speeds, both having a new generations released every year since this study began back in 2016. This is encouraging as the performance of GPU programs are expected to rise in every generation as well as the hardware becoming more optimized. So solutions like this could be running even better for much bigger datasets. • VNNAcc: Accuracy of VNN-SVM ;

List of References
• RandAcc: Average accuracy of random SVM s created by sampling the same number of points as number of the SVC s selected on this run.
On a unbalanced dataset the same rule was applied as the VNN-SVM, selecting all elements of the minority and sampling the rest to create the SVM ; • nSVC: Number of SVC selected by the VNN-SVM ; • nSV VNN: Number of support vectors of the SVM created using the SVC s; • nSV Rand: Number of support vectors of the SVM created using the SVC s; • PositiveAcc: Accuracy of VNN-SVM on the positive class (minority class); • NegativeAcc: Accuracy of VNN-SVM on the negative class (majority class); • PositiveAccRand: Average accuracy of the 20 random SVM s on the positive class (minority class); • NegativeAccRand: Average accuracy of the 20 random SVM s on the negative class (majority class);

APPENDIX B Glossary
• Classifier -In machine learning a classifier is an algorithm that predicts the class or label of points based on it's input variables.
• Coalescing Memory Reading -Coalescing Memory Reading is the concept used in GPU programming that sequential threads will use the same sequential sections of memory. GPU architectures uses this concept to speed up computation by, when receiving a memory access request, instead of loading just the memory requested it sends to the thread managers a group of memory consisting of the one requested and ones close to it in memory, this way any other threads can use those extra memory points for their own computation.
• Continuous Variables -Continuous Variables are variables that can have a infinite number of possible values. In practical terms continuous variables are any variables that can receive any natural or real numbers.
• Decision Surface -Decision Surface or Decision Boundary is a ndimensional hyperplane that divides the feature space in two classes. The points will be classified as belonging to a class depending on which side of the decision surface they reside.
• Discrete Variables -Discrete Variables are variables that can have a finite number of real values.
• Euclidean Distance -Is the common method of calculating distance between two points by finding the straight line distance between two points in euclidean space.
• Feature Space -In Machine Learning, feature space is the n-dimension space where the variables of a dataset live, where n is the number of attributes of the dataset.
• Greatest Margin Classifier -A greatest margin classifier is a algorithm that will find the region in feature space where the distance between the classes is the greatest and use it to create a decision hyperplane that will be placed in the middle of the margin, classifying every point one side of the margin as one class and the all points on the other side as another class.
• Hamming Distance -Is a common method of comparing discrete variables, where all possible values of the variable receive a bit of information corresponding to that value and the distance between points is calculated by how many bits are different.
• Kernel Space -On SVM s, kernel space refers to the n-dimensions space where the distance calculation takes place in the SVM algorithm, where n > of attributes of the dataset. Although the data is never converted to kernel space the decision surface created by the SVM is a plane in that n-dimensional kernel space.
• Linear Classifiers -In machine learning a linear classifier is an algorithm that classify points based on a linear combination of its features. Even though they are called linear they can exist in n-dimensional space as a hyperplane achieving the same results.
• Margin -In machine learning the margin is the distance between the decision surface and the closest data points of each class.
• Non-linear Problems -In machine learning a non-linear problems is a problem that can't be classified correctly using a linear classifier.
• Outliers -In machine learning outliers are points that are separated from other members of the same class by a great distance, not sharing the same characteristics of points of its class. Because of this distance these points are usually hard to classify and sometimes they may lay closer to members of other classes hindering the creation of a classifier.
• Overfitting -Overfitting is a common modeling error that happens when the created model tries too fit the training data too closely often learning noise or outliers that don't represent the overall form of the data.
• Polytope -Polytope is an n-dimensional geometric object. In the context of the text it refers to a geometric object that encloses a subset of points to be classified.
• Test/Training Data -When creating a classifier it is usual to divide the data you have in 2 sets, one bigger for training and a smaller for testing.
This way you can train your classifier with your training data. And, with the remaining points, see how your classifier performs on points never seen to check for Overfitting or bad parameters in your training.
• Time Complexity -In computer science, time complexity is the computational complexity that describes the amount of time it will take to run an algorithm. On this dissertation we will be dealing with the Big O notation of the code implemented indicating the order of function that relates with its growth rate.
• XOR -XOR or Exclusive Or, is a basic logic operation that receives 2 inputs and based on their sign it outputs 0 (if the inputs have the same sign) or 1 (if the inputs have different signs). This logic operation when applied to a 2 dimensional data set will divide the data in 4 different quadrants that are impossible to classify with a normal linear classifier, so it is often used that way to create data to test classification algorithms.