Date of Award

2019

Degree Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Stastistics

First Advisor

Lutz Hamel

Abstract

Ninety percent of the world data today was generated over the last two years, boosted by the great speed in which information is created over the Internet and the low prices for storage and sensors. This new paradigm is what we call Big Data.

One of the biggest challenges in the field of Machine Learning today is how established algorithms perform on Big Data. The sheer size of these datasets can make it infeasible to use know algorithms to create a decision surfaces in a reasonable time.

Support Vector Machines is one of the algorithms that experience a steep increase in runtime when creating a decision surface for Big Data. This fact led to the decline of its use for classification on these types of datasets.

This dissertation introduces Voting Nearest Neighbors, a new preprocessing algorithm that assists Support Vector Machines on dealing with Big Data by creating a voting system based on k-nearest neighbors. The algorithm will select points close to the border between classes that have a higher chance of being used by a Support Vector Machine as Support Vectors, while removing outliers that would negatively impact the margin created. These points will be the only ones used in the training of the Support Vector Machine, allowing it to create the a decision surface in a reasonable time. In order to guarantee a good performance in a reasonable time, the algorithm is implemented in parallel using CUDA on GPU.

The technique was successfully tested against 5 datasets that cover a broad range of sizes, from the Iris containing just 150 points to the Air Pressure system Failure and Operational Data for Scania Trucks Dataset which has 60,000 points, with an encouraging diminish in runtime for Big Data datasets and a impressive performance when used to classify imbalanced datasets.

Share

COinS