Date of Award
Master of Science in Computer Science
Computer Science and Statistics
Dimensionality reduction algorithms are a commonly used solution to create a visual summary of high dimensional data in a way that makes identification of patterns and trends easier. Algorithms that are used to visualize data as 2 or 3 dimensional plots are popular options, even more so due to clustering and manifold learning. There already exist many tools, both linear and nonlinear, that are used in visualizing high dimensional data, three of the most popular being PCA, t-SNE and UMAP. PCA has low memory requirements and is efficient in low dimensions, t-SNE captures much of the local structure of high dimensional data while also revealing factors like presence of clusters, and UMAP has no computational restrictions on embedding dimension.
Despite each of their respective advantages, all three of these tools have noticeable drawbacks. t-SNE and UMAP both have hyperparameters which require tuning to get visualizations of any value. PCA cannot recover nonlinear structure, so there can be significant loss of the global structure when applying that algorithm to data. These drawbacks prompt the development of new (mostly nonlinear) tools for visualizing high dimensional data. The reason for which we would want to visualize high dimensional data in the first place is because humans are incapable of seeing in more than three dimensions. Reducing the dimension of high dimensional data enables us not only to view the data, but to notice patterns and easier detect anomalous data points.
Manifold learning is one approach to getting a simplified low dimensional version of higher dimensional data. This machine learning tool is used in the visualization of high dimensional data by describing these datasets as low dimensional manifolds embedded into higher dimensional space. Clustering is a machine learning approach that groups together individual data points in a way that provides value. Clustering simplifies a large high dimensional dataset by showing clusters, or organized groups of data points, rather than all the data points individually. Hierarchical clustering applies this principle by first organizing datasets into one large cluster, and then recursively dividing the current cluster(s) until a specific criteria is met that finds the optimal “level” of this process, or the optimal clusters which represent the dataset.
Clustering algorithms are usually more effective in lower dimensions due to the “curse of dimensionality”, or the issues which arise when analyzing high dimensional data that do not occur in lower dimensions. For this reason, if we want to apply clustering algorithms to high dimensional data, we will be required to use dimensionality reduction first. This is a reason for which we would use manifold learning in tandem with hierarchical clustering, as it reduces the dimension of the data first to maximize the effectiveness of clustering.
When manifold learning and hierarchical clustering are used in unison, the result is a set of clusters from a dataset brought down to a lower dimension through manifold learning. These clusters, when taken from the manifold, are then able to be visualized easily in graph form. In this study, we will develop a tool to visualize high dimensional data by using hierarchical clustering and manifold learning together, but without actually reducing the dimension. Instead of using dimensionality reduction traditionally, we will visualize low dimensional summaries of high dimensional data. The summaries inferred from the data will give information about the manifold, such as connectedness between different parts of the manifold and how this connectedness changes through different stages of the hierarchical clustering algorithm. These summaries will also give factors indicating the presence of possible anomalous data points.
To create and access these summaries, we will use Pyclam, the Python implementation of CLAM (Clustered Learning of Approximate Manifolds). CLAM is an existing dimensionality reduction tool that uses manifold learning and hierarchical clustering, and made primarily for anomaly detection. From the manifolds produced by CLAM, we will be able to access all the necessary properties needed to infer graphs. These graphs will be returned in our implementation in the form of a DOT file, a file format read by various software to produce a graphical representation. After we are able to produce working DOT files, we will use a visualization tool of our own design, implemented in Rust, to read these DOT files and display these graphs in a force-directed layout.
Amani, Ali, "DATA DIMENSIONALITY REDUCTION THROUGH CLUSTER TREES AND MANIFOLD LEARNING" (2021). Open Access Master's Theses. Paper 1941.