PERFORMANCE COMPARISON OF SELF-ORGANIZING MAPS BASED ON DIFFERENT AUTOENCODERS

The Autoencoder (AE) is a kind of artificial neural network, which is widely used for dimensionality reduction and feature extraction in unsupervised learning tasks. Analogously, the Self-Organizing Map (SOM) is an unsupervised learning algorithm to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. Some recent work has shown improvement in performance by combining the AEs with the SOMs. Knowing which variations of AEs work best and finding out whether the selection of AEs is data-depended or not is the purpose of this research. Five types of AEs are implemented in this research; three different data sets are used for training; map embedding accuracy and estimated topographic accuracy are used for measuring the model quality. Overall, this research shows that nearly all AEs at least improve the SOM performance, improving embedding accuracy and letting the training process become efficient. The Convolutional Autoencoder (ConvAE) shows an outstanding performance in image-related data set, the Denoising Autoencoder (DAE) works well with the real-word data with noise, and the Contractive Autoencoder (CAE) performs excellently in the synthetic data set. Therefore, we can see that the selection of AEs depends on the properties of data.


Introduction
The Autoencoder (AE) is a kind of artificial neural network. It is an unsupervised learning algorithm that is mainly used for feature extraction and dimensionality reduction [1]. It consists of an encoder and a decoder, which intend to reconstruct the original input data from the hidden layer representation. The architecture of an AE is shown in Figure 1.
The Self-Organizing Maps (SOMs) proposed by T. Kohonen [2] is another approach to reduce dimensionality, which shows the clustering results for highdimensional input data onto a 2D grid map. In recent research, combining the AEs with SOMs has shown some promise in improving the performance of regular SOMs [3]. A Deep Neural Maps (DNMs) [4] model proposed in 2018 achieved this combination and gave excellent performance in high-dimensional data visualization.
However, there are many different kinds of AEs, and knowing which one works best is an open question. Performance comparison of different AEs could help one find more appropriate AEs for a data set, hence improving the performance of the underlying SOMs.
In fields such as genomic data clustering [5] [6] and cluster analysis of massive astronomical data [7] [8], the SOM is a good approach since it does not only accomplish the clustering task but also provides an accessible visible clustering representation. However, because both genomic data and astronomical data are highdimensional, it takes the SOM a long time to train the data. The AE is an excellent 2 method to bring the data to a lower dimensionality while keeping the intrinsic structure of the data. Hence, the SOM in conjunction with AE could help save the training time. This project can help select an appropriate AE for a data set to reduce data dimensionality, thus reducing the computing time of SOM. Hamel [9], which is based on map embedding accuracy and estimated topographic accuracy.
The remaining chapters of this thesis are organized as follows: 3

Self-Organizing Map
A kind of artificial neural network created by Teuvo Kohonen [2], the Self-Organizing Map (SOM), is an unsupervised learning algorithm that is mainly used for the visualization of high-dimensional data. Usually, it produces a two-dimensional lattice of nodes (called a map) to represent the high-dimensional input data while preserving the topological relationships of the input [2], and therefore it is utilized in dimensionality reduction. The convergence of the SOM algorithm has been proved by Y. Cheng [10], the model will converge after reasonably long iterations [2].
The basic SOM algorithm can be summarized as follows [11]: 1) Selective step: initialize each node's weight vectors randomly, select a training data vector from the input space.
2) Competitive step: find the best matching neuron based on the Euclidean distance between the data vector and the neurons: where is a neuron indexed by and denotes the index of the best matching neuron on the map.
3) Update step: update the winning neuron's neighborhood using the following rule: 5 where ( − ) denotes the difference between the neuron and the training instance scaled by the learning rate (0 < < 1), ℎ( , ) denotes the following loss function: where Γ( ) is the neighborhood of the best matching neuron .
Repeat from the selective step for iterations until the model converges. For a large high-dimensional data set, could be a large number, however, the basic SOM does not show a high performance after reasonably long iterations [11].

Vectorized SOM Training
Vectorized SOM training (VSOM) proposed by L. Hamel [11] is an efficient implementation of stochastic training for SOMs, which replaces all iterative constructs with vector and matrix operations. It is a single threaded algorithm, providing substantial performance increases over the basic SOM algorithm (up to 60 times faster) [11]. Because R does not support multi-threading well, the VSOM is well suited as a replacement for iterative stochastic training of SOM in R [11]. The VSOM implementation is available in R based POPSOM package [12].

Autoencoder
The origin of the autoencoder (AE) is not clear and the terminology may change over time. J. Schmudhuber [13] indicates that perhaps the first work to study potential benefits of unsupervised learning based pre-training was published by Dana H. Ballard [14] in 1987, which proposed unsupervised AE hierarchies. According to the information provided in [15], I summarize the AE as following.

6
An AE is a kind of artificial neural network that is mainly used for feature extraction and dimensionality reduction. It is composed of two parts, an encoder and a decoder, which aims to reconstruct the original input. The encoder maps the input into a hidden layer representation (or called code), and then the decoder reconstructs the input from the hidden layer representation.
An autoencoder could be undercomplete or overcomplete. The one with code dimension less than the input dimension is called undercomplete, while the one with code dimension greater than the input dimension is called overcomplete.
Regularization can prevent the overcomplete autoencoder from only copying the input to the output without learning anything useful [15], such as sparse autoencoder, denoising autoencoder, and contractive autoencoder.

Sparse Autoencoder
In 1997, Olshausen and Field [16] indicated that sparse coding with an overcomplete basis set leads to interesting interactions among the code elements because sparsification weeds out those basis functions not needed to describe a given image structure. Hence, sparse coding is a good candidate for the data set whose input data contain much noise [17].
Sparse autoencoder (SAE) is a kind of overcomplete autoencoder that includes more hidden nodes than input, but only a small number of hidden nodes are activated at once [18]. The training criterion of an SAE involves a sparsity penalty Ω( ) on the code layer , in addition to the reconstruction error L, the objective function is as following [15]: ( , ( ( ))) + Ω( ) (4) where ( ) denotes the encoder output, ( ) denotes the decoder output, we have = (ℎ 1 , ℎ 2 , … , ℎ ) = ( ). The sparsity penalty Ω( ) can be formulated in different ways, and one approach is applying L1 regularization term on the activation and scaling by a tuning hyperparameter [15]: Recently, an autoencoder with linear activation function called K-Sparse Autoencoder [19] was proposed in 2013, in which only the k-highest activities are kept in hidden layers. It achieves high speed on the encoding stage and well-suits to large problem sizes [19].

Denoising Autoencoder
Differently from SAE that adds a penalty to the loss function, the denoising autoencoder (DAE) achieves a representation by changing the reconstruction error term of the loss function [15]. The DAE takes corrupted input data and is trained to predict the original uncorrupted data as output [15], therefore the input and output for a DAE are no longer the same. Figure 2 shows the architecture of a DAE: Figure 2. The DAE architecture. Reproduced from ref [20] 8 the initial input is corrupted into ̃ by stochastic mapping ̃ ~(̃| ), the encoder then maps it to a hidden representation ℎ = (̃) from which we reconstruct the = ′ (ℎ), and the reconstruction error is measured by loss ( , ) [21]. In order to let reconstruction as close as possible to the clean input , the parameters and ′ are trained to minimize the average reconstruction error over the training set [21]. Note that the corruption process (̃| ) could be any types, such as Gaussian noise, Masking noise, and Salt-and-pepper noise [21].

Contractive Autoencoder
The contractive autoencoder (CAE) aims to resist perturbations of the input and is encouraged to contract the input neighborhood to a smaller output neighborhood [15].
CAE adds a regularizer penalty ‖ ( )‖ 2 (the Frobenius norm of the Jacobian matrix ( ) ) to the reconstruction cost function to encourage robustness of the representation ( ) [22]: where ℎ is the hidden representation, the penalty is the sum of squares of all partial derivatives of the extracted features ℎ( ) with respect to the input [22]. Similar as SAE, the objective function of the CAE has the following form: By comparing CAEs with DAEs, we can see that CAEs encourage robustness of representation ( ), but DAEs encourage robustness of reconstruction ( ( )) [22].
In 2014, Alain and Bengio [23] showed that in the limit of small Gaussian input noise DAEs make the reconstruction function resistant to finite-sized perturbations of the input, but CAEs make the reconstruction function resistant to infinitesimal perturbations of the input [15].

Convolutional Autoencoder
Different from basic autoencoders, a convolutional autoencoder (ConvAE) is built with convolutional layers rather than fully connected layers, hence it is efficient for image data sets. To exploit the spatial structure of images, the convolutional autoencoder is defined as follow [24]: where ( ) denotes the encoder output, (ℎ) denotes the decoder output, and the embedded code ℎ are matrices or tensors, is the activation function, and * is convolution operator. The object is to minimize the mean squared errors between the input and output over all samples [24]: In recent research, a Fully Convolutional Autoencoder (FCAE) [25] was proposed in 2017 which can be trained in an end-to-end manner. It is composed of convolution (de-convolution) layers and pooling (un-pooling) layers, plus adding batch normalization layers to each of the convolution-type layers. Different from the traditional ConvAEs, the FCAE could avoid the tedious and time-consuming layerwise pretraining stage [25].

A new Deep Neural Maps (DNMs) model designed by Mehran Pesteie, Purang
Abolmaesumi and Robert R. Rohling [4] in 2018 gives excellent performance in highdimensional data visualization, which uses SOM models in conjunction with deep convolutional AEs shown in Figure 3. The result shows that the DNM has separated each class of input data and mapped it to a particular position on a lattice successfully [4]. D. Rajashekar [3] proposed an Autoencoder based Self Organizing Map (AESOM) framework, which uses an AE that contains two hidden layers. It shows improvements in data representation and improves detection rates from encoding and reduces the feature space of the input [3].

Experiment Design
In this research, the experiment is mainly divided into two parts: 1) implement the AEs and SOMs (build five AEs with Keras [26] in TensorFlow [27] library and implement SOMs with the R-based POPSOM library [12]), 2) evaluate the performance. In this chapter, I will introduce the evaluation methods and implementation process in detail.

Model Structure
Based on the DNM model, the overall model structure is shown in Figure 4.

Data Set Selection
In this project, the task of AEs is reducing dimensionality and extracting features, and the task of SOMs is clustering the input data. For this purpose, the ideal data set for this project is the one with high dimensionality and precise classification. To compare and to evaluate the performance of AEs in conjunction with SOMs in various circumstances, three different types (synthetic, real-world, image) of data sets were selected.
1) The 'dim064' [28] [29] is a 64-dimensional synthetic data set with 1024 observations that well separated in 16 Gaussian clusters ( Figure 5). I split the data set with a ratio of 0.4, namely 60% data for training (614 instances) and 40% data for testing (410 instances).  3) The 'MNIST' database [31]is a large database of handwritten digits that is widely used for machine learning. It consists of 70,000 (60,000 for training, 10,000 for testing) grey-scale images of handwritten digits ('0' -'9') whose size is 28 by 28 pixels. I selected 10,000 examples from the training set and 2,000 examples from the test set to make a subset of MNIST database that as my third data set.
I converted the original image into 28 by 28 2D-array and scaled the value of each cell between 0 and 1, and each cell represents the single pixel of the image.
Before feeding to the AEs (except ConvAEs), the 2D array was flattened into a 1D array, hence the dimension of the data set is 784 (28 by 28).

Performance Evaluation of AEs
The evaluation process of AEs is based on the loss error (reconstruction error). I plot the loss functions of training data and validation data for each type of AEs and compare the mean and minimum value of them. For image data sets, I also plot the original input images and the decoded images to show visible reconstruction results.
Additionally, the evaluation results of SOMs also indicate the quality of AEs that whether the encoders extract useful features.

Performance Evaluation of SOMs
Within this research, the evaluation methods of SOMs are based on the SOM quality measures presented by L. Hamel [9], which is an efficient statistical approach measures both the embedding and the topological quality of a SOM.

1) Embedding Accuracy
The motivation for the map embedding accuracy is that [9], 'A SOM is completely embedded if its neurons appear to be drawn from the same distribution as the training instances.' That features are embedded means that their mean and variance are adequately modeled by the neurons in the SOM. The embedding accuracy ( ) for features are defined as following: where = { 1 if feature is embedded, 0 otherwise.

15
A map is fully embedded if the embedding accuracy equals 1.

2) Estimated Topographic Accuracy
The topographic error [32] is almost the simplest measure of the topological quality of a map which is defined as: where ( ) = { 1 if ( ) and 2 ( )are not neighbors, 0 otherwise. is the number of training instances, denotes the th training vector on the map, ( ) and 2 ( ) are the best matching unit and the second-best matching unit for . The estimated topographic accuracy [9] can be defined as, (13) where is the size of a sample of the training data. L. Hamel indicated that we can get accurate values for ′ with very small samples so that the algorithm is more efficient than conventional topographic accuracy (1 -). We say a map is fully organized if the topographic accuracy close to 1.

3) Convergence Accuracy
Convergence accuracy is an SOM quality assessment which is implemented in the R-based POPSOM package [12] [33]. It is defined as, 16 The convergence accuracy is a linear combination of the embedding accuracy and the estimated topographic accuracy, which indicates the model performance from both the training data set and the map neurons. It is the primary approach to evaluate and compare the quality of SOMs in this research.

Implementation
The five types of AEs were implemented in Python with the TensorFlow Keras framework. The SOMs were built in R with the POPSOM package.

Basic AE
I implemented a single fully-connected layer as encoder and as decoder. The parameters of the basic AE for each data set are shown in Table 1, Table 2, and Table   3. The architecture of the basic AE for each data set are shown in Figure 7, Figure    2) Landsat Satellite data set    Figure 9. Architecture of basic AE in 'MNIST'

SAE
The SAE adds an L1 regularizer to the encoded layer base on the basic AE. Both the parameters (Table 1, Table 2, Table 3) and architecture ( Figure 7, Figure 8, Figure   9) of SAE for each data set are the same as the basic AE.

CAE
The CAE uses the same parameters (Table 1, Table 2, Table 3) and architecture ( Figure 7, Figure 8, Figure 9) as the basic AE as well, except that a different loss function is applied. According to the objective function (Equation 7) of the CAE, I implemented a distinct loss function by expanding the Equation 6 as, (15) then translated the equation to Python code and got a contractive loss function [34].

DAE
I set the noise factor to be 0.5 to create noisy input. For 'dim064' and 'Landsat Satellite' data sets, both the encoded layer and the decoded layer are still single fullyconnected layers, and the parameters are the same as before. For the subset of MNIST data set, I implemented a Denoising Convolutional Autoencoder (DCAE), the architecture is shown in Figure 10. Before feeding to the network, I reshaped each input into size 28 × 28 × 1.
The encoder consists of three 2D convolutional layers followed by downsampling (max-pooling) layers (pooling size 2 × 2) and a flatten layer (encoded layer). The first two convolutional layers have 32 filters and the third one has 4 filters of size 3 × 3. The output of the encoded layer is 64 dimensional.
The decoder consists of four 2D convolutional layers followed by three upsampling layers (size 2 × 2), the last convolutional layer is the decoded layer. The first convolutional layer has 8 filters, the following two convolutional layers have 32 filters, and the decoded layer has 1 filter of size 3 × 3.

ConvAE
For the 'dim064' and the 'Landsat Satellite' data sets, I utilized 1D convolutional layers, 1D max-pooling layers, and 1D up-sampling layers to build the models. The architectures of convolutional AEs for these two data sets are shown in Figure 11 and For the subset of the MNIST data set, the architecture of ConvAE is the same as DCAE. Differently, input the original data to the network rather than the noisy data.

SOM
Before feeding the SOM with the encoded data extracted from five AEs, I drop the columns which are consisted of all zeros, because they contain no information for the clustering task. For the 'dim064' data set, I implemented a 20 × 15 map that has 300 neurons in total. For the 'Landsat Satellite' data set, I implemented a 40 × 35 map that has 1,400 neurons in total. For the subset of the 'MNIST' data set, I implemented a 40 × 40 map that has 1,600 neurons in total.

Results
In this chapter, I will use the abbreviations shown in Table 4 to represent each model.

Loss of AEs
After 200 epochs, the training loss and validation loss of each model are shown in Figure 13. All the models were trained well. For the DAE and ConvAE, the generalization of the models could not be further improved due to that the validation loss became saturated after approximately 150 epochs.
24 Figure 13. Training loss and validation loss of five models in 'dim064'

SOM Models Results
I trained the SOM models from 10 to 400,000 (10, 100, 1000, 10,000, 50,000, 100,000, 200,000, 400,000) iterations for 5 times, plotted the convergence accuracy, embedding accuracy, and estimated topographic accuracy of each model, shown in Figure 14 and Figure 15. I scaled the x axis (iterations) as log base 2. ConvAE_SOM also shows good embedding accuracy after 100,000 iterations, but the estimated topographic accuracy varies around 0.88 after 100 iterations and could not be further improved.
By comparison, CAE_SOM is the best, followed by ConvAE_SOM. This indicates that data have a property that they are insensitive to small perturbation so that CAE best captures their intrinsic structure. Except for the basic AE_SOM, using encoded data yields better results than using original data. Moreover, the encoding brings data to a lower dimensional representation, therefore it makes computing SOM more efficient.
Overall, all these models perform quite well in this dataset. The reason could be that synthetic datasets have a very good underlying clustering structure. Each feature 28 in the data is equally important. The number of each category is averagely distributed among the data set. Little noise is persistent in the data. Thus, it is much easier for SOM to learn the actual distribution of training data even without encoding.

Clustering Result Representation
The starburst representation of the model (Figure 16) gives us a visible clustering result with class labels. The clusters are identified by light color (yellow) and cluster boundaries are identified by darker colors (red) [11]. The starburst lines help identify the center of each cluster, that all nodes are connected to their centroid node [33]. I plot the heat maps to confirm that those quantities (embedding accuracy, estimated topographic accuracy, convergence accuracy) when meeting certain criteria provide a good measure that SOM learns the underlying structure.
Since the CAE_SOM model achieved the best result, I implemented a 20 × 15 CAE_SOM compared with the SOM with unencoded data. I trained the models with 200,000 iterations and output the starburst representations of clusters, shown in Figure   16 and Figure 17.
Visibly, both maps separate the data into 14 clusters while two classes (with label 6 and label 7) are mis-clustered, and the locations of clusters distribute similarly on the maps. Overall, the clustering structure is almost the same, and CAE_SOM shows an excellent clustering result. Therefore, the encoded data has a similar structure to the original data, and both structures are successfully discovered by the SOM. It also suggests that we can trust the encoded data as the input of the SOM.

Loss of AEs
As seen from Figure 18, all the models were trained well after 200 epochs. For the DAE, the generalization of the models could not be further improved due to that the validation loss became saturated after approximately 175 epochs. Figure 18. Training loss and validation loss of five models in 'Landsat Satellite'

SOM Models Results
I trained the SOM models from 10 to 400,000 (10,100,1000,10,000,50,000,100,000,200,000,300,000,400,000) iterations for 5 times, plotted the convergence 31 accuracy, embedding accuracy, and estimated topographic accuracy of each model, shown in Figure 19 and Figure 20. I scaled the x axis as log base 2.    Figure 21 and Figure 22, the number of the identified clusters is almost the same and the visible starburst lines span in a similar way, which shows that the clustering structure is nearly the same. Therefore, the encoded data has a similar structure to the original data, and both structures are successfully discovered by the SOM. It also suggests that we can trust the encoded data as the input of the SOM.

Loss of AEs
I plotted the loss and visible reconstruction results of each model, which are shown in Figure 23 - Figure 27.

SOM Models Results
Similarly, I plot the convergence accuracy, embedding accuracy, and estimated topographic accuracy of each model ( Figure 28, Figure 29). I scaled the x-axis as log base 2.   Compared with the starburst representation shown in Figure 31, the ConvAE_SOM shows a close clustering structure as the SOM with unencoded data because the number of the identified clusters are almost the same and the visible starburst lines span similarly. It indicates that the encoded data has a similar structure to the original data, and both structures are successfully discovered by the SOM. It also suggests that we can trust the encoded data as the input of the SOM.

CHAPTER 5 Conclusion
The objective of this research is to find answers for the following two questions, 1) for one data set, what kind of AE performs best in improving the performance of the underlying SOM, 2) whether the selection of AEs in conjunction with SOMs is data-dependent or not. According to the experiment results, we can see that nearly all AEs at least improve the performance of SOM. They also bring original data to a lower dimension representation, which let the training process become efficient. The CAE performed excellently in the synthetic data set. The ConvAE shows an outstanding performance in image-related data set. The DAE works well with the realword data with noise. The SAE did not show good results in the three chosen data sets, which may be due to that data do not have the sparse property. Hence, the selection of the AEs depends on the property of data, based on the features of a data set to select an appropriate AE could help the SOM obtain a better clustering result.
Interestingly, many embedding accuracy figures have a peak value after a certain number of iterations. This could arise from that the neurons start to learn a finer-scale cluster; therefore, the embedding accuracy drops down a little. I suspect it will rise again until adequately learning an even finer scale in the future. To the end, each neuron is a cluster itself and the embedding accuracy approaches 1.

Future Work
Firstly, it is worth studying when the peak value of embedding accuracy comes out, which may help train a model with appropriate training iterations. For now, we could see that the embedding accuracy oscillates after the peak value, but I do not know the definite trend in the future. Training the model with much more iterations in the featured study will help discover the embedding accuracy variate trend and find the relationships between the peak value and training iterations.
Secondly, it is suggested to compare the SOM performance by using different dimensionality encoded data as input. In this research, I only encoded the original data into one type of dimensionality. Test different encoding dimensionality to see whether the encoding degree will affect the SOM clustering result could yield more interesting insight.
Additionally, there are still some other variations of AEs such as variational autoencoder and stacked autoencoder, which could be emphasized in the future study.