Implementation of Self-Organizing Maps with Python

As a member of Artificial Neural Networks, Self-Organizing Maps (SOMs) have been well researched since 1980s, and have been implemented in C, Fortran, R [1] and Python [2]. Python is an efficient high-level language widely used in the machine learning field for years, but most of the SOM-related packages which are written in Python only perform model construction and visualization. However, the POPSOM package, written in R, is capable of performing functionality beyond model construction and visualization, such as evaluating the model’s quality with statistical methods and plotting marginal probability distributions of the neurons. In order to give the Python user the POPSOM package’s advantages, it is important to migrate the POPSOM package to be Python-based. This study shows the details of this implementation. There are three major tasks for the implementation: 1) Migrate the POPSOM package from R to Python; 2) Refactor the source code from procedural programming paradigm to object-oriented programming paradigm; 3) Improve the package by adding normalization options to the model construction function. In addition to constructing the model in Python, Fortran is also embedded to accelerate the speed of model construction significantly in this project. The final program has been completed, and it is necessary to guarantee the correctness of the program. The best way to achieve this goal is to compare the output of the Python-based program to the output generated by the R-based program. For the model construction function, the SOM algorithm initializes the weight vector of the neurons randomly at the very beginning, and then selects the input vectors randomly during the training. Due to these two random factors, one cannot expect the same input (data set) will result in exactly the same output (neurons). Instead, to give evidence that the Python program is working properly, there are two solutions that have been proposed and applied in this project: 1) measuring the average difference of vectors between two neurons which have been generated by the R and Python functions respectively; 2) measuring the ratio of the variances and the difference of features’ mean for the two neurons. Besides the model construction, model visualization and other functions which take neurons as their input should return the same results by feeding the same input (neurons). The detail of above verification will be represented in the following chapters.


Introduction
Dimensionality reduction has been an important topic within the data analysis community for some time. Several solutions have been proposed by researchers, one of which is Principal Component Analysis (PCA), a statistical procedure based on orthogonal transformation. It has been used as a tool in exploratory data analysis and the creation of predictive models. In the 1980s, another approach for dimensionality reduction was proposed by T. Kohonen [3] known as Self-Organizing Maps (SOMs), a type of neural network for the visualization of high-dimensional data. Typically, the SOM graphic represents [4] the high-dimensional input data with a 2-D grid map. This type of map preserves the topology and neighborhood relationship of the input space [5]. Additionally, indicated by [3], the convergence of the model is guaranteed after a certain amount of iterations.
The SOM algorithm has been implemented by C, R, Fortran and Python [6]- [8].
To date, there are more than 100 packages available on the GitHub community. Although the number of packages is sufficient and continually increasing, the functionalities that are provided by these packages are quite similar. Most of these packages only focus on model construction and model visualization. Few of them touch on the aspect of evaluating the model's quality. POPSOM, an R package [8] developed and maintained by Dr. Lutz Hamel and his former students, not only provides the model construction and model visualization as other packages, but also provides a set of func-tions for evaluating the model's quality and visualizing the marginal probability distribution of each feature. The purpose of this project is to migrate the POPSOM package from R to Python so that researchers in the Python community may utilize it in their research.
A Self-Organizing Map (SOM) is a specific type of Artificial Neural Network whose purpose is to reduce the dimension of the input space. The resulting map is a graphical representation easily interpreted by the end user [4]. From a practical point of view, the SOM's program package should include at least the following three main functions: 1) model construction, 2) model evaluation, and 3) model visualization.
As the most important part of the SOM, the model construction algorithm has been proposed by [3]. The basic idea of this algorithm is described in the following two major steps: 1) Initiate the weight vector (or neuron) randomly.
2) Update the weight vector using the following formula with a certain number of iterations.
Secondly, the model evaluation function is designed to help users determine the appropriateness of the model after each training. Many quality measures have been proposed to evaluate the quality of the resulting map [4]. Most of them either focus on one aspect of a SOM or on the computational expense [9]. In 2017, Dr. Lutz Hamel proposed an efficient statistical approach [9] to measure both the map embedding accuracy (or convergence) and the estimated topographic accuracy of the model. This approach has been since implemented in the R-based POPSOM package [8].
Most of packages available on GitHub only represent the resulting maps as heat maps, while the R-based POPSOM package provides users with three kinds of graphical reports: 1) the significance of each feature with respect to the self-organizing map model ( Figure 1), 2) the starburst representation of the SOM model (    As a kind of pre-processing method, standard normalization is not necessary for the model training, but it may improve the map embedding accuracy [3] within the SOM algorithm. The R-based POPSOM package has been developed and maintained since 2013. Migrating all of the functionalities in the package from R to Python has been the basic goal of this project. Besides preserving all the functionalities of the R-based package, the following improvements have also been made: 1) refactoring the procedural programming paradigm to object-oriented programming paradigm, 2) addition of normalization as an optional argument for model initialization (or instantiation), and 3) Fortran embeddedness as another option for model training.
Finally, the Fisher/Anderson Iris data set [10] and Wheat Seed data set [11] from the UCI machine learning repository were utilized to evaluate the correctness of the Python-based package. The reasoning stands that if the same input data is imputed into both the R-based and Python-based packages and if the Python-based package is working correctly, then both packages will return the same outcome. Since the SOM algorithm initializes the weight vector randomly [3] at the beginning of model training and selects the vector randomly during the training, even the same input data set will return a different outcome (neurons) for a different training by one package. Thus, it is not feasible to evaluate the correctness of the Python program by measuring the difference in the two neurons directly. In order to achieve this goal, two statistical measurements have been proposed and applied in this project. 1) When measuring the average difference of vectors between the two neurons, the result should be closed to 0 at the end of the training if the two neurons are drawn from the same input data space. 2) The ratio of the variances and the difference of features' means for both neurons are evaluated.

Self-Organizing Maps
Since the dawn of the data era, more and more efficient data analysis technologies have been researched, proposed and applied at a very fast pace, especially tools for statistical analysis for high-dimensional data (data with multiple features). Self-Organizing Maps (SOMs) proposed by [3] are considered effective tools for the visualization of high-dimensional data [3]. The SOM algorithm is used to compress the information to produce a similarity graph while preserving the topologic relationship of the input data space. The convergence of the SOM has been previously discussed and guaranteed [3].
The basic SOM model construction algorithm can be interpreted as follows: 1) Create and initialize a matrix (weight vector) randomly to hold the neurons. If the matrix can be initialized with order and roughly compiles with the input density function, the map will converge quickly [3]; 2) Read the input data space. For each observation (instance), use the optimum fit approach, which is based on the Euclidean distance to find the neuron which best matches this observation. Let x denote the training vector from the observation and denote a single neuron in the matrix. Update that neuron to resemble that observation using the following equation: ( ): the weight vector before the neuron is updated.
( + 1): the weight vector after the neuron is updated.
( ): the training vector from the observation.
ℎ ( ): the neighborhood function (a smoothing kernel defined over the lattice points), defined though the following equation: : the neighborhood set, which decreases with time.
( ): the learning-rate factor which can be linear, exponential or inversely proportional. It is a monotonically decreasing function of time (t).
3) Update the immediate neighborhood of that neuron accordingly ( Figure 4). As proposed by Cheng [12], after running this algorithm with a sufficient number of iterations, the map will ultimately converge. However, it is difficult for users to de-termine how many iterations are sufficient. Another practice measure is evaluating the map's quality, which can help users determine the optimal number of iterations.

Evaluation of the Quality of the Map
It is necessary to ensure that the model obtained from training is already wellconverged and reliable. In other words, the quality of the SOMs need to be measured first before any further operation, such as visualization, is employed. Recently, many different quality measures of SOMs have been proposed and argued [13], [14]. However, most of them either measure only one aspect of a SOM or are computationally expensive. Some include both of these drawbacks [9]. Based on map embedding accuracy and estimated topographic accuracy, Dr. Hamel proposed a population-based [15] computationally efficient statistical approach [9] to evaluate the quality of a SOM model. This approach is based on two populations (one from the training data set and the other from the neuron of the map) and evaluates the quality of a SOM by the measure (or magnitude) of the convergence index, which is the linear combination of the map embedding accuracy (convergence) and the estimated topographic accuracy.
The map embedding accuracy, derived from the theory proposed by Yin and Allison [12], is limited in that the neurons of a SOM will converge on the probability distribution of the training data [12].
The computational complexity of the embedding accuracy is where n is the number of observations in the training data, m is the number of neurons, and d is the number of features in the training data. Without any exponential function, the above equation indicates this computation is efficient in most cases (where ≪ , ≪ ). Although the embedding accuracy measures the same thing as quantization error, it confers the advantage of indicating when statistically there is no difference between two populations (training data and neurons).
Topographic error [9] can be defined as: where n is the number of observations in the training data, is the ith observation in the training data, and ( ) and 2 ( ) (bmu stands for the best matching unit) represent the best-matching and second best-matching unit for the training vector .
Accordingly, the topographic accuracy could be defined as: Computing the topographic accuracy is a time-consuming task, especially for a large data set. To make this computation more efficient and practical, Dr. Hamel proposed utilizing a sample of the training data, a smaller subset of all the training data, to estimate the topographic error.
The estimated topographic accuracy [9] is defined as follows: The values of the map embedding accuracy and estimated topographic accuracy are numbers between 0 and 1. If the value is equal to 1, then one can interpret that the map has converged well or is fully organized. Dr. Hamel proposed to use the convergence index as defined by: which is a linear combination of the map embedding accuracy and estimated topographic accuracy to evaluate the quality of a SOM model. This approach has been implemented in the R-based POPSOM package [8].

R-based POPSOM Package
The R-based POPSOM package [8] has been developed and maintained by Dr. map.build is the entrance of the POPSOM package. The input of this function is the training data (or input space) in the form of a dataframe [16]. Each row of the training data is an unlabeled training observation (or instance), and each column pre-sents a feature of the observation. After a round of training, this function will generate an object called "map" with the following structure:

map.topo
Reports the estimated topographic accuracy.

map.projection
Generates a table with the association of the labels with map co- Table 2. Description of functions in the R-based POPSOM package.

Other Python-based SOMs packages
As of the writing this paper, there are 113 Python-based self-organizing map related repositories available on GitHub. Beyond the above listed packages, the others available are either not up to date or not popular within the user community or both. All 6 of the aforementioned most popular packages analyzed provide model construction and visualization. The "somber"

Rank
package is the only package that has a function to measure the quality of the SOM model using topographic accuracy. However, the approach is very time consuming [9].
None of the packages include map embedding accuracy. It should be mentioned that all of the source codes of these packages are organized in an object-oriented programming paradigm.

Migration of POPSOM Package from R to Python
Migrating the source code of the POPSOM package from R to Python is the first step in implementing the SOM in Python. The goal of migration is to preserve all the functionalities during the entire process. There are three kinds of objects that need to be taken into account: 1) naming rules, 2) mathematical and statistical functions, and 3) data manipulation functions.

Naming Rules
In R, period separated (.) is allowed as a part of a variable's or function's name which is unique to the R language. For Python, the period separated within the names of variables or functions needs to be changed into another acceptable sign such as underscore (_).
Both the left arrow sign (←) and the equals sign (=) are acceptable assignment operators in R. The left arrow sign, which is not an acceptable assignment operator in Python, has been applied widely within the R-based POPSOM package. Thus, each of these left arrow signs be substituted by the equals sign in Python.
Finally, both R and Python are case sensitive languages. Generally, the reserved words in Python [2] are in lower case except "True", "False", and "None", which are capitalized with their first letter. All letters of these three reserved words are upper case in R [16].

R Python
NULL None

TRUE True
FALSE False Table 4. Three reserved words in R and Python.

Mathematical and Statistical Functions
Beyond the basic arithmetic operations addition (+), subtraction (-), multiplication (×), and division (÷), other mathematical and statistical operators in R and Python are coded differently. In R, most of the mathematical and statistical operators use either built-in functions [18] or a combination of operators (which start and end with the percentage sign (%)). In Python, most of these functions are supported by a third-party library such as math library or numpy library or both.

Integer Division
Ceiling ceil(x) numpy.ceil(x) Table 5. Examples of mathematical and statistical functions in R and Python.

Data Manipulation Functions
Both R and Python provide powerful data manipulation functions for data analysis and research. R uses built-in functions to manipulate data, where as Python is powered by third-party libraries (e.g. numpy).

Functions R Python
Sorting Data sort(x) numpy.sort(x)

Replicate Elements
rep(x,n) numpy.linspace(x,x,n) Table 6. Examples of data manipulation functions in R and Python.

Rewriting Functions
Although most of the mathematical and statistical functions in the R-based POP-SOM package have counterparts in Python or third-party Python libraries, there are three specific statistical functions in the R-based POPSOM package that are not available in any Python built-in or third-party libraries. Hence, these functions needed to be constructed in Python from scratch.

Variety of T-test
"t.test" [19], which performs a variety of t-tests, has been applied in the R-based POPSOM package to test the difference between the means of two data sets. One of the data sets comes from an input data sample, while the other one comes from the map. In Python, t-tests are implemented by utilizing the mean function (returns arithmetic mean along specific axis) from the numpy [20] library and DescrStatsW (returns descriptive statistics and tests with weights) and CompareMeans (returns a class for two sample comparison) from the statemodels [21] library. The source code of this function is presented in the appendix.

F-test
"var.test" [22] performs an F-test to test the ratio of variances of two data sets from an input data space and the map respectively in R. An F-test is implemented in Python with the help of the variance function (returns the sample variance of data) from the statistics library [23] and ppf (percent point function) from the scipy [24] library.

Kernel Smoother for Irregular 2-D Data
"smooth.2d" [25] is utilized to approximate the Nadaraya-Watson kernel smoother for irregular 2-D data. This function is implemented in Python by utilizing the ecu-lidean_distances function from the sklearn [26] library and the fft (Discrete Fourier Transform) function from the numpy [20] library.

Programming Paradigm Refactoring
The formal object-oriented programming concept was introduced in the mid-1960s [27]. Many of the modern programming languages are multi-paradigm programming languages that support the object-oriented programming paradigm. Python is one of them. In the R-based POPSOM package, "map" has been defined as a class, and all the hyper-parameters, input arguments, and neurons are member variables of the class. However, all the methods are independent functions that take "map" as one of the arguments. In order to build a pure object-oriented programming package (convert all the independent methods into member methods of the class), the Python-based POPSOM package was refactored from a procedural programming paradigm to an object-oriented programming paradigm ( Figure 5). After refactoring, the whole package was defined as a class. All of the input hyper-parameters, input arguments, and neurons are still member variables of the class, the same as in the R-based packages. Figure 5. Comparison of the source code appearance of a procedural programming paradigm and an object-oriented programming paradigm.
All independent methods become member methods of the "map" class. Python reserved method "__init__" was used as a constructor for the instance, and the map.build name was changed to fit as it has been used as a conventional function name for training the model (or instance) within the Python community.

Normalization
Different from PCA (Principle Component Analysis) [28], normalization is not necessary in the SOM algorithm, but it may improve numerical accuracy as proposed by [3]. A good rule of thumb, however, is for the end user to utilize the significance function to graphically report the significance of each feature in order to facilitate making a decision as to whether or not the original input data need to be normalized before training.
The normalization method has been developed and reserved in the R-based POP-SOM package, but works independently. In the Python-base package, not only is the normalization method implemented, but it is also applied to the model training by adding one more argument (option) in the model initialization function ("__init__").

Embed Fortran for Training
There are three programming languages utilized for model training in the R-based POPSOM package: C, R and Fortran. As a kind of imperative programming language, Fortran is especially suited to numeric computation and scientific computing [28].
Hence, in addition to Python, one more training algorithm is also implemented in Fortran in this project.
Several solutions for embedding Fortran in Python have been proposed and discussed. Two of them are highly recommended. The first is to write an extension module, then to import into Python using the import command. The suffixes of extension modules are different in Windows (pyd) and Unix-like operation systems (so). The second is calling a shared-library subroutine directly from Python using the ctypes modules. It requires the code to be wrapped as a shared library. After comparison (more successful stories have been reported), the first solution is utilized in this project. After successful installation, there are two more tasks: 1) Add the MinGW-64 bin path to the system path: 2) Create a configuration file (named "distutils.cfg") to connect Python with MinGW-64: F2PY is a third-party Python library [29] which enables a Python script to call a compiled Fortran extension module. F2PY is part of the numpy library.
Since numpy is already installed, there is no need to install F2PY for this project.

Install GFortran
GFortran (or GNU Fortran) is the abbreviation for the GNU Fortran Compiler. It is used to compile a source file (.f90) to an object file (extension module).

Compile the Fortran-extension module.
The standard Python build system numpy.distutils supports compiling Fortran-extensions (.f90 file to .pyd file). A small Python program (Figure 7) named "build.py" has been created to generate the extension module. Figure 7. Source Code of "build.py" program Next, execute the command: python build.py build. It will generate a file named "vsom.pyd" which can be loaded directly.

Test the extension file
Execute the following command in Python prompt:

>>>import vsom
If there is no error message prompt, then the extension module has been loaded successfully.   Table 9. Speed comparison of Python versus Fortran (as the number of iterations increases exponentially). Based on the above speed comparisons, it is obvious that Fortran is much more efficient than Python in numerical computations. Table 10 reports the processing time (in seconds) for training the iris flower data set [10] and wheat seed data set [11] until the maps fully converged using Python and Fortran respectively.

Experiment Design
Since this project is inspired and based on the R-based POPSOM package, the entire implementation of the Python-based package can be divided into migrating the Rbased package to Python-based and refactoring the source code from a procedural programming paradigm to an object-oriented programming paradigm. After the Python-based package was complete, it was determined that the best way to evaluate the correctness and quality of the Python-based package was to compare the outcome of each function with the outcome from the R-base package. Most functions in the package run with a non-random algorithm. Hence, it is expected that the same input would generate the same outcome, such as reporting the significance of each feature and plotting the marginal probability distribution of neurons and input data. On the other hand, some algorithms run with random factors, in particular the model-training algorithm.

Data Set Selection
As a kind of unsupervised learning algorithm, the major task of Self-Organizing Maps is clustering the input data. Thus, the label of the observation is not necessary in the algorithm, but it will help the end user to interpret the map. The ideal data for this project is intuitive, easily interpreted and clustered (or categorized) by human beings (although professional knowledge may be required in some cases). The data should have at least three-dimensional measurements (two-dimensional data can be presented by 2-D map without any learning). To evaluate the quality and capability of the Python-based package, two different data sets with different magnitudes of measurements and observations for the experiment were selected.
The Iris flower data set [10] (sometimes called Fisher's or Anderson's data set) introduced by Ronald Fisher in 1930s has been widely used as a "toy"/test data set within the machine learning and statistics communities. There are 150 observations (or instances) that are categorized into three species distributed evenly within the data set.
This data set has four measurements (or attributes): the sepal length, the sepal width, the petal length and the petal width, all of which are measured on the same scale (in centimeters). The Iris data set has been embedded in R ( Figure 10) and can be accessed directly. Figure 10. Accessing the Iris data set in R The Iris data set has been embedded in the scikit-learn Python library. Before accessing this data set in Python, the sklearn library needs to be imported at the very beginning of the source code. In order to represent it as a data frame (Figure 11), which is friendlier to the end user, the pandas library should also be imported. Figure 11. Accessing and representing the Iris data set as a data frame in Python.
In addition to the Iris data set, the Wheat Seed data set [11] was also selected from UCI machine learning repository to evaluate the Python-based package. This data set of grain measurements, which was obtained from the real word, contains 210 observations clustered as 3 species (Kama, Rosa and Canadian), 70 elements for each.
There are 7 measurements of main geometric features obtained by X-ray technique: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. All of them are scaled by either millimeters or square millimeters.
The original Wheat Seed data are stored in either a csv file or plain text file. The

Initialize the Model (instantiate the Model)
In the R-based POPSOM package, there is no independent function for initializ- 6 Norm Switch, apply normalization to input data space False Table 11. Description of "__init__" function's arguments. Most of the arguments are easy to understand and setup. For the argument of "train", which indicates the number of iterations, there is no good rule of thumb recommended. Less iterations will result in insufficient converge, while more iterations will increase unnecessary computational expense. Examining the quality of the map after each training is the best way to determine the optimal number of iterations. For the Iris data set, 1000 iterations will return better than a 0.9 convergence index in most instances, which is acceptable for this project.

Fit the data
The R-based package merges the initialization and fitting the data into one function, called map.build, while the Python-based package has an independent fitting data function: fit. The fit function only has two arguments, data and labels. This label is different with the one in other supervised algorithms. In the SOM algorithm, labels are not involved in the training process. They are only used for labeling the grid of the map after training. Figure 15. Example of fitting the Iris data and labels to the SOM model.

Report the Significance of Each Feature
The significance of each feature can be reported in the form of either a vector ( Figure 16) or a graph ( Figure 17) by switching graphics to True (which is default) or False respectively. Figure 16. Reporting the significance of each feature by vector for the Iris data. Figure 17. Graphically reporting the significance of each feature for the Iris data.

Report the map convergence index
The convergence index is a linear combination of the map embedding accuracy and the estimated topographic accuracy (Figure 18 , 19, 20, 21). It is a criteria for evaluating the quality of the map [4]

Report the Map Embedding Accuracy
Report the map embedding accuracy using either the ks-test or the variance and mean tests (Figure 22, 23).

Report the Estimated Topographic Accuracy
Estimated topographic accuracy is a part of the convergence index. It also can be reported independently as well for this project (Figure 24, 25,26,27). As discussed in [9], evaluating the SOMs topographic accuracy by using random samples instead of all available input data is a reliably computational and efficient statistical approach.

Neuron
This function returns the content of the observation by given coordinates ( Figure   33).

Initialize the Model (instantiate the Model)
Since there are 210 observations in the Wheat Seed data set, a larger (15 × 10) map is used to represent the model (Figure 34).

Fit the data
The raw data of the Wheat Seed data set are stored in the text file without the header. The data was loaded from the text file and the attribute name was inserted manually ( Figure 35). Figure 35. Fitting the Wheat Seed data and labels to the model. Figure 36. Reporting the significance of each feature by vector for the Wheat Seed data. Figure 37. Graphically reporting the significance of each feature for the Wheat Seed data.

Projection
Figure 52. Reporting the location of each observation on the map.

Evaluating the Correctness of Python-based Package.
The R-based POPSOM package has been developed and verified, and this Python-based package was derived from it. Thus, the best way to evaluate the correctness of the Python-based package is to measure the distance between the R-based package and the Python-based package results. Three sophisticated functions from the package were utilized to demonstrate this comparison: model training, starburst representation, and visualization of marginal probability distribution.

Evaluating the Model Training Function
The R-based package used map.build to train the model, while the Python-based package used __init__ and fit functions to complete this same task. There are two random factors within the algorithm: 1) randomly initialized the neurons at the beginning and 2) random selection of a vector from the input data space for training. Due to these two random factors, it is not feasible to expect that the same training data will generate exactly the same neurons from both the R and Python functions. In order to evaluate the correctness of the Python program based on the two neurons generated, three statistical approaches have been proposed and applied in this project.   As can be clearly seen, the average difference descends to 0.2 at the end of the training. This result fulfills the expectation.
2) Measuring the ratio of the variances from the two neurons. [9], [30]  These results show that 0 falls within each confidence interval obtained by applying the above formulas. This indicates that there is no statistical difference in the means; the means of the two neurons are the same.
Based on the above statistical analysis, it is apparent that each feature in the two neurons share the same distribution and the same means, and the average difference between them is closed to 0. Since all the criteria are fulfilled, this evidence supports the hypothesis that the Python package is working in the same way as the R package.   It is clear by visual comparison of the starburst representations that the heat map and connected components generated by R and Python are exactly the same.

Evaluating the Density Plot Function.
Plotting the density of training data overlaid with the neurons density for the same features is easy for the user to interpret the quality of the map. Plotting the same density representation by different programs (R-based and Python-based) is the best way to reveal any differences between the two programs. The following are side-by-side displays of the density plots from both the R-based and Python-based packages for each feature of the Iris data set (Figure 59). These four groups of density plots are evidence that there is no difference between the R-based package and the Python-based package for this function. And, this also indicates the Python-based package is working properly as the R-based package is validated as reliable and consistent.

Submit the Python-based Package to Public Repository
With the purpose of benefiting Python users with the SOM algorithm, the Rbased POPSOM package has been distributed as free software, so that everyone can use, modify and redistribute it under the terms of the GNU General Public License the PyPI community will allow Python users to utilize these findings in their research.
In addition, the Python-based package can be published on GitHub as opensource software to increase exposure to it as the R-based POPSOM package has been added for R users.

Using animation to simulate the formation of the model.
Currently, what is obtained from the POPSOM package is a graphical report (starburst representation) of a SOM model. Although the algorithm is not difficult to understand, the end users without basic quantitative knowledge still view this algorithm as a black box and might wonder what is happening during the training. Presenting the process of the heat map formation may give the user knowledge of the internal mechanisms and make the magical box more transparent. This, in turn, gives evidence to the end user that the algorithm is not a magic trick, but rather a reliable, predicable and replicable process.
-data -a dataframe where each row contains an unlabeled training instance -labels -a vector or dataframe with one label for each observation in data """ """ convergence --the convergence index of a map Parameters: -conf_int -the confidence interval of the quality assessment (default 95%) -k -the number of samples used for the estimated topographic accuracy computation -verb -if true reports the two convergence components separately, otherwise it will report the linear combination of the two -ks -a switch, true for ks-test, false for standard var and means test Return -return value is the convergence index """ if ks: embed = self.embed_ks(conf_int, verb=False) else: embed = self.embed_vm(conf_int, verb=False) topo_ = self.topo (k, conf_int, verb=False, in-terval=False) if verb: return {"embed": embed, "topo": topo_} else: return (0.5*embed + 0.5*topo_) def starburst(self, explicit=False, smoothing=2, merge_clusters=True, merge_range=.25): """ starburst --compute and display the starburst representation of clusters parameters: -explicit -controls the shape of the connected components -smoothing -controls the smoothing level of the umat (NULL,0,>0) -merge_clusters -a switch that controls if the starburst clusters are merged together -merge_range -a range that is used as a percentage of a certain distance in the code to determine whether components are closer to their centroids or centroids closer to each other. """ umat = self.compute_umat(smoothing=smoothing) self.plot_heat (umat, explicit=explicit, comp=True, merge=merge_clusters, merge_range=merge_range) def compute_umat(self, smoothing=None): """ compute_umat --compute the unified distance matrix parameters: -smoothing -is either NULL, 0, or a positive floating point value controlling the smoothing of the umat representation return: -a matrix with the same x-y dims as the original map containing the umat values """ d = euclidean_distances(self.neurons, self.neurons) umat = self.compute_heat(d, smoothing) return umat def compute_heat(self, d, smoothing=None): """ compute_heat --compute a heat value map representation of the given distance matrix parameters: -d -a distance matrix computed via the 'dist' function -smoothing -is either NULL, 0, or a positive floating point value controlling the smoothing of the umat representation return: -a matrix with the same x-y dims as the original map containing the heat """ """ plot_heat --plot a heat map based on a 'map', this plot also contains the connected components of the map based on the landscape of the heat map parameters: -heat -is a 2D heat map of the map returned by 'map' -explicit -controls the shape of the connected components -comp -controls whether we plot the connected components on the heat map -merge -controls whether we merge the starbursts together.
-merge_range -a range that is used as a percentage of a certain distance in the code to determine whether components are closer to their centroids or centroids closer to each other. # return a list of unique centroid positions return {"position_x": xlist, "position_y": ylist} def distance_from_centroids(self, centroids, unique_centroids, heat): """ distance_from_centroids --A function to get the average distance from centroid by cluster.
parameters: -centroids -a matrix of the centroid locations in the map -unique_centroids -a list of unique centroid locations -heat -a unified distance matrix """ """ cluster_spread --Function to calculate the average distance in one cluster given one centroid.
parameters: -x -x position of a unique centroid -y -y position of a unique centroid -umat -a unified distance matrix -centroids -a matrix of the centroid locations in the map """ centroid_x = x centroid_y = y sum = 0 elements = 0 xdim = self.xdim ydim = self.ydim centroid_weight = umat [centroid_x, centroid_y] for xi in range (xdim """ list_clusters --A function to get the clusters as a list of lists. # get the clusters associated with a unique centroid and store it in a list clus-ter_list.append(self.list_from_centroid(cx, cy, centroids, umat)) return cluster_list def list_from_centroid(self, x, y, centroids, umat): """ list_from_centroid --A funtion to get all cluster elements associated to one centroid.
parameters: -x -the x position of a centroid -y -the y position of a centroid -centroids -a matrix of the centroid locations in the map -umat -a unified distance matrix """ """ combine_decision --A function that produces a boolean matrix representing which clusters should be combined.
parameters: -within_cluster_dist -A list of the distances from centroid to cluster elements for all centroids -distance_between_clusters -A list of the average pairwise distance between clusters -range -The distance where the clusters are merged together.
-interval -a switch that controls whether the confidence interval is computed.