Computer Science and Statistics
self-organizing maps; artificial intelligence; data mining
A self-organizing map (SOM) is a type of artificial neural network that has applications in a variety of fields and disciplines. The SOM algorithm uses unsupervised learning to produce a low-dimensional representation of high- dimensional data. This is done by 'fitting' a grid of nodes to a data set over a fixed number of iterations. With each iteration, the nodes of the map are adjusted so that they appear more like the data points. The low-dimensionality of the resulting map means that it can be presented graphically and be more intuitively interpreted by humans. However, it is still essential to evaluate the 'quality' of a map to ensure that the model is indeed representative of the underlying data.
There are several quality measures used with self-organizing maps. The traditional method is to compute the quantization error of the map by summing the distances between the nodes and the data points, with smaller values representing a better fit. However, it has been shown that the quantization error approaches zero as the number of training iterations is increased, and therefore can result in overfitting. For this reason, we choose to use a population based convergence criterion for map evaluation. This method treats the nodes of the SOM and the points in the data set as two populations and uses a statistical test to determine if they appear to be drawn from the same probability distribution. If they do, we say the model is converged and has a 'good fit.' If not, the map requires more training. With both evaluation methods, and in SOM construction in general, the quality of the model is directly related to the predetermined number of training iterations.
We propose that the efficiency of the algorithm would be improved by eliminating the current guess-and-check method that is associated with arbitrarily picking the number of iterations to train a map. This is done by combining the processes of construction and evaluation. That is, the amount of training necessary for a good fit can be more accurately determined by periodically evaluating the quality of the model. We automate this procedure using both fixed and variable length training windows. The convergence of the map is calculated after each window; and construction is halted when the change between windows has stagnated or the target value has been reached. This completely removes the necessity to know in advance how much training is required for an accurate model. We attempt to validate this hypothesis by analyzing SOMs built on both synthetic and real-world data sets using the traditional algorithm and our convergence window algorithm.