An Implementation of Deep Belief Networks Using Restricted Boltzmann Machines in Clojure

In a work that ultimately heralded a resurgence of deep learning as a viable and successful machine learning model, Dr. Geoffrey Hinton described a fast learning algorithm for Deep Belief Networks [1]. This study explores that result and the underlying models and assumptions that power it. The result of the study is the completion of a Clojure library (deebn) implementing Deep Belief Networks, Deep Neural Networks, and Restricted Boltzmann Machines. deebn is capable of generating a predictive or classification model based on varying input parameters and dataset, and is available to a wide audience of Clojure users via Clojars, the community repository for Clojure libraries. These capabilities were not present in a native Clojure library at the outset of this study. deebn performs quite well on the reference MNIST dataset with no dataset modification or hyperparameter tuning, giving a best performance in early tests of a 2.00% error rate. https://clojars.org/deebn


Introduction
Machine Learning is an extensive field of research, and this study delves into one corner of it, especially pertaining to research conducted by Dr. Geoffrey Hinton.
His most recent work with Deep Belief Networks, and the work by other luminaries like Yoshua Bengio, Yann LeCun, and Andrew Ng have helped to usher in a new era of renewed interest in deep networks.
In light of the initial Deep Belief Network introduced in Hinton, Osindero, and Teh [1], and Hinton and Salakhutdinov [9], pioneering work has been completed using these models to produce winning models for various machine learning competitions, problems and datasets [13], [14]. Chapter 3: Deep Learning Models Given the theory covered in chapter 2, we now outline the baseline model and algorithm described in Hinton and Salakhutdinov [9] and Hinton, Osindero, and Teh [1] along with improvements described in Hinton [10] for RBMs and improvements in Nielsen [15] for neural networks.

Probabilistic Graphical Models
Probabilistic graphical models provide a concise and information-dense method to communicate different properties of a joint probability distribution, as well as efficient methods for inference and sampling. The layout of a graphical model efficiently and explicitly codifies independence between random variables and allows for powerful inference in large joint distributions.

Conditional Independence
Two distributions are conditionally independent if and only if the conditional joint distribution is equal to the product of each of the marginal conditional distributions. In other terms, where X, Y, Z are probability distributions: This facilitates factorization over large joint distributions, and is essential to making operations on these large distributions tractable.
The graphical model codifies random variables as nodes in the graph, and conditional independence as a lack of edge between two nodes.

Inference
One of the primary uses of probabilistic graphical models (and the primary use in this study) is in probabilistic inference. In the general case, there are nodes whose state is known, called "visible" nodes, and there are nodes whose state is sought, known as "hidden" nodes. Given a joint distribution where the random variables are split into the visible set x v and the hidden set x h , then to infer the state of the hidden variables, given the visible variables [17]:

Markov Random Fields
Markov Random Fields (MRF), or undirected graphical models, form a model of conditional independence that may be more appropriate for modeling problems that inhabit a spatial or relational domain.
To model conditional independence in an MRF, the concept of graph separation is used: for sets of nodes A, B, C, B separates A and C iff there is no longer a path from A to C after B has been removed from the graph. This maps directly to conditional independence in the nodes. If nodes A, B, and C represent joint distributions, A is said to be conditionally independent of C given B. This is known as the global Markov property: Two examples of nodes that separate two joint distributions in the graph are shown in 2.1.
A Markov blanket is the smallest set of nodes that renders a node conditionally independent of the rest of the graph. In an MRF, the Markov blanket for a node is its neighbors, and this is known as the local Markov property. As an extension of the local Markov property, it's simple to dis-6 cern that any two nodes that are not connected in the graph structure are conditionally independent given the rest of the graph. This is known as the pairwise Markov property. Either of the nodes c or e separate any of the nodes a, b, d from f, g.

Hammersley-Clifford Theorem
The Hammersley-Clifford theorem [17] provides a general factorization of a joint distribution modeled with an MRF. This defines non-negative potential functions or factors for each maximal clique in the graph: ψ c for each maximal clique c in the set of all cliques C. Then, the joint distribution can be found by: where Z is the partition function, given by Z = x c∈C ψ c (x), and acts as a normalization constant to ensure the resulting distribution sums to 1.

2.3 Energy-Based Models
Energy-based models are different from the models reviewed so far in that they associate each permutation of the parameters being trained with a scalar energy level.
Borrowing from statistical physics, there are many ways to define a potential function -the only limitation is that potential functions must be non-negative. There is a particular distribution used in statistical physics that is used to build a training algorithm for Boltzmann machines -the Gibbs distribution.
The Gibbs distribution takes the form [18]: where E = c∈C ln ψ c (x c ) is referred to as the energy function, and ψ c represents the potential function for a particular maximal clique c.

Learning a Markov Random Field
It's typically not possible to find the optimal parameters for an MRF directly using something like maximum likelihood estimation (due to the partition function), so approximation methods like gradient ascent/descent are used instead. An equivalent method to maximizing the likelihood of the model distribution is minimizing the distance between the data distribution and the model distribution in terms of the Kullback-Leibler (KL) divergence [8]. The KL divergence equivalency will become important during the discussion of learning rules for Restricted Boltzmann Machines.
However, it's still intractable to sample over a Boltzmann machine that allows arbitrary connections between any nodes in the graph, including between nodes that are both hidden or both visible. A key advance to making this a tractable problem was to restrict the connectivity between nodes in the same set of visible or hidden nodes. This leads to Restricted Boltzmann Machines, discussed in section 3.1.

Artificial Neural Networks
The field of artificial neural network research grew from pursuing an understanding of the brain and its learning process. This started with approximations of a single neuron, and progressed to deep neural networks that are winning modern machine learning competitions. neuron, a single hidden layer with 10 hidden nodes and a single bias node, and a single output node with its own bias node. This trivial network has learned to produce the square root of an input number.
There are a few key building blocks to reach this point: the nodes themselves, how they relate to neurons in the human brain (and why they lend their names to artificial neural networks), the structure of the network graph, and a method to train the network. McCulloch and Pitts were the first to model the human neuron in a form that modern-day practitioners will recognize [23]. They postulated that the output of a neuron could be computed as a function of its inputs, multiplied by weights, and subjected to some form of a threshold function: where θ is some threshold, and I is the indicator function. This gives a model with a binary output, which could be used as a simple binary classifier. Frank Rosenblatt's perceptron algorithm [24] was an initial attempt at training the model that McCulloch and Pitts postulated. The result was a type of linear classifier that could model simple relationships, but a problem like modeling an exclusive OR gate was beyond the reach of a perceptron [25].

Activation Functions
The original model proposed in [23] and [24] used a step activation function, with the resulting binary output. However in order to model more complex functions, a non-linear activation function needs to be used [15]. There are two functions that are commonly used as activation functions -the logistic activation function and the hypertangent activation function. Both of these functions have two important qualities needed for the backpropagation algorithm described in subsection 2.4.4: they are both sigmoid functions (an "S" shape, clamped between some maximum and minimum value), and are continuously differentiable for all real values.

Figure 2.3: A single perceptron
In this model, there are inputs (nominally from some form of input vector, or sampled from a data distribution in probability terms), weights to influence the effect that any particular input has on the output of the model, a bias, and the threshold function to determine the model's output.

Logistic Activation Function
The logistic function has the form: σ(x) = 1 1+e −x and produces a clean sigmoidal output between 0 and 1, as seen in Figure 2.4. Its first derivative

Hypertangent Activation Function
The hyperbolic tangent function is defined as tanh 1+e −2x , and produces a sigmoidal output between -1 and 1, as seen in Figure 2 To cast the learning process as an optimization problem, it's necessary to use some sort of cost or loss function to easily determine how well the model is performing for a given data vector [15].

Quadratic Error
The quadratic or mean squared error cost function is likely the simplest way to measure error: where n is the number of observations in the batch, andŷ is the model's prediction. This provides a measure of error that is always positive, and increases as the model's prediction is "more wrong": the greater the difference between the prediction and the expected value, the greater the cost.

Backpropagation
With a non-linear activation function and a cost function, the remaining piece to build a learning model is an algorithm to modify the free parameters of the model.
Rosenblatt's perceptron was limited to classifying datasets that were linearly separable. In order to get over this hurdle, multiple layers of perceptrons were proposed, and it was later found that with minor assumptions about the activation function, a multi-layer perceptron with at least one hidden layer was a universal approximator [26], [27].
Multi-layer perceptrons were powerful models, but until the backpropagation algorithm gained attention in [6] there was no feasible method to train them. The backpropagation algorithm provided a novel method to propagate the errors present in the output neurons to the preceding layers, as well as provide a method to adjust the weights of the network towards a more accurate output. In 1958, Selfridge defined a deep network with multiple layers of feature detectors, called a "Pandemonium" [28]. The Pandemonium had tightly defined layers of feature detectors, which allowed the model to extract progressively more complicated patterns out of the data. Selfridge described a very specific model, but a learning algorithm for a generalized version of these multiple layers of feature detectors was sought for decades after its introduction. Hinton [12] outlines five methods that could be used to learn multilayer networks, including backpropagation and using a generative model, which are both used in this study. 16

Restricted Boltzmann Machines
Restricting the connections between nodes in a Boltzmann machine to only those between a hidden and a visible node gives rise to the Restricted Boltzmann machine (RBM).

Training a Restricted Boltzmann Machine
Training an RBM follows the same principles regardless of its intended use.
The energy of a particular state of stochastic binary visible (i) and hidden (j) units is: where θ is the model parameters: a and b are the visible and hidden unit biases, respectively, and w is the weight matrix connecting the two layers.
The connection restriction inherent in an RBM greatly simplifies the Gibbs sampling used in both learning and model generation. Since the hidden and visible nodes factorize completely, Gibbs sampling for the entire hidden or visible layers can be done in parallel.
To find the gradient of the log-likelihood for training, we can first look at the derivative of the log-likelihood of a single training sample (v) with respect to the weight w ij : Averaged over the training set, we find the often-seen rule: Unfortunately, finding this exactly is intractable, so samples can be obtained using Gibbs sampling. Sampling takes less time, but reaching a suitable stationary distribution is often still undesirable due to the need to reach the stationary distribution of the Markov chain. A breakthrough in speeding up this learning process was outlined by Hinton and termed "Contrastive Divergence".

Contrastive Divergence
Calculating the log-likelihood gradient of a Restricted Boltzmann Machine directly is typically not done directly due to the partition function, so approximations are used instead. Hinton introduced contrastive divergence (CD) [8] as a method to get approximate samples without a large number of Gibbs sampling steps.
Hinton found that maximizing the log-likelihood over the data distribution is equivalent to minimizing the Kullback-Leibler divergence between the data distribution and the equilibrium distribution of the model after Gibbs sampling.
The general idea behind CD is that even just a few steps of the Markov chain will provide a direction for the gradient in the state space for the Markov chain, and provide the training algorithm with the appropriate correction to the gradient. Running the chain for an infinite number of steps would provide us with the exact correction for the model parameters, but this is obviously intractable as well.
Typically CD is run for one full step of Gibbs sampling: the visible units are initialized with a sample from the training data (v 0 ), h 0 is sampled from . Then, the log-likelihood for v 0 is approximated by [18]: Even with this approximation and the variance that it introduces to the learning process, empirical results show that this is an effective and efficient learning algorithm.

Deep Belief Networks
A Deep Belief Network (DBN, as seen in Figure 3. incredibly small, so one cause occurring tends to "explain away" the evidence for the other cause node.
As a result of explaining away, calculating the posterior in any densely connected network is intractable, and except for a few special cases, approximation is the alternative. Gibbs sampling can be used to get an approximate posterior sample, but this is lengthy process. To work around these limitations, "complementary" priors are introduced [1]. Complementary priors are added as an extra hidden layer with correlations opposite of the next layer.
The result is a posterior that factorizes exactly, and gives rise to a tractable method for calculating the posterior.
Given complementary priors, the template for building a greedy by-layer training algorithm starts to take shape. One can construct an infinite logistic belief net using tied weights as outlined in Figure 3.3 to generate complementary priors for each hidden layer.
With this model in mind, it's possible to compute the derivative of the log probability of the data. The learning rule for a single layer is then: because the complementary prior at each layer ensures that the posterior distribution really is factorial.
Since we can sample from the true posterior, we can compute the derivatives of the log probability of the data. Let us start by computing the derivative for a generative weight, w 00 i j , from a unit j in layer H 0 to unit i in layer V 0 (see Figure 3). In a logistic belief net, the maximum likelihood learning rule for a single data vector, v 0 , is where ⟨·⟩ denotes an average over the sampled states andv 0 i is the probability that unit i would be turned on if the visible vector was stochastically Looking at all the layers in the infinite net, the vast majority of the terms cancel out, and we're left with the difference between the starting state of the visible units clamped to a particular data vector, and the resting state of the Markov chain after repeated Gibbs sampling: Unfortunately, this guarantee does not hold with the approximate learning method of contrastive divergence. Empirically, contrastive divergence still works quite well, and is used in this study.

Deep Neural Networks
The benefits of pre-training a deep neural network (DNN) have been covered extensively [33], [34], and the models we've discussed so far can be modified slightly to build a traditional feedforward neural network.
Given a DBN that has been trained on a particular dataset, we can add an additional linear or logistic regression layer to the top of the model, train it based on the output of the dataset propagated to the top layer of the DBN, and then use traditional backpropagation designed for neural networks, including regularizing the parameter weights. In testing, this method gave the best results on the test dataset.

Training a Better Restricted Boltzmann Machine
Hinton released some guidance [10]   Using momentum when updating parameters is a useful tool to avoid local minima during training. In a model using momentum, the gradient for each update effects the current velocity of a parameter instead of the parameter itself. The momentum also decays over time to prevent excessive and counterproductive parameter updates, at a rate determined by a hyper-parameter α. The momentum is calculated as follows:

Monitoring the Energy Gap for an Early Stop
Ideally, the number of epochs to train for a model is eliminated as a hyperparameter by having some measure to determine when model performance stops improving and starts to degrade. In an RBM, one of the more reliable indicators is the free energy gap between a set of observations in the training data and a validation set. An increase in the gap means that the model is likely starting to overfit, and it's time to stop training.

Training a Better Neural Network
There are a few optimizations to the backpropagation algorithm covered in Nielsen [15] that are used for fine-tuning the DNN. A better error/loss function is used to improve learning at the saturation extremes of activation functions, and weight decay is used to prevent weights from growing too large. 26

Cross-Entropy Error
The quadratic error function discussed in section 2.4.3 works well in most cases as a cost function, but there are certain edge cases where it performs quite poorly. If the output for a node is very wrong, then the rate at which the output is corrected is quite slow [15]. This effectively prolongs the learning process, and may result in stopping learning in a local minima instead of closer to the global minimum. The solution to this problem is a cost function that does not have this attribute.
The cross-entropy error provides a smoother learning rate curve over various node output values: where x is summing over the input data, j is summing over the output nodes, y is the target output value, and a is the network's output.

L2 Regularization
L2 regularization or weight decay is used to continually shrink weights to ensure that they don't grow too large, and also to serve as a crude scarcity method to decrease noise and isolate learned feature detectors. An additional term is added to the cross entropy cost function to account for weights that have grown large: λ is the regularization parameter, and is used to control just how quickly the weights decay. The number of data points n is used to ensure that the rate of weight decay is independent of the current batch size. Changing the size of λ can shift the priority of the minimization function from the original cost function (better modeling the distribution of the dataset) to ensuring that the weights stay small. As is implied by the name of this type of regularization, the modified cost function does not take into account the biases.

Deep Boltzmann Machines
Salakhutdinov and Hinton [35] introduced an efficient learning algorithm for a Deep Boltzmann Machine(DBM): a model that is very similar to a DBN but is fully undirected. This requires some modifications to the learning and generative process seen in DBNs, but the result is a model that seems to perform better than a DBN.

Deep Autoencoders
Hinton and Salakhutdinov [9] also described a method for pre-training deep autoencoders, a method akin to principal components analysis, that "encodes" the data into a small number of features, and then proceeds to "decode" the features into the original data. Figure 3.4 is the figure from the original paper, and diagrams the initial pre-training using RBMs at each to transform the high-dimensional data into a low-dimensional code and a similar Bdecoder[ network to recover the data from the code.
Starting with random weights in the two networks, they can be trained together by minimizing the discrepancy between the original data and its reconstruction. The required gradients are easily obtained by using the chain rule to backpropagate error derivatives first through the decoder network and then through the encoder network (1). The whole system is called an Bautoencoder[ and is Fig. 1. It is difficult to optimize the nonlinear autoencoders that hav hidden layers (2)(3)(4). With large ini autoencoders typically find poor lo with small initial weights, the grad early layers are tiny, making it i train autoencoders with many hidd the initial weights are close to a go gradient descent works well, but f initial weights requires a very diffe algorithm that learns one layer of time. We introduce this Bpretraining for binary data, generalize it to realand show that it works well for data sets.
An ensemble of binary vector ages) can be modeled using a tw work called a Brestricted Boltzman (RBM) (5,6) in which stochastic, b are connected to stochastic, bin detectors using symmetrically we nections. The pixels correspond units of the RBM because their observed; the feature detectors co Bhidden[ units. A joint configurati the visible and hidden units has an given by where v i and h j are the binary stat and feature j, b i and b j are their bia is the weight between them. The signs a probability to every possibl this energy function, as explained probability of a training image can     layer, unrolling the RBMs to create the deep autoencoder, and fine tuning using standard backpropagation.

deebn
The library that contains the functionality described in this thesis is named deebn, after Geoffery Hinton's recommendation to differentiate a Deep Belief Network from a dynamic Bayes net. His recommendation was to call a Deep Belief Network a "DeeBN" and a dynamic Bayes net a "DyBN" [17]. The library is the core result from this study, and has been released to the public in its first iteration. It is available for direct download 1 as a Java ARchive (JAR), or as a dependency to be used by the dependency management tools

vectorz-clj
The vectorz-clj library is a thin wrapper around the vectorz Java library, which aims to implement fast and accurate matrix operations in pure Java.
vectorz-clj implements the interfaces defined in core.matrix using spaceand computation-efficient implementations defined in the vectorz library, resulting in high-level, declarative matrix operations with desirable memory and CPU usage.

Clojure Records
Clojure offers a record type that acts much like a map (or dictionary in other languages), but also allows for type-based dispatch implemented in Clojure protocols. This allows for generic functions like train and test that can be implemented specifically for different record types. Based on their varying structure and intended use, there are a number of records defined for use in deebn. The type-based dispatch is also more efficient than using reflection or arbitrary dispatch based on a map value.

Restricted Boltzmann Machines
There are two types of RBM used in deebn: a model designed only for unsupervised training and later part of a DBN, and a model that can be used itself for classification. Even though a single RBM used for classification is sub-optimal, it is available for use. This may seem at first to give worse results (as it averages over a batch), but instead reduces variance in the resulting parameter updates.

Creating a Restricted Boltzmann Machine
There are a number of hyperparameters that the user is responsible for providing, but default values are available, as seen in

Learning a Deep Neural Network
The top layer of weights is pre-trained with the backpropagation algorithm, since it is initialized with random weights. This allows the backpropagation algorithm to start with weights that are close to ideal for the entire network, and not use outputs that are effectively random.
The learning algorithm for a DNN is less space-efficient than that of the RBM, as the output of each the units in each layer needs to be retained for the backpropagation algorithm as outlined in Section 2.4.4. The memory usage isn't as much of an issue with smaller batch sizes, but this growth is only linear with an increase in batch size.
In its current iteration, there is no early stopping implemented for the backpropagation training, but is instead specified as a number of epochs by the user, along with other hyperparameters as outlined in Table 4  The deebn library is usable in its current state, and could be integrated into a machine learning pipeline in a number of different application settings.
What follows are a few things that would either increase its target audience, or increase the usability or functionality of the library.

Java Interoperability
Clojure and the deebn library run on the Java Virtual Machine, but using the library in its current form from Java is either sub-optimal or in some cases, impossible. There is no concrete way to measure just how many Java developers there are in the industry, but most attempts put it somewhere in the top 3 [40]. Clojure has facilities to make this a fairly straightforward process, and doing so would enable any Java developer to integrate deebn into their machine learning pipeline.

Visualization
While not strictly a requirement for machine learning or classification, the ability to visualize various aspects of the model during its learning process is valuable in determining optimal hyperparameters for learning. There are multiple examples in [31] and [10] that illustrate the utility of visualizing parts of the model to gain insight into what the model "sees" at various stages of learning.
Motivation and methods behind using model visualizations to debug and optimize a model are outlined in Yosinski and Lipson [41]. The paper outlines four methods to troubleshoot training progress for an RBM, as well as a timeline for expected measurement progress throughout training.

Using Different Matrix Libraries
In its current implementation, deebn exclusively uses the vectorz backing library for matrix operations. There are already a handful of libraries that implement the core core.matrix API, including clatrix 1 , which takes advantage of the BLAS 2 (Basic Linear Algebra Subroutines) library. Instead of solely using the vectorz library, it would be a viable default selection that the user could override for either a custom backing implementation, or one more suitable to their use case. The core.matrix API is general enough that this need could be filled by a library that took advantage of the GPU, or even distributed computing over a cluster.

Persistent Contrastive Divergence
Persistent Contrastive Divergence (PCD) [42] provides another method to This method has been experimentally proven to provide better results, and should increase the accuracy of the resulting models (see Figure 6.1).
Restricted Boltzmann Machines using Approximations to the Likelihood Gradient f the units. As a result, we found that hes of 50 training points instead of 100 tle bit more time per training point, pdating the model parameters almost which is preferable in the mini-batch ocedure.
chnical Details tes used in the experiments are not ractice, decaying learning rates often these experiments, the learning rate ayed from some initial learning rate to uration of the learning. Preliminary owed that this works better than the ested in theoretical work by (Robbins ), which is preferable when infinitely ailable for the optimization. nt parameters, such as the number of nd the size of the mini-batches, were , the initial learning rate was chosen ion set, as was weight decay for the iments on the spam, horses, MNIST tificial data sets. For each algorithm, each training duration, 30 runs were evaluation on validation data, trying ngs that worked best. Then a choice of rate and, for the shorter experiments, ere made, and with those chosen setuns were performed, evaluating on test vided 10 test performance numbers, marized by their average and standard n as error bars).
MNIST Tasks the three MNIST tasks are shown in d 3. CD-1 and CD-10 refer to 1-and 10-step contrastive divergence, and MF CD refers to mean field contrastive divergence.

Mutation and Performance
deebn is currently implemented using immutable data structures provided by core.matrix. This results in code that is easy to read and simple to reason about. Unfortunately, a high price is paid in memory consumption and computation time with so many interim objects created during parameter updated phases. A first pass attempt of moving to the mutating operations that core.matrix provides should not only reduce memory overhead but also speed up overall computation during run time.

A Stepping Stone
As is the case with most advances in any field, the Deep Belief Network as Geoffrey Hinton described is no longer state-of-the-art when it comes to deep learning models. It has since been surpassed, and even experimentally found to be suboptimal compared to other alternatives [43]. Conceding this fact, it has sparked a recent renaissance of deep learning, and has pushed the envelope of learning methods.