DEEP GENERATIVE MODEL FOR MULTI-CLASS IMBALANCED LEARNING

Learning from imbalanced data has drawn growing attentions nowadays in the machine learning and data mining area. The imbalanced distribution will influence the performance of many machine learning algorithms, especially those need big amount of data. To reduce the influence of skewed data distribution on discriminative models, various synthetic oversampling methods have been proposed to generate extra samples for data balance. However, most of the classic oversampling algorithms, such as Synthetic Minority Over-sampling Technique (SMOTE) or Adaptive Synthetic Sampling Approach (ADASYN), were developed only focusing on balancing the data distribution of low dimensional data in a binary feature space, which limits their application on high dimensional multi-class data. To deal with the deficiency of current imbalanced learning methods, this thesis proposed a deep generative model based multi-class imbalanced learning algorithm. Both Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) are implemented as data generators for creating high dimensional image data. Besides, we designed an Extended Nearest Neighbor (ENN) based selection process to add the most relevant samples to the original imbalanced database to further improve the classification performance. Based on our experiments on two data sets and comparisons with traditional oversampling algorithms, we demonstrate the effectiveness and robustness of our model.


Introduction
Learning from imbalanced data has drawn significant amount of attentions nowadays owing to the pervasive skewed data distribution in numerous data bases.
It is a situation where the number of observations belonging to one class is significantly lower than those from the other classes. The skewed distribution leads to poor performance when applying conventional machine learning methods, owing to particularly underrepresented features learned from minority classes.
Furthermore, machine learning algorithms are usually created to improve accuracy by avoiding the erroneous predictions. Therefore, most of these algorithms ignore the data distribution among different classes. When facing imbalanced data set, standard algorithms like K Nearest Neighbors [1] and Decision Tree [2] tend to treat minority samples as noise and hence produce a strong bias towards the majority class.
It is truly worthwhile to explore effective imbalanced learning methods, because imbalanced data is prevalent in many application areas in industry, where anomaly detection is critical like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, etc [3]. In the cyber security area, recognizing patterns in imbalanced data plays a crucial role for data analysis. For instance, to detect cyber attack in a large network, an unusual pattern only takes a relatively small percentage of total data information but plays an crucial role in computer intrusion detection [4]. In the financial engineering area, it is important to detect fraudulent activities, such as credit card, insurance, and insider trading frauds, among a large number of transactions [5].
In order to improve the classification performance of imbalanced data sets, two sets of methodologies were proposed by data mining researchers. Published imbalanced learning solutions can be categorized as algorithm level and data level algorithm [6]. At algorithm level, the classifier itself is modified to bias towards the minority class without changing the original data, such as cost-sensitive learning and recognition-based learning [7]. Cost sensitive learning grants corresponding weights to the cost of different kinds of misclassification. The goal of this type of learning is to minimize the total cost [8]. At data level, oversampling and undersampling methods are applied to create or delete samples to achieve a balanced data distribution.
The classic synthetic oversampling methods achieved the state-of-the-art performance when dealing with imbalanced data. However, those methods are only designed for low dimensional feature space samples in binary classification scenarios and hard to cope with high dimensional data samples, like images, audio signals and time series. For multi-class scenerios, Wang [9] studied the challenges from multi-class imbalanced problems and investigated the generalization abilities of some ensemble solutions, including their recently proposed algorithm Ad-aBoost.NC. Zong [10] propsed a weighted extreme learning machine method to deal with multiclass imbalanced data which can also be generalized to cost sensitive learning. Li [11] proposed a Boosting weighted Extreme learning machine to solve the weight selection problems of weighted ELM, which also can be used in multiclass imbalanced scenerio. However, all those methods designed for multiclass imbalanced data just focused on balancing data distribution in feature space and got tested on simple data set like UCI [12] and KEEL [13].
The emerging research surge of deep generative models gave us the inspirations for alternative imbalanced learning method to deal with more complicated imbalanced data. Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN) as two of the most popular model to learn data distribution in an unsupervised way, have already achieved success in generating a variety of complex data, including handwritten digits, faces, house numbers and CIFAR images [14]. Fig.1 shows the real images and the generated images by GAN .
In this thesis, we explore the possibilities of applying these two generative models in imbalanced learning areas. We choose image data as the input of generative models, and apply Extended Nearest Neighbor (ENN) method to select synthetic candidates for the minority class, and compare the generation results with the traditional synthetic oversampling methods on several different evaluation metrics.
This thesis consists of five chapters, which is arranged as follows: Chapter 1 provides some basic knowledge of traditional imbalanced learning problem and puts forward some potential draw backs of classical approaches. Besides, we give some insights of using deep generative models like GAN and VAE for imbalanced data sets. Chapter 2 offers the details of the background of classical imbalanced learning problems and compares the advantages and disadvantages of traditional oversampling methods. Chapter 3 states the mathmatical foundations of two most famous deep generative models: GAN and VAE, and proposes a novel ENN based selection method to find more suitable samples from the model outputs for imbalanced learning. Chapter 4 provides the performance and analysis of the experiments on two different data sets to prove the effectiveness of our proposed method. The last chapter summarizes the whole thesis and shows the possible future works could be done.  [3].

List of References
At data level, the goal is to re-balance the data distribution by resampling the database, including oversampling the instances of minority class and undersampling the majority class.
Among all the resampling method, oversampling methods balance the data set by increasing the number of minority samples, while undersampling methods tries to reduce some majority samples to keep balance [4]. Random oversampling adds the minority samples by randomly replicating existing minority members, which in some degree improve the performance of learning process while it does not provide any additional information to the training set. Besides, random oversampling method may cause overfitting of machine learning models. Compared with random oversampling, random undersampling methods even lose some training information which may have a negative effect on the learning process [5]. Fig.2 and Fig.3 shows the details of these two methods.

Synthetic Oversampling Methods
In order to provide more information to the training data, synthetic oversampling methods create new samples to balance the data set and achieved better learning performance.    to the minority class [6]. SMOTE generates samples according to the similarities among existing minority instance. For specific feature sample x i in a feature set S, SMOTE find the K-nearest neighbors of x i in the feature space. To generate a synthetic sample, one of these K-nearest neighbors is randomly selected, then calculate the euclidean distance between these two samples, and at last add the multiplication result of the distance with a random number between [0, 1] to the original feature instance [7]: Fig.4 shows the process of SMOTE. Compared with random oversampling, SMOTE reduces the overfitting problem to a certain degree and enlarges the minority data in a way that benefits the learning process. However, SMOTE also have its disadvantages like generalization and variance issues [8].
Inspired by the SMOTE algorithm, Han et al. proposed Borderline-SMOTE methods to generate synthetic samples on the borderline between two classes for better classification results [9]. The idea resulted from the fact that the samples close to the borderline are more significant for classification, considering that most machine learning algorithms try to learn the borderline between each class.
The Borderline-SMOTE1 algorithm first finds every example's k nearest neighbors in the entire training set [9]. For every sample p i , let k be the number of majority samples in p's k nearest neighbors. If k = k, all the k nearest neighbors are majority samples, which means this sample can be regarded as noise. If k/2 ≤ k < k, more majority samples can be found in the k neighbors than the minority ones, so we put p i in the DANGER set. If 0 ≤ k < k/2, p i won't be considered as an endangered sample. So DANGER set can be defined as: Where d n is the number of endangered samples in training set, while n is the size of the training set. At last, we calculate the differences dif j between p i and its s nearest neighbors. Then we can generate s new samples by: Where r j is a random number between 0 and 1.
Different from Borderline-SMOTE1, Borderline-SMOTE2 not only creates synthetic examples from the nearest minority neighbors of each sample in DAN-GER set, but also does that from its nearest majority neighbors. Besides, to make the synthetic examples closer to the minority class, a random number between 0 and 0.5 is multiplied by the difference between the endangered sample and its neighbors [9].
The procedure of ADASYN is described as below: Step.1 To begin with, we assume there is an imbalanced data set D imb with m samples x i , y i , where i = 1, ..., m, x i and y i are data samples and labels respectively.
Besides, we define the number of majority samples are m l and the number of minority samples are m s . Therefore, we got m s + m l = m and m s < m l [10].
Step.3 Define a threshold for the maximum tolerable imbalance degree d th . If d < d th : (a) Calculate the exact number of samples need to be generated for the minority class: where β ∈ [0, 1] is a number defined to specify the balance degree after sample generation. Usually, β is set to 1 since in most situation, a completely balanced data set is desired.
(b) For each sample x i in the minority class, find its K nearest neighbors based on the Euclidean distance. Define ratio γ i as where ∆ i is the number of majority samples in the K nearest neighbors of x i . Therefore (c) Do normalization for γ:γ (d) Find the number of synthetic examples which need to be generated for each minority example x i : where G is from Eq.5.
(e) Generate g i samples for each minority sample x i iteratively, based on SMOTE algorithm: where δ ∈ [0, 1], is a random number.

Limitation of Conventional Oversampling Methods
However, the vast majority of existing oversampling methods are proposed to cope with two class scenario in the low dimensional feature space, which limits its application of generating high-dimensional samples in raw data space.
However, all the synthetic sampling methods and techniques mentioned before are designed for imbalanced data set in two-class scenarios. To apply those algorithms in multiclass situations, researchers utilize class decomposition to convert a multiclass problem to a set of binary class subproblems [11]. Given a data set with multi-class N (N > 2), a common decomposition plan is to treat one class as minority and to combine all the rest class together as the majority. This approach is also referred as one-against-all (OAA) [12].
Besides the limitation of appliance in multiclass scenarios, introduced synthetic oversampling methods seldom show their superiority when the target sample contains more complex or high-dimensional features, such as image or audio samples. According to the research results from Blagus et al, when dealing with high-dimensional class-imbalanced data, SMOTE does not reduce the classification bias towards majority class for most classifiers, and even less effective than random undersampling [13]. Furthermore, as the number of dimensions grows, the Euclidean distance becomes a meaningless metric to measure the similarity between samples [14]. In some situations, the distance between the target sample and its nearest neighbor might be larger than the distance to its furthest neighbor.
Furthermore, the existing oversampling methods are proposed to deal with the imbalanced learning problems in feature space where feature vectors are presented in a fixed and regularized pattern. However, for the imbalanced data sets which is hard to extract features in an regularized way (e.g. image data), those synthetic oversampling methods fail to generate new samples similar as but different from the original ones.
Therefore, it is necessary for researchers to explore new approaches to generate samples to balance the skewed distribution of high-dimensional data. Generative modeling is a wide area of unsupervised learning methods which attempts to learn the underlying data distributions of the original data set [2]. A generative model captures the joint probability of the input data and labels P (x, y) simultaneously, which can be used to generate new data sample similar to existing ones. For example, considering images as input data, each sample (image) has thousands of dimensions (pixels) and the generative model's job is to capture the dependencies between pixels, e.g., that pixels close to each other may be formed into an recognizable object [1]. However, this is not enough for us to generate more samples similar to those already in a database, but not exactly the same, which is the purpose of imbalanced learning. Mathematically, we want to achieve a distribution P which is as close as possible to the original data distribution P ori and where we can get new sample from.
Training this kind of generative models has been a big challenge for several decades, resulting from three serious drawbacks: First, strict assumptions on the original data may be required to achieve reliable results. Second, it is easy to lead to a local optimum if applying sever approximation about the data structure [3].
Third, when applying algorithms like Markov Chain Monte Carlo [4], the training process is very computationally expensive .

Variational Autoencoder Based Data Generation
More recently, powerful function approximators like neural networks provide more reliable way to training generative model through backpropagation. Variational Autoencoder is one of the most widely implemented deep generative models.
Unlike the traditional generative models which either require strong assumptions about the structure of the data or rely on computationally expensive inference procedures, VAE only makes weak assumption on the data, and the training procedure is fast via back propagation [1]. According to the variational auto-encoder literature [1], we consider the following latent variable model for the data X.
VAE tries to learn the joint probability of the input data and labels P (X, y) simultaneously, which can be used to generate new data sample similar to existing ones. In Eq. (10), P (X | z) is almost zero for most value of z, therefore, z contributes little to the estimation of P (X). The main goal of the VAE is to sample those values of z which are very likely to generate the data X. Then, P (X) is computed from those z based on Eq. (10).
For the purpose of attaining those z values, we need a new function Q (z | X) which outputs a distribution over z that are likely to produce data X. The Kull-backLeibler (KL) divergence between Q (z | X) and P (z | X) is shown as Then, the marginal likelihood of each sample X can be written as Eq. (12) log where the first term on the right hand side is the KL divergence between the approximate and the true posterior distribution, and the second term is the variational lower bound on the marginal likelihood of sample X. Eq. (12) can be rewritten as Eq. (13) is the core of VAE. The left hand side is the one we want to maximize while the right hand side is the one we can optimize via gradient descent. Our goal is to maximize the marginal likelihood log P (X) and minimize the KL divergence . By minimizing the KL divergence, we are pushing the approximate posterior Q (z | X) to match the true posterior P (z | X). The architecture of VAE is shown in Fig. 6, where P and Q are implemented by neural networks. The architecture of VAE is similar to that of autoencoder, Q encodes data X into latent variable z, and P decodes z and reconstructs X.
During the training process, the right hand side of Eq. (13) is maximized by gradient descent. Its first term is the KL divergence between the approximate posterior and the prior distribution. The approximate posterior is often chosen as are arbitrary deterministic functions whose distribution parameters ϑ are learned from data [3]. In addition, Σ is constrained to be a diagonal matrix. For the prior distribution, a common used prior over the latent variable is the centered isotropic multivariate Gaussian P (z) = N (0, I). By choosing these multivariate Gaussian distributions, it becomes very easy to compute the KL divergence between Q (z | X) and P (z) as Where k is the dimension of the distribution. The second term on the right side of Eq. (13) is an expected negative reconstruction error. Here, we take one sample of z and use P (X | z) to approximate E z∼Q [log P (X | z)]. In general, P (X | z) is a multivariate Bernoulli distribution so that The forward pass of Fig.6 works fine, but we can not backpropagate the error through the layer which samples from Q (z | X). This layer has no gradient because the sampling operation is non-continuous. The "reparameterization trick" is proposed to solve this problem by moving the sampling operation to the input layer. The architecture of the modified VAE is shown in Fig. 7. When given the mean and covariance of Q (z | X), i.e., µ(X) and Σ(X), we can first sample ε from N (0, I), then z will equal to µ(X) + Σ 1/2 (X) * ε. With this modification, we can backpropagate the error from the decoder to encoder [3].
The overall algorithm of VAE based synthetic data generation [3]  Randomly select minibatch of data.

6:
Update weights θ by minimizing the loss function

Generative Adversarial Network Based Data Generation
Among all the recently proposed generative models, Generative Adversarial Network, or GAN in short, is another outstanding and successful framework used to generate image samples [5]. A GAN consists of two networks: a generative net G that can be trained to acquire the knowledge of data distribution, and a discriminative net D which aims to distinguish the generated samples from real ones [6]. The simple structure of a generative adversarial network is shown in Fig.8. The generator's distribution P G over data x is modeled as a differentiable function G(z; θ g ), which can be implemented by a neural network with parameters θ g and input noise variables z. The discriminator D(x; θ d ), also implemented by a neural network with parameters θ g , takes sample x as input and outputs a single scalar representing the probability that x came from the data rather than P G . The training goal of GAN is to learn a generator distribution P G (x) that matches the real data distribution P data (x) [5]. To achieve this goal, a minimax gameplay value Real/Fake function is proposed as follow: In the training process, the discriminator aims to make D(G(z)) approach 0 and D(x) approach 1, while the generator attempts to make D(G(z)) approach 1 to maximize the probability of discriminator making a mistake. This network structure corresponds to a minimax two-play game [6].
Concretely, if the batch size of noise variables and training samples is m, we need to update the discriminator by ascending its stochastic gradient: On the opposite, update the generator by descending its stochastic gradient: After proper training, the ideal solution to this minimax game is nash equilibrium [7]. The overall algorithm of GAN is shown below: To adapt to our situation, we can implement a Deep Convolutional Generative Adversarial Network(DCGAN) [8], which applied convolutional layers in the Algorithm 2 Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the discriminator, k, is a hyperparameter.
1: for number of training iterations do 2: for k steps do

5:
Update the discriminator by ascending its stochastic gradient: Sample minibatch of m noise samplesz (1) , ..., z (m) from noise prior p g (z) 8: Update the generator by descending its stochastic gradient: 9: end for The gradient-based updates can use any standard gradient-based learning rule.
construction of discriminator and generator. DCGAN proves a powerful image generative model which can be tested for the sample generation for imbalanced image data.

Extended Nearest Neighbor Based Selection for Borderline Samples
Based on our previous research result [3], which is published in Computational Intelligence (SSCI), 2017 IEEE Symposium Series on IEEE, the outputs of generative models (VAE) can be used together with original data to improve the classification performance. However, not all of the generated instances are of the same significance in classification process.
From the analysis in [9] and [10], most of the existing classifiers attempt to learn the borderline between each class as accurately as possible in the training process, which makes the examples close to the borderline are more likely to be misclassified than ones far from the borderline, thus more significant for classification.
Enlightened by the idea of synthesizing borderline samples, we proposed a novel method to select borderline samples from generative model outputs based on Extended Nearest Neighbor Method (ENN) [11]. Through our selection, only the generated minority samples that close to the boundaries of classes are appended to the original imbalanced data, in order to improve the performance of imbalanced learning.

Compared with further improved versions of K-nearest neighbor(KNN)
method [12][13] [14][15], the Extended Nearest Neighbor(ENN) predicts input patterns on the basis of the maximum intra-class coherence increment [11]. ENN takes into consideration not only the nearest neighbors of the test sample, but ones who regard the test sample as their nearest neighbors. This two-way communication style provides access to the integral distribution of training data, thus outperforms other related pattern recognition algorithms [11].
The essential definition of ENN is the generalized class-wise statistic T i . In two-class classification scenario, the generalized class-wise statistic T i for class i is defined as the following: where S 1 and S 2 represent samples in class 1 and class 2, respectively, x denotes one single sample in S = S 1 ∪S 2 , n i is the number of samples in S i , and k is the number of nearest neighbors to search in the prediction process. The indicator function I r (x, S) specifies if both the sample x and its r−th nearest neighbor belong to the same class, defined as follows: where N N r (x, S) represents the r−th nearest neighbor of x in S. This equation indicates that if both x and its r−th nearest neighbor in the pool of S belong to the same class, then I r (x, S) equals 1; otherwise, it equals 0. The large T i suggests that samples in S i are much closer together and their nearest neighbors are dominated by the same class samples, whereas a small T i indicates that samples in S i have excessive nearest neighbors from other classes [11]. Accordingly, T i can be used to specify the data distribution across multiple classes. Therefore, the concept of intra-class coherence is defined as follows: To classify an unknown sample Z in multiclass situation, Z is assigned respectively to class 1, class 2,...and class m, to obtain m 2 generalized class-wise statistics T j i : where n i is the size of S i,j and S i,j is defined as ENN classifier predicts Z's membership according the following target function: For computational convenience in practical applications, [11] recommended an equivalent target function f EN N.V 1 to replace f EN N : where k is the defined number of nearest neighbors for prediction, n i is the number of training samples for class i, k i is the number of the nearest neighbors of the test sample Z from class i, ∆n j i represents the change of k nearest neighbors for class i when assigning the test sample Z to class j, and T i represents the generalized class-wise statistic of original class i without the introduction of Z.
Since f EN N.V 1 and f EN N are equivalent, we can instead use f EN N.V 1 to predict the class membership of every generated samples from out model. To select those samples near the boundaries of classes, we proposed a criterion for the selection, the chosen sample must meet the following two requirements: 1) The generated sample must be classified as its original class based on ENN.
2) The generated sample must be neighbored by at least one sample from other class based on the number of nearest neighbor k defined in Eq.24.
Every qualified samples selected from the output of the generative models are added to the original data set until the skewed data distribution is balanced. In case of the situation of not enough samples having neighbors from other classes, we only apply requirement 1) to select samples such that the data set can also be balanced. The complete algorithm structure is shown as Algorithm.3.
Compared with directly using the generated images from our deep generative models, the ENN based selected samples provide more useful information on the boundaries between different categories, therefore leads to better classification performance.

Algorithm 3 Deep Generative Model Based Minority Class Data Generation.
Require: Training ENN with the target imbalanced data set for t thre =1:N thre do

5:
Generate a class n sample x gen and classify x gen based on ENN.

6:
Name classification result as y gen

7:
Attain the vector k n 8: if y gen == n and k n contains more than 1 none-zero elements :

9:
Add x gen to the original data set, N gen + +.

19:
Add x gen to the original data set, N gen + +.

CHAPTER 4 Simulations and Experiments
To prove the effectiveness of proposed model, we decided to use image data as high-dimensional input of two deep generative models. The parameters of two different models:VAE and GAN, are illustrated in this chapter. Furthermore, we applied ENN based selection to choose most relevant samples for classification.
Generated pictures of classical synthetic oversampling methods and our proposed models are also displayed. Finally, we use five different evaluation metrics to compare the classification performance of different approaches.

Variational Autoencoder based generative model
First, we implement VAE for the image generation. The structure of VAE here is a little different from the vanilla one: we applied convolutional layers [1] here to extract and restore sufficient features of original data samples.
The encoder of VAE consists of six layers: three convolutional layers, one dropout layer, one flatten layer and one dense layer. The kernel size of the convolutional layers is 5 * 5 with a stride size of 2 * 2. Both same padding and valid padding are applied here. The encoder structure is shown in Table.1.
The decoder consists of 5 layers: one reshape layer and four deconvolutional layers. The deconvolutional layer, also called transposed convolutional layer, is implemented to map the input vector back to the image pixel space. The network structure of decoder is shown in Table.2.
The discriminator consists of 6 layers: two convolutional layers, one flatten layer and 3 dense layers. The first convolutional layer has the kernel size of 4 * 4 with a stride size of 2 * 2 and 64 feature maps as output. The second convolutional layer has the kernel size of 4 * 4 with a stride size of 2 * 2 and 128 feature maps as output. Both these two layers applys leaky Relu as the activation function.
The last dense layer has the ouptput size of 1, which represent the possibility of whether generated image sample is real. The discriminator structure is shown in Table 4.
In our implementation, we trained the GAN for 1000 epochs, with a 0.001 learning rate discriminator and a 0.004 learning rate for the generator, in order to keep discriminator in an optimum state and make generator learn the distribution steadily.

Experiment on MNIST Data set
To test the effectiveness of our proposed method, we trained our model on MNIST data set and compared the performance result with other oversampling methods. Since the original MNIST data set is a balanced data set with 6000 samples per class for label 0 to 9, we make some modifications on it to make it unbalanced: we choose digits from 0 to 4 as the minority classes and pick 300 samples from these classes. Then, we pick 3000 samples from label 5 to 9 as the majority classes. We divide this imbalanced data set averagely into three folds to prepare for 3 folds cross validation.
To make this distribution balanced again, we apply VAE to train the minority class samples. New samples will be created by feeding values of z ∼ N (0, I) into the decoder. To attain suitable samples for the learning process, we apply ENN to sampling methods for high-dimensional data. According to [2], since the Euclidean distance is unsuitable to measure the similarity between high-dimensional samples, the synthetic sampling methods based on Euclidean distance will generate unreasonable images.
To evaluate the performance of our proposed generative structure, we choose convolutional neural networks (CNN) [3] as our basic classifier to do classification.
We applied two convolutional layers in our classifier. The first layer has a kernel size of 3*3, the stride size is 1*1, and 32 feature maps, and the second layer has the same kernel and stride size with Layer 1 with a feature map size of 64.
Then we applied 2*2 sized max pooling with a 0.25 dropout rate followed behind.
After that, we use a 128-dimension dense layer with 0.5 dropout rate before the final softmax layer. Fig. 9 shows the structure of our classifier. The structure of our CNN remains the same for all other databases used for all the method.

Evaluation Metrics
We further quantitatively compare the performance of our proposed method with the synthetic sampling methods on the MNIST test set. Considering that accuracy itself may not be sufficient for evaluating the performance of imbalanced learning algorithm [4], we instead apply a set of assessment metrics, such as precision, recall, specificity, F1 score, and G mean.

1) Precision:
P recision = T P T P + F P where TP represents True Positive and FP represents False Positive.

3) Specificity:
Specif icity = T N T N + F P where TN represents True Negative. 4) F1 score: where β is a weight coefficient to adjust the significance of recall (usually β =1).

5) G mean:
Since the above metrics are designed for two-class, to apply these metrics in the multi-class scenarios, one-versus-all technique [5] is used to calculate average values of these metrics over all the classes.   randomly chosen samples. Then the imbalanced data set is divided into three parts of same size, which will be used for cross validation. The parameters of VAE and GAN for this database are the same as the one applied in MNIST but with a 1000 training epochs instead. Besides, we choose the exactly the same classifier with Fig.9. Fig.11 shows the selected generated images from both models. To attain suitable samples for the learning process, we apply ENN to select borderline samples.
The defined number of nearest neighbors k enn is 10. For each class, 1600 more samples will be generated. Again, To apply traditional sampling methods in this

Performance Analysis
Based on our experiment results, we can see that the generation results of VAE and GAN are different in two aspects: (1) Compared with GAN, VAE based model tends to produce blurry images, but achieved better performance. (2) Compared with VAE, GAN based model tends to produce clearer and sharper images but it is difficult to train and prone to collapse.
According to [6], VAEs are easier to train and robust to hyperparameter choices and give interpretable latent variables which is learned to map the input to a lower dimensional space. The limitation of VAE is the approximation of posterior P (X | z) is usually oversimplified, because we can't parametrize much more complex distribution than normal Gaussian.
On the other side, GAN have the advantages of generating clearer pictures because it applied the adversarial mechanism, which force the generative net to produce something hard for discriminator to distinguish. Hence the lack of adversarial training might be the reason why VAE may generate blurry images.
However, besides the fact that GANs are usually trickier to train compared with VAEs, the training process of GANs may ignore some patterns of original distribution since its ultimate goal is to satisfy the requirements of generator and discriminator. Fig.12 shows the difference between VAE and GAN in data generation, where the gray lines shows the original data distribution, and colored ones shows the generated data distribution. We can see that from the probability point of view, GAN tends to generate samples follows a specific pattern, while VAE may generate more diverse data samples, some of which may locates out the range of the data distribution. List of References CHAPTER 5

Conclusion
Based on our previous analysis and model simulation, we can conclude that deep generative models, like VAE and GAN, could be implemented as image generator to compensate the skewed data distribution, which produce more clear and meaningful samples and performs better compared with traditional feature space oversampling methods. The selection process based on ENN further optimizes the generation process to make it easier to find classification boundaries.
From the generated samples and simulation results, we can see the difference between VAE and GAN : VAE tends to generate a little blurrier imaged compared with the sharp and clear images generated by GAN. However, VAE better captured the distribution of original data base so it tends to lead to better classification results.
Besides, we also test our generative model on more complicated data set like CIFAR 50. However, it is hard to get clear and meaningful generation results, which limit its application on complicated image (like scenery, animals) imbalanced learning. As the best knowledge we have, our proposed model works well on simple structured image data sets, such as MNIST, NIST19 and Fashion MNIST.
Multi-class imbalanced learning on complex data has always been a challenge for data mining and machine learning research. Several published methods of multi-class imbalanced learning just focus on the feature level data, which limits their applications on real world raw data samples. This thesis proposed a deep generative model based imbalanced learning method for image data and provided some insights on how to deal with skewed data distributions by using the state-ofthe-art deep learning techniques. However, limited by current generation abilities of GAN and VAE, our contribution is just a small step towards the goal of data space imbalanced learning. We believe that the rising of deep learning, especially the rising of new deep generative models will bring new breakthroughs on this topic.

A.1 Data sets and Computational Resources
To verify the effectiveness of the potential solutions, we consider to apply three different image data set: (1) MNIST database The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST [1]. The digits have been size-normalized and centered in a fixed-size image.
(2) NIST19 database NIST19 database is a handwritten English letter database. It publishes handprinted sample forms from 3600 writers, 810,000 character images isolated from their forms, ground truth classifications for those images, reference forms for further data collection, and software utilities for image management and handling.
Since the training and generating process of deep generative models costs lots of computation time. We plan to run the program on the GPU supported computer in our CISA lab.
The implement of our algorithm will be coded in Python using Tensorflow, which is an open source deep learning framework developed by Google Brain [2].