Deep Learning of Human Apparent Age for the Detection of Sexually Exploitative Imagery of Children

Over the last decade, advancements in deep learning and computer vision have led to a tremendous growth in performance at the tasks of automated human age estimation and nudity detection. Modern machine learning models can predict whether or not an image contains nudity or the presence of a minor with startling accuracy. When used in conjunction, these technological advancements can be used to identify new instances of child pornography without ever coming into contact with the illicit material during model training. In this thesis, a label distribution learning framework for modeling human apparent age is proposed. Instead of directly modeling a person’s biological age, we use a probability distribution over a sample of humans guessing how old that person looks like as the ground truth. This allows us to better capture the subjective nature of a person’s age and advance state of the art performance at the task of apparent age estimation. Next, we introduce a framework to automatically identify Sexually Exploitative Imagery of Children (SEIC) in both images and video. It is a synthesis of our original age estimation models and Yahoo!’s open sourced nudity detection model, OpenNSFW. Deep learning models are used to identify the presence of a minor or nudity in any given image or video. The performance of this approach is evaluated on several widely used age estimation and nudity detection datasets. Additionally, preliminary tests were conducted with the help of a local law enforcement agency on a private dataset of SEIC taken from real world cases.

Mean Absolute Error on APPA-Real for three different protocols P 0, P 1, and P 2. For each protocol 4 models are trained: 101-way classification, histogram label distribution, normal label distribution, and KDE label distribution. Each model is trained 10 times with the same set of random seeds and the mean and standard deviation of the MAE is reported for both apparent and biological age estimation. We achieve a MAE of 3.688 with (P2) and Kernel Density Estimation, surpassing previous results for APPA-Real reported in [1].  Table   Page x 9 Speculative results for SEIC image detection by combining nudity and minor detection. A threshold of 0.402 is used to consider an image NSFW and an age cutoff of 14 is used. The number of false positives from the RedLight dataset is also pro- has reviewed more than 236 million images and videos and law enforcement has identified more than 14,500 child victims [2]. As of 2017, NCMEC has sent more than 209,000 notifications to service providers regarding publicly accessible websites (URLs) on which suspected child sexual abuse images appeared.
During the course of an investigation of suspected child pornography, a computer forensics specialist typically spends many hours looking at hundreds of thousands of images and videos. The seizure and further analysis of a suspect's computer and data is tedious, error prone, invades the privacy of the suspect, wears on the investigator, and demands time that the investigator could be using to address the backlog of cases he/she likely faces. Automating the process of searching images and videos on seized media would drastically reduce the amount of time that investigators have to spend looking at the images, reduce time spent looking through irrelevant non-pornographic photographs, and would allow investigators to concentrate on other aspects of the case.
We propose using recent computer vision and machine learning advances to automate the process of identifying Sexually Exploitative Imagery of Children (SEIC) in order to significantly decrease the amount of time law enforcement agents spend on child pornography investigations.
Traditional machine learning methods for this task rely on manual feature engineering and tend to generate many false positives, serving only as a coarse filter for suspected material. When a large number of files are present on the suspect's hard drive, the agent reviewing the case may become overwhelmed with imagery falsely flagged as being SEIC, especially when many pornographic images are present. This is because pornographic content is difficult to distinguish from SEIC using traditional techniques since the notion of age is not explicitly modeled. To address this issue, we propose fusing the predictions of more accurate deep learning models for nudity detection with our recent work in apparent age estimation to classify SEIC material as a synthesis of nudity and minor detection.
The development of computational methods for age estimation from human face images has been one of the most challenging problems within the field of facial analysis [3]. In addition to common difficulties in facial analysis, such as pose and illumination, age estimation proves a significant challenge due to the subjective nature of the problem. The process of aging is unique to every individual and is influenced by their genetics, diet, occupation, and hobbies [4]. This implies that two individuals with the same biological age can have quite different appearances.
On the other hand, large annotated databases are difficult to collect, especially for individuals in the lower and upper ends of the spectrum of human ages.
The exact way in which we approach age estimation has been an active area of research. It may be posed as a classification task, where labels are discrete (age groups ranging several years or just a single year), as a regression task, where labels are continuous (in years), or as a hybrid task using both classification and regression methods.
Age estimation may be considered from either of two different representations: biological age estimation or apparent age estimation. In biological age estimation, the actual age of a human subject is predicted while in apparent age estimation the label is the aggregation of a group of guesses made by human labelers. This aggregation is usually the arithmetic mean of the collection of guesses.
Apparent age estimation is a more recent topic, which has received increasing attention [1,5,6], as a result of the deep learning revolution and two apparent age estimation competitions run by ChaLearn in 2015 and 2016 [7]. State-of-the-art results are already beating the human reference. By focusing the attention on the apparent age of individuals, the hopes are to alleviate the subjectivity underlying biological age estimation tasks, since human guesses are expected to agree more on how old a subject looks like.
The goal of Chapter 3 is to approach the apparent age estimation problem under the framework of label distribution learning (LDL) [8]. Unlike classic singlelabel or multi-label classification, in which instances are assigned to a single or multiple labels, the aim of LDL is to assign instances to label distributions, i.e., vectors containing the probabilities of the instance having each label. Our motivation is essentially to find better ways to model the label ambiguity underlying the apparent age estimation problem. Furthermore, APPA-Real [1] has been recently made available. This dataset provides a large number of face images labeled with real and apparent age annotations. APPA-Real contains 7.6k face images with an associated number of nearly 300k human guesses.
We propose an end-to-end framework, based on convolutional neural networks, to learn distributions of apparent age labels. Given an input image of a human face x, we want the model to produce a discretized probability distribution vector where each value at index k represents the probability of x being k years old. In order to evaluate our framework, we conduct experiments with the APPA-Real dataset because it provides a number of human age guesses for each image.
Overall, the contributions of Chapter 3 include: • A novel end-to-end framework, based on learning label distributions, that leverages the availability of human guesses in the APPA-Real dataset for modeling the apparent age estimation problem; • Better performance than state-of-the-art methods on apparent age estimation using the APPA-Real dataset. We improve the mean absolute error to 3.688 years; • Empirical evidence that pre-training using label distributions yields higher performing models regardless of the target task.
Next, in Chapter 4, we introduce a framework for the automatic detection of SEIC videos and images using convolutional neural networks (CNNs). Given some seized hard drive or another source of digital media, all video and image files are located. The probability of each containing child pornography is estimated using our age estimation model [9]. Additionally, we calculate nudity scores using Yahoo!'s publicly available OpenNSFW nudity detection network [10].
Since we are explicitly incorporating the notion of age into our model, we are better able to capture the subtlety between pornography and SEIC. We can also provide more interpretable analysis for law enforcement agents. The number of faces found, the age of each person, and overall nudity detection score for each image will be presented to the user. Images may then be flagged as SEIC with arbitrary precision by tweaking the estimated age required for being a minor and nudity detection score.
Videos are slightly more complex to classify. Videos are split up into a series of frames and treated as a set of images. Similar characteristics may be reported to the user at a per-frame resolution. A final machine learning model reports the probability of the video containing illicit content, and classification may still be performed to arbitrary precision by specifying some threshold required for video flagging. The agent reviewing the material will then be able to quantify exactly how much nudity and how many children must be detected in a video before flagging it for review.
This novel approach results in an framework which may be as fine or coarse a filter as the agent specifies, in a way previous approaches cannot. It can even distinguish between challenging examples of pornographic and SEIC videos with 89% accuracy using the default thresholds. Since our approach relies on automatic representation learning through the use of convolutional neural networks, those involved in this work never had to be directly exposed to pornographic or SEIC content. Additionally, in contrast to most other works on SEIC content detection, we rigorously evaluate our models on a series of challenging datasets to analyze their performance before presenting results on data collected from real law enforcement cases.
Overall, the contributions of Chapter 4 include: • A novel framework for the automatic detection of child pornography in videos and images • A rigorous analysis of the performance of the nudity detection and age estimation models on ethnically diverse and challenging pornographic and nonpornographic images and videos • Validation of our approach through empirical evidence that treating video classification as a per-frame image classification task with prediction aggregation achieves competitive results at pornography detection on the NPDI video dataset • A rare evaluation of our framework for child pornography detection on a real world dataset collected by local law enforcement agents from 20 real world cases This thesis concludes in Chapter 5 with a brief summary and a discussion of future work. Additional figures highlighting learned probability distributions are given in the Appendix.
In this chapter, the background literature for deep learning, age estimation, label distribution learning, and pornography detection, which forms the basis for our methodology, is summarized. Currently available techniques for the automatic detection of child pornography in images are also described.

Age Estimation
Lately the automatic estimation of age from facial images of humans has received great and increasing interest [11]. Age estimation can be posed in either of two ways. The question, "How old is the person in this photograph?" can be interpreted as either trying to determine the biological age of the subject, or as observing the "apparent" age. A computer vision algorithm can be created with the intention of satisfying either form of this question. Most of the early work utilizing feature engineering was focused on biological age.
Related work for the automatic estimation of biological age from facial images includes methods that employ hand-crafted features to represent age patterns, e.g.
local binary patterns, histogram of oriented gradients (HOG), and biologically inspired features [12]. Given a set of hand-crafted features, the problem of facial age estimation can be modeled as a classification task for discrete age intervals [13], as a regression task for direct age estimation [11], or as a fusion of both tasks. A more complete review of existing age estimation methods is presented in Liu et al. [14].

Biological Age Estimation
The vast majority of existing computational methods focus on the prediction of biological age. This problem has been one of the most challenging problems in facial analysis. After relying mainly on human inspection of craniofacial features, later studies incorporated a plethora of computer vision and machine learning techniques, to first, extract features from images, and then pose the problem either as a classification or as a regression task. An extensive review on different approaches and datasets is presented in [15].
With recent developments in deep neural networks, where manual feature design is no longer required, current age estimation methods have shown impressive performances. In [3], a discussion of human accuracy on predicting the biological age on the FG-NET dataset is given. Through aggregation by outlier removal and the arithmetic averaging of ten votes from human labelers, a Mean Absolute Error (MAE) 1 of 4.7 is given as the human error on the dataset. It is interesting to contrast this with recent results from deep learning based models on this dataset from 2015 by DEX [6] which achieves a MAE of 3.09, surpassing human performance by this metric.
More recently, state-of-the-art results in biological age estimation have been obtained by casting age estimation as an ordinal classification problem. In [16], the authors proposed training a single Convolutional Neural Network (CNN) with many binary predictors. For l potential age classes, each output neuron k ∈ {1, 2, . . . , l − 1} would predict the probability of example x being older than age k. The predictions are then aggregated together via Eq. 1 whereŷ is the final predicted label, andf k (x) ∈ {0, 1} is the output of the k-th classifying neuron given input image x. They jointly optimized k binary classifiers over the cross-entropy loss function.ŷ The key idea here is recognizing that there is some ordinal relationship in the set of ages to predict. By jointly optimizing several binary classifiers, the authors force a network to extract a set of features to detect whether or not a person is older than age k. They then combine the predictions of each simple binary classifier to create a more accurate final prediction.

Apparent Age Estimation
Apparent age estimation can be considered a relatively new topic in facial analysis. Most age estimation datasets lack apparent age labels and are only suitable for biological age estimation. More recently, researchers have begun modeling the aging process by using the apparent age of a person to train their models. The motivation is that in reality, people have an apparent age that could differ perceptually from their chronological age. Since feed-forward neural networks learn by adjusting themselves based on self-error, punishing the model for a face that visually looks not in a certain age group may be counter productive. The quality of a machine learning model in this context depends on the availability of a large dataset with face images annotated with apparent age labels such as the one introduced in the ChaLearn Looking At People Apparent Age Estimation competitions run in recent years [17,18].
An overview of the most popular methods for apparent age estimation can be found in the 2015 and 2016 ChaLearn LAP competitions [7]. Both competitions are relevant because they propelled research in apparent age estimation by providing the first dataset annotated with human guesses. Each image in these datasets is annotated with a mean age and a corresponding standard deviation of human guesses.
One of the greatest difficulties in age estimation is how to pose the objective function. The winner of the ChaLearn LAP 2015 competition, whose model was named DeepEXpectation (DEX) [6], utilized a VGG-16 based CNN architecture and performed experiments with posing the problem as a regression or classification with groups of varying sizes. Treating age estimation as a 101-way classification and optimizing the network using the standard cross-entropy loss function (softmax) produced the best results. The key to their success was computing the softmax expected value. This formulation takes advantage of the assumption that when the model misclassifies an image, it is likely to predict an age closer to the ground-truth age. It is important to note that this implicit ordinal relationship is not exploited during the training phase. Additionally, they fine-tuned the network on a crawled dataset of 0.5 million celebrity pictures, collected from IMDB (Internet Movie Database) and Wikipedia. To the best of our knowledge, this is currently the largest annotated, and publicly available, dataset for biological age prediction (IMDB-WIKI).
The APPA-REAL dataset [19] contains a set of human face images with accompanying biological ages and human age guesses. The age guesses were collected via crowd sourcing with an average of 38 human guesses per image. With such a rich group of labels available for each image, more sophisticated techniques can be used for learning that exploit this idea of subjectivity and the ordered nature of ages. Such is the aim of label distribution learning (LDL), which seeks to assign input instances to entire label distributions, i.e., vectors where each element contains the probability of the instance having each label.

Deep Learning
Recent advances in machine learning algorithms, together with the availability of large online datasets and GPU (Graphics Processing Unit) technology have paved the way to tackling problems once considered impossible, particularly in the field of computer vision. Convolutional neural networks (CNNs) have achieved remarkable levels of accuracy for a variety of tasks, such as the automated detection of faces [20], nudity [21], and human age [16] as well as the traditional computer vision problem of image classification [22]. More spectacular examples of CNNs are those such as one trained for the task of automatically determining the location where a photo was taken just by processing its pixels, with data mined from geotagged images [23] or in artistic style transfer [24] from one image to another.
CNNs, a particular type of feedforward neural network, take advantage of the grid-like structure and spatial locality of images to accomplish these tasks.
CNNs can not only perform traditional machine learning tasks such as classification or regression, but also learn complex feature hierarchies directly from raw pixels, eliminating the need for manual feature engineering. This ability to automatically learn rich features from the data as opposed to relying on hand-crafted feature design is key to their success. In practice, CNNs may prove difficult to train due to the massive number of parameters that must be learned. The large capacity of the network demands many annotated training samples, and even with the general abundance of labeled data found online, often times researchers have trouble locating large, specific datasets, although transfer learning may alleviate this issue [25].

Convolutional Neural Networks
Let X denote some feature space and Y denote a set of labels. Given some dataset D = (x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (n) , y (n) ) where x ∈ X , y ∈ Y and each (x (i) , y (i) ) pair form a single labeled training example, the goal of supervised learning is to find some function g which best approximates the mapping X → Y. This mapping is typically learned by minimizing some loss function L with respect to the parameters of g using stochastic gradient descent. A loss function typically takes a form given as Eqn. 2 where N is the number of training examples and θ are the parameters of the model.
One of the most popular loss functions used for classification tasks is the cross-entropy loss function, given as Eqn. 3 where the notation g j denotes the jth element of the vector of class scores produced by our model g and g y i denotes the score of the vector element corresponding to the ground truth class.
When computed, the value of a loss function is a scalar representing how happy we are with the model's predictions for the training set. The loss function must be bounded below, typically at 0, which represents a perfect matching of training examples to their ground truth labels. If our model g were to make mistakes in mapping input instances x to their corresponding labels y, the output of our loss function would increase; preferably this increase would be proportional to the severity of the mistake, but this is not always the case with generic loss functions as we will see in Section 3.1.
A wide range of machine learning models exist. Some types of models are better at particular tasks than others. Over the last decade, the increasing computational power offered by Graphical Processing Units (GPUs) has given researchers the ability to train classes of models with millions of parameters in a reasonable amount of time. This has led to the dominant performance of convolutional neural networks, a specific type of feedforward neural network, at various computer vision tasks such as image classification. In this thesis, we are concerned with such tasks -therefore, we may further restrict the input space X = R W ×H×C . Each instance of x ∈ X is an image represented by a tensor of pixel values and is of dimension W, H, C -the width, height, and number of color channels in the image respectively.
In convolutional neural networks (CNNs) [26], the prediction function g can be thought of as the composition of a linear sequence of mathematical operations organized into a computational graph. Each operation is more commonly referred to as a layer in the neural network. There are many types of layers; the convolutional layer being the hallmark of the CNN. In a convolutional layer a N × N × K "filter" is slid across the L × M × K input to the layer. At each spatial location a dot product is performed between the tensor of weights which make up the filter and the portion of the input matrix the filter is currently sliding over. The weights in this filter are a learnable parameter of the network. Convolutional layers have two key properties that make them especially good at tasks where the input is arranged with grid-like structure like an image: sparse interaction and parameter sharing [27].
Sparse interaction: In typical feedforward neural networks, the output of each layer is connected to each hidden node in the next layer. This results in a matrix multiplication with a runtime of complexity of the convolution operation since every input and output neuron pair are not connected by a unique weight. In most applications K is typically orders of magnitude smaller than M , so the savings in computational cost, memory, and model capacity gained by using convolutional layers are significant.
Parameter Sharing: Each convolutional filter is slid across all spatial locations of the input layer. For each of these positions, the same matrix of weights is used. This results in a set of weights that are tied together; different elements of the input do not get unique weights. They share the same value. This results in a network that needs to learn the set of weights corresponding to a filter only once for it to be applied to the entire input. If we wanted the same function that is represented by these weights in a traditional neural network the same weights would have to be learned many times over in the much larger parameter space, and we would end up with many redundant representations of the same function.
More intuitively, the role of the convolutional layer is to learn some intermediary feature representation of the input. For a given image, a convolutional filter will learn to detect visual patterns that are useful for the given task. The first few layers of convolutional filters typically learn simple edge detectors and gabor filters while those deeper in the network learn more complex, abstract representations of the input interpreted as linear combinations of the preceding layer. For a CNN trained to classify images as containing cats or not, some convolutional filters may learn that combinations of lines making up circles are useful features.
A filter deeper in the network may learn that particular orientations of circles are useful for determining if eyes are present in an image. The network in its entirety will combine the "detections" of all such filters to make some final prediction, interpreted by the human as whether or not the image is classified as "cat".
Other types of layers present in CNNs include fully connected layers, nonlinearities, pooling, batch normalization, and dropout.
A fully connected layer is simply the matrix multiplication of every input element with its corresponding hidden weight. If there are N input neurons connected to a hidden layer of size M , there will be N × M weights. Non-linearity layers are the same as in typical fully connected networks, with ReLU = max(0, x) being the activation function of choice [28]. Non-linearities are necessary to prevent a network from devolving into a single matrix multiplication because without them a series of fully connected layers could be viewed as a single linear transformation.
Pooling layers replace the output of a convolutional layer with a summary statistic of the activations close together spatially. Pooling helps to make intermediary representations invariant to translation, and reduce the dimensionality of the input. Batch normalization [29] forces the input to a layer to be unit gaussian by learning scaling and shifting parameters α and β. This operation helps alleviate difficulties with properly initializing neural networks, and was instrumental in training the first deep CNNs.
define a label distribution as a vector of real numbers, in which values P (y) ∈ [0, 1] represent the degree that the corresponding label y describes an instance. All values in this vector sum to 1, i.e., a distribution over the set of labels. A number of methods have been proposed to address this task. In [31,8,32], proposed LDL methods are based on the maximum entropy model [33]. As distributions from the exponential family arise as natural solutions to the maximum entropy problem, the generality of the solution is restricted. A different group of LDL approaches aim to extend existing machine learning algorithms to deal with learning distributions.
More recently, in [37], CNNs are proposed as an end-to-end learning framework that minimizes the Kullback-Leibler divergence between the predicted and ground-truth discrete label distributions. As ground-truth label distributions are not available in most existing datasets, the authors generated discrete label distributions under proper assumptions. For example, in the age estimation context, the authors labeled each image with a label distribution generated from a normal probability density function. This density function is a natural choice given that, for each image, a mean µ and standard deviation σ are available in the training dataset. When the standard deviation is not available, the value of σ = 2 is arbitrarily chosen. Our work does not assume that human guesses for apparent age estimation are normally distributed, but rather, shows that non-parametric distributions outperform the assumption of a normal distribution.

Pornography Detection
In this section, we outline recent work in both the automated detection of pornography and SEIC.

Automated Pornography Detection
Currently many popular websites such as YouTube and Facebook augment their automated content flagging systems with human laborers to review and moderate User Generated Content [38]. The human laborers they employ often quit within months of being hired and are usually not trained or prepared to deal with the trauma of seeing so many deplorable videos [39]. There are clear incentives for tech companies to produce an accurate automated content moderator but, so far, major web companies have not come up with an effective algorithm to automatically remove unwanted content without human intervention -at least not one that has been disclosed publicly.
Most of the current research on automatic pornography detection considers the application of machine learning techniques to still images [40]. proportions, to train machine learning models for pornography classification [43].
This approach worked well for detecting adult pornography, where many instances of video are studio grade but failed to generalize to lower resolution, and improperly filmed videos.
Not all digital media containing large amounts of body exposure is considered pornography. Pictures and videos of beach scenes, sports games, and people in revealing clothing fall into a difficult to categorize group of images where a large amount of skin is exposed in a non-sexual context. Traditional computer vision algorithms for pornography detection that use skin detection features to detect nudity often fail to capture this nuance [44,45,43]. Ulges et al. improve upon this by using by using color visual words instead of skin detection to identify this content [46]. More recently, convolutional neural networks have been shown to achieve great success at pornographic image and video classification [47,48,49,21], particularly on the NPDI video pornography dataset [50]. This dataset consists of three categories (pornography, non-porn easy and non-porn hard) and serves as a challenging, modern benchmark for pornography detection.

Automated SEIC Detection
Microsoft's PhotoDNA technology [51], a resizing resistant image hashing algorithm, is available as a free service and helps stop child exploitation images from being shared online. The algorithm works by comparing the hash value of a suspected image to the hash values contained in a database of identified SEIC images. When an image is flagged as containing SEIC, the service provides the capability to report the illegal content to the NCMEC and appropriate law enforcement agencies. Unfortunately, it is limited to detecting images already cataloged in their database and cannot detect SEIC video at all; once a new instance of illicit material is discovered it takes time to be verified and make it into the database.
Significant effort has been made into automating the detection of new instances of SEIC. In [52], traditional visual and audio features were used to classify SEIC images and videos while [53] used CNNs. Both train their models and present results on real SEIC data through collaboration with local law enforcement agents.
The work of Sae-Bae, which does not make use of CNNs, takes a different approach similar to the one proposed in this paper and poses SEIC detection as a hybrid task of nudity detection followed by age estimation [44], but relied on manual feature engineering.
A rigorous study on the difficulty of SEIC image classification was given in [54].
Five law enforcement agents were tasked with identifying illicit material with the intention of analyzing the challenges faced when categorizing these images. In order to be categorized as SEIC, the image had to be identified as containing a minor and indecency.
The agents reported it difficult to identify the age of the victims in the photographs when there was a discrepancy between bodily and facial features, the victim was "staged" to alter their appearance (makeup or jewelery was applied that a child does not typically wear), there was an absence of secondary sex characteristics, and finally because of the natural variation in sexual development and variability across ethnic groups. Children in the developmental stage of early and late childhood were easy for law enforcement to identify(≤ 10 years old).
Indecency was hard to identify when the offender was absent from the image, the child had positive facial expressions, the context of the image was ambiguous, and the image was taken in a public area. Indecency was easy to identify when there was evident sexual activity between an adult and a child, the victim was in obvious distress, the image was taken in a sexual context, or when the background of the image was suspicious.
This analysis motivates our approach for a hybrid approach to SEIC content detection. To be successful at detecting SEIC content our models must therefore learn how to detect both age and indecency. Models trained for pornography detection that try to generalize to SEIC classification have no concept of age and will likely fail to distinguish normal pornography from SEIC [52]. In contrast to other approaches, we designed a series of experiments inspired by the recent study of agents tasked with identifying SEIC content to analyze exactly how robust our framework is in situations humans have difficulty in.

Apparent Age Estimation
In this chapter we define the Label Distribution Learning (LDL) problem [8], our proposed solution, and an overview of our network architecture and training.

Label Distribution Learning
Let X = R w×h×c denote the input space of images, where w, h and c are the width, height, and number of channels of the input instance. Let Y = {y 1 , y 2 , . . . , y l } denote the ordered set of labels. A LDL problem is defined as learning the mapping function f : x → d between an input instance x ∈ X and its (2) , y (2) ), . . . , (x (n) , y (n) )} Moreover, considering the age estimation problem, given an input image x with a discrete label distribution d, we interpret each value d i as the probability of x being i years old. In the context of this work, the total number of labels is l = 101 with ages ranging from 0 to 100.
When training LDL models with neural networks, specifically CNNs, the Kullback-Leibler (KL) divergence is used as the loss function. The KL divergence can be seen as a similarity measure between the ground-truth discrete label distribution and the predicted label distribution. We seek to optimize the loss function given in Eq. 4 where x is an input image,f (x) is the log-normalized probability vector the model produces, and d is the ground-truth label distribution. When performing inference, we follow [6] in taking the expected value of the output distribution to get our final predictionŷ = E[exp(f (x))] over the ages [0, 1, . . . , 100].
Consider the context of training CNNs for single label classification using the cross-entropy loss, where each class is an age in years. It is worth to note that iterative gradient updates will induce changes in the model's weights, ig- Furthermore, as we can see in sample images from the APPA-Real dataset shown in Figure 1, the age of some people can be extremely difficult to predict.
Consider the middle image in the same figure -the apparent age guesses vary wildly from 14 to 29 years old. If we limit ourselves to training an apparent age classifier on this image for the subject's mean apparent age of 20, we will cause the model to make more drastic changes in its weights for a "misclassification" into the adjacent age of 19, even though by visual inspection this error seems perfectly reasonable. When we train on label distributions, we alter the ground-truth label to include some adjacent classes.
Alternatively, if a model is trained for regression and the Mean Squared Error is used as the loss function, an insight can be developed for a different type of error. Suppose for some image x the model predictsŷ = 4 when in reality y = 0  intuitively predicting a newborn baby as being 4 years old is inherently a worse error than predicting a 31 year old adult as being 35. By transforming the groundtruth into label distributions, we can expect the apparent age curve for an infant to be much narrower than the curve for a thirty-five year old. Empirically, in [6], the authors show that training a model for age regression performs much worse than classification and suggest the large gradients produced by errors prevents the network from converging.

Modeling Human Guesses
One of the advantages of APPA-Real is that each face image in the dataset is labeled with a number of human guesses (approximately 38 per image). This implies that the ground-truth label distribution d, although not directly available, can be easily generated under certain assumptions. For each image in the training set, a ground-truth label distribution is created from all its corresponding apparent age guesses, using the following approaches:

Histogram Distributions
The label distribution is associated with the frequency of counts for all guesses normalized by the total number of guesses.

Normal Distributions
The mean apparent age and standard deviation are calculated from the apparent age guesses. Then, a normal probability density function parametrized by the mean and standard deviation is used to generate the label distribution, at intervals {0, 1, . . . , 100}. This type of distribution is most similar to other work done in the field [31,37], and assumes the "subjectivity" in the apparent age of a person is normally distributed.

Kernel Density Estimation
A probability density function of the apparent age guesses is estimated using kernel density estimation in conjunction with a gaussian kernel for smoothing. In this approach, bandwidth selection is critical and contributes to the interpretability of the final model. By selecting a small bandwidth, the resulting density is "tighter" and using a larger bandwidth allows the density function to be stretched out over the age guesses. Figure 1 shows examples of images with densities estimated with different bandwidths. A reasonable bandwidth h for a random variable X of length n can be calculated using the formula described in [55], given as Eq. 5, where Var denotes the variance and IQR denotes the interquartile range. with m = min Var(X), IQR(X) 1.349 (5)

Evaluation Measure
To evaluate the performance of all models, the Mean Absolute Error (MAE) is used. It is the most commonly used performance metric in the age estimation literature, and is defined for N images in Eq. 6 as the average difference between the predicted ageŷ i and the ground truth label y i . In this work, results are reported on both the apparent and biological MAE. The former uses the arithmetic mean of the apparent age guesses on an image as the ground truth label, while the latter uses the biological age. As the predicted output of our model is a label distribution, we define the predicted age as the expected value of the predicted label distribution.

Transfer Learning
In [56], the authors explored transferability of existing CNN models for age and gender classification. Experimental results show significant gains when comparing transferred CNNs with a baseline model. Furthermore, in [57], state-of-the-art performance is achieved for facial expression recognition by using transfer learning from the popular ImageNet dataset.
As large high quality datasets for apparent age estimation are scarce, in this work we conduct a thorough exploration of the impact of transfer learning on learning label distributions. Our motivation is to transfer low-level feature representations from large datasets, such as ImageNet and IMDB-Wiki and explore how using a consistent loss function across pre-training and fine-tuning may benefit the performance of the network on the final task.

Apparent Age Estimation Experiments
In this section we first provide a detailed description of our experiments, including the datasets, network architecture, transfer learning procedures, hyperparameter selection, and learning rate schedules. Then, a summary of the results and findings is presented followed by discussion and analysis.

Datasets
To the best of our knowledge, IMDB-Wiki [6] is the largest publicly available Likewise, all images where the face of the subject could not be located were also removed. This pre-processing was done with the meta-information provided by the maintainers of the dataset. We used 90% of the remaining images for training and 10% for validation, that is, 165, 970 and 18, 442 images respectively.
In [1], the APPA-Real dataset was introduced with the objective of gathering a large, robust set of human apparent age predictions on images of people "in the wild" 3 . The authors relied on crowd-sourcing to label each of the 7, 591 images.
The dataset is divided into three folds, containing respectively 4, 113 images for training, 1, 500 for validation, and 1, 978 for testing. The pictures collected belong to about 7, 000 different individuals taken in varying conditions of lighting and image quality, which makes the dataset more representative of the real world.
Experimental results are presented supporting the benefits of training age estimation models on a "wide" dataset [56]. In APPA-Real, each image is labeled with both biological and apparent age labels. On average, each image contains apparent age guesses given by 38 different people. A biological MAE, which refers to the average difference between the biological age and the predicted age, of 4.573 was aggregated from the human labelers. This means that when we put together the predictions of all the crowd sourced respondents, the "wisdom of the crowd" is wrong by 4.573 years on average. APPA-Real is a unique dataset that provides both real and apparent age labels required for LDL.

Training Details
For all pre-training and fine-tuning tasks, we implement the same preprocessing procedure described in [6], which includes a face detector that performs face rotation to up-frontal position, and performs a crop of the face with 40% margin to maintain background context in the image. The IMDB-Wiki and APPA-Real datasets provide pre-processed images of this form online.
As for the network architecture, we used the implementation of dense convolutional networks (DenseNets) [58] provided by PyTorch. We chose a DenseNet based architecture because of its reduced number of parameters, computational efficiency, and training times. We specifically chose the 161-layer architecture because of its superior performance at the ImageNet challenge. Figure 2 shows the steps taken for performing inference on an input image, which involve pre-processing the image followed by a feedforward pass on the trained DenseNet.
All of our models are trained with stochastic gradient descent, a nesterov momentum of 0.9, a batch size of 64, and weight decay of 1e − 5. These parameters remained consistent throughout all pre-training and fine-tuning procedures. The learning rate schedule varied depending on the loss function being used.
When pre-training on IMDB-Wiki, the initial learning rate is set to 0.01 and 1.0 for the cross-entropy and KL-divergence losses respectively. In both cases, the All experiments are repeated 10 times per task using the same seeds appropriately. During pre-training we use uniform seeds for weight initialization, estimating the order of training examples, and data augmentation.

Protocols
In order to organize our experiments we define the following protocols: P 0 : Model is pre-trained on ImageNet and fine-tuned using the APPA-Real dataset. For each of the protocols described above, we fine-tuned the models with APPA-Real using 4 different target tasks: • 101-way classification with cross-entropy loss; • Histogram label distribution with KL-divergence loss; • Normal label distribution with KL-divergence loss; and • Kernel density estimation label distribution with KL-divergence loss.

Results and Analysis
Performance on IMDB-Wiki is not commonly reported in the literature, as it is a dataset used for pre-training only. The biological MAE on the validation set is reported in Table 1  Even though the performance of the label distribution model is worse in terms of MAE, it is suspected that the model has learned a better feature representation from which more accurate models may be fine-tuned because of the explicit ordinal relationship introduced when training for label distributions. It is important to note that as of now only one other work has been published using the APPA-Real dataset, which we compare our results to in Table 2. Our best performing model, a label distribution learning model trained on kernel density estimation distributions, surpasses the previous state-of-the-art apparent age estimation model reported in [1]. Interestingly, our model trained for apparent age estimation matches the performance of the Real DEX model trained for biological age estimation at predicting the biological age. In Figure 3 and Table 3, we report our final results on the APPA-Real dataset. Figure 3 shows the tight variance in our model's performance at apparent age estimations while Table 3 also includes the standard deviation and biological MAE.  Figure 4, as the peaks of the predicted distribution happen to be close together even though nothing in  we must assume some normal distribution with fixed standard deviation.
The significantly improved performance of label distribution models in P 0 and P 1 over 101-way classification suggests that it is not just pre-training with the same loss function that is responsible for the improved performance of the models; those fine-tuned using label distribution learning perform better in all cases.
In order to get additional insights, Figures 4

CHAPTER 4 Automated Detection of SEIC
In this chapter a framework for SEIC detection is introduced, as are the techniques used for image and video processing and inference. The various deep learning models used are described in detail.

Age Estimation and Label Distribution Learning
By taking advantage of the per-image collection of human guesses availiable in the APPA-REAL dataset, a unique normal distribution may be fit to model the age of each image as described in [9]. The model will then be trained to predict each image's unique discretized distribution using the Kullback-Leibler (KL) divergence.
Optimizing our model against a distribution of age labels explicitly takes advantage of the ordered nature of aging in a way that the cross-entropy loss cannot. Since a unique distribution can be specifically fit to each image, the subjectivity or difficulty of the image is captured as the width of the image's label distribution.
The intuition we are hoping to achieve from modeling these distributions is that the tighter the distribution gets the easier an age is to guess.
For more details on label distribution learning, see Section 3.1

Integrating Deep Learning Models
The framework has been developed with the goal of a unified data pipeline in mind. A basic diagram of the approach is provided as Fig. 6. Highlighted in gray are the applications in which deep learning is employed to achieve the specified task. Tasks that are vertically aligned can be performed in parallel.
In the current implementation, each task is run sequentially (to optimize GPU memory availability) and log files containing relevant information move between By moving batches of images through the pipeline, the application would be able to process media in real-time and work with live video.
Each part of the application is independent of the others' operation. Processing starts with the file crawler, which given a base directory, identifies every image and video file on the system. Next, videos are sent to the frame extractor to separate each video into a series of discrete frames to be processed as images.
Currently each video is sampled at an arbitrary rate of 1 frame per second and saved to the file system as a set of images. While greater accuracy could likely be achieved with a higher frame rate or by taking advantage of spatio-temporal features present in videos, such analysis is excluded due to concerns for the increased computational cost of such techniques and to preserve the simplicity of our models. A more intelligent frame extraction technique which makes multiple passes on a video to collect more frames from difficult to analyze portions of a video could easily be fit into the architecture of the framework and would likely yield improvements in performance as well.
Additionally, our framework relies on nudity, face detection and age estimation to identify illicit SEIC material. Large scale datasets for facial age estimation in videos do not yet exist and is an open area of research.
Yahoo's OpenNSFW CNN serves as the base of the pornography detection module [10]. The model achieves satisfactory performance at identifying pornography in images containing SEIC without any fine-tuning, which we evaluate using the publicly available weights. It accepts images as input and outputs the probability of the image containing NSFW content. Preprocessing for this CNN includes converting to an RGB color format, and resizing the image to be 256x256 using a bilinear interpolation method. format, resized to 640x640 using a bilinear interpolation method, and each color channel has the mean color value of image net subtracted from itself.
The age estimation module is based on our previous work in [9]. As input it accepts human facial images, and outputs a probability distribution across ages.
In order to determine if an image contains a minor, we simply take the expected value of the distribution and check to see if the predicted value is less than the threshold for adulthood. To prepare input images, they are converted to an RGB color format, resized to 256x256 with a bilinear interpolation, center cropped to a size of 224x224, and each channel is standardized to image net's mean and standard deviation.

Performing Inference
When performing inference on images, the predictions of the age estimator with that of the NSFW detector are combined. If a minor is detected and there is NSFW content in an image, then it is flagged as suspected child pornography.
The ages of each person detected in the image and the NSFW score is logged to file. For videos, inference is performed on each extracted frame.
Results for each frame are then aggregated for final video classification. A video is not flagged as illicit material if only a single frame contains contraband because the application may produce many false positives. As the number of frames sampled in a video grows, so would the likelihood of a false positive using this criteria. We therefore propose training a support vector machine (SVM) for classifying a video as containing child pornography.
This meta-classifier is trained on a normalized histogram feature set consisting of the output predictions of both the nudity detection and age estimation models.
In addition to these discretized age and NSFW group bins, the number of faces, children, frames with nudity, total frames, proportion of frames with nudity, and proportion of faces containing a child are used as features. Two categorical features were also added to the model based on whether or not 30 NSFW frames or children were found in the video. The value 30 was chosen because, due to our frame sampling rate, it would imply that at least 30 seconds of nudity or children appeared in the video.
Another benefit to posing the final video classification this way is that the SVM will be able to provide a confidence value in the classification of the video.
This value may be tuned to the users preference of whether it is better to identify more examples of illicit material at the expensive of increased false positives or serve as a more restrictive filter, only presenting the most probable instances to the law enforcement agent using the tool.

Training Details
Models were evaluated locally on an Ubuntu 14.04 server configured with 4x Nvidia Titan X Pascals, 2x Intel Xeon CPU E5-2620v4, and 64 gb of memory. All CNNs were trained and inferenced using the GPUs. OpenNSFW and the S 3 F D face detector used the Caffe deep learning framework while the age estimation model runs on PyTorch. Yahoo!'s OpenNSFW nudity detection model [10] was initialized with ImageNet weights and then trained on a private dataset of both NSFW and SFW images. The S 3 F D face detector [20] uses a VGG-16 architecture and is fine-tuned from ImageNet onto the WIDER FACE face detection dataset [60]. We used these two models as provided by the authors on their respective github pages with their provided weights. Our age estimation model uses a DenseNet-161 architecture pre-trained on ImageNet, fine-tuned onto a collection of 130, 000 images of actors and actresses crawled from the publicly available IMDB-Wiki dataset [61] for the task of label distribution learning. Since apparent age guesses are not available for this dataset, the distribution was parameterized by a fixed standard deviation of 3 and a mean equivalent to the age of the subject as a proxy. Finally, our model is fine-tuned onto APPA-REAL [19] for the task of apparent age estimation using normal label distributions parameterized by the mean and standard deviation of the per-image human guesses.
Analysis of the runtime of the models deployed at the local law enforcement agency is also provided. This workstation is less powerful than the one the models were trained on, and only has 1 Nvidia Quadro P4000. In total, 1, 956 videos were split into 1, 044, 577 frames. 123, 200 images were detected. It took the file crawler 11 minutes to identify all files (single thread), 3 hours and 4 minutes for the frame extractor to extract the frames at a rate of 1 per second (using a single thread), Significant improvements in runtime could be made by altering the face detector and OpenNSFW models to support batch sizes larger than 1, and additional improvements could be made by parallelizing the file crawler, frame extractor, and face cropper. In the future, we plan to implement these improvements in addition to exploring options in reducing the complexity of the CNNs used in order to speed up inference time.
Keep in mind that the CNN used for the face detector, nudity detector, and age estimator were not fine-tuned for the task of SEIC content detection or trained on video frames at all. In the next section, we present empirical results that demonstrate the generalization abilities of these models in challenging and diverse real world environments. The age estimation model is shown to generalize to challenging images of adults and children, the never before seen space of NSFW images and the variance of the model's predictions on video frames is analyzed.
The OpenNSFW model is also validated on the RedLight dataset. To the best of our knowledge, all reported results come from proper "test" sets, that is data that the model has never been exposed to during training. Our uncertainty stems from the possibility that OpenNSFW's private training dataset may have included some of our test data, however this is unlikely.
As mentioned in Section 4.3, a meta-classifier was trained on the age and nudity score predictions of each frame of every video. When evaluating results for video classification, a different protocol is used since some amount of model training is performed. In all cases, the data is split into two groups. 75% of the data is used to perform a 5-fold cross validation before the model is retrained on that whole subset of data to be evaluated on a held out 25% test set. Confusion matrices, plots, and the reported true positive rates come from the test set.

Robustness of NSFW and Age Models
In this section we establish the generalization capabilities of our age estimation and nudity detection models on NSFW and SFW content, different ethnic groups, challenging images, and on videos.

Evaluation of Age Estimation and Nudity Detection Models on SFW and NSFW Images
Results for minor classification on the APPA-Real dataset's test fold are presented in Fig. 8. When the age estimator is given a properly extracted image of a face, it is quite good at correctly identifying adults as adults yielding false positive rates of only 1.11% and 0.78% for cutoff ages of 14 and 18 respectively.
Next, the performance of the age estimator on all the NSFW material included in the RedLight dataset is analyzed. Since all the data was collected from legal pornography websites, it is assumed that there are no minors present in any of the images, i.e. a perfect model would label every face as being an adult. The true positive rate for adult classification is given in  Table 4: Binary classification of faces extracted from NSFW images into adults and minors using a cutoff age of 14. The accuracy of the model degrades significantly here in comparison to APPA-Real, but still generalizes well to this far more uncontrolled environment.
number of images in the dataset. The face detector may not have detected a face in every image, or in some instances multiple faces were detected and analyzed.
A detailed description of each of the categories shown as rows in Table 4 are as follows. Partial nudity includes suggestive imagery such as lingerie models and people in bathing suits, some of which are provocatively posed but without exposed It is suspected that the feature space of pornographic images is too different from "natural" images, and the quality of the age estimation model is suffering.
The number of false positives (minors) detected in RedLight may be significantly reduced by fine-tuning the model on such content, but that would create a serious issue with class imbalance as no labeled faces of children in that context are available. A more feasible approach would be to fine-tune the nudity detection model directly on SEIC material to discriminate SEIC vs. non-SEIC content, as was done in other recent works, so there would be no concerns about gathering the detailed facial labels required to fine-tune the age estimator.
The adult actresses found in this type of material are also typically biased towards the younger side as many popular categories of pornography predominantly feature young looking adults. Making the distinction between these adult actresses and actual minors is extremely difficult, even for trained professionals such as the law enforcement agents interviewed in [54]. One participant of this study is reported as saying "I was starting to struggle around teenage years, because some I thought, oh, they could be 12 or they could be 18.". Another participant said "The ones that are difficult are when there's sort of . . . well, its the age, isn't it, whether you are looking at them thinking, well, are you 15 or are you a young-looking 18-year-old, or are you an old-looking 15-year-old, and it's that area that's difficult" [54].     Table 7: Binary classification of ethnic faces extracted from NSFW images into adults and minors using a cutoff age of 14.
In Table 7, some bias is observed in detecting the age of two different ethnic-  Table 8: Binary classification of extremely challenging images of children (+) and adults (-) using a cutoff age of 14.
through our framework; the results for age estimation are presented in Table 8.
They key observation from this experiment is that the age estimation model relies

Evaluation of Age Estimation Models on Videos
Our framework operates under the assumption that if our models work well on images then they will also perform well at inference on video frames even though neither of our models were trained on video frames at all. This is a problem because the quality of video frames is generally much worse than that of images.
Additionally, when a face is detected in a video the quality or resolution of the cropped image may be too degraded to get an accurate estimation of age, especially when the face is blown up to the required input size for our model (224x224). In the future, performance may be increased substantially by fine-tuning both the NSFW and age estimation modules on individual frames from videos.

Classification of SEIC Videos and Images
Even though deep learning models perform well individually on RedLight and other legal media, our analysis must include the target domain of child pornography detection. In this section results on SEIC image classification, NSFW video detection, and SEIC video detection are presented in addition to a discussion on the advantages of our framework from an interpretability standpoint and practical considerations.

NSFW Video Detection
In 97.9 ± 0.7 Table 10: Performance on the task of pornographic video detection using the metaclassifier compared to other deep learning methods on the same dataset. Figure 10: Sample frames drawn from the NPDI dataset. The top row displays frames from the non-porn hard subset of data. The bottom row are samples from some non-porn easy videos. Even though the content of the non-porn easy videos may be easy to distinguish from pornography, it may still be subject to challenges such as blur and poor quality.

SEIC vs. NPDI Videos
Finally, the main contribution of this work is presented; results for classification of child pornography videos by training an SVM on age and nudity features, referred to as our meta-classifier. The NPDI video pornography classification dataset is used as a baseline to compare our model against. The NPDI dataset contains both pornographic and non-pornographic videos. The benign videos are separated into two categories: non-porn hard and non-porn easy. Examples from the publicly available NPDI dataset are given in Fig. 10. All results in this section use a sensitivity threshold (probability) of 0.50 for classification.
In Table 11  It is important to reiterate that neither the age estimation model or nudity detection model have ever been exposed to frames/videos from NPDI or the SEIC dataset. By posing the final classification in this way the base models can be trained and tested locally on easy to obtain pornography and aging datasets.
Careful evaluation and analysis of ethnic bias, challenging examples, and common mistakes may be performed that is not possible on the target dataset due to its grotesque and restricted nature.
Nonetheless, as shown in the confusion matrix presented as Fig. 11a  Another surprising observation may be made from this plot. Many SEIC videos are found to contain no faces of children, but a large amount of nudity.
Pornographic videos do not have this property; almost all of these videos have between 5% and 20% of the faces mistakenly classified as under 18. It seems that the meta-classifier learned this strange relationship in our datasets and was still able to distinguish between the two groups successfully. On the other hand, if a tool was to be deployed to monitor images and videos passing through a network or another high traffic application where even a 1% false positive rate is not acceptable, these options could be made very restrictive and only flag the most suspicious content for human review i.e, images where nudity and a child's face is detected with high probability or videos where most of the frames are identified similarly. This level of interpretability may be used to generate a report with many options for filtering and sorting for the user. Perhaps an agent wants to sort by the probability of SEIC, or only review content with faces and nudity (irrespective of the presence of a detected child).
Another avenue, unexplored so far in this work, is file system analysis. Due to the structured nature of the file system on seized computer, it seems natural to assume that directories adjacent to those containing SEIC content are also likely

Practical Limitations and Considerations
While the results can be seen as impressive, this framework should only be used a filter to flag suspected instances of digital media for further human review.
The law enforcement agent in the middle can not be removed from the process of tagging SEIC content, and the agent should be aware that while the age estimation model has been rigorously validated in many different scenarios, it may still fail for not obvious reasons. Some SEIC content will go undetected on a suspected hard drive, and some borderline material will be falsely flagged as being SEIC. However it is often the case the suspects of child pornography cases have many gigabytes of SEIC. The age estimation model will only work best on images where faces are present and in reasonable quality. In the end, it must be up to the human agent to decide whether or not a particular file should be used for prosecution or identified as SEIC content.
As is evident by the discussion between law enforcement agents regarding the classification of this type of material, there is also a large amount of nuance to be captured that we can not explicitly model: "But in terms of any sort of sexual nature, actually in the image, there isn't any. Because it is actually a family photograph (Agent 1). Yeah, and in fact, despite the fact that they're naked, it's completely unisexual . . . It's not . . . there's nothing about it that's indecent (Agent 3)." [54] In this particular scenario our model would almost certainly flag this particular suspected image as SEIC. The human agents must be kept in the loop to rectify these types of mistakes. The true ability of an automated tool to cut down on the subjectivity and personal bias exhibited by law enforcement agents, as called for in [54], would have to be further analyzed before replacing humans is even considered. If this is validated, however, the potential for further cutting down on the emotional damage and time sink incurred upon the law enforcement personnel manually identifying this material is great.

CHAPTER 5 Conclusion
In this thesis we presented a study of convolutional neural networks trained on label distributions for the task of predicting the apparent age of face images. We provided analysis on different ways to pose the objective function for this task, and showed empirically that label distributions are the most intuitive solution. Models pre-trained using label distributions learn better underlying feature representations even when the true distribution is not available. We also achieved a new state-ofthe-art mean absolute error in apparent age estimation on the APPA-Real dataset.
Our best model implements a KL-divergence loss function and defines the groundtruth label of each image as a nonparametric density function estimated from its corresponding human guesses.
Additionally, we presented a framework for the automated detection of Sexually Exploitative Imagery of Children images and video by utilizing a fusion of age estimation and nudity detection model predictions. A rigorous analysis of the models' performance at age estimation and nudity detection was conducted, including experiments on ethnically diverse and challenging data. We examined the variability of modern age estimation models by utilizing the unique nature of the GLAMOUR YouTube videos. Finally, our framework was validated on a real world dataset consisting of 1, 109 videos and 84, 619 images of SEIC content from 20 cases collected by local law enforcement agents. We achieve 97% accuracy in SEIC video classification (SEIC versus easy non-porn), 94.5% accuracy in NSFW video classification (NSFW versus non-porn), 89% accuracy in SEIC versus NSFW video classification, and finally, 84% accuracy in SEIC versus all of NPDI video classification.
In the future, we hope to expand upon our work by analyzing the interpretability of our predicted label distributions and by applying unsupervised learning to exploit the massive amount of unlabeled face-containing data. It has been recently shown that fine-tuning the nudity detection module on both NPDI video frames and the target illicit material will result in a significant boost in performance [64] (9 percentage point boost in accuracy over OpenNSFW model in their experiments). Additionally, models taking advantage of spatio-temporal features have been shown to improve results in pornographic video detection as well [21]. The SEIC video detection procedure could be extended to have the frame extractor make several passes over a video, extracting additional frames to better inform the final classifier if the nature of the video is ambiguous.
Perhaps most exciting, however, is the prospect of investigating minor detection and general apparent age estimation in videos. By utilizing the GLAMOUR YouTube videos, we hope to soon release a framework for evaluating age estimation models in videos. Preliminary results on this proposed work resulted in analysis of the variability and sensitivity of our age estimation models' predictions in videos.
This served as inspiration for a new semi-supervised learning algorithm to take advantage of the massive amount of unlabeled facial data available in videos.
The proposed algorithm will train a model with age-labeled facial images jointly with unlabeled frames of videos. Suppose we have a video of the popular American actor Nicholas Cage. This video will be split up into a series of frames, and a face detector will be run to extract all faces. Next, a facial verification model will be used to identify all facial images of Nicholas Cage. All other faces will be thrown out. We are then left with a large set of frames all containing the same person throughout the video. Such a dataset already exists with millions of unlabeled faces organized by human subject.
This algorithm will exploit the assumption that a person will not age throughout a video. If a person's age does not change, then our model should predict the same age consistently for every extracted facial image. We can therefore enforce some consistency loss penalizing the model for predicting different ages for the same person. The model will then be jointly optimized with the labeled image data to train a model which should be far more robust to spatial rotation of the face, closed eyes, and other phenomenon found in videos. The predictions of our model should also be much more consistent for the same person.
The overarching goal of this future work is to improve the age estimation models and further tackle the problem of SEIC detection. It is the belief of the author that based on the results shown thus far, this is an extremely important real world problem that is ready to be solved.