PEDA 376K: A Novel Dataset for Deep-Learning Based Porn-Detectors

Document Type

Conference Proceeding

Date of Original Version



The rapid expansion of the digital world introduces complex challenges within the forensic and security domains. In particular, the wide availability of online pornographic media is a huge problem for applications that seek to prevent exposure to inappropriate/undesired audiences, or that aim to automate the detection of any illegal behavior. There is a thin veil separating the definition of pornographic and non-pornographic media, making it difficult, even for humans, to agree on a consistent interpretation. Most of the available APIs for detecting NSFW (not-safe-for-work) media are not able to infer clearly whether a file contains pornographic content or not. In general, given an input file, APIs return a set of probability scores, leaving the responsibility of a final binary decision to the users. What is more, NSFW APIs do not publicly share their training datasets.Aiming to mitigate these issues, we introduce a novel dataset of images: Pornographic and Explicit Dataset 376K (PEDA 376K), which was labeled using a well-defined criteria, to aid the development of machine learning for detecting whether an image is pornographic or not. We also trained decision trees for transforming the probabilistic output of standard APIs into binary decisions. We conducted experiments with two datasets, PEDA 376K and RedLight, and found that when APIs are optimized with decision trees, their average accuracies increase. Finally, we propose a deep learning architecture trained directly on the PEDA 376K dataset. When comparing this model against state- of-the-art models and their corresponding optimized outputs, we outperform five existing neural architectures, reaching an overall accuracy of 99.2%.

Publication Title, e.g., Journal

Proceedings of the International Joint Conference on Neural Networks