Towards A Public Dataset/Benchmark for ML-Sec

Richard Harang, Ethan Rudd

While machine learning for information security (ML-Sec) has taken off in recent years, as a field, it is still arguably in its infancy, compared with other areas of applied machine learning like computer vision and natural language processing (NLP). One reason for the size, traction, and research progress in these other fields has been the presence of large scale benchmark datasets and common evaluation protocols, e.g., ImageNet and the associated ILSVRC competitions/benchmarks. While many private threat intelligence/vendor response aggregation feeds are available, there are, at current, few public data sources in the industry that reflect realistic commercial ML-Sec use cases. We argue that this is detrimental both to the progress of the industry, as it offers no common benchmark to assess trained classifiers, and especially detrimental to academic researchers, where the resources/infrastructure required to leverage commercial threat feeds and obtain realistic data sets is often a barrier to entry into ML-Sec compared with areas of applied machine learning.

To this end, we have created an ML-Sec dataset for public release. The first part of this talk will announce the dataset and give an overview of its design, design rationale, and benchmark classifier performance. With respect to design, first and foremost we release (nearly) raw binary files of both mal- and benign-ware rather than samples already pre-processed into features, allowing researchers to experiment realistically with both feature extraction and model construction problems. Next, we ensure that the dataset is sufficiently difficult to leave room for performance improvements, even with realistic baseline classifiers – i.e., performance is not saturated at release. Third, we ensure that the dataset is large / representative enough to ensure that classifiers' performance will (with high-probability) retain relative rank order, even in the presence of much larger training sets. Finally, we provide a variety of metadata for heterogeneous applications, including but not limited to both malware detection and malware type tagging/classification, enabling richer applications beyond simple binary classification.

The second part of this talk will discuss challenges faced during the dataset release including hosting, legal/licensing considerations, and security challenges encountered during this release, and provide paths forward and suggestions for groups wishing to release similar types of datasets. The purpose of this part of the presentation is to provide paths forward for other ML-Sec research groups to create similar industry/academic benchmarks and data sets; we hope to expedite this process by flagging core challenges.