David Krisiloff
FireEye
Measure Twice, Quarantine Once: A Tale of Malware Labeling over Time (pdf, video)
Cybersecurity utilizes crowdsourcing for a variety of tasks from spam detection to security bug bounties. For anti-virus, VirusTotal provides a crowdsourcing platform that aggregates results from more than 70 antivirus (AV) scanners making it a tempting source of labels to train machine learning based AV. However, VirusTotal has multiple unique features compared to other crowdsourcing models. Unlike most crowdsourced data, AV scanners reliably improve over time. New AV engine versions incorporate new malware signatures that, on average, improve detection performance. Furthermore VirusTotal detections are public, producing a feedback loop where AV scanners can learn from other AV scanners. VirusTotal runs each AV engine against every new file submitted. In addition, VirusTotal also allows users to rescan an old file with the latest AV engines, but limits the number of files that can be rescanned per day. This environment raises a variety of questions. How do we assign malware labels from noisy VirusTotal reports? When should a file be rescanned to take advantage of AV updates? How should rescans be prioritized?
Using a set of historical VirusTotal reports, we examine the temporal dynamics of virus detections and discuss a variety of models for producing labels from the reports. Changes in AV detections over time are generally predictable using machine learning models. This makes it possible to anticipate which files are mostly likely to change their labels over time, regardless of the function used to combine the crowdsourced detections into labels. We present optimal strategies for rescanning files on VirusTotal to build improved data sets. Ultimately, our models produce more accurate labels faster than passively waiting for AV vendors on VirusTotal to come to a consensus.