Phil Roth

Elastic

EMBER Improvements (slides, code, video)

Phil Roth

Endgame released an update to the EMBER dataset that includes updated features and an new set of PE files from 2018.

We used a new process for selecting PE files from 2018 to include in the dataset. We were aiming to create a testing set that is more difficult to classify by a machine learning algorithm than the original EMBER 2017 set. We also added steps that eliminated the worst outliers and cut down on duplications in the feature space.

The expanded feature set includes corrections to ordinal import calculations, new features that allow the EMBER classifier to be compared to the Adobe Malware Classifier, and an updated version of LIEF. Features were recalculated using the samples from EMBER 2017 and released. This necessitated versioning the feature calculation and sample selection separately.

I’ll talk about the motivations behind all the changes, what research this expansion enables, and the potential dangers in joining the EMBER 2017 and 2018 samples into a single analysis. I’ll also show the results of some of the different classifiers we’ve trained on EMBER 2018 samples.