Derek Everett,
Edward Raff
and
James Holt
Hamm-Grams: Mining Common Regular Expressions via Locality Sensitive Hashing (pdf, video)
n-grams have proven to be simple and efficient features for many domains in machine learning, but these features are intrinsically brittle to changes of any of the n tokens. We develop hamm(h)-grams, a new alternative to n-grams which allow wildcard tokens. The method is employed for the problem of malware detection with static features, where common patterns of bytes can only be represented by expressions including wildcards. We devise an efficient algorithm for finding common h-grams using a new locality-sensitive hash. We then demonstrate the power of h-gram features in tasks important for malware classification and analysis.