AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

RJ Joyce
Discovery, Research, and Experimental Analysis of Malware (DREAM) Lab

Joint work with Tirth Patel, Dr. Charles Nicholas, and Dr. Edward Raff

12:00pm (noon) – 1pm
Friday, April 14, 2023
Remotely via WebEx: https://umbc.webex.com/meet/sherman
Recording of Talk


We introduce AVScan2Vec, a sequence-to-sequence autoencoder that can ingest AV scan data, extract semantic meaning, and produce meaningful feature vectors for malware. AVScan2Vec is able to bypass several limitations of prior malware feature-extraction methods, while simultaneously showing noteworthy improvement in several relevant ML tasks. Our implementation of AVScan2Vec in combination with Dynamic Continuous Indexing is especially potent, enabling 10-nearest-neighbor lookup queries in ~16ms on a dataset containing over 7 million malware samples. Automation has become increasingly more vital to the field of malware analysis due to manual effort being slow and costly. To improve common tasks such as classification, clustering, and nearest-neighbor lookup of malware, improving malware feature extraction has been a significant research focus. Many approaches rely on features that can only be obtained using prolonged analysis. Due to the enormous quantity and variety of malware, however, applying these feature extraction techniques to a production-size malware corpus would be infeasible. Other, more scalable feature-extraction methods are hindered by static obfuscation, restricted to a single file format, and/or limited in their capacity to identify higher-level malware features. Our work explores the under-recognized potential of antivirus (AV) scan data, which is relatively cheap to acquire and contains rich features.

About the Speaker:

RJ Joyce (joyce8@umbc.edu) is a PhD student at UMBC under the supervision of Dr. Charles Nicholas and Dr. Edward Raff. Presently, RJ works as a data scientist at Booz Allen Hamilton performing research at the intersection of malware analysis and machine learning. RJ is also a visiting lecturer at UMBC and is teaching the Principles of Computer Security course this semester.


Alan T. Sherman, sherman@umbc.edu

Upcoming CDL Meetings:

  • April 28, Roberto Yus (UMBC), Privacy
  • May 5, CSEE Research Day (ECS Atrium)
  • May 12, Kia-Won-Tia von Wrex (UMBC), Cyberdawgs

Support for this event was provided in part by the National Science Foundation under SFS grant DGE-1753681.

The UMBC Cyber Defense Lab meets biweekly Fridays 12-1 pm. All meetings are open to the public.