“Malware Data Science: Attack Detection and Attribution” (MDS) is a book every information security professional should consider reading due to the rapid growth and variation of malware and the increasing reliance upon data science to defend information systems. Known malware executables have expanded from 1 million in 2008 to more than 700 million in 2018. Intrusion Detection Systems (IDS) are changing from signature-based systems as code packing, encryption, dynamic linking and obfuscation point security towards tools applying heuristics methods supported by data science. This article is a summary and a review, but my primary goal is to encourage the reader to read the book and complete the activities. If you do, I am sure that your security toolkit will be better equipped.
Overview of Malware Data Science
MDS identifies Data Science as a growing set of algorithmic tools that allow us to understand and make predictions about data using statistics, mathematics, and artful statistical data visualizations. While these terms may imply a difficult read, authors Joshua Saxe (Chief Data Scientist at Sophos) and Hillary Sanders (Infrastructure Data Science Team Lead at Sophos) equip the reader for upcoming concepts well, building upon key concepts with python code examples and walking through the code to reinforce learning. At points they identify additional resources or refer to prior chapters in a way that both supports the reader and encourages further study.
The code is downloadable from a site dedicated to MDS. Executing the code as you read helps to learn the concepts. I found working directly with the code itself to be surprisingly encouraging and even fun. Of course, some of the code is malware obtained from VirusTotal or Kaspersky Labs. That code is de-fanged with some flipped bits, but the code should be treated with due care in VirtualBox. The text offers a provisioned VirtualBox download.
MDS is broken into four parts. The first three chapters cover the basics of reverse engineering and support concepts discussed in later chapters. The next part addresses ways to extract threat intelligence from malware and then create visualizations of relationships between malware families and sources. The third section consists of four chapters that are dedicated to machine learning based malware detection and lead toward building your own machine learning tools. These tools can help with integrating data visualization into your day-to-day workflow. In the last section, MDS enters an advanced area of machine learning called deep learning. After covering the basics, it explains how to implement deep learning-based malware detection systems in python using open sourced tools. This section concludes with some pathways to becoming a data scientist and some qualities that can help someone succeeded in the field.
Coverage of the first three chapters, “Basic Static Malware Analysis”, “Beyond Basic Static Analysis: x86 Disassembly” and “A Brief Introduction to Dynamic Analysis” will be combined in the Reverse Engineering portion of this review. The remaining chapters are covered individually.
Reverse engineering is a method of determining how something was assembled by disassembling a finished product, examining its parts, and considering how and why they fit together. Reverse engineering leads to a better understanding of how malware binaries provide attackers ways to hide and continue an attack on an infected machine. In this section, the book introduces static analysis techniques then illustrates its application with real-world analysis. Microsoft Windows Portable Executable is broken down by header section function to understand .exe and .dll files. For example, the idata section contains the Import Address Table, and reveals the library calls a program makes. This may show the malware’s high-level functionality. Later, malware images and malware strings are analyzed in a malware binary. The reader can implement commands to support learning the concepts. The Bash command ‘$ strings filepath | less’ prints all the strings in a file to the terminal. This can show relevant actions like DOWNLOAD or GET, or reveal relevant server names, information that might reveal key functions or traits of the malware or its author.
The section moves on to look at x86 Disassembly. Disassembly is a process used to translate malware binary code into valid x86 assembly language. The book uses linear disassembly on a contiguous sequence of bytes to a Portable Executable (PE) file and notes that perfect disassembly in the face of deliberate obfuscation is an unsolved problem in computer science.
Before disassembling a file, MDS walks through the basics of x86 assembly language, including the function and structure of general-purpose, stack and control flow registers, and assembly language instructions. These may be a review for some but are key concepts in programming and information security. MDS is a good starting point for study if this information is new to you. From an ethical hacker perspective, you will need to know this information to move forward in pen testing. Mastering the details is necessary for creative applications. In this context I have read, “Those who say, ‘I understand the general principals, I don’t want to bother learning the details’ are deluding themselves.”1 I suggest you look into the details of memory management and assembly language.
MDS continues its consideration of disassembly by looking at some factors that limit static analysis. Packing is a process by which malware writers compress, encrypt, or otherwise change the code so it appears benign. It is only when the code unpacks itself at runtime that the true nature is revealed. Resource Obfuscation obscures the nature of graphical images and strings by adding, for example, a “1” to all bytes, then removing the “1” from the bytes at runtime. Anti-disassembly techniques include branching to a memory location that disassemblers will identify as a different instruction. Finally, dynamically loaded data involves externally sourcing data and code from an external server, perhaps decryption keys that are used in the code’s execution.
This section concludes with a brief introduction to Dynamic Analysis. This involves running the malware in a safe location called a sandbox to examine how it behaves. Dynamic Analysis is useful for reverse engineering and data science to verify what a given sample does. Malware can be categorized based upon common traits. Typical malware behavior includes modifying the file system, modifying the Windows registry, loading device drivers, or network actions like resolving domain names. MDS walks through some dynamic analysis using online resources then closes by addressing some limitations of Dynamic Analysis.
Chapter 4 – Identifying Attack Campaigns Using Malware Networks
Network Analysis Theory relates to extracting threat intelligence from malware. In this section, MDS identifies tools and methods necessary to perform shared attribute analysis to help identify the social connections between malware samples. It also shows how to create visual malware networks using Python and open source tools like NetworkX. Shared attributes include embedded IP addresses, hostnames, strings of printable characters, graphics or similar artifacts. A connection may be based upon the fact that samples call back to the same hostname and IP addresses, indicating that they came from the same attackers.
A network is a collection of connected objects called nodes, and the connection between nodes are called edges. A network can be complex and have attributes that connect to either nodes or edges. One common edge attribute is weight, with greater weight indicating a stronger connection between samples. A bipartite network is one whose nodes can be divided into two partitions, where neither partition contains internal connections. An example of a bipartite network would have one partition of domain names and another of malware samples, where none of the domain names connect directly with each other, and none of the malware samples connect directly with each other, but visualization techniques demonstrate the network’s interrelationships.
The major challenge in doing network visualization is network layout, which is the process of deciding where to render each node in a network within a two- or three-dimensional coordinate space. The conventional technique is to space them so their visual distance to one another is proportional to the shortest-path distance between them in the network. Shortest path is in the same sense as a router hop, with two hops being two inches and three hops three inches. Distortion problems are reduced with force-directed layout algorithms based on physical simulations of spring-like forces and magnetism.
One of the ways in which Malware Data Science is so interesting is in how it uses actual Advanced Persistent Threat data samples used in two different attack campaigns to support visualizing malware networks. Visualization makes themes apparent, and through programming you can manipulate node shape, node and edge color, and text labels to more clearly communicate relationships or functions.
Chapter 5 – Shared Code Analysis
Shared Code Analysis is a process of comparing two malware samples by estimating the percentage of precompilation source code they share. This differs from shared attribute analysis. This process considers features like a set of strings in static analysis, or features based on dynamic run logs in a dynamic analysis to create a bag of features. Two bags of features can be considered like in a Venn diagram for code sharing analysis. Sequencing of events can be incorporated in what is called an N-gram. An N-gram is a sub-sequence of events that has a certain length, N, of some larger sequence of events. N-grams are created by iterating over a sequence and recording the sub-sequence from the event at index i to the event at index i + N – 1. The Sequence (1,2,3,4,5,6,7) is translated into five different sub-sequences of length 3: (1,2,3), (2,3,4), 3,4,5), (4,5,6), (5,6,7). For example, malware analysis would extract N-grams of sequential API calls that a malware sample made, then represent them as a bag of features for comparison with another malware sample’s N-grams.
The Jaccard Index is used to quantify similarities between two bags of features. This yields a normalized value that can be placed on a common scale ranging from 0 (no code sharing) to 1 (samples share 100 percent of their code). False positives and false negatives can be identified by comparing several malware families to each other in a similarity matrix. Comparing one family with itself should display a match. If it does not, it is a false negative. If it displays a match with another family, it is a false positive.
This approach can then be used with bags of features including Instruction Sequence, Strings, Import Address Table, Dynamic API Call or other identified similarities. Malware families may show more false positives or more false negatives when being analyzed by various attributes.
After walking through the process, the authors supply code to parse arguments and extract features from PE files then create a code-sharing strings-based graph in Python.
Then it is time to scale up. Jaccard indices require computations to compute a similarity matrix over a data set size of (n2-n) / 2. This can be applied as (42-4) / 2 = (16 -4) / 2 = 6 which seems manageable, but a data set of 10,000 requires 49,995,000 computations. To reduce computations, a technique called minhash allows for some errors in return for less computations. This is a trade off that should be familiar with anyone who has worked with statistics, but the process may be new to you. The features are hashed with k hash functions, then only the minimum value of the hashes computed over all the features are retained. Then the approximate Jaccard index between two samples based on their minhash arrays is determined by how many minhashes matched with an expected error of 1.0 / √ K value set to 256 would produce an estimate that is off by 6% on average.
With the background explained, it is time to build a Malware Similarity Search System, looping minhashes and all. Using a technique called Inverted Indexing allows samples to be stored based on their sketch values instead of an ID. For each sample that shares a sketch with the query sample, we compute its approximate Jaccard index using its minhashes. The most similar hashes are reported to the user. At that point you can run the similarity search system: Load samples into database, Comment on a sample, Search for samples and print in order of similarity, and Wipe, clear the database of all records.
Chapter 6 – Understanding Machine Learning–Based Malware Detectors
This section introduces a process for developing your own detection tools at a high level and explains the big ideas behind machine learning including feature spaces, decision boundaries, training data, underfitting and overfitting. Building a machine learning tool involves collecting examples of malware and benignware, extracting features from each group to represent the example as an array of numbers for training, training the machine to recognize malware using the extracted features, and testing the machine to see how well the detection system works.
A feature space is the geometrical space defined by the features you’ve selected, and a decision boundary is a geometrical structure running through this space so that binaries on this side of the boundary are defined as malware and that side of the boundary as benignware.
Overfitting and underfitting are important concepts in machine learning in that good machine learning avoids both and captures the general trend without getting distracted by outliers. Underfit models ignore outliers but fail to capture the general trend, Overfit models get distracted by outliers in ways that do not reflect the general trend. Both result in poor accuracy on new binaries.
The major types of learning algorithms are logistic regression, k-nearest neighbors, decision trees and random forest algorithms. MDS walks through each approach and explains the value of a given analysis in a given context.
Logistic regression is the method that utilizes feature space and decision boundaries to separate malware from benignware. It uses an iterative, calculus-based approach called gradient descent to adjust the decision boundary to maximize the probability that any one data point is on the correct side of the boundary. The positive aspect of this approach is that the results are easy to interpret, but the model often fails when the data is more complex.
K-nearest neighbors is an algorithm based upon the idea that if a binary in the feature space is close to other binaries that are malicious, it too is malicious. It makes the decision based upon the nature of the majority of neighbors. K represents the number of neighbors you think should be included in the sample. A distance function identifies the distance between a new unknown binary’s features and the samples in the training set. The most common distance function is Euclidean distance which is the length of the shortest path between two points in our feature space. K-nearest neighbors can produce a much more complex decision boundary than logistic regression, and both overfitting and underfitting can be controlled by changing the value of k. This is a good algorithm to use where features do not map cleanly to suspiciousness, but closeness to malicious samples is a strong indicator of maliciousness.
Decision trees are like a game of twenty questions to determine if a binary is malware. The most determinant feature is the first decision point, called the root node. The second most determinate feature is the next question. With each decision one branch’s probability of malware increases and the other branch’s probability decreases. You can stop asking questions at a selected depth, or the tree can grow until we’re certain about the nature of a binary based upon the structure of the tree. There is less chance the tree will overfit the training data if we keep it small. Allowing the tree to grow to maximum size corrects for underfitting the training data. The downside is that decision trees often do not result in very accurate models.
Random Forest uses hundreds or thousands of decision trees for malware detection, with each tree trained differently to have a slightly different perspective on the data. For each tree that is built, only a few features are considered out of the entire feature pool. Then the nature of a new binary is determined based upon how many trees voted yes.
Chapter 7 – Evaluating Malware Detection Systems
There are four possible outcomes in a detection system. A binary is truly identified as malicious, falsely identified as malicious, truly identified as benign or falsely identified as benign. True positives correctly identify malware as malware, and precision is determined by the detection systems true positives / true positives + false positives when tested against a set of binaries. The Base rate is the percentages of the data that has the qualities we are looking for, malware. Reducing false positives is a critical element of deploying a useful detection system in an enterprise, because otherwise their precision is much too low to be of value.
Chapter 8 – Building Machine Learning Detectors
At this point the authors apply all the prior instruction to help you build a machine learning malware detector using scikit-learn, an open source machine learning package, called sklearn for short. Sklearn requires training data in vector form. Vectors are arrays of numbers where each index in the array corresponds to a single feature of the training example software binary. If two features considered were compressed data and encryption, the vector (1,0) would mean the sample was compressed but was not encrypted. Sklearn also requires a label vector, which contains one number per training example, which corresponds to whether or not the example is malware or benignware. For example, if we passed three training sets, then the label (0,1,0) we would be saying the first example was benign, the second malware, and the third benign. By convention, machine learning engineers use a capital X variable to represent the training data and a lowercase y variable to represent the labels. With this knowledge, some sample code, some python code, some instruction, and sklearn, it is time for the reader to train their own malware detector, and then evaluate its performance! Remember, keep those false positives low.
Chapter 9 – Visualizing Malware Trends
Visualization helps identify the types of malware in a dataset, trends in datasets, and the effectiveness of detection systems. MDS shows how to use a Python data analysis package called pandas, and the Python data visualization packages seaborn and matplotlib, to prepare data and create visualizations. These visualizations can be manipulated with filters or by selecting attributes important to your organization.
And Now Its Time for… Chapter 10 – Deep Learning Basics
Deep learning is a form of many-layered processing units that function like Linux piped commands. Each layer’s output is used as the next layers input. Perhaps that analogy is not organic enough. Each processing unit is called a neuron and the model architecture is called a neural network or a deep neural network when there are many layers. A neuron consists of inputs x1, x2, and x3, each of which are multiplied by a weight value. The inputs are then added to from a weighted sum to which a bias value is added. The weights and bias are modified during training to optimize the model. Then an activation function is applied to the weighted sum plus the bias value. This is done to apply a nonlinear transformation to the weighted sum, which is a linear transformation of the neuron’s input data.
You may be thinking, that sounds pretty deep, but the deep part of deep learning is the number of layers of neurons that are involved in any one determination. Real neural networks often have thousands of neurons and millions of connections. The number of optimizable parameters in a basic neural network is the number of edges connecting an input to a neuron plus the number of neurons. Neural networks can be used for feature extraction, which could be applied to anything from facial recognition to malware detection.
Chapter 11 – Building a Neural Network Malware Detector with Keras
A neural network is trained by feeding the network an observation input, then examining the output, and adjusting as necessary to bring the output closer to the known value. The iterative process is known as gradient descent and the number of computations is reduced by a process called backpropagation. Back propagation allows for the calculation of partial derivatives along computational graphs like a neural network.
There are several types of neural networks, and MDS addresses several types and how they can be applied in a cybersecurity context. There are Feed-Foreword, Convolutional, Autoencoder, Generative Adversarial, Recurrent, and ResNet Neural Networks. I have listed some types, but for the context. math, and visualizations, you will have to read the book! When you do, you will also be able to build a neural network with Keras, the Python Deep Learning Library, train the model with the half million benign and malicious HTML files available for use, and validate its accuracy with sklearn.
“Malware Data Science, Attack Detection and Attribution” closes with Chapter 12 – Becoming a Data Scientist with some information on not only what it takes to become a data scientist but also a nice section on “A Day in the Life” to get a better understanding if this is a path for you. If so, then there are some additional sections with some experiences that the authors share and where to go from here. There’s also an “Appendix: An Overview of Datasets and Tools” that does a wonderful job of hand-holding the reader through the datasets from each chapter of the book. Again, there’s no need to go through this section in detail, but rest assured it is a treasure trove knowledge.
Ending the book with such great additional thoughts and materials really takes this book beyond normal technical tomes. It doesn’t just introduce the concepts but also offers a career path. Even if you are primarily interested in exactly what the hot topic of data science has to do with Information Security, reading MDS and applying the provided code along with the data is a valuable use of time. It gave me new insights and inspired me to learn even more.
Let’s not pretend that the whole field of data science, machine learning, neural networks and dataviz are not highly technical and dense. They clearly are. However, in gradually getting the reader through a certain progression of ideas, the authors made it less daunting. So much so, that I went through the material faster than initially anticipated, grasped the concepts, and it left me wanting more. I went through the book in a couple weeks but plan to return to further develop my skills with the newly acquired tools that Josh and Hillary added to my security toolkit. So even if this doesn’t lead to a change of course in my own career, it nonetheless opened my eyes to the wonders of what this technology and the rigorous use of proper technique can accomplish.
- Bryant, R. & O’Hallaron, D. (2015) Computer Systems, A Programmer’s Perspective, 3rd Ed. Pearson.
Mike Green is an Information Security Professional who serves as a Consultant, Educator & Attorney (Member of the California Bar). He is currently completing his Masters in Cybersecurity at Marymount University. He earned his J.D. & Masters in Educational Leadership from George Mason University. Having passed the CISSP this year, he is seeking new challenges in the field. Follow Mike on Twitter.book review data science highlight machine learning malware neural networks reverse engineering