Aqib Rashid, Colin Ife
|
September 4, 2024

How Glasswall’s experts are tackling ‘concept drift’ in machine learning for malware detection

Machine learning (ML) systems are trained on vast amounts of data, enabling them to make accurate predictions on future, unseen inputs. In dynamic environments like malware detection in cybersecurity, constantly evolving threats pose significant challenges that can hinder model performance. A primary reason for degradation in model performance is a phenomenon known as ‘concept drift’. This occurs as malware tends to evolve and becomes less like the original training data, thus characterized by shifts in underlying data distributions.  

At Glasswall, our experts are leading groundbreaking research to develop innovative ML-based malware detection systems. In this article series, we'll explore the complex challenges of applying machine learning to cybersecurity, focusing on concept drift and its impact on malware detection. We'll also highlight Glasswall's innovative research and solutions to overcome these obstacles.

What is concept drift?

In most domains, machine learning models are trained under the assumption that the data distribution remains fixed. This fundamental premise underpins many of the techniques and algorithms used in model development and training.  

For example, consider a machine learning model designed to detect malware. Initially trained on an extensive dataset of malware samples, the model might perform exceptionally well.  

However, over time, the methods used to develop and distribute malware evolve. Additionally, malware authors continually adapt and enhance their techniques to evade detection. Consequently, what once was obvious malware starts to resemble legitimate software to the model, and vice versa. The model, once highly accurate, now tends to make mistakes. This highlights the impact of the fixed data distribution assumption on model performance and illustrates the problem of concept drift, where there is a statistical change in the relation between the model’s inputs and outputs.

Cyber security and beyond: Detecting malware and concept drift in AI-based  sensor data streams using statistical techniques - ScienceDirect

Source: Cyber security and beyond: Detecting malware and concept drift in AI-based sensor data streams using statistical techniques

Concept drift occurs in different patterns, such as gradually, suddenly, or recurringly. Even state-of-the-art models, which can cost billions of dollars to develop, are not immune to this issue. For instance, ChatGPT acknowledges the potential for errors, particularly due to its training data having a cut-off date.

Chat GPT test for cut-off date

Caption: ChatGPT response on 16th July 2024  

Drift can take multiple forms

Other types of drift can also occur, with machine learning practitioners often using these terms interchangeably to describe model decay (i.e., diminishing predictive capability). For example, model drift refers to the degradation in a model’s prediction capability (and not necessarily due to changes in learnt relationships). This can occur due to several reasons, such as concept drift or data drift.

Meanwhile, data drift refers to the changes in distributions of input data. This means that features which the model has learnt for classification look different than expected. At first, it may seem like concept drift, but there is a subtle difference. For example, a model may have learnt that most malware is traditionally distributed through email attachments and downloadable files from websites. However, over time, more malware is distributed through mobile apps. Since the model has encountered fewer mobile-based examples, it struggles to distinguish between malware and legitimate apps, which have different characteristics. Hence, data drift occurs: a shift in the distribution of input features, such as “platform type” or “file size.” This shift can lead to a decline in the model’s performance. However, the fundamental concept of what constitutes malware hasn’t changed in this case.

In contrast, following this example, concept drift would be a new malware technique that significantly increases the incidence of mobile-based threats by exploiting new vulnerabilities in mobile apps. This increase happens regardless of the overall distribution of platform types, illustrating concept drift without data drift.

In summary, concept drift can be mathematically expressed as a change in the conditional probability P(Y∣X), indicating a shift in the relationship between input features and the target variable. On the other hand, data drift is expressed as a change in the marginal probability P(X), reflecting alterations in the distribution of input features over time.

Broadly speaking, both data drift and concept drift tend to be used synonymously by the machine learning community.  

Why is concept drift a key challenge in malware detection?

Concept drift poses a significant cybersecurity challenge due to the ever-changing tactics and attack vectors used by criminals. In cybersecurity, user behaviours are highly variable, influenced by new technologies, applications, and internet usage trends. Additionally, attackers constantly develop new malware and tactics to exploit vulnerabilities, leading to significant and frequent shifts in the data patterns that detection models rely on.

In fact, concept drift significantly impacts model performance in malware detection. Industry whitepapers reveal that models unable to adapt to new malware types experience a rapid decline in detection accuracy, starting as soon as two2 months after deployment.  

Graph showing a machine learning test model detection rate degradation over time

Caption: Sourced from “Machine Learning for Malware Detection” by Kaspersky  

Hence, machine learning practitioners must take various measures to address this issue. These efforts require substantial investment in both time and resources to maintain effective end-to-end security solutions. The constant need for vigilance and adaptation emphasizes the importance of proactive measures in cybersecurity, highlighting the ongoing challenge of managing concept drift in malware detection.

How Glasswall is tackling concept drift

Given the detrimental effect of concept drift on AI/ML systems, especially in malware detection systems, as outlined above, it is fair to question whether concept drift can be truly prevented. Through rigorous experimentation, iteration, and systems design -   rooted in sound engineering principles, MLOps best practices, and our own unique capabilities as a leading provider of Content Disarm and Reconstruction (CDR) solutions -  Glasswall has determined several strategies to mitigate the effects of concept drift in file-based malware detection systems, a few of which are discussed below.

Continuous data collection and management

The first step in preventing drift in any ML task – including malware detection – is investing in robust data collection and management systems. Especially for the file security domain, it is essential to maintain a steady stream of both malicious and non-malicious files from diverse sources. This ensures that the data used to train and test models is up- to- date, structurally comprehensive, and representative of real-world user content and malicious attack vectors.  

This approach can be broken down into several key elements:

Sample Collection

The effectiveness of a machine learning model depends on the quality and quantity of the data it is trained on, rather than the complexity of its architecture. This principle, known as Data-Centric AI, is strongly advocated by Glasswall’s experts to ensure high-performing AI/ML systems. This approach also helps mitigate data and concept drift, maintaining performance over time.

For this reason, Glasswall leverages multiple vantage points on the web for continuous file sample collection.

Label evaluation

Beyond collecting samples to expand structural coverage and distributional representation, there is also the need to assign appropriate labels to each sample. To this end, Glasswall employs a multi-faceted approach to determining and assigning labels to the samples (i.e., whether the files are malicious or not) and ensuring the quality of those assignments.

Deferring the discussion on the exact nature of those methods, an important further step for mitigating the problem of concept drift at this stage is the regular review and maintenance of these assigned labels. Using threat intelligence services as an example, they often change the maliciousness determination of the entity of interest (IPs, domains, files). This may be correcting a false negative indication (not malicious now determined as malicious) or a false positive indication (previously determined malicious now determined as not malicious).

Whatever the label assignment method used, ensuring that the accuracy of the labels is regularly reviewed helps to ensure a correct, functional relationship between input features and output labels is embedded within the models.

Once these files are collected and the labels are assigned, we regularly assess the structural coverage these files provide as depicted by their embeddings in the feature-space, using the power of deep CDR analysis,

A collage of images of different types of bacteriaDescription automatically generated

Caption: 2D t-SNE feature space projections for different file formats. Blue markers denote benign files, while red markers denote malicious ones.

Of course, with any data collection exercise, there is an abundance of known and unknown biases that may be introduced into the dataset. These may be by way of the vantage points used, the quality of the label assignment method, or how the class and data strata are distributed before being served into further stages of the learning pipeline. These biases, and how Glasswall handles them, may be discussed further in a future blog post.

Infrastructure and automation

Developing the necessary systems to refresh and expand one’s dataset at will is an exceptionally helpful and fundamental tool in mitigating data drift, which hinges on the diversity of input features (or in the unique case of CDR, document structures). Incorporating a data management system such as a database, data warehouse, or data lake allows one to scale up or scale down their training operations as required, all while maintaining an organizsed and query-able repository of input features.  

Ensuring synchronization with the latest feature sets is essential to combat data drift, whether due to changes in how structural artifacts are coded in the CDR engine or how they are decomposed into complex features. Without a feature synchronization strategy during the training phase, models risk becoming outdated as subtle changes in data representation occur, leading to degraded performance during inference.

A diagram of a cellDescription automatically generated

Caption: Feature space projections of an MS Word (Binary) dataset, D, before and after a change in feature definitions, leading to data drift. Here, T1 < T2, where T denotes time.  

Finally, automating this process introduces much enhanced efficiency that, if not factored in, can very quickly accumulate in time, labor, and costs.

If implemented correctly, having automated, data collection infrastructure and synchronisation processes helps ensure that the data distribution used to train and test the models remains representative, comprehensive, and relevant.

Online monitoring and re-training

Extending the continuous data collection paradigm, online monitoring and re-training is a go-to, MLOps best -practice for tackling concept drift, with a principal focus on maintaining model performance. Methods to achieve this include monitoring live AI/ML deployments such as an API endpoint or other vantage point, identifying a criterion for the maximum acceptable performance degradation, and triggering re-training of the model once this criterion is met.

The key difference between this method and continuous data collection (as aforementioned) lies in the reactive nature of this method in that re-training is only triggered once production model performance falls outside of acceptable regions. In contrast, continuously collecting and re-training on new data is an inherently proactive method for ensuring peak model performance. With that being said, both methods may certainly be leveraged in tandem.  

Another difference introduced through the online monitoring method lies in the often-used inclusion of a feedback loop between the vantage point where live monitoring takes place and the model dataset used to train and test the model. This feedback loop typically results in previously unobserved, misclassified samples being collected and fed back into the re-training and evaluation process. This presents obvious benefits – the ability to identify weak points in one's models through further analysis of these misclassified samples as well as to iterate one’s models through rapid re-training.  

However, when applying these “feedback loops,” consideration must always be made towards the types of traffic that may be incident on the vantage point, and what biases, if any, these may introduce into one’s dataset and one’s models. Using spam detection as an example, if the vantage point for such a detection model is deployed to a particular e-mail provider or network, but that same provider or network is more or less likely to be exposed to a particular type of spam, any ensuing model improvements will only cater for such subsets of the overall spam population.

For this reason, Glasswall advocates a combined approach of passively and actively collecting samples for model re-training.

How does Glasswall use Machine Learning?

We are recognized for our patented zero-trust CDR (Content Disarm and Reconstruction) technology, which effectively eliminates malware threats from business files and documents while preventing accidental data leaks by removing hidden data that could expose sensitive information. Our customers trust the security and confidence that CDR provides.

Currently, we are developing machine learning-based malware detection solutions capable of identifying threats across a wide range of file types. These solutions, built on CDR, deeply analyze files for compromise indicators, uncovering threats beyond the surface layer. This innovation will enhance Glasswall's threat intelligence services, providing actionable insights into why files are flagged as malicious, and ultimately contributing to global protection. Stay tuned for more updates in early 2025.

Conclusion

Concept drift is a formidable challenge in maintaining the performance and reliability of machine learning models, especially in dynamic and high-stakes environments like cybersecurity. At Glasswall, we are committed to understanding and addressing concept drift to enhance the adaptability and effectiveness of our ML-driven cybersecurity solutions. Our state-of-the-art research and development efforts are focused on creating high-quality models that keep pace with the rapidly evolving landscape of cyber threats, ensuring robust protection for our users.

By staying at the forefront of machine learning research, Glasswall continues to innovate and provide state-of-the-art solutions to the ever-changing challenges in cybersecurity.

Book a demo

Talk to us about our industry-leading CDR solutions

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.