IRONSCALES Engineering: Developing a Machine Learning Model to Identify Phishing Emails

In this post we will discuss the methodology and workflow of our ML team and walk through a case study of deploying a real machine learning model at scale. IRONSCALES is a cybersecurity startup that protects mailboxes from phishing attacks. Our product detects phishing attacks in real time using machine learning and can automatically remove emails from the end-user’s mailbox.

For this task we built a machine learning classifier that can calculate the phishing probability of an email. The model input consists of features and attributes of a specific email, and desired output is “phishing” or “not phishing”. End-to-end development is not as simple as training on data and saving to a binary file. Data collection, validation and deployment are often outside the scope of ML tutorials and articles. By writing about the process end-to-end, we hope that this article can provide unique insights that can be applied in other domains in the field of machine learning and threat detection.

There are a few distinct phases of ML development:

  • Data exploration
  • Training
  • Validation
  • Serving in production at scale
  • Monitoring

Development is an iterative process, and models often go through several rounds of training, validation, monitoring and retraining.

​Data Exploration

Data exploration is the first step of the end-to-end machine learning workflow. Typically, we will pull a distribution of emails, open in Pandas, write scripts for wrangling and visualization and then save for further exploration or training.

Selecting a data set

At IRONSCALES, we process billions of emails. Handling data at this scale is a daunting task and there are several challenges. Firstly, it is impossible to process, sort, and load so much data in tools like Pandas or Excel. It is possible to use distributed technologies such as Dask or Spark, but this would require more complications and, in many cases, would not be suitable for this type of exploratory analysis. In addition, in a normal distribution of emails there is extreme bias against phishing. This is to say, most emails are not phishing and looking at a bunch at random is more time consuming.

In this case we can reduce the size of our dataset by looking at emails that contain at least one anomaly. An anomaly is an email attribute that is unexpected or deviates from the expected value. Even though we do not collect and track all anomalies, and some phishing emails can remain undetected and excluded from our dataset, reducing the scope of our search to anomalous emails will reduce the imbalance and size of the dataset to make it more manageable. By itself, an anomaly is not enough to determine if an email is phishing, but we now have a starting point for further exploration.

Data Analysis

Since we already have a storage system where it is possible to easily access anomalous emails, the first step for this task is to pull a distribution of interesting emails from storage and explore it in Pandas. Imagine we have a table with rows of emails and columns of attributes.

From here we can start looking at different sub-samples and try to understand patterns in the data. While running analysis on this data, it became clear that specific senders were sending out many phishing emails. It was also apparent that many of these senders had common attributes. To investigate, we want to pivot the table and aggregate attributes on the sender value.

By pivoting the table, we now have sender as the primary index and columns that aggregate the corresponding attributes from all that sender’s emails contained in the sample.

 

At this point, it is possible to display this cluster of phishing senders in 3D feature space.

In this image blue dots represent legitimate senders and yellow and green are phishing senders. Chart 1 clearly shows that there is a concentrated cluster of phishing senders in the middle, surrounded by blue legitimate senders. It’s not an exact relationship that is easy to determine with a rule or heuristic-based business logic, but it is a good candidate for training with machine learning.

In this distribution of emails, we can observe three types of email senders:

  • notifications
  • spam
  • phishing

The goal of this model is to identify phishing senders specifically and prevent their emails from reaching the end user’s inbox. Notifications and spam should both be ignored.

Now we save the data to CSV and open a new notebook for training.

Classification Task

We extracted various features and tried to fit to the data. Since the data is tabular in nature, Random Forest is a good candidate model for prediction.

The results look pretty good.

Notice for this model we care about precision rather than recall. In other words, false negatives (classifying phishing as legitimate) are acceptable for the model but false positives (classifying legitimate emails as phishing) are not. This is because the scope of this model is to detect a specific type of phishing. There are other models detecting other types of phishing, so as long as we don’t accidentally delete an email from the customer’s mailbox, we can improve the system by introducing other classifiers that detect other phishing emails. Recall can be improved by adding more models, but precision is only as good as the weighted average precision of all models combined.

​Improving features

With experimentation it is possible to add new features and improve the model’s classification metrics. RandomForestClassifier has an API for ranking features by importance.

Cross Validation

Now that the model has been trained and optimized, the next step is cross validation. For this we can use CalibratedClassifierCV. This classifier has the advantage of calibrating the predict_proba method which skews for Random Forest models.

Deploy Silently

OK, now we have a model and we are ready to deploy it into production. To know how it performs on live data, we need to integrate it into our email inspection flow and make silent predictions. Let’s discuss the inspection flow at a high level.

To serve the model, we need to aggregate the anomalies for each sender. Luckily, we have the anomaly table and after creating a new index on sender and date, it can be read quickly.

There are four steps in our initial deployment:

  1. query anomaly table for rows with sender
  2. aggregate and flatten features extracted from query result and format for prediction [1, 3, .43 , .87,]
  3. run prediction
  4. if prediction phishing proba is above threshold, then write new entry in silent phishing table

It is important to note that the time window for aggregation of model prediction should match the time of the training data. For example, if 30 days of data was pulled from anomalies table for training, then 30 days should be used for predictions. On the other hand, if for example a 90-day window size is used, the model input values will be potentially 3 times larger than the training set, which can cause serious issues at inference time.

Finally, before deploying the model into a production system it is best practice to write unit tests and validate code quality. Even a silent model can potentially cause exceptions in the inspection flow, so it is generally preferred to write a few basic tests. Firstly, the simplest is to pass pre-generated values to the prediction function and check that there are easily noticeable exceptions in the logic and handling of type, or value errors are correct. Secondly, at least one integration test where data is inserted into a test database and then the entire inspection flow is called for a specific email. Thirdly, it is important to test and benchmark classification performance of the model. If the classifier requires a lot of resources or takes time to read from disk, it may require a different architecture.

Analysis of Silent Predictions

After deploying the model, we waited a few days and reviewed the silent predictions table to determine the accuracy of the model.

After reviewing the predictions, we noticed that the model was classifying a lot of spam emails as phishing. This was because the sender activity for spam and phishing are often identical, i.e., the same sender was sending both spam and phishing emails and the only difference was the content of the email. After running some calculations, the precision was found only around 80%. Unfortunately, this model was not mature enough for deploying into active mode.

Training NLP Model for Ensemble Inference

In order to further mature this prediction, we need to add another model to train using the email body text to separate spam and phishing emails even from the same sender. We will create a simple language model for this second classification task. For the training data set, we use silent prediction emails. The idea is that after the Random Forest model makes a prediction on an email and determines the sender is phishing, there will be a second language model prediction of the email text to recognize spammy language and cancel the first prediction.

Now we have a robust ensemble model.

The function returns the phishing probability of zero if email is spam. Also, it is useful to save the original prediction for future analysis, retraining, and monitoring purposes. Since the attributes of the sender will change with each subsequent email, sender stats are returned as well.

Active Deployment

After a few more days of silent evaluation, the measured precision of new sender anomaly ensemble model scored above 99%. The next phase involves adding to the list of models that run in the active detection flow. If any of these models individually predict an email to be phishing above a certain threshold, then the email is escalated. Only minor code changes are required since the model has already been running in silent mode for a few weeks.

Monitoring

After we deploy a model into production it is important to monitor its effectiveness over time. Machine learning models do not last forever, they require maintenance and retraining. Eventually they are deprecated and removed. Over time, the input data can change. Perhaps spammers will use new text templates. Phishers can use new sending patterns to avoid detection. Changes to other parts of the inspection flow will probably change the distribution of anomalous emails saved and there will be drift in input data.

While we are currently lacking a robust monitoring system, this is a major area of attention. One task is to log predictions and probabilities over time and display with monitoring tools like Datadog that can be easily viewed by analysts in real time. Another task to monitor drift of input values.

​Next Steps

Finally phishing emails are escalated in the production system with high precision! Our model is deployed and our customers are happy. The development of this model took a few months in total, from the time anomalous emails were pulled for exploration until the first email was removed from the end user mailbox. A lot was learned during this time, and a lot can still be improved.

This model uses a novel approach toward classification. While most previous approaches try to classify an email based on content and metadata, this is the first model at IRONSCALES where an email can be classified as phishing based on sending patterns alone. If a sender is classified as phishing for one email, wouldn’t all recent emails from that sender be suspect? Sender classification opens an exciting new area for exploration in the future with collection of new categories of relationship and activity data.

Another interesting insight is the effect of spam on phishing classification. Our observations from building the model indicate classifying what isn’t phishing is useful for identifying what is. Understanding and classifying spam is another area for improvement of phishing detection accuracy.

Conclusion

In this article we covered several topics related to the development, deployment and monitoring of a machine learning model:

1. Data exploration with Pandas

2. Training the model on tabular data with RandomForestClassifier

3. Deploying the model at scale

4. Training the model text data with SGDClassifier

5. Monitoring production code

First for data exploration, we selected a dataset and examined a Pandas DataFrame and created some charts and visualizations. Then we trained the model on tabular data using RandomForestClassifier. We validated, calibrated and then deployed our model in a production system. After realizing that the classification score needed some improvements, we added a text-based model and redeployed. Finally, our model is actively making predictions and we are monitoring performance for regressions.

Congratulations — you made it to the end! If you are interested in building machine learning models and distributed systems to help protect companies from advanced email attacks, IRONSCALES is hiring!

To learn more about IRONSCALES’ award-winning anti-phishing solution, please sign up for a demo today at https://IRONSCALES.com/get-a-demo/.