Detect suspicious IP addresses with the Amazon SageMaker IP Insights algorithm

Today, we are announcing the new IP Insights algorithm for Amazon SageMaker. IP Insights is an unsupervised learning algorithm for detecting anomalous behavior and usage patterns of IP addresses. In this blog post, we introduce the problem of identifying fraudulent behavior using IP addresses, describe the Amazon SageMaker IP Insights algorithm, demonstrate how you can use it in a real-world application, and share some of our results using it internally.

Fighting malicious activity

Malicious activities often involve an account takeover — unauthorized access to online resources, such as access to online banking accounts, admin consoles, and social networking or webmail accounts. Takeover attempts typically use stolen, lost, or leaked credentials, and unauthorized access is likely to originate from an IP address that is not typical to the account (for example, from the hacker’s computer rather than from the user’s).

A common defense for preventing account takeovers is to flag cases when online resources are accessed by an IP address that hasn’t been seen before. Flagged interactions can be blocked, or users can be challenged to provide additional forms of authentication (such as responding to an SMS). However, most users regularly access online resources from IP addresses they have never used before. Therefore, the “flag new IPs” method yields unreasonably high false positive rates and results in a poor customer experience.

While users regularly access online resources from new IP addresses, choosing a new IP address is not completely random. Several latent factors influence the allocation, such as traveling habits of users and IP assignment strategies of internet service providers. Explicitly enumerating all of these latent factors is generally intractable. However, by looking at access patterns of an online resource, it’s possible to predict whether a new IP address is an expected event or an anomaly. The Amazon SageMaker IP Insights algorithm is designed precisely to do that.

The Amazon SageMaker IP Insights algorithm

The Amazon SageMaker IP Insights algorithm uses statistical modeling and neural networks to capture associations between online resources (for example, online bank accounts) and IPv4 addresses. Under the hood, the algorithm learns vector representations for the online resources and IP addresses where each point is close together if they have been used together. The algorithm itself can learn and incorporate many of the latent factors without requiring us to explicitly model them.

The training procedure starts by randomly assigning each possible IP address and resource to a random point. An online resource is any opaque string identifier (such as a user ID, UUID, etc.). At its core, the algorithm iteratively pushes the points representing IP addresses and resources together if they are associated with each other in the training data, and it pulls them away from each other if they are not associated.

Due to the special neural network architecture, which uses the structure of IPv4 addresses, the algorithm models the behavior of IP addresses. It can compute accurate vector representations, even if they were not seen before in the training data.

The Amazon SageMaker IP Insights Algorithm can be used to analyze access logs and make predictions about whether an access attempt (such as a login event or an online transaction) is suspicious based on the IP address and a user’s access history. This is even the case when an IP address has not been seen before.

Hands-on example: Detecting suspicious login attempts to a web application

In this section, we’ll show you how the Amazon SageMaker IP Insights algorithm can be used to identify suspicious login events to a web application. For more information or to try it out yourself, try the example notebook here.

We are going to focus on an account takeover scenario where an attacker tries to log in to a user’s account with stolen credentials. Such malicious login attempts often originate from unusual IP addresses. Therefore, we can identify them by using the Amazon SageMaker IP Insights algorithm. First, we’ll show you how to prepare your dataset and train the model, then we’ll show how you can call the trained model from your application to act on insights.

Preparing the dataset

The Amazon SageMaker IP Insights algorithm can be applied to any situation where you have data linking a resource (such as user account) and an IP address. In many cases this might come directly from your application or web server logs, application database, or data warehouse. The first step is exporting your data to Amazon S3 in headerless CSV files that contain two fields (EntityId, IpAddress). The <EntityID> can be any string identifier for a resource, and the <IpAddress> should be in IPv4 dot notation. For example, your dataset should look like this:

Entity1,10.0.0.1
Entity2,192.168.0.100
.
.
.
Entity2,10.0.0.1

To see how our model performs, we split the dataset into a training and test set. The algorithm makes predictions using the test set to evaluate how accurately it can identify valid and invalid access attempts. Typically you will want to use several consecutive days of the dataset for training, and then the subsequent days for the test set.

It’s a best practice to use data over a longer period of time (at least days to weeks) and to regularly refresh your model by retraining with new data. Similarly, the algorithm performs better if the training dataset is shuffled when you create it.

Training the model

We train the model on Amazon SageMaker using the IP Insights algorithm. There are a few hyperparameters (configuration for the algorithm) that we can tweak to improve performance: vector_dim is the dimension of the latent space that both IP addresses and accounts are represented; num_entity_vectors is the number of distinct vector representations that the algorithm maintains for accounts. The mapping from an account to a vector is determined by a hash function, so num_entity_vectors should be set larger than the total number of unique accounts to minimize the adverse effects of hash collisions. Finally, shuffled_negative_sampling_rate and random_negative_sampling_rate specify how many negative samples are generated for each record of the training data by randomly picking an IP address from the current mini batch and by randomly generating IP address, respectively. A detailed explanation of the model hyperparameters is provided here.

After we set the training job parameters and the model hyperparameters, we start training the Amazon SageMaker IP Insights model as follows:

role = get_execution_role()
sess = sage.Session()
image = 'xxxxxxx.dkr.ecr.yyyy.amazonaws.com/ipinsights:latest'

input_data = {
    'train': sage.session.s3_input('s3://my_train_data', content_type='text/csv'),
}

model = sage.estimator.Estimator(image, 
                                 role, 
                                 train_instance_count=1, 
                                 train_instance_type='ml.p3.2xlarge',
                                 output_path='s3://{}/output'.format(sess.default_bucket()),
                                 sagemaker_session=sess)
                                 
model.set_hyperparameters(epochs='25', 
                          mini_batch_size='1000', 
                          learning_rate='0.001', 
                          vector_dim='128', 
                          num_entity_vectors='1000000',
                          shuffled_negative_sampling_rate='2',
                          random_negative_sampling_rate='1',
                          num_ip_encoder_layers='1')
model.fit(input_data)

Identifying suspicious logins

After the training is completed, we deploy the model to an endpoint for online inference:

from sagemaker.predictor import csv_serializer, json_deserializer

predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

From your application code, you can now invoke the model. Since Amazon SageMaker is a managed service this can be done from many different languages including Java, Python, etc.

Python

predictor.serializer = csv_serializer
predictor.accept = 'application/json'
predictor.deserializer = json_deserializer

predictor.predict(dataset)

Java 8

String dataCSV = String.join(",", entityId, ipv4Address);
ByteBuffer buf = ByteBuffer.wrap(dataCSV.getBytes());

InvokeEndpointRequest invokeEndpointRequest = new InvokeEndpointRequest();
invokeEndpointRequest.setBody(buf);
invokeEndpointRequest.setEndpointName(endpointName);
invokeEndpointRequest.setContentType("text/csv");
invokeEndpointRequest.setAccept("application/json");

AmazonSageMakerRuntime amazonSageMaker = AmazonSageMakerRuntimeClientBuilder.defaultClient();
InvokeEndpointResult invokeEndpointResult = amazonSageMaker.invokeEndpoint(invokeEndpointRequest);

Evaluating model performance

Now that we have the model deployed, we want to validate that it can distinguish between authorized login events and suspicious or fraudulent attempts. We do that by comparing the scores the model gives for legitimate login events in the test dataset with those of the negatively sampled random events. To generate negative events, we pick a login event from test dataset, keep the account the same and replace the IP address with a randomly generated IP address. This way, a negative event somewhat represents a malicious login attempt, since it is a record of a known account being accessed from an unknown IP address.

As we can see, the Amazon SageMaker IP Insights model gives much higher scores to malicious events, and there is a clear separation between the two distributions.

Tweaking model performance and threshold

Now that we can see the range of scores for legitimate and malicious events, we can make a better choice about the threshold we chose and the actions we should take. If we used the model’s score to trigger an additional authentication challenge, such as sending one-time code to a mobile phone or displaying security questions, a good choice of threshold value would be around 0. This allows for most malicious login attempts to face additional authentication challenges. More legitimate traffic will be flagged, but only a small fraction of legitimate users would be bothered by that. On the other hand, if we triggered a manual investigation based on these scores, then we would choose a threshold value around 10. This would correspond to an operating point with a much lower false positive rate. That is, although some malicious events would be missed, the ones selected for manual investigation would be much more likely to be malicious.

Results and baseline comparison

When designing the algorithm, we evaluated its performance on an internal dataset of user logins. In this section, we compare its performance to existing methods that are used to detect suspicious logins. First we compare it to two variations of the “flag new IP” method mentioned earlier:

IP Table Method: In this method, a login event is considered malicious if the account has never used the IP address during training period.
Subnet Table Method: This method is a more relaxed version of the previous method. Here, a login event is considered malicious if the account has never used an IP address from the same /24 subnet during the training period.

While being simple, these methods are quite effective and often achieve close to 100% true positive rate because an attacker’s IP address is highly likely to be different than the IP addresses that the victim uses. However, as we will see, they suffer from high false positive rates because legitimate users sometime log in from IP addresses that they have not used during the training period. One of the main contributions of the Amazon SageMaker IP Insights algorithm is to reduce the high false positive rate by associating accounts with more likely IP addresses, even if they have never been used before.

To compare Amazon SageMaker IP Insights with the baselines, we created a labelled test case where we artificially inject 1% malicious traffic into a dataset of legitimate traffic. We then score each event in the dataset using both methods.

We observe in these Receiver Operating Characteristics (ROC) curves that both baseline methods reach 100% true positive rate (TPR) with around 20% false positive rate (FPR). The Amazon SageMaker IP Insights model, on the other hand, achieves 100% true positive rate at a much lower false positive rate, around 10%. In addition, the baseline models are rigid and their only operating point is TPR=100% and FPR~20%. On the other hand, the Amazon SageMaker IP Insights model can be configured to operate at lower FPR values by adjusting the threshold. As we discussed earlier, lower FPR is especially useful when high-scoring events trigger a manual investigation.

Conclusion

In this post, we introduced the problem of malicious login attempts. We demonstrated how the Amazon SageMaker IP Insights model can be used to identify suspicious login events, and we showed that the Amazon SageMaker IP Insights model performs significantly better than baseline methods. Furthermore, now that the IP Insights model is on Amazon SageMaker, it can be used with Amazon SageMaker Automatic Model Tuning for you to achieve even better performance.

About the authors

Jared Katzman is a Software Engineer in the AWS AI Labs organization. They are interested in researching ways we can use machine learning and technology for social good. In their spare time, they run a mentorship program for LGBTQ+ students interested in technology.

Baris Coskun is a Senior Applied Scientist in the AWS External Security Services, where he leads a team of scientists working on machine learning and information security.

Acknowledgements

We would like to thank Jakub Zablocki, Jianbo Liu, and Zak Jost from AWS Payments & Fraud Team for their valuable inputs on the research of this project, as well as Eric Kim and Pranav Garg from Amazon AI, for their early contributions.

Artificial Intelligence