Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library
Fraudulent users and malicious accounts can result in billions of dollars in lost revenue annually for businesses. Although many businesses use rule-based filters to prevent malicious activity in their systems, these filters are often brittle and may not capture the full range of malicious behavior.
However, some solutions, such as graph techniques, are especially suited for detecting fraudsters and malicious users. Fraudsters can evolve their behavior to fool rule-based systems or simple feature-based models, but it’s difficult to fake the graph structure and relationships between users and other entities captured in transaction or interaction logs. Graph neural networks (GNNs) combine information from the graph structure with attributes of users or transactions to learn meaningful representations that can distinguish malicious users and events from legitimate ones.
This post shows how to use Amazon SageMaker and Deep Graph Library (DGL) to train GNN models and detect malicious users or fraudulent transactions. Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud.
In this blog post, we focus on the data preprocessing and model training with Amazon SageMaker. To train the GNN model, you must first construct a heterogeneous graph using information from transaction tables or access logs. A heterogeneous graph is one that contains different types of nodes and edges. In the case where nodes represent users or transactions, the nodes can have several kinds of distinct relationships with other users and possibly other entities, such as device identifiers, institutions, applications, IP addresses and so on.
Some examples of use cases that fit under this include:
- A financial network where users transact with other users and specific financial institutions or applications
- A gaming network where users interact with other users but also with distinct games or devices
- A social network where users can have different types of links to other users
The following diagram illustrates a heterogeneous financial transaction network.
GNNs can incorporate user features like demographic information or transaction features like activity frequency. In other words, you can enrich the heterogeneous graph representation with features for nodes and edges as metadata. After the node and relations in the heterogeneous graph are established, with their associated features, you can train a GNN model to learn to classify different nodes as malicious or legitimate, using both the node or edge features as well as the graph structure. The model training is set up in a semi-supervised manner—you have a subset of nodes in the graph already labeled as fraudulent or legitimate. You use this labeled subset as a training signal to learn the parameters of the GNN. The trained GNN model can then predict the labels for the remaining unlabeled nodes in the graph.
To get started, you can use the full solution architecture that uses Amazon SageMaker to run the processing jobs and training jobs. You can trigger the Amazon SageMaker jobs automatically with AWS Lambda functions that respond to Amazon Simple Storage Service (Amazon S3) put events, or manually by running cells in an example Amazon SageMaker notebook. The following diagram is a visual depiction of the architecture.
Data preprocessing for fraud detection with GNNs
In this section, we show how to preprocess an example dataset and identify the relations that will make up the heterogeneous graph.
For this use case, we use the IEEE-CIS fraud dataset to benchmark the modeling approach. This is an anonymized dataset that contains 500 thousand transactions between users. The dataset has two main tables:
- Transactions table – Contains information about transactions or interactions between users
- Identity table – Contains information about access logs, device, and network information for users performing transactions
You use a subset of these transactions with their labels as a supervision signal for the model training. For the transactions in the test dataset, their labels are masked during training. The task is to predict which masked transactions are fraudulent and which are not.
The following code example gets the data and uploads it to an S3 bucket that Amazon SageMaker uses to access the dataset during preprocessing and training (run this in a Jupyter notebook cell):
Despite the efforts of fraudsters to mask their behavior, fraudulent or malicious activities often have telltale signs like high out-degree or activity aggregation in the graph structure. The following sections show how to perform feature extraction and graph construction to allow the GNN models to take advantage of these patterns to predict fraud.
Feature extraction consists of performing numerical encoding on categorical features and some transformation of numerical columns. For example, the transaction amounts are logarithmically transformed to indicate the relative magnitude of the amounts, and categorical attributes can be converted to numerical form by performing one hot encoding. For each transaction, the feature vector contains attributes from the transaction tables with information about the time delta between previous transactions, name and addresses matches, and match counts.
Constructing the graph
To construct the full interaction graph, split the relational information in the data into edge lists for each relation type. Each edge list is a bipartite graph between transaction nodes and other entity types. These entity types each constitute an identifying attribute about the transaction. For example, you can have an entity type for the kind of card (debit or credit) used in the transaction, the IP address of the device the transaction was completed with, and the device ID or operating system of the device used. The entity types used for graph construction consist of all the attributes in the identity table and a subset of attributes in the transactions table, like credit card information or email domain. The heterogeneous graph is constructed with the set of per relation type edge lists and the feature matrix for the nodes.
Using Amazon SageMaker Processing
You can execute the data preprocessing and feature extraction step using Amazon SageMaker Processing. Amazon SageMaker Processing is a feature of Amazon SageMaker that lets you run preprocessing and postprocessing workloads on fully managed infrastructure. For more information, see Process Data and Evaluate Models.
First define a container for the Amazon SageMaker Processing job to use. This container should contain all the dependencies that the data preprocessing script requires. Because the data preprocessing here only depends on the pandas library, you can have a minimal Dockerfile to define the container. See the following code:
You can build the container and push the built container to an Amazon Elastic Container Registry (Amazon ECR) repository by entering the following of code:
When the data preprocessing container is ready, you can create an Amazon SageMaker
ScriptProcessor that sets up a Processing job environment using the preprocessing container. You can then use the
ScriptProcessor to run a Python script, which has the data preprocessing implementation, in the environment defined by the container. The Processing job terminates when the Python script execution is complete and the preprocessed data has been saved back to Amazon S3. This process is completely managed by Amazon SageMaker. When running the
ScriptProcessor, you have the option of passing in arguments to the data preprocessing script. Specify what columns in the transaction table should be considered as identity columns and what columns are categorical features. All other columns are assumed to be numerical features. See the following code:
The following code example shows the outputs of the Amazon SageMaker Processing job stored in Amazon S3:
All the relation edgelist files represent the different kinds of edges used to construct the heterogenous graph during training. Features.csv contains the final transformed features of the transaction nodes, and tags.csv contains the labels of the nodes used as the training supervision signal. Test.csv contains the
TransactionID data to use as a test dataset to evaluate the performance of the model. The labels for these nodes are masked during training.
GNN model training
Now you can use Deep Graph Library (DGL) to create the graph and define a GNN model, and use Amazon SageMaker to launch the infrastructure to train the GNN. Specifically, a relational graph convolutional neural network model can be used to learn embeddings for the nodes in the heterogeneous graph, and a fully connected layer for the final node classification.
To train the GNN, you need to define a few hyperparameters that are fixed before the training process, such as the kind of graph you’re constructing, the class of GNN models you’re using, the network architecture, and the optimizer and optimization parameters. See the following code:
The preceding code shows a few of the hyperparameters. For more information about all the hyperparameters and their default values, see estimator_fns.py in the GitHub repo.
Model training with Amazon SageMaker
With the hyperparameters defined, you can now kick off the training job. The training job uses DGL, with MXNet as the backend deep learning framework, to define and train the GNN. Amazon SageMaker makes it easy to train GNN models with the framework estimators, which have the deep learning framework environments already set up. For more information about training GNNs with DGL on Amazon SageMaker, see Train a Deep Graph Network.
You can now create an Amazon SageMaker MXNet estimator and pass in the model training script, hyperparameters, and the number and type of training instances you want. You can then call
fit on the estimator and pass in the training data location in Amazon S3. See the following code:
After training the GNN, the model learns to distinguish legitimate transactions from fraudulent ones. The training job produces a pred.csv file, which contains the model’s predictions for the transactions in test.csv. The ROC curve depicts the relationship between the true positive rate and the false positive rate at various thresholds, and the Area Under the Curve (AUC) can be used as an evaluation metric. The following graph shows that the GNN model we trained outperforms both fully connected feed forward networks and gradient boosted trees that use the features but don’t fully take advantage of the graph structure.
In this post, we showed how to construct a heterogeneous graph from user transactions and activity and use that graph and other collected features to train a GNN model to predict which transactions are fraudulent. This post also showed how to use DGL and Amazon SageMaker to define and train a GNN that achieves high performance on this task. For more information about the full implementation of the project and other GNN models for the task, see the GitHub repo.
Additionally, we showed how to perform data processing to extract useful features and relations from raw transaction data logs using Amazon SageMaker Processing. You can get started with the project by deploying the provided CloudFormation template and passing in your own dataset to detect malicious users and fraudulent transactions in your data.
About the Author
Soji Adeshina is a Machine Learning Developer who works on developing deep learning based solutions for AWS customers. Currently, he’s working on graph learning with applications in financial services and advertising but he also has a background in computer vision and recommender systems. In his spare time, he likes to cook and read philosophical texts.