AWS Machine Learning Blog

Detect financial transaction fraud using a Graph Neural Network with Amazon SageMaker

Fraud plagues many online businesses and costs them billions of dollars each year. Financial fraud, counterfeit reviews, bot attacks, account takeovers, and spam are all examples of online fraud and malicious behaviors.

Although many businesses take approaches to combat online fraud, these existing approaches can have severe limitations. First, many existing methods aren’t sophisticated or flexible enough to detect the whole spectrum of fraudulent or suspicious online behaviors. Second, fraudsters can evolve and adapt to deceive simple rule-based or feature-based methods. For instance, fraudsters can create multiple coordinated accounts to avoid triggering limits on individual accounts.

However, if we construct a full interaction graph encompassing not only single transaction data but also account information, historical activities, and more, it’s more difficult for fraudsters to conceal their behavior. For example, accounts that are often connected to other fraudulent-related nodes may indicate guilt by association. We can also combine weak signals from individual nodes to derive stronger signals about that node’s activity. Fraud detection with graphs is effective because we can detect patterns such as node aggregation, which may occur when a particular user starts to connect with many other users or entities, and activity aggregation, which may occur when a large number of suspicious accounts begin to act in tandem.

In this post, we show you how to quickly deploy a financial transaction fraud detection solution with Graph Neural Networks (GNNs) using Amazon SageMaker JumpStart.

Alternatively, if you are looking for a fully managed service to build customized fraud detection models without writing code, we recommend checking out Amazon Fraud Detector. Amazon Fraud Detector enables customers with no machine learning experience to automate building fraud detection models customized for their data, leveraging more than 20 years of fraud detection expertise from Amazon Web Services (AWS) and Amazon.com.

Benefits of Graph Neural Networks

To illustrate why a Graph Neural Network is a great fit for online transaction fraud detection, let’s look at the following example heterogeneous graph constructed from a sample dataset of typical financial transaction data.

An example heterogeneous graph

A heterogeneous graph contains different types of nodes and edges, which in turn tend to have different types of attributes that are designed to capture characteristics of each node and edge type.

The sample dataset contains not only features of each transaction, such as the purchased product type and transaction amount, but also multiple identity information that could be used to identify the relations between transactions. That information can be used to construct Relational Graph Convolutional Networks (R-GCNs). In the preceding example graph, the node types correspond to categorical columns in the sample dataset such as card number, card type, and email domain.

GNNs utilize all the constructed information to learn a hidden representation (embedding) for each transaction, so that the hidden representation is used as input for a linear classification layer to determine whether the transaction is fraudulent or not.

The solution shown in this post uses Amazon SageMaker and the Deep Graph Library (DGL) to construct a heterogeneous graph from tabular data and train an R-GCNs model to identify fraudulent transactions.

Solution overview

At a high level, the solution trains a graph neural network to accept an interaction graph, as well as some features about the users in order to classify those users as potentially fraudulent or not. This approach ensures that we detect signals that are present in the user attributes or features, as well as in the connectivity structure, and interaction behavior of the users.

This solution employs the following algorithms:

  • R-GCNs, which is a state-of-the-art GNN model for heterogenous graph input
  • SageMaker XGBoost, which we use as the baseline model to compare performances

By default, this solution uses synthetic datasets that are created to mimic typical examples of financial transactions datasets. We demonstrate how to use your own labeled dataset later in this post.

The outputs of the solution are as follows:

  • An R-GCNs model trained on the input datasets.
  • An XGBoost model trained on the input datasets.
  • Predictions of the probability for each transaction being fraudulent. If the estimated probability of a transaction is over a threshold, it’s classified as fraudulent.

In this solution, we focus on the SageMaker components, which include two main parts:

The following diagram illustrates the solution architecture.

Architecture diagram

Prerequisites

To try out the solution in your own account, make sure that you have the following in place:

When the Studio instance is ready, you can launch Studio and access JumpStart. JumpStart solutions are not available in SageMaker notebook instances, and you can’t access them through SageMaker APIs or the AWS Command Line Interface (AWS CLI).

Launch the solution

To launch the solution, complete the following steps:

  1. Open JumpStart by using the JumpStart launcher in the Get Started section or by choosing the JumpStart icon in the left sidebar.
  2. Under Solutions, choose Fraud Detection in Financial Transactions to open the solution in another Studio tab.
    Launch the solution in SageMaker JumpStart
  3. In the solution tab, choose Launch to launch the solution.
    Launch the solution
    The solution resources are provisioned and another tab opens showing the deployment progress. When the deployment is finished, an Open Notebook button appears.
  4. Choose Open Notebook to open the solution notebook in Studio.
    Open notebook

Explore the default dataset

The default dataset used in this solution is a synthetic dataset created to mimic typical examples of financial transactions dataset that many companies have. The dataset consists of two tables:

  • Transactions – Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction, and a column indicating whether the corresponded transaction is fraud or not.
  • Identity – Contains information about the identity users performing transactions. Examples of columns include the device type and device IDs used.

The two tables can be joined together using the unique identified-key column TransactionID. The following screenshot shows the first five observations of the Transactions dataset.

Sample dataset transactions table

The following screenshot shows the first five observations of the Identity dataset.

Sample dataset identity table

The following screenshot shows the joined dataset.

Sample dataset joined table

Besides the unique identifier column (TransactionID) to identify each transaction, there are two types of predicting columns and one target column:

  • Identity columns – These contain identity information related to a transaction, including card_no, card_type, email_domain, IpAddress, PhoneNo, and DeviceID
  • Categorical or numerical columns – These describe the features of each transaction, including ProductCD and TransactionAmt
  • Target column – The isFraud column indicates whether the corresponded transaction is fraudulent or not

The goal is to fully utilize the information in the predicting columns to classify each transaction (each row in the table) to be either fraud or not fraud.

Upload the raw data to Amazon S3

The solution notebook contains code that downloads the default synthetic datasets and uploads them to the input Amazon Simple Storage Service (Amazon S3) bucket provisioned by the solution.

To use your own labeled datasets, before running the code cell in the Upload raw data to S3 notebook section, edit the value of the variable raw_data_location so that it points to the location of your own input data.

Code lines that specify data location

In the Data Visualization notebook section, you can run the code cells to visualize and explore the input data as tables.

If you’re using your own datasets with different data file names or table columns, remember to also update the data exploration code accordingly.

Data preprocessing and feature engineering

The solution provides a data preprocessing and feature engineering Python script data-preprocessing/graph_data_preprocessor.py. This script serves as a general processing framework to convert a relational table to heterogeneous graph edge lists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:

  • Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount
  • Constructing graph edge lists between transactions and other entities for the various relation types

All the columns in the relational table are classified into one of the following three types for data transformation:

  • Identity columns – Columns that contain identity information related to a user or a transaction, such as IP address, phone number, and device identifiers. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes.
  • Categorical columns – Columns that correspond to categorical features such as a user’s age group, or whether or not a provided address matches an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph.
  • Numerical columns – Columns that correspond to numerical features such as how many times a user has tried a transaction. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that aren’t identity columns or categorical columns are numerical columns.

The names of the identity columns and categorical columns need to be provided as command line arguments when running the Python script (--id-cols for identity column names and --cat-cols for category column names).

Code lines that specify Python script command line arguments

If you’re using your own data and your data is in the same format as the default synthetic dataset but with different column names, you simply need to adapt the Python arguments in the notebook code cell according to your dataset’s column names. However, if your data is in a different format, you need to modify the following section in the data-preprocessing/graph_data_preprocessor.py Python script.

Code lines that handle data format

We divide the dataset into training (70% of the entire data), validation (20%), and test datasets (10%). The validation dataset is used for hyperparameter optimization (HPO) to select the optimal set of hyperparameters. The test dataset is used for the final evaluation to compare various models. If you need to adjust these ratios, you can use the command line arguments --train-data-ratio and --valid-data-ratio when running the preprocessing Python script.

When the preprocessing job is complete, we have a set of bipartite edge lists between transactions and different device ID types (suppose we’re using the default dataset), as well as the features, labels, and a set of transactions to validate our graph model performance. You can find the transformed data in the S3 bucket created by the solution, under the dgl-fraud-detection/preprocessed-data folder.

Preprocessed data in S3

Train an XGBoost baseline model with HPO

Before diving into training a graph neural network with the DGL, we first train an XGBoost model with HPO as the baseline on the transaction table data.

  1. Read the data from features_xgboost.csv and upload the data to Amazon S3 for training the baseline model. This CSV file was generated in the data preprocessing and feature engineering job in the last step. Only the categorical columns productCD, card_type, and the numerical column TransactionAmt are included.
  2. Create an XGBoost estimator with the SageMaker XGBoost algorithm container.
  3. Create and fit an XGBoost estimator with HPO:
    1. Specify dynamic hyperparameters we want to tune and their searching ranges.
    2. Define optimization objectives in terms of metrics and objective type.
    3. Create hyperparameter tuning jobs to train the model.
  4. Deploy the endpoint of the best tuning job and make predictions with the baseline model.

Train the Graph Neural Network using the DGL with HPO

Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream tasks. We can model the fraud detection problem as a node classification task, and the goal of the GNN is to learn how to use information from the topology of the sub-graph for each transaction node to transform the node’s features to a representation space where the node can be easily classified as fraud or not.

Specifically, we use a relational graph convolutional neural networks model (R-GCNs) on a heterogeneous graph because we have nodes and edges of different types.

  1. Define hyperparameters to determine properties such as the class of GNN models, the network architecture, the optimizer, and optimization parameters.
  2. Create and train the R-GCNs model.

For this post, we use the DGL, with MXNet as the backend deep learning framework. We create a SageMaker MXNet estimator and pass in our model training script (sagemaker_graph_fraud_detection/dgl_fraud_detection/ train_dgl_mxnet_entry_point.py), the hyperparameters, as well as the number and type of training instances we want to use. When the training is complete, the trained model and prediction result on the test data are uploaded to Amazon S3.

  1. Optionally, you can inspect the prediction results and compare the model metrics with the baseline XGBoost model.
  2. Create and fit a SageMaker estimator using the DGL with HPO:
    1. Specify dynamic hyperparameters we want to tune and their searching ranges.
    2. Define optimization objectives in terms of metrics and objective type.
    3. Create hyperparameter tuning jobs to train the model.
  3. Read the prediction output for the test dataset from the best tuning job.

Clean up

When you’re finished with this solution, make sure that you delete all unwanted AWS resources to avoid incurring unintended charges. In the Delete solution section on your solution tab, choose Delete all resources to delete resources automatically created when launching this solution.

Clean up in SageMaker JumpStart

Alternatively, you can use AWS CloudFormation to delete all standard resources automatically created by the solution and notebook. To use this approach, on the AWS CloudFormation console, find the CloudFormation stack whose description contains sagemaker-graph-fraud-detection, and delete it. This is a parent stack; deleting this stack automatically deletes the nested stacks.

Clean up from AWS CloudFormation

With either approach, you still need to manually delete any extra resources that you may have created in this notebook. Some examples include extra S3 buckets (in addition to the solution’s default bucket), extra SageMaker endpoints (using a custom name), and extra Amazon Elastic Container Registry (Amazon ECR) repositories.

Conclusion

In this post, we discussed the business problem caused by online transaction fraud, the issues in traditional fraud detection approaches, and why a GNN is a good fit for solving this business problem. We showed you how to build an end-to-end solution for detecting fraud in financial transactions using a GNN with SageMaker and a JumpStart solution. We also explained other features of JumpStart, such as using your own dataset, using SageMaker algorithm containers, and using HPO to automate hyperparameter tuning and find the best tuning job to make predictions.

To learn more about this JumpStart solution, check out the solution’s GitHub repository.


About the Authors

Xiaoli Shen profile pictureXiaoli Shen is a Solutions Architect and Machine Learning Technical Field Community (TFC) member at Amazon Web Services. She’s focused on helping customers architecting on the cloud and leveraging AWS services to derive business value. Prior to joining AWS, she was a senior full-stack engineer building large-scale data-intensive distributed systems on the cloud. Outside of work she’s passionate about volunteering in technical communities, traveling the world, and making music.

Xin Huang profile picture Dr. Xin Huang is an applied scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering.

Vedant Jain profile picture Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.