Build fraud detection systems using AWS Entity Resolution and Amazon Neptune Analytics

Financial institutions such as banks, payment processors, and online merchants face significant challenges in detecting and preventing fraud and financial crimes. Entity resolution and graph algorithms can be combined to support fraud detection use cases such as Card Not Present (CNP) fraud detection. A CNP transaction occurs when a credit or debit card payment is processed without the physical card being presented to the merchant, typically during online, telephone, or mail-order purchases. These transactions carry higher fraud risks because merchants can’t physically verify the card or the cardholder’s identity, making them particularly vulnerable to fraudulent usage.

Entity resolution services such as AWS Entity Resolution identify links between entities using shared attributes. Amazon Neptune Analytics, a memory-optimized graph database engine for analytics, enhances CNP fraud detection by enabling graph analysis of complex relationships between customers, transactions, and fraud patterns. When entities are resolved and matched, they create connections that can be stored and queried using graph database structures. Furthermore, graph databases’ built-in support for graph algorithms including community detection enables efficient exploration of entity networks, making it straightforward to discover hidden patterns and indirect connections between resolved entities. This combined approach facilitates fraud detection by quickly traversing relationships and identifying unusual patterns.

In this post, we show how you can use graph algorithms to analyze the results of AWS Entity Resolution and related transactions for the CNP use case. We use several AWS services, including Neptune Analytics, AWS Entity Resolution, Amazon SageMaker notebooks, and Amazon Simple Storage Service (Amazon S3).

Solution overview

AWS Entity Resolution ingests customer data from various sources, standardizing and matching records to create a single view of the customer with a persistent identifier. The persistent customer identifier, customer attributes, and transactions are then loaded into Neptune Analytics as vertices, and relationships between each entity form the edges of the graph. Amazon Neptune Workbench hosted Amazon SageMaker AI notebooks provide the environment for investigators to assess the data. For more details, see Accessing the graph.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

Source customer data flows into AWS Entity Resolution for matching and standardization.
Resolved entities are placed into the output S3 bucket.
Entities and transactions are transformed in Neptune Workbench and loaded into Neptune Analytics as graph elements. Users then run queries in Neptune Workbench to modify the graph in memory and perform graph analytics.

Data model

The graph data model consists of several node types and edge relationships designed to represent customer and transaction data. The nodes include Group (containing persistent identifiers from AWS Entity Resolution), Email, Customer (with source system customer information like name and date of birth), Credit Card Account, Address, and Phone. These nodes are connected through relationships such as HAS_CUSTOMER (linking Group to Customer nodes with confidence scores), HAS_ACCOUNT (linking Customer to Credit Card Account nodes), HAS_PHONE, HAS_ADDRESS, and HAS_EMAIL. The model is further enhanced with transaction-related nodes like CnpCreditCardTxInit, CreditCardTx, and CnpInitFail, which are connected through relationships that track the flow of CNP transactions and their outcomes.The following diagram illustrates the graph data model.

The following table lists the nodes and edges in more detail.

Type	Label	Description	Properties
Node	`Group`	The persistent identifier generated by AWS Entity Resolution for resolved entities	`MatchId`
Node	`Email`	Contains the email address identifiers for the entities ingested into AWS Entity Resolution	`Email`
Node	`Customer`	Contains the financial institution’s customer identifier and relevant customer information such as name fields and date of birth	`FirstName` `MiddleName` `LastName` `DateOfBirth`
Node	`CreditCardAccount`	Contains details regarding the customer’s linked account identifiers	`AccountNumber`
Node	`Address`	Contains the input address fields that were ingested by AWS Entity Resolution	`Address`
Node	`Phone`	Contains the customer’s known phone numbers that were provided during onboarding	`PhoneNumber`
Node	`CnpCreditCardTxInit`	Captures when a transaction was initiated without a physical card, such as online shopping	`InitiationDate` `InitId`
Node	`CreditCardTx`	Represents a successful transaction without a physical card	`TransactionId` `Amount` `TransactionDate`
Node	`CnpInitFail`	Captures a failure reason for why a transaction was rejected if a card was not present	`TransactionId` `Amount` `TransactionDate` `ReasonCode`
Edge	`HAS_CNP_TX_INIT`	An edge from the CreditCard node to the CnPCreditCardTxInit node
Edge	`HAS_CNP_TX_INIT`	An edge from the CnPCreditCardTxInit node to the CreditCardTx node
Edge	`HAS_FAIL`	An edge from the CnPCreditCardTxInit node to the CnpInitFail node
Edge	`HAS_CUSTOMER`	A relationship from the Group nodes to the associated Customer nodes; contains the confidence scores generated by the AWS Entity Resolution machine learning workflow	`ConfidenceScore`
Edge	`HAS_ACCOUNT`	A relationship from the Customer nodes to the customer’s linked accounts
Edge	`HAS_PHONE`	A relationship from the Customer nodes to the customer’s known identifiers
Edge	`HAS_ADDRESS`	A relationship from the Customer nodes to the customer’s known addresses
Edge	`HAS_EMAIL`	A relationship from the Customer nodes to the customer’s known email addresses

Prerequisites

You will incur costs on your account for the AWS services used in the example. Although AWS offers the AWS Free Tier for some services such as SageMaker AI, review the pricing pages for AWS Entity Resolution and Amazon Neptune Analytics before proceeding. For help with estimating costs, refer to the AWS Pricing Calculator.

To follow along with this post, you must have the following resources.

An AWS account
AWS Command Line Interface (AWS CLI) installed
An S3 bucket for intermediate data
An AWS Glue Crawler and Glue Table to retrieve schemas for AWS Entity Resolution
One or more AWS Identity and Access Management (IAM) roles to run the examples, with permissions to do the following:
A SageMaker execution role with permissions to read and write to your S3 bucket
A Neptune notebook configured for Neptune Analytics

Prepare AWS Entity Resolution ML workflow

AWS Entity Resolution provides multiple workflow options to identify potential duplicate entities. Machine learning (ML) workflows use an AWS provided ML model that can handle variations in matching fields such as Name, Address, Email Address, Phone, and Date of Birth. The output of the workflow will provide a confidence score, which indicates how likely the entities within the same match group are duplicates based on the trained model. You can use rule-based workflows to configure matching logic based on business rules.

Select a dataset such as the Freely extensible biomedical record linkage (FEBRL) dataset or create an example dataset with at least three of the five matching fields that can be used by the ML workflow. For this post, we used the Faker Python library to create mock matching fields (address, date_of_birth, email, firstname, lastname, full_name, and middle) to perform entity resolution matching.

After you have created the dataset, loaded it into an S3 bucket where AWS Entity Resolution workflows have read permissions, and crawled the data using AWS Glue, you must define a schema mapping. The schema mapping informs AWS Entity Resolution workflows how to interpret the source fields for matching workflows.

The following screenshot illustrates an example schema mapping.

After you create the schema, use the AWS CLI to create and run an ML workflow. Refer to Creating a machine learning-based matching workflow for steps to create and execute an ML workflow using AWS Management Console.

First, create a JSON file named ml-workflow.json in the current directory and add the following contents (replace placeholders).

{
     "workflowName": "<your ml workflow name>", 
     "description": "Entity Resolution and Neptune Analytics ML Workflow",
     "inputSourceConfig": [
        {
            "applyNormalization": true,
            "inputSourceARN": "arn:aws:glue:<region>:<AWS account>:table/<database name>/<table name>",
            "schemaName": "<schema name>"
        }
     ], 
     "outputSourceConfig": [
        {
            "applyNormalization": true,
            "output": [
                {"name": "address", "hashed": false}, 
                {"name": "date_of_birth", "hashed": false}, 
                {"name": "email", "hashed": false}, 
                {"name": "full_name", "hashed": false}, 
                {"name": "lastname", "hashed": false},
                {"name": "middle", "hashed": false},
                {"name": "phone_number", "hashed": false}
            ],
            "outputS3Path": "s3://<your s3 bucket>/entityresolution/output/"
        }
    ],     
    "resolutionTechniques": {
          "resolutionType": "ML_MATCHING"
     },
     "roleArn":"arn:aws:iam::<AWS account>:role/<entity-resolution-role-name>"
}

After you create the JSON file, run the following command to create the workflow:

aws entityresolution create-matching-workflow --region <region> --cli-input-json file://ml-workflow.json

Run the following command to execute the workflow:aws entityresolution start-matching-job --region <region> --workflow-name <your ml workflow name>

Transform data output

When the ML workflow is complete, the output can be transformed into a valid Neptune data format and ingested into Neptune Analytics. The following code snippets show examples of how to transform the AWS Entity Resolution data output into the OpenCypher Bulk Load format.

Use the following code in your Neptune Notebook to create nodes from the output of AWS Entity Resolution. Replace the placeholders with your own values:

import datawangler as wr
import pandas as pd
# read data from AWS Entity Resolution output
df = wr.s3.read_csv("s3://<your s3 bucket>/<ML workflow execution id>/success/<ML workflow execution file>")

# Create Match Groups
sor = df[['MatchID']].drop_duplicates().dropna()
sor['~id'] = 'Group-'+sor['MatchID']
sor['~label'] = 'Group'
sor['MatchId:String(single)'] = sor['MatchID']
wr.s3.to_csv(sor, 's3://<your s3 bucket>/neptune/nodes/groups.csv', columns = ['~id', '~label', 'MatchId:String(single)'], index = False)

# Create email nodes
lg= df[['email']].drop_duplicates().dropna(subset='email')
lg['~id'] = 'Email-'+lg['email']
lg['~label'] = 'Email'
lg.rename(columns= {'email': 'Email:String(single)'}, inplace = True)
wr.s3.to_csv(lg, 's3://<your s3 bucket>/neptune/nodes/login.csv', columns = ['~id', '~label', 'email:String(Single)'], index = False)

# create CustomerId nodes
cust= df[['customer_id', 'firstname', 'lastname', 'middle', 'full_name', 'date_of_birth']].drop_duplicates()
cust['date_of_birth'] = pd.to_datetime(cust["date_of_birth"], format = '%m/%d/%y').dt.strftime('%Y-%m-%d').fillna('')
cust['~id'] = "Customer-" + cust['customer_id']
cust['~label'] = "Customer"
cust.rename(columns = {'customer_id': 'customer_id:String(single)',
'firstname': 'FirstName:String(single)',
'lastname': 'LastName:String(single)',
'middle': 'MiddleName:String(single)',
'date_of_birth': 'DateOfBirth:Date(single)'
},
inplace = True)
wr.s3.to_csv(cust, 's3://<your s3 bucket>/neptune/nodes/customer.csv', index = False)

# create phone nodes
lg= df[['phone']].drop_duplicates().dropna().astype(str)
lg['~id'] = 'Phone-'+lg['phone']
lg['~label'] = 'Phone'
lg.rename(columns= {'PhoneNumber': 'phone:String(single)'}, inplace = True)
wr.s3.to_csv(lg, 's3://<your s3 bucket>/neptune/nodes/phone.csv', index = False)

# address
addr = df[['address']].drop_duplicates().dropna()
addr['~id'] = 'Address-'+addr['address']
addr['~label'] = 'Address'
addr.rename(columns= {'address': 'address:String(single)'}, inplace = True)
wr.s3.to_csv(addr, 's3://<your s3 bucket>/neptune/nodes/address.csv', index = False)

Use the following code to create edges from the output of AWS Entity Resolution:

# Customer to Group
## Group to Device edges
in_group = df[['customer_id', 'MatchID', "ConfidenceLevel"]].drop_duplicates().dropna()
in_group['~to'] = "Customer-"+ in_group['customer_id']
in_group['~from'] = "Group-"+ in_group['MatchID']
in_group['~label'] = "HAS_CUSTOMER"
in_group['~id'] = in_group['~label'] +'-' + in_group['~from'] + in_group['~to']
in_group.rename(columns= {'ConfidenceLevel': 'confidence:Float'}, inplace = True)
wr.s3.to_csv(in_group, 's3://<your s3 bucket>/neptune/edges/hasCustomer.csv',
columns = ['~id', '~label', '~from', '~to', 'confidence:Float'], index = False)

#Customer to Phone
hasPhone = df[['customer_id', 'phone']].drop_duplicates().dropna(subset='phone')
hasPhone['~to'] = "Phone-"+ hasPhone['phone'].astype(str)
hasPhone['~from'] = "Customer-"+ hasPhone['customer_id']
hasPhone['~label'] = "HAS_PHONE"
hasPhone['~id'] = hasPhone['~label'] +'-' + hasPhone['~from'] + hasPhone['~to']
wr.s3.to_csv(hasId, 's3://<your s3 bucket>/neptune/edges/hasPhone.csv',
columns = ['~id', '~label', '~from', '~to'], index = False)

#Customer to Email
hasEmail = df[['customer_id', 'email']].drop_duplicates().dropna(subset='email')
hasEmail['~to'] = "Email-"+ hasEmail['email']
hasEmail['~from'] = "Customer-"+ hasEmail['customer_id']
hasEmail['~label'] = "HAS_EMAIL"
hasEmail['~id'] = hasEmail['~label'] +'-' + hasEmail['~from'] + hasEmail['~to']
wr.s3.to_csv(hasEmail, 's3://<your s3 bucket>/neptune/edges/hasEmail.csv',
columns = ['~id', '~label', '~from', '~to'], index = False)

#Customer to Address
hasAddr = df[['customer_id', 'address']].drop_duplicates().dropna(subset='address')
hasAddr['~to'] = "Address-"+ hasAddr['address']
hasAddr['~from'] = "Customer-"+ hasAddr['customer_id']
hasAddr['~label'] = "HAS_ADDRESS"
hasAddr['~id'] = hasAddr['~label'] +'-' + hasAddr['~from'] + hasAddr['~to']
wr.s3.to_csv(hasAddr, 's3://<your s3 bucket>/neptune/edges/hasAddress.csv',
columns = ['~id', '~label', '~from', '~to'], index = False)

Load additional datasets

We also want to supplement the mock customer data with some generated transactions that simulate a customer transaction workflow. Use the following code to load the data into an S3 bucket where the Neptune Loader has permissions to read from (replace the placeholders with your own values):

import random
import uuid
import csv
import os
import boto3
import datawangler as wr
from faker import Faker
from faker.providers import internet

#define sample files
fake = Faker()
fake.seed_instance(1212)
fake.add_provider(internet)

def add_cnp_init(cwriter, crelwriter, cid, cnptxwriter, txrelwriter, cnpfailwriter, failrelwriter): #next-, cnpfailwriter, failrelwriter,
    # given a CC account number, create one or more CnpCreditCardTxInit(CNP_CC_TX_INIT_LABEL) nodes and attach via USER_TO_CNP_INIT_REL_LABEL rel
    # then create and attach cnptx or cnpfailure nodes for each CnpCreditCardTxInit node (one-to-one)
    cnp_node_count = random.randint(1, 20)
    for _ in range(cnp_node_count):
        curr_cnp_id = uuid.uuid4()
        cwriter.writerow([curr_cnp_id, uuid.uuid4(), CNP_CC_TX_INIT_LABEL])
        crelwriter.writerow([uuid.uuid4(), USER_TO_CNP_INIT_REL_LABEL, cid, curr_cnp_id])

        cnp_failed_rand = random.randint(1, 10)
        if cnp_failed_rand%3 == 0:  # create/attach cnpfail node
            curr_fail_id = uuid.uuid4()
            reason_code = random.randint(1, 3)
            cnpfailwriter.writerow([curr_fail_id, uuid.uuid4(), reason_code, CNP_CC_FAIL_LABEL])
            failrelwriter.writerow([uuid.uuid4(), CNP_INIT_TO_FAIL_REL_LABEL, curr_cnp_id, curr_fail_id])
        else:
            curr_tx_id = uuid.uuid4()
            cnptxwriter.writerow([curr_tx_id, uuid.uuid4(), CNP_CC_TX_LABEL])
            txrelwriter.writerow([uuid.uuid4(), CNP_INIT_TO_TX_REL_LABEL, curr_cnp_id, curr_tx_id])


USER_NODE_LABEL = "Customer"
CREDIT_CARD_ACCOUNT_LABEL = "CreditCardAccount"
CNP_CC_TX_INIT_LABEL = "CnpCreditCardTxInit"
CNP_CC_TX_LABEL = "CreditCardTx"
CNP_CC_FAIL_LABEL = "CnpInitFail"

USER_TO_CREDIT_CARD_ACCOUNT_REL_LABEL = "HAS_CC_ACCOUNT"
USER_TO_CNP_INIT_REL_LABEL = "HAS_CNP_TX_INIT"
CNP_INIT_TO_TX_REL_LABEL = "HAS_TX"
CNP_INIT_TO_FAIL_REL_LABEL = "HAS_FAIL"

# open the identity resolution file and get a list of the accounts. We want to attach random credit cards to those accounts.
sample_members = wr.s3.read_csv("s3://<your s3 bucket>/<aws entity resolution source file>

customer= sample_members[<customer identifier field or unique customer field>].tolist()

cc_random_generator = random.randint(1,9)

#Create csv writers
with open('./blog_cc_nodes.csv', mode='w') as cc_file, \
        open('./blog_member_to_cc_acct_rels.csv', mode='w') as m_cc_file, \
        open('./blog_cnp_init_nodes.csv', mode='w') as cnp_init_file, \
        open('./blog_cc_to_cnp_init_rels.csv', mode='w') as cc_to_cnp_init_file, \
        open('./blog_cnp_tx_nodes.csv', mode='w') as cnp_tx_file, \
        open('./blog_cnp_to_cnp_tx_rels.csv', mode='w') as cnp_to_tx_file, \
        open('./blog_cnp_fail_nodes.csv', mode='w') as cnp_fail_file, \
        open('./blog_cnp_to_fail_rels.csv', mode='w') as cnp_to_fail_file:

    cc_writer = csv.writer(cc_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cc_writer.writerow(['~id', 'acct_number', '~label'])

    m_cc_writer = csv.writer(m_cc_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    m_cc_writer.writerow(['~id', '~label', '~from', '~to'])

    cnp_writer = csv.writer(cnp_init_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cnp_writer.writerow(['~id', 'tx_init_id', '~label'])

    cc_to_cnp_writer = csv.writer(cc_to_cnp_init_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cc_to_cnp_writer.writerow(['~id', '~label', '~from', '~to'])

    cnp_tx_writer = csv.writer(cnp_tx_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cnp_tx_writer.writerow(['~id', 'tx_id', '~label'])

    cnp_to_tx_writer = csv.writer(cnp_to_tx_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cnp_to_tx_writer.writerow(['~id', '~label', '~from', '~to'])

    cnp_fail_writer = csv.writer(cnp_fail_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cnp_fail_writer.writerow(['~id', 'tx_id', 'reason_code', '~label'])

    cnp_to_fail_writer = csv.writer(cnp_to_fail_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    cnp_to_fail_writer.writerow(['~id', '~label', '~from', '~to'])

    for i in customer:
        curr_member_id = i
        if 1 < cc_random_generator <=9: #subset of members with credit accounts.
            curr_cc_id = uuid.uuid4()
            cc_writer.writerow([curr_cc_id, fake.credit_card_number(), CREDIT_CARD_ACCOUNT_LABEL])
            m_cc_writer.writerow([uuid.uuid4(), USER_TO_CREDIT_CARD_ACCOUNT_REL_LABEL, f'Customer-{curr_member_id}', curr_cc_id])

            add_cnp_init(cnp_writer, cc_to_cnp_writer, curr_cc_id, cnp_tx_writer, cnp_to_tx_writer, cnp_fail_writer, cnp_to_fail_writer)
# once all files have been created write to the s3 bucket where Neptune Analytics has read access.
s3_client = boto3.client('s3')
neptune_bucket = '<your s3 bucket>'
for file in os.listdir('.'):
    if file.startswith('blog'):
        if 'nodes.csv' in file:
            object_name = f'neptune/nodes/{file}'
        elif 'rels.csv' in file:
            object_name = f'neptune/edges/{file}'
        s3.upload_file(file, neptune_bucket, object_name)

Neptune Analytics supports multiple mechanisms to load data into the in-memory graph, including loading from an existing Neptune cluster and from Amazon S3. For this example, we read data from Amazon S3 using a batch load, where the caller’s IAM role has the appropriate permissions:

CALL neptune.load(
  {
    source: "${Neptune S3 Loader Bucket}",
    region: <region>,
    format: "csv"
  }
)

Analyze output with Neptune Analytics

Using Neptune Analytics through Neptune Workbench, you can run powerful graph algorithms like Louvain and weakly connected components to identify clusters of potentially fraudulent activities and analyze relationships between resolved entities. For example, you can quickly identify clusters of CNP transaction failures, analyze the number of shared personally identifiable information (PII) elements between different entities, and assess risk based on the number of known bad actors in the graph, making it an effective tool for detecting sophisticated fraud patterns.

The Louvain algorithm is a community detection algorithm that helps identify clusters or communities within a graph by optimizing modularity, which measures the density of connections within communities compared to connections between communities. In Neptune Analytics, the Louvain algorithm can support the discovery of natural groupings in data, such as finding customer segments or detecting fraudulent clusters of accounts that are working together.

The queries in this post illustrate the use of Louvain and weakly connected components, though results will vary based on your specific dataset characteristics. In Neptune Analytics, the CALL keyword invokes the algorithms, while mutate keyword writes computed results back to the graph as new properties on nodes or edges. We will use Neptune Notebook’s visualization tools to showcase query results. There are also advanced visualization tools available such as Graph-explorer, an open-source tool that you can use to browse graph-data without writing queries.

First, let’s find clusters within the graph for transactions where they are associated with CNPInitFail. We want to persist these clusters with an edge property, CNPFailCommunity:

Match (p:Customer)-[r:HAS_CC_ACCOUNT]->(a:CreditCardAccount)-[r2*1..]->(n:CnpInitFail)
with collect([p, a, n]) as test
unwind test as list_set
unwind list_set as t
with t
CALL neptune.algo.louvain.mutate(
  {
    edgeLabels: ["HAS_CC_ACCOUNT","HAS_TX", "HAS_CNP_TX_INIT",  "HAS_FAIL"],
    writeProperty: "CNPFailCommunity"
  }
)
YIELD success
return success

In addition to discovering clusters associated with CNPInitFail, we want to analyze the AWS Entity Resolution results. Although clusters have already been created by AWS Entity Resolution, we can create additional clusters using the graph algorithm and weakly connected components to generate clusters where resolved customers might share at least one matching attribute:

Match (n)
CALL neptune.algo.wcc.mutate(
    {
    edgeLabels: ["HAS_ADDRESS",
      "HAS_EMAIL",
      "HAS_FAIL",
      "HAS_CUSTOMER",
      "HAS_PHONE",
      "HAS_CC_ACCOUNT"],
    writeProperty: "WCC"
    }
)
yield success
return success

Now that our clusters have been generated, let’s find the largest cluster of transactions generated by Louvain:

Match (n:CnpInitFail)
return distinct n.CNPFailCommunity, count(n.CNPFailCommunity) as numNodes
order by numNodes desc
limit 1

We can take this largest cluster of transactions and retrieve the associated AWS Entity Resolution attributes that are associated with the same weakly connected components cluster:

Match (n:CnpInitFail)
with distinct n.CNPFailCommunity as comm_id, count(n.CNPFailCommunity) as numNodes
order by numNodes desc
limit 1
match (n:Customer)
where n.CNPFailCommunity = comm_id
with n
Match (imp_nodes:Customer {WCC: n.WCC})-[r]->(pii)
return *

Let’s drill down further into to CNPInitFail, where we select a failure code to assess risk. Assume that there are only three failure codes (1, 2, and 3) generated in the preceding transaction code, where failure code 3 is the riskiest. We want to see if multiple AWS Entity Resolution resolved entities are associated with the failures:

Match (n:CnpInitFail {reason_code: "3"})
with n.CNPFailCommunity as lvn
Match (g:Group)-[r]->(c:Customer {CNPFailCommunity: lvn})
return g.WCC, count(distinct g) as num
order by num desc

Given the group with the largest number of resolved entities (distinct AWS Entity Resolution match IDs), we want to assess the number of shared PII elements to assess if these entities’ distinct groups are two distinct bad actors or a single bad actor:

Match (n:CnpInitFail {reason_code: "3"})
with n.CNPFailCommunity as lvn
Match (g:Group)-[r]->(c:Customer {CNPFailCommunity: lvn})
with g.WCC as find_wcc, count(distinct g) as num
order by num desc
limit 1
with find_wcc
Match path =(g1:Group)-[x]->(p:Customer)-[r]->(pii)<-[r2]-(p2)<-[x2]-(g2:Group)
where g1 <> g2 and p.WCC = find_wcc
with collect(distinct pii.`~id`) as shared, find_wcc
Match (g3:Group)-[rc]->(p2:Customer {WCC: find_wcc})-[r]->(pii2)
where not pii2.`~id` in shared and labels(pii2) <>["CreditCardAccount"]
return p2, r, pii2, g3, rc

For visualization purposes (comment out the where not statement), we can see two distinct Group nodes and four parties. The two groups have a differing email address, phone, and address that are not shared. This suggests that the two resolved entities might be related, because they have shared the same number and at least one address.

We can also perform risk analytics based on the number of known bad actors in the graph. The following query analyzes groups of resolved entities by calculating the ratio of bad actors to total customers within each match group, ordered by the percentage of bad actors per cluster. This analysis helps identify which groups have the highest concentration of customers associated with high-risk CNP transaction failures (for example, reason code 3), providing investigators with a risk-based metric to prioritize their investigations.

Match (g:Group)-[r]->(c:Customer)
with g, c
Match (g)-[]->(bd:Customer)-[r2:HAS_CC_ACCOUNT]->(a)-[r3:HAS_CNP_TX_INIT]->(x)-[r4:HAS_FAIL]->(f {reason_code: "3"})
return g.MatchId, count(distinct bd) as numBadActors, count(distinct c) as totalCustomers, 
count(distinct bd)/toFloat(count (distinct c)) as percentPerCluster
order by percentPerCluster desc

For example, based on the above output, the first two results have a higher percentage of bad actors in the cluster at 50% as opposed to the third cluster with 30% of bad actors in the cluster. However, there are only 2 total customers within groups 1 and 2, which may be an important consideration for investigations.

Clean up

When you’re done, clean up your resources to stop incurring costs. You can do so using the AWS CLI or the respective service consoles. For instructions, refer to AWS Entity Resolution, AWS Glue Tables, AWS Glue Crawlers, Neptune Notebooks, or Neptune Analytics documentation. You can also delete S3 buckets or objects created as part of this exercise.

Conclusion

The combination of AWS Entity Resolution and Neptune provides a powerful solution for financial institutions to detect and prevent CNP fraud. By using AWS Entity Resolution to match and standardize customer records with the Neptune graph database, organizations can quickly identify suspicious patterns and relationships between entities. The solution in this post demonstrated how you can transform resolved entities into a graph structure, perform advanced analytics using Neptune Analytics, and visualize complex relationships between customers, accounts, and transactions. The integration with Neptune Workbench helps investigators efficiently analyze clusters of potentially fraudulent activities and assess the relationships between resolved entities.

To learn more about using AWS Entity Resolution or Neptune Analytics, contact your AWS account team or visit the AWS Entity Resolution and Amazon Neptune User Guides.

AWS Database Blog

Build fraud detection systems using AWS Entity Resolution and Amazon Neptune Analytics

Solution overview

Data model

Prerequisites

Prepare AWS Entity Resolution ML workflow

Transform data output

Load additional datasets

Analyze output with Neptune Analytics

Clean up

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help