Classify sensitive data in your environment using Amazon Macie

June 15, 2020: This blog is out of date. Please refer here for the updated info: https://aws.amazon.com/blogs/aws/new-enhanced-amazon-macie-now-available/

In this post, I’ll show you how to create a sample dataset for Amazon Macie, and how you can use Amazon Macie to implement data-centric compliance and security analytics in your Amazon S3 environment. I’ll also dive into the different kinds of credentials, document types, and PII detections supported by Macie. First, I’ll walk through creating a “getting started” sample set of artificial, generated data that you can use to test Macie capabilities and start building your own policies and alerts.

Create a realistic data sample set in S3

I’ll use amazon-macie-activity-generator, which we call “AMG” for short, a sample application developed by AWS that generates realistic content and accesses your test account to create the data. AMG uses AWS CloudFormation, AWS Lambda, and Python’s excellent Faker library to create a data set with artificial—but realistic—data classifications and access patterns to help test some of the features and capabilities of Macie. AMG is released under Amazon Software License 1.0, and we’ll accept pull requests on our GitHub repository and monitor any issues that are opened so we can try to fix bugs and consider new feature requests.

The following diagram shows a high level architecture overview of the components that will be created in your AWS account for AMG. For additional detail about these components and their relationships, review the CloudFormation setup script.

Architectural components created by the CloudFormation template

Depending on the data types specified in your JSON configuration template (details below), AMG will periodically generate artificial documents for the specified S3 target with a PutObject action. By default, the CloudFormation stack uses a configuration file that instructs AMG to create a new, private S3 bucket that can only be accessed by authorized AWS users/roles in the same account as the bucket. All the S3 objects with fake data in this bucket have a private ACL and inherit the bucket’s access control configuration. All generated objects feature the header in the example below, and AMG supports all fake data providers offered by https://faker.readthedocs.io/en/latest/index.html, as well as a few of AMG‘s own custom fake data providers requested by our customers: aws_creds, slack_creds, github_creds, facebook_creds, linux_shadow, rsa, linux_passwd, dsa, ec, pgp, cert, itin, swift_code, and cve.

# Sample Report - No identification of actual persons or places is # intended or should be inferred

 74323 Julie Field
 Lake Joshuamouth, OR 30055-3905
 1-196-191-4438x974
 53001 Paul Union
 New John, HI 94740
 Mastercard
 Amanda Wells
 5135725008183484 09/26
 CVV: 550
 354-70-6172
 242 George Plaza
 East Lawrencefurt, VA 37287-7620
 GB73WAUS0628038988364
 587 Silva Village
 Pearsonburgh, NM 11616-7231
 LDNM1948227117807
 American Express
 Brett Garza
 347965534580275 05/20
 CID: 4758

599.335.2742 JCB 15 digit Michael Arias 210069190253121 03/27 CVC: 861

Create your amazon-macie-activity-generator CloudFormation stack

You can deploy AMG in your AWS account by using either these methods:

Use the CloudFormation Template: https://s3.amazonaws.com/amazon-macie-activity-generator-us-east-1-fb58a9df3468/CloudFormationTemplate.yml
Or use this One-click CloudFormation launch stack.

Follow these steps:

Log in to the AWS Console in a region supported by Amazon Macie, which currently includes US East (N. Virginia), US West (Oregon).
Select the One-click CloudFormation launch stack, or launch CloudFormation using the template above.
Read our terms, select the Acknowledgement box, and then select Create.

Creating the data takes a few minutes, and you can periodically refresh CloudWatch to track progress.

Add the new sample data to Macie

Now, I’ll log into the Macie console and add the newly created sample data buckets for analysis by Macie.

Note: If you don’t explicitly specify a bucket for S3 targets in CloudFormation, AMG will use the S3 bucket that’s created by default for the stack, which will be printed out in the CloudFormation stack’s output.

To add buckets for data classification, follow these steps:

Log in to Amazon Macie.
Select Integrations, and then select Services.
Select your account, and then select Details from the Amazon S3 card.
Select your newly created buckets for Full classification, including existing data.

For additional details on configuring Macie, refer to our getting started documentation.

Macie classifies all historical and newly created data in the buckets created by AMG, and the data will be available in the Macie console as it’s classified. Typically, you can expect the data in the sample set to be classified within 60 minutes of the time it was selected for analysis.

Classifying objects with Macie

To see the objects in your test sample set, in Macie, open the Research tab, and then select the S3 Objects index. We’ll use the regular expression search capability in Macie to find any objects written to buckets that start with “amazon-macie-activity-generator-defaults3bucket”. To search for this, type the following text into the Macie search box and select the magnifying glass icon.

filesystem_metadata.bucket:/amazon-macie-activity-generator-defaults3bucket.*/

Research tab regex results

From here, you can see a nice breakdown of the kinds of objects that have been classified by Macie, as well as the object-specific details. Create an advanced search using Lucene Query Syntax, and save it as an alert to be matched against any newly created data.

Objects classified by Macie

Analyzing accesses to your test data

In addition to classifying data, Macie tracks all control plane and data plane accesses to your content using CloudTrail. To see accesses to your generated environment (created periodically by AMG to mimic user activity), on the Macie navigation bar, select Research, select the CloudTrail data index, and then use the following search to identify our generated role activity:

sessionName.key:/amazon-macie-activity-generator-LambdaFunction-.*/

From this search, you can dive into the user activity (IAM users, assumed roles, federated users, and so on), which is summarized in 5-minute aggregations (user sessions). For example, in the screen shot you can see that one of our AMG-generated users listed objects one time (ListObjects) and wrote 56 objects to S3 (PutObject) during a 5-minute period.

Summary of user activity

Macie alerts

Macie features both predictive (machine learning-based) and basic (rule-based) alerts, including alerts on unencrypted credentials being uploaded to S3 (because this activity might not follow compliance best practices), risky activity such as data exfiltration, and user-defined alerts that are based on saved searches. To see alerts that have been generated based on AMG‘s activity, on the Macie navigation bar, select Alerts.

AMG will continue to run, periodically uploading content to the specified S3 buckets. To stop AMG, delete the AMG CloudFormation stack and associated resources here.

What are the costs?

Macie has a free tier enabling up to 1GB of content to be analyzed per month at no cost to you. By default, AMG will write approximately 10MB of objects to Amazon S3 per day, and you will incur charges for data classification after crossing the 1GB monthly free tier. Running continuously, AMG will generate about 310MB of content per month (10MB/day x 31 days), which will stay below the free tier. Any data use above 1GB will be billed at the Macie public price of $5/GB. For more detail, see the Macie pricing documentation.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Amazon Macie forum or contact AWS Support.