Build a centralized audit data collection for Amazon RDS for PostgreSQL using Amazon S3 and Amazon Athena

Database audits are one of the important compliance requirements that organizations need to meet. You might be required to capture, store, and retain the audit data for the long term. You also need to meet your organization’s information security regulations and standards.

In this post, we show you how to capture and store audit data from an Amazon Relational Database Service (Amazon RDS) for PostgreSQL database and store it in Amazon Simple Storage Service (Amazon S3). We also show you how to process the audit data using AWS Glue and query it with Amazon Athena. This solution makes it easier to process and query audit data and can free up database resources from storing and processing audit data.

Solution overview

The following diagram shows the solution architecture, which uses the following AWS services to analyze PostgreSQL audit files:

Amazon Athena
Amazon CloudWatch
AWS Glue
Amazon Kinesis Data Firehose
AWS Lambda
Amazon S3

The high-level steps to implement this solution are as follows:

Create AWS Identity and Access Management (IAM) roles.
Create a Lambda function to decrypt the streams.
Create an S3 bucket for storing the files generated by Kinesis Data Firehose.
Enable Amazon RDS to write to CloudWatch Logs.
Create a Firehose delivery stream.
Create a subscription filter.
Set up an AWS Glue database, crawler, and table.
Run Athena queries to identify database performance issues.

Prerequisites

To follow along with this post, you must have the following prerequisites:

An AWS account with proper privileges to create and configure the necessary infrastructure.
An RDS for PostgreSQL database. For instructions, refer to Create and Connect to a PostgreSQL Database.
Auditing set up for the PostgreSQL database. For instructions, refer to Logging at the session and object level with the pgAudit extension.
The necessary roles to interact with the services. We provide more details in the following section.

Because this solution involves setting up and using AWS resources, it will incur costs in your account. Refer to AWS Pricing for more information. We strongly recommend that you set this up in a non-production instance and run end-to-end validations before you implement this solution in a production environment.

Create IAM roles

You can create your IAM roles using the IAM console, the AWS Command Line Interface (AWS CLI), the AWS Tools for PowerShell, or the IAM API. For more information, refer to Creating IAM roles.

You need three roles: CWLtoKinesisFirehoseRole, FirehosetoS3Role, and CWLtofirehose-lambda-exec-role. The CWLtoKinesisFirehoseRole allows CloudWatch Logs to stream data to Kinesis Data Firehose. See the following code:

{
    "Statement": [
        {
            "Effect": "Allow",
			"Action": 
				[
                "firehose:DescribeDeliveryStream",
                "firehose:PutRecord",
                "firehose:StartDeliveryStreamEncryption",
                "firehose:CreateDeliveryStream",
                "firehose:PutRecordBatch",
                "firehose:ListDeliveryStreams",
                "firehose:StopDeliveryStreamEncryption",
                "firehose:ListTagsForDeliveryStream",
                "firehose:UpdateDestination"
            ],
            "Resource": [
                "arn:aws:firehose:us-east-2:111111111:deliverystream/DB-CWL-KinesisDataFirehose"
            ]
        }
    ]
}

The FirehosetoS3Role allows Kinesis Data Firehose to write to Amazon S3. See the following code:

{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::pgaudit2",
                "arn:aws:s3:::pgaudit2/*"
            ]
        }
    ]
}

The CWLtofirehose-lambda-exec-role is the Lambda execution role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:us-east-2:111111111:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:us-east-2:111111111:log-group:/aws/lambda/CWLtofirehose:*"
            ]
        }
    ]
}

Create a Lambda function

We create the Lambda function to decrypt the streams and push the records to Amazon S3 via Kinesis Data Firehose. Then we add a layer to the function for the logger. To create the function on the Lambda console, complete the following steps:

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Name your function CWLtofirehose.
Choose the runtime Python 3.10.
Choose the existing role CWLtofirehose-lambda-exec-role.

Enter the following code and create your function:

import re
import datetime

import base64
import gzip
import io
import json
import zlib

from aws_lambda_powertools import Logger
cwlogger=Logger()

def csvValues(message):
    standardizedMessage = re.sub(r'(\s)+', ' ', message, flags=re.MULTILINE)
    res = re.match(r'^(.+ .+) UTC\:(.+)\(.+\)\:(.+)\:\[.+ AUDIT\:] (.+)', standardizedMessage)
    dt = datetime.datetime.strptime(res.group(1), '%Y-%m-%d %H:%M:%S')
    
    tail = res.group(4).split(',', 7)[7]
    if tail.startswith('"'):
        query = tail.split('"')[1]
    else:
        query = tail.split(',')[0]

    return dict(zip(
        ['year', 'month', 'day', 'timestamp', 'datetime', 'ipaddress', 'userlogged', 'query'],
        [dt.year, dt.month, dt.day, int(dt.timestamp()), res.group(1), res.group(2), res.group(3), query],
    ))
    
def lambda_handler(event, context):
    output = []
    cwlogger.info(event)
    for record in event['records']:
        cwlogger.info(record['data'])
        compressed_payload = base64.b64decode(record['data'])
        uncompressed_payload = gzip.decompress(compressed_payload)
        cwlogger.info(uncompressed_payload)
        decoded = uncompressed_payload.decode('utf-8')
        stripped = re.sub(r"\\", "", decoded)

        payload = json.loads(stripped)
        cwlogger.info(payload['logEvents'])
        data = [
            csvValues(le['message'])
            for le in payload['logEvents']
        ]
        cwlogger.info(data)
        str_data = "\n".join([json.dumps(d) for d in data])
        # Do custom processing on the payload here
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(str_data.encode('utf-8')).decode('utf-8')
        }
        cwlogger.info(output_record)
        output.append(output_record)

    cwlogger.info('Successfully processed {}        records.'.format(len(event['records'])))

    return {'records': output}

On the Lambda console, choose Layers in the navigation pane.
Choose Add layer.
For Layer source, select AWS layers.
For AWS layers, choose the layer AWSLambdaPowertoolsPythonV2.
Choose Add.

Create an S3 bucket

Next, we create a bucket to store the audit files generated by Kinesis Data Firehose. For instructions on creating your S3 bucket, refer to Creating a bucket.

Enable Amazon RDS to write to CloudWatch Logs

To enable Amazon RDS to write to CloudWatch Logs, complete the following steps:

On the Amazon RDS console, choose Databases in the navigation pane.
Choose the instance that you want to publish logs to CloudWatch for, then choose Modify.
In the Log exports section, select the log types that you want to publish.

Create a Firehose delivery stream

Create the Firehose delivery stream with the following steps:

On the Kinesis Data Firehose console, choose Create a delivery stream.
Choose your source and destination.
For Delivery stream name, enter a name.
For Data transformation, select Enabled.
For AWS Lambda function, enter your function ARN.
For Buffer size, enter your preferred buffer size for your function.
For Buffer interval, enter your preferred buffer interval for your function.
For Record format conversion, select Disabled.
For S3 bucket, enter the name of your S3 bucket.
For Dynamic partitioning, select Disabled.
For S3 bucket prefix, enter an optional prefix.
Under S3 buffer hints, choose your preferred buffer and interval.
For Compression for data records, select Disabled.
For Encryption for data records, select Disabled.
For Source record backup in Amazon S3, select Disabled.
Expand Advanced settings and for Amazon CloudWatch error logging, select Enabled.
Under Permissions, select Choose an existing role.
Choose the role FirehosetoS3Role.
Choose Create delivery stream.

To view your log group, choose Log groups in the CloudWatch console navigation pane. The CloudWatch log groups have an audit file; for example, /aws/rds/instance/rds-pg-labs/postgresql, as shown in the following screenshot.

Create a subscription filter

To create your Kinesis Data Firehose subscription filter, complete the following steps:

On the Kinesis Data Firehose console, navigate to the Subscription filters tab.
On the Create menu, choose Create Kinesis Firehose subscription filter.
For Destination account, select Current account.
For Kinesis Firehose delivery stream, enter the name of your delivery stream.
For Select an existing role, choose the role CWLtoKinesisFirehoseRole.
For Log format¸ choose Other.
For Subscription filter pattern, enter AUDIT.
For Subscription filter name, enter a name.
For Log event messages, enter your log data.
Choose Start streaming.

Set up an AWS Glue database, crawler, and table

You should now have your CloudWatch metrics stream configured and metrics flowing to your S3 bucket. In this step, we configure the AWS Glue database, crawler, table, and table partitions.

On the AWS Glue console, choose Add database.
Enter a name for your database, such as pgaudit_db.
Now that we have our AWS Glue database in place, we set up the crawler.
In the navigation pane, choose Crawlers.
Choose Add crawler.
Enter a name for your crawler, such as pgaudit_crawler.
Choose Next.
For Add a data store, choose S3.
For Include path, enter the S3 path to the folders or files that you want to crawl.
Choose Next.
For Add another data store, choose No.
For Choose IAM role, choose an existing role with the necessary permissions or let AWS Glue create a role for you.
In the Create a schedule for this crawler section, for Frequency, choose Daily.
You can also choose to run the crawler on demand.
Enter your start hour and minute information.
Choose Next.

For more information about configuring crawlers, see Crawler Properties.

When this crawler runs, it automatically creates an AWS Glue table. It also creates the table schema and partitions based on the folder structure in the S3 bucket.

Run Athena queries to identify database performance issues

Now that we have created a database and table using AWS Glue, we’re ready to analyze and find useful insights about our database. We use Athena to run SQL queries to understand the usage of the database and identify underlying issues, if any. Athena is a serverless interactive query service that makes it easy to analyze data using standard SQL. You pay only for the queries that you run. For more information, refer to AWS Pricing. With Athena, you don’t have to set up and manage any servers or data warehouses. You just point to your data in Amazon S3, define the schema, and start querying using the built-in editor.

To test our solution, run the following query:

Select  * from “pgaudit_db”.audit_db1;

Run the following query:

SELECT * FROM "pgaudit_db"."audit_db1" where query like 'select%';

Clean up

To avoid ongoing costs, delete the resources you created as part of this post.

Delete the IAM role.
Delete the Lambda function.
Delete the S3 bucket.
Delete the RDS for PostgreSQL instance.
Delete the Firehose delivery stream.
Delete the AWS Glue database.

Conclusion

In this post, we demonstrated how you can capture and store audit data from RDS for PostgreSQL databases and store it in Amazon S3, process it using AWS Glue, and query it using Athena. This solution can help you generate reports for auditing. This solution also works for Amazon Aurora PostgreSQL-Compatible Edition. In part 2 of this post, we walk you through setting up auditing for MySQL and MariaDB.

If you have any comments or questions about this post, share them in the comments.

About The Authors

Kavita Vellala is a Senior Database Consultant with AWS and brings vast experience of database technologies. Kavita has worked on the database engines like Oracle, SQL Server, PostgreSQL, MySQL, Couchbase and Amazon Redshift. At AWS, she helps empower customers with their digital transformation and accelerate migration of their database workload to the AWS Cloud. She enjoys adapting innovative AI and ML technologies to help companies solve new problems, and to solve old problems more efficiently and effectively.

Sharath Lingareddy is Sr. Database Architect with the Professional Services team at Amazon Web Services. He has provided solutions using Oracle, PostgreSQL, MySQl DynamoDB, Amazon RDS and Aurora. His focus area is homogeneous and heterogeneous migrations of on-premise databases to Amazon RDS and Aurora PostgreSQL.