AWS Big Data Blog

Analyze Database Audit Logs for Security and Compliance Using Amazon Redshift Spectrum

With the increased adoption of cloud services, organizations are moving their critical workloads to AWS. Some of these workloads store, process, and analyze sensitive data that must be audited to satisfy security and compliance requirements. The most common questions from the auditors are around who logs in to the system when, who queried which sensitive data when, when did the user last modify/update his/her credentials?

By default, Amazon Redshift logs all information related to user connections, user modifications, and user activity on the database. However, to efficiently manage disk space, log tables are only retained for 2–5 days, depending on log usage and available disk space. To retain the log data for longer period of time, enable database audit logging. After it’s enabled, Amazon Redshift automatically pushes the data to a configured S3 bucket periodically.

Amazon Redshift Spectrum is a recently released feature that enables querying and joining data stored in Amazon S3 with Amazon Redshift tables. With Redshift Spectrum, you can retrieve the audit data stored in S3 to answer all security and compliance–related questions. Redshift Spectrum can also combine the datasets from the tables in the database with the datasets stored in S3. It supports files in Parquet, textfile (csv, pipe delimited, tsv), sequence file, and RC file format. It also supports different compression types like gzip, snappy, and bz2.

In this post, I demonstrate querying the Amazon Redshift audit data logged in S3 to provide answers to common use cases described above.

Walkthrough

You set up the following resources:

  • Amazon Redshift cluster and parameter group
  • IAM role and policies to give Redshift Spectrum access to Amazon Redshift
  • Redshift Spectrum external tables

Prerequisites

  • Create an AWS account
  • Configure the AWS CLI to access your AWS account
  • Get access to a query tool compatible with Amazon Redshift
  • Create an S3 bucket

Cluster requirements

The Amazon Redshift cluster must:

  • Be in the same region as the S3 bucket storing the audit log files.
  • Be version 1.0.1294 or later.
  • Have read bucket and put object permissions on the S3 bucket to be configured for logging.
  • Have an IAM role attached that has at least the following two built-in policies attached, policies AmazonS3ReadOnlyAccess and AmazonAthenaFullAccess.

Set up Amazon Redshift

Create a new parameter group to enable user activity logging:

aws redshift create-cluster-parameter-group --parameter-group-name rs10-enable-log --parameter-group-family Redshift-1.0 --description "Enable Audit Logging"
aws redshift modify-cluster-parameter-group --parameter-group-name rs10-enable-log --parameters '{"ParameterName":"enable_user_activity_logging","ParameterValue":"true"}'

Create the Amazon Redshift cluster using the new parameter group created:

aws redshift create-cluster --node-type dc1.large --cluster-type single-node --cluster-parameter-group-name rs10-enable-log --master-username <Username> --master-user-password <Password> --cluster-identifier <ClusterName>

Wait for the cluster to build. When it is complete, enable audit logging:

aws redshift enable-logging --cluster-identifier rscluster --bucket-name <bucketname>

Set up Redshift Spectrum

To set up Redshift Spectrum, create an IAM role and policies, an external database, and external tables.

IAM role and policies

Create an IAM role for Redshift Spectrum to access S3 bucket from Redshift database:

aws iam create-role --role-name RedshiftSpectrumRole --assume-role-policy-document file://RedshiftAssumeRole.json

RedshiftAssumeRole.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Attach built-in policies AmazonS3ReadOnlyAccess and AmazonAthenaFullAccess for Amazon Redshift to access S3 using Redshift Spectrum:

aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonAthenaFullAccess --role-name RedshiftSpectrumRole
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess --role-name RedshiftSpectrumRole

Add the role that you just created to the Amazon Redshift cluster. Replace the account number with your account number.

aws redshift modify-cluster-iam-roles --cluster-identifier rscluster --add-iam-roles arn:aws:iam::<accountNumber>:role/RedshiftSpectrumRole

At this point, you now have Redshift Spectrum completely configured to access S3 from the Amazon Redshift cluster.

External database and schema

While you are logged in to Amazon Redshift database, set up an external database and schema that supports creating external tables so that you can query data stored in S3.

An external database created in Amazon Redshift is stored in the Amazon Athena data catalog and can be accessed directly from Athena. It defines a mapping of a database whose metadata locations exist either in the catalog or in a user-specified Hive metastore server to a local schema inside the Amazon Redshift cluster. This mapping allows table definitions to be shared between Athena and Amazon Redshift, or even different infrastructure setups between Amazon Redshift, Apache Spark, and Apache Hive.

To query audit log data, create an external database and schema in Amazon Redshift. Update the account number, role name, and region name below before running the DDL.

CREATE external SCHEMA auditLogSchema 
FROM data catalog 
DATABASE 'auditLogdb' 
iam_role 'arn:aws:iam::<AccountNumber>:role/<RoleName>'
REGION ‘<regionName>’
CREATE external DATABASE if not exists;

The region parameter specifies the region of the Athena data catalog. By default, this is the same region as the Amazon Redshift cluster.

Create external tables

Redshift Spectrum supports querying S3 data by creating external tables. External tables can be created in Amazon Redshift, Athena, or the Hive metastore. External tables are read-only. Currently, data in S3 files cannot be modified using this feature.

An external table is always referenced by prefixing the table name with the schema name.

Create three tables to query three different audit log files:

  • User DDL: Captures the changes made to user definitions or any users created or deleted.
  • User connection: Logs all successful and unsuccessful logon attempts.
  • User activity: Captures all queries executed by users.

The data file format of user DDL and user connection logs is pipe-delimited text files. Both files are compressed using gzip utility. The user activity log is free flow text. Each query is delimited by a new line. This log is also compressed using gzip utility.

Update the S3 bucket locations in the following queries with your own bucket.

CREATE external TABLE auditlogschema.userlog
(
	userid INTEGER,
	username CHAR(50),
	oldusername CHAR(50),
	action CHAR(10),
	usecreatedb INTEGER,
	usesuper INTEGER,
	usecatupd INTEGER,
	valuntil TIMESTAMP,
	pid INTEGER,
	xid BIGINT,
	recordtime VARCHAR(200)
)
row format delimited
fields terminated by '|'
stored as textfile
location ‘s3://<bucketName>/AWSLogs/<accountNumber>/Redshift/<regionName>/YYYY/MM/DD/’;


CREATE external TABLE auditlogschema.connectionlog
(
	event CHAR(50) ,
	recordtime VARCHAR(200) ,
	remotehost CHAR(32) ,
	remoteport CHAR(32) ,
	pid INTEGER ,
	dbname CHAR(50) ,
	username CHAR(50) ,
	authmethod CHAR(32) ,
	duration BIGINT ,
	sslversion CHAR(50) ,
	sslcipher CHAR(128) ,
	mtu INTEGER ,
	sslcompression CHAR(64) ,
	sslexpansion CHAR(64) ,
	iamauthguid CHAR(36) 
)
row format delimited
fields terminated by '|'
stored as textfile
location ‘s3://<bucketName>/AWSLogs/<accountNumber>/Redshift/<regionName>/YYYY/MM/DD/’;


CREATE external TABLE auditlogschema.activitylog
(
logtext VARCHAR(20000)
)
row format delimited
lines terminated by '\n'
stored as textfile
location ‘s3://<bucketName>/AWSLogs/<accountNumber>/Redshift/<regionName>/YYYY/MM/DD/’;

Demonstrate the use case

In order to demonstrate the use cases for this post, create a user “guest,” log in as that user, and create a table, “person.” Then, query the table to generate some user activity.

Log in as admin user and create a new user “guest”:

CREATE USER guest PASSWORD 'Temp1234';

Log in as user “guest” and run the DDLs below:

CREATE TABLE person (id int, name varchar(50),address VARCHAR(500));

INSERT INTO person VALUES(1,'Sam','7 Avonworth sq, Columbia, MD');
INSERT INTO person VALUES(2,'John','10125 Main St, Edison, NJ');
INSERT INTO person VALUES(3,'Jake','33w 7th st, NY, NY');
INSERT INTO person VALUES(4,'Josh','21025 Stanford Sq, Stanford, CT');
INSERT INTO person VALUES(5,'James','909 Trafalgar sq, Elkton, MD');

While you are still logged in as guest, run some ad hoc queries on the person table to generate the activity log:

SELECT * 
  FROM person;

SELECT * 
  FROM person 
 WHERE name=’John’;

In the next sections, I demonstrate one use case for each type of log generated in the S3 bucket.

User Definition Log

The following query identifies whether the user ‘guest’ definition was altered during the day.

SELECT username
,action
,recordtime 
  FROM auditlogschema.userlog 
 WHERE action in ('create','alter','drop','rename') 
   AND username='guest';

 

User Connection Log

The following query identifies the remote host from where the user ‘guest’ logged on, logon time, and so on.

SELECT event
,recordtime
,remotehost 
  FROM auditlogschema.connectionlog 
 WHERE length(username) >0 
   AND username='guest' 
ORDER BY recordtime;

User Activity Log

The following query pulls information on when the person table was accessed and by which users.

SELECT pu.usename
	,substring(logtext,2,strpos(logtext,'UTC')+1)UTCTime
	,substring(logtext,strpos(logtext,'LOG:')+5) query
  FROM auditlogschema.activitylog al,pg_user pu
 WHERE logtext like '%xid=%'
   AND logtext not like '%SELECT 1%'
   AND logtext not like '%SET %'
   AND logtext like '%from person%'
   AND substring(substring(logtext,strpos(logtext,'userid=')+7),1,strpos(substring(logtext,strpos(logtext,'userid=')+7),' '))=pu.usesysid;

Summary

In this post, I discussed how the new addition to Amazon Redshift, Redshift Spectrum, helps you query Audit log data stored in S3 to answer security and compliance-related queries with ease.

Amazon Redshift Spectrum is currently available in the US East (N. Virginia), US East (Ohio), and US West (Oregon) Regions.

To learn more about Amazon Redshift Spectrum, you can visit here. You can also visit the Get Started page to know more about the new feature.

If you have any questions or suggestions, please comment below.


Additional Reading

Learn how to improve query performance by orders of magnitude in the Amazon Redshift Engineering’s Advanced Table Design Playbook.

 

 

 

 

 


About the Author

Sandeep Kariro is a Consultant for AWS Professional Services. He is a seasoned data center migration specialist to AWS. Sandeep also provides data-centric design solutions optimal for Cloud deployments while keeping security, compliance and operations as top design principles. He also acts as a liaison between the customer and AWS product teams to request new product features and communicate top customer asks. He loves travelling around the world and has traveled to several countries around the globe in last decade.