How do I analyze my audit logs using Amazon Redshift Spectrum?

Last updated: 2020-08-19

I want to analyze my audit logs using Amazon Redshift Spectrum. How do I query the audit logs?

Short description

Before you begin to use Redshift Spectrum, be sure to complete the following tasks:

1.    Enable your audit logs.

Note: It might take some time for your audit logs to appear in your Amazon Simple Storage Service (Amazon S3) bucket.

2.    Create an AWS Identity and Access Management (IAM) role.

3.    Associate the IAM role to your Amazon Redshift cluster.

To query your audit logs in Redshift Spectrum, create external tables, and configure them to point to a common folder (used by your files). Then, use the hidden $path column and regex function to create views, generating the rows for your analysis.

Resolution

To query your audit logs in Redshift Spectrum, perform the following steps:

1.    Create an external schema:

create external schema s_audit_logs 
from data catalog 
database 'audit_logs' 
iam_role 'arn:aws:iam::your_account_number:role/role_name' create external database if not exists

Replace your_account_number to match your real account number. For role_name, specify the IAM role attached to your Amazon Redshift cluster.

2.    Create the external tables:

create external table s_audit_logs.user_activity_log(
    logrecord varchar(max)
)
 ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '|' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket_name/logs/AWSLogs/your_account_id/redshift/region'

In this example, you are creating a user activity log table. Replace bucket_name, your_account_id, and region to match your actual bucket name, account ID, and Region.

3.    Create a connection log table:

CREATE EXTERNAL TABLE s_audit_logs.connections_log(
  event varchar(60),  recordtime varchar(60), 
  remotehost varchar(60),  remoteport varchar(60), 
  pid int,  dbname varchar(60), 
  username varchar(60),  authmethod varchar(60), 
  duration int,  sslversion varchar(60), 
  sslcipher varchar(150),  mtu int, 
  sslcompression varchar(70),  sslexpansion varchar(70), 
  iamauthguid varchar(50),  application_name varchar(300))
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '|' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket_name/logs/AWSLogs/your_account_id/redshift/region';

Replace bucket_name, your_account_id, and region to match your actual bucket name, account ID, and Region.

4.    Create a local schema to view the audit logs:

create schema audit_logs_views;

5.    Create views in a database (using the WITH NO SCHEMA BINDING option) to access the external tables:

CREATE VIEW audit_logs_views.v_connections_log AS
select *
FROM s_audit_logs.connections_log
WHERE "$path" like '%connectionlog%'
with no schema binding;

The files that are returned are being restricted by the hidden $path column to match the connectionlog entries.

In the following example, the hidden $path column and regex function are used to restrict the files that are returned for v_connections_log:

CREATE or REPLACE VIEW audit_logs_views.v_useractivitylog AS
SELECT    logrecord,
          substring(logrecord,2,24) as recordtime,
          replace(regexp_substr(logrecord,'db=[^" "]*'),'db=','') as db,
          replace(regexp_substr(logrecord,'user=[^" "]*'),'user=','') AS user,
          replace(regexp_substr(logrecord, 'pid=[^" "]*'),'pid=','') AS pid,
          replace(regexp_substr(logrecord, 'userid=[^" "]*'),'userid=','') AS userid,
          replace(regexp_substr(logrecord, 'xid=[^" "]*'),'xid=','') AS xid,
          replace(regexp_substr(logrecord, '][^*]*'),']','') AS query
   FROM s_audit_logs.user_activity_log
   WHERE "$path" like '%useractivitylog%'
   with no schema binding;

The files returned match the useractivitylog entries.

Note: There is a limitation related to the multi-row queries in user activity logs. Therefore, it's a best practice to query the column log records directly.