How do I use Amazon Athena to analyze my Amazon S3 server access logs?

6 minute read
1

I want to query Amazon Simple Storage Service (Amazon S3) server access logs in Amazon Athena.

Resolution

Amazon S3 stores server access logs as objects in an S3 bucket. Follow these steps to use Athena to analyze and S3 query server access logs.

  1. Turn on server access logging for your S3 bucket, if you haven't already. Note the values for Target bucket and Target prefix. You need both to specify the Amazon S3 location in an Athena query.
    Note: For log object key format, choose date-based partitioning to speed up analytics and query applications.

  2. Open the Amazon Athena console.
    Note: Before you run your first query, you might need to set up a query result location in Amazon S3.

  3. In the Query editor, run a DDL statement to create a database.
    Note: It's a best practice to create the database in the same AWS Region as your S3 bucket.

    create database s3_access_logs_db
  4. Create a table schema in the database. In the following example, the STRING and BIGINT data type values are the access log properties. You can query these properties in Athena. For LOCATION, enter the S3 bucket and prefix path from Step 1. Make sure to include a forward slash (/) at the end of the prefix (for example, s3://doc-example-bucket/prefix/). If you don't use a prefix, then include a forward slash (/) at the end of the bucket name, for example, s3://doc-example-bucket/.

    CREATE EXTERNAL TABLE s3_access_logs_db.mybucket_logs(
     `bucketowner` STRING, 
     `bucket_name` STRING, 
     `requestdatetime` STRING, 
     `remoteip` STRING, 
     `requester` STRING, 
     `requestid` STRING, 
     `operation` STRING, 
     `key` STRING, 
     `request_uri` STRING, 
     `httpstatus` STRING, 
     `errorcode` STRING, 
     `bytessent` BIGINT, 
     `objectsize` BIGINT, 
     `totaltime` STRING, 
     `turnaroundtime` STRING, 
     `referrer` STRING, 
     `useragent` STRING, 
     `versionid` STRING, 
     `hostid` STRING, 
     `sigv` STRING, 
     `ciphersuite` STRING, 
     `authtype` STRING, 
     `endpoint` STRING, 
     `tlsversion` STRING,
     `accesspointarn` STRING,
     `aclrequired` STRING)
     PARTITIONED BY (
       `timestamp` string)
    ROW FORMAT SERDE 
     'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES ( 
     'input.regex'='([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$') 
    STORED AS INPUTFORMAT 
     'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
     'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
     's3://bucket-name/prefix-name/account-id/region/source-bucket-name/'
     TBLPROPERTIES (
      'projection.enabled'='true', 
      'projection.timestamp.format'='yyyy/MM/dd', 
      'projection.timestamp.interval'='1', 
      'projection.timestamp.interval.unit'='DAYS', 
      'projection.timestamp.range'='2024/01/01,NOW', 
      'projection.timestamp.type'='date', 
      'storage.location.template'='s3://bucket-name/prefix-name/account-id/region/source-bucket-name/${timestamp}')
  5. In the left pane, under Tables, choose the ellipsis next to the table name, and then choose Preview table. If you see data from the server access logs in the Results window, then you successfully created the Athena table. The data looks like values such as bucketowner, bucket_name, and requestdatetime.

You can now query the Amazon S3 server access logs.

Example queries

To find the request for a deleted object, use the following query:

SELECT * FROM s3_access_logs_db.mybucket_logs WHERE key = 'images/picture.jpg' AND operation like '%DELETE%';

To show Amazon S3 request IDs for requests that resulted in 403 Access Denied errors, use the following query:

SELECT requestdatetime, requester, operation, requestid, hostid FROM s3_access_logs_db.mybucket_logs
WHERE httpstatus = '403';

To find Amazon S3 request IDs for HTTP 5xx errors in a specific time period with the key and error code, use the following query:

SELECT requestdatetime, key, httpstatus, errorcode, requestid, hostid FROM s3_access_logs_db.mybucket_logs WHERE httpstatus like '5%'
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show who deleted an object and when with the timestamp, IP address, and AWS Identity and Access Management (IAM) role, use the following query:

SELECT requestdatetime, remoteip, requester, key FROM s3_access_logs_db.mybucket_logs 
WHERE key = 'images/picture.jpg' 
AND operation like '%DELETE%';

To show all operations that were performed by an IAM role, use the following query:

SELECT * FROM s3_access_logs_db.mybucket_logs 
WHERE requester='arn:aws:iam::123456789123:user/user_name';

To show all operations that were performed on an object in a specific time period, use the following query:

SELECT *FROM s3_access_logs_db.mybucket_logs WHERE Key='prefix/images/picture.jpg'  AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show how much data is transferred to an IP address for a specific time period, use the following query:

SELECT coalesce(SUM(bytessent), 0) AS bytessentTOTAL 
FROM s3_access_logs_db.mybucket_logs
WHERE RemoteIP='1.2.3.4'AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show all expire operations that were performed by lifecycle rules in a specific time period, use the following query:

SELECT * FROM s3_access_logs_db.mybucket_logs WHERE operation = 'S3.EXPIRE.OBJECT' 
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To count the number of objects that expired in a specific time period, use the following query:

SELECT count(*) as ExpireCount FROM s3_access_logs_db.mybucket_logs
WHERE operation = 'S3.EXPIRE.OBJECT' 
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show all transition operations that were performed by lifecycle rules in a specific time period, use the following query:

SELECT * FROM s3_access_logs_db.mybucket_logs
WHERE operation like 'S3.TRANSITION%' 
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show all requesters grouped by Signature Version, use the following query:

SELECT requester, Sigv, Count(Sigv) as SigCount
FROM s3_access_logs_db.mybucket_logs
GROUP BY requester, Sigv;

To show all anonymous requesters that made requests in a specific time period, use the following query:

SELECT bucket_name, requester, remoteip, key, httpstatus, errorcode, requestdatetime
FROM s3_access_logs_db.mybucket_logs
WHERE requester IS NULL AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show all requesters that sent PUT object requests in a specific time period, use the following query:

SELECT bucket_name, requester, remoteip, key, httpstatus, errorcode, requestdatetime
FROM s3_access_logs_db.mybucket_logs
WHERE operation='REST.PUT.OBJECT' 
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show all requesters that sent GET object requests in a specific time period, use the following query:

SELECT bucket_name, requester, remoteip, key, httpstatus, errorcode, requestdatetime
FROM s3_access_logs_db.mybucket_logs
WHERE operation='REST.GET.OBJECT' 
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'

To show all requesters ordered by highest turnaround time in a specific time period, use the following query:

SELECT * FROM s3_access_logs_db.mybucket_logs
WHERE NOT turnaroundtime='-' 
AND timestamp
BETWEEN '2024/01/29'
AND '2024/01/30'
ORDER BY CAST(turnaroundtime AS INT) DESC;

Note: Because the number of objects within S3 buckets increases over time, queries will then process more data. It's a best practice to create a lifecycle policy for your server access logs bucket. Configure the lifecycle policy to periodically remove log files. This reduces the amount of data that Athena analyzes for each query.

Related information

Analyzing Amazon S3 server access logs using Amazon OpenSearch Service

Amazon S3 server access log format

Querying AWS service logs

AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago
4 Comments

please consider add some considerations for the potential large size of the S3 bucket and the cost associated with querying large data. We followed this article and ran into issue that Athena always either timeout or hit rate limit. then it turns out that our bucket contains over 8TB worth of logs. it doesn't make sense to try to query the whole bucket unless the use case is to dig into years old logs, even with that it is probably best to mention that large data needs to be partitioned first, like using GLUE crawler. We ended up copying the days of data we are interested in to a new bucket and use Athena to query the new bucket.

AWS
CheeZ
replied 10 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 10 months ago

I recommend updating the Athena DDL as server access logs may now contain 2 additional fields (Access Point ARN & aclRequired):

https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html#log-record-fields

The link for "Amazon S3 server access log format" under Related Information is also an older URL (it redirects but might as well update).

replied 5 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 5 months ago