How do I troubleshoot timeout issues when I query CloudTrail data using Athena?

Last updated: 2021-05-26

When I use Amazon Athena to query my AWS CloudTrail data, my queries take a long time to run or they time out.

Resolution

CloudTrail logs can grow in size over time even if you partition the CloudTrail table to reduce the run time of the queries. Queries against a highly partitioned table have a higher planning time and do not complete quickly.

To resolve the timeout issue, you can manually create a CloudTrail table using partition projection. This allows Athena to dynamically calculate the value of CloudTrail tables instead of scanning through a large list of partitions. With partition projection, you don't need to manage partitions because partition values and locations are calculated from configuration.

To create a CloudTrail table partitioned by timestamp with partition projection, see Creating the table for CloudTrail logs in Athena using partition projection.

To create a CloudTrail table for multiple accounts that's partitioned by year, month, and day with partition projection, use a command similar to the following:

CREATE EXTERNAL TABLE ctrail_pp_ymd (
eventversion STRING,
useridentity STRUCT<
               type:STRING,
               principalid:STRING,
               arn:STRING,
               accountid:STRING,
               invokedby:STRING,
               accesskeyid:STRING,
               userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
               mfaauthenticated:STRING,
               creationdate:STRING>,
sessionissuer:STRUCT<
               type:STRING,
               principalId:STRING,
               arn:STRING,
               accountId:STRING,
               userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
               ARN:STRING,
               accountId:STRING,
               type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, year string, month string, day string)
ROW FORMAT SERDE
  'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://doc_example_bucket/AWSLogs/'
TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.day.type'='integer',
  'projection.day.range'='01,31',
  'projection.day.digits'='2',
  'projection.month.type'='integer',
  'projection.month.range'='01,12',
  'projection.month.digits'='2',
  'projection.region.type'='enum',
  'projection.region.values'='us-east-1,us-east-2,us-west-2',
  'projection.year.type'='integer',
  'projection.year.range'='2015,2021',
  'projection.account.type'='enum', 
  'projection.account.values'='111122223334444,5555666677778888',  
  'storage.location.template'='s3://doc_example_bucket/AWSLogs/${account}/CloudTrail/${region}/${year}/${month}/${day}'
)

Be sure to replace the following in the preceding query:

  • ctrail_pp_ymd with the name of the CloudTrail table
  • doc_example_bucket with the name of the Amazon Simple Storage Service (Amazon S3) bucket where you want to create the CloudTrail table
  • 1111222233334444 and 5555666677778888 with the account IDs of accounts that you want to create the CloudTrail table for
  • us-east-1,us-east-2,us-west-2 with the Region that you want to create the CloudTrail table for
  • Table attributes and properties based on your use case
  • Projection ranges based on your use case (for example, if your CloudTrail data is available only from year 2018, then replace the projection range for partition column year with '2018,2021')

To create a CloudTrail table for multiple accounts under the same organization , use a command similar to the following:

CREATE EXTERNAL TABLE ctrail_pp_ymd_org (
eventversion STRING,
useridentity STRUCT<
               type:STRING,
               principalid:STRING,
               arn:STRING,
               accountid:STRING,
               invokedby:STRING,
               accesskeyid:STRING,
               userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
               mfaauthenticated:STRING,
               creationdate:STRING>,
sessionissuer:STRUCT<
               type:STRING,
               principalId:STRING,
               arn:STRING,
               accountId:STRING,
               userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
               ARN:STRING,
               accountId:STRING,
               type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, year string, month string, day string)
ROW FORMAT SERDE
  'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://doc_example_bucket/AWSLogs/doc_example_orgID/'
TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.day.type'='integer',
  'projection.day.range'='01,31',
  'projection.day.digits'='2',
  'projection.month.type'='integer',
  'projection.month.range'='01,12',
  'projection.month.digits'='2',
  'projection.region.type'='enum',
  'projection.region.values'='us-east-1,us-east-2,us-west-2',
  'projection.year.type'='integer',
  'projection.year.range'='2010,2100',
  'projection.account.type'='enum', 
  'projection.account.values'='111122223334444,5555666677778888',  
  'storage.location.template'='s3://doc_example_bucket/AWSLogs/doc_example_orgID/${account}/CloudTrail/${region}/${year}/${month}/${day}'
)

Note: If you need to query the CloudTrail data before year 2010, then be sure to update the year range in projection.year.range.

Be sure to replace the following in the above query:

  • ctrail_pp_ymd_org with the name of the CloudTrail table
  • doc_example_bucket with the name of the Amazon S3 bucket where you want to create the CloudTrail table
  • doc_example_orgID with your organization ID
  • 1111222233334444 and 5555666677778888 with the account IDs of accounts that you want to create the CloudTrail table for
  • us-east-1, us-east-2, and us-west-2 with the Regions where you want to create the CloudTrail table
  • Table attributes and properties based on your use case
  • Projection ranges based on your use case (for example, if your CloudTrail data is available only from year 2018, then replace the projection range for partition column year with '2018,2021')

When you run your queries, be sure to include restrictive conditions on the partition columns in your queries. This allows Athena to scan less data and speeds up query processing.

For example, you can run a command similar to the following to find out which user made the GetObject request to the S3 bucket. The table in this query uses the year, month, and day partitioning strategy.

Note: Be sure to have event logging activated for Amazon S3 in CloudTrail.

SELECT useridentity.arn, eventtime  FROM "ctrail_pp_ymd"
where eventname = 'GetObject'
and year = '2021'
and month = '05'
and region = 'us-east-1'
and cast(json_extract(requestparameters, '$.bucketName')as varchar) ='doc_example_bucket'

Be sure to replace the following in the above query:

  • ctrail_pp_ymd with the name of the CloudTrail table
  • doc_example_bucket with the name of the S3 bucket where you want to create the CloudTrail table
  • Restrictive conditions based on your use case

If you have timeout issues even after implementing the above steps, then you can request a service quota increase.


Did this article help?


Do you need billing or technical support?