Monitor Amazon S3 activity using S3 server access logs and Pandas in Python

Monitoring and controlling access to data is often essential for security, cost optimization, and compliance. For these reasons, customers want to know what data of theirs is being accessed, when it is being accessed, and who is accessing it. With more data to monitor, large amounts of data can make it more challenging to granularly understand access patterns and data usage, even more so with large numbers of users.

With trillions of objects stored, Amazon S3 buckets hold a tremendous amount of data. For S3 users, S3 server access logging is a feature that they can use to monitor requests made to their Amazon S3 buckets. These logs can be used to track activity for a variety of use cases, including data access patterns, lifecycle and management activity, security events, and more.

In this blog, I show you how to use Pandas in Python to analyze Amazon S3 server access logs for a couple of common customer use cases: monitoring a static website and monitoring S3 Lifecycle activity. With the outlined solution, you can simplify yet augment your data monitoring and control at scale to optimize for cost, security, compliance, and more.

Note:

For this blog, I used Jupyter Notebooks to perform my data analysis steps. However, you can also use these steps to create scripts that can generate reports or create datasets which can be used by other Python libraries, such as Seaborn for data visualization. You can find instructions for installing Jupyter Notebooks here.

In addition, I read the S3 server access logs directly from my Amazon S3 bucket, however, if you have a large logging bucket, then you may want to consider using AWS Glue to convert the S3 server access log objects, before analyzing them with Pandas. In order to access the logs stored in an S3 bucket, your computer needs to have AWS credentials configured. You can do this through the AWS CLI, or with an IAM role attached to an EC2 instance.

Enabling S3 server access logging

To use Amazon S3 server access logs, first enable server access logging on each bucket that you want to monitor. When you enable server access logging on a bucket, you must specify a bucket for log delivery. It is a recommended best practice to have your S3 server access logs delivered to a bucket other than the bucket being monitored, to prevent the creation of logs about logs. Since server access logs must be delivered to a bucket in the same Region as the bucket being monitored, a good strategy would be to create a dedicated bucket for server access logging in each AWS Region. You can then create prefixes in the logging bucket that match the names of buckets you want to monitor, and configure server access logging for each monitored bucket to deliver their logs to the matching prefix in the logging bucket. You can find more information on enabling S3 server access logging in the documentation.

In order to centralize your S3 server access logs, you can use S3 Cross-Region Replication on your logging buckets. This can help to consolidate your logs in each Region to a central bucket, either in the same account, or in a centralized logging account. You can find more information on S3 Cross-Region Replication here.

Dependencies and setup

To begin, you must install the boto3 and Pandas libraries. You can do this by running the following commands:

pip install boto3
pip install pandas

Once installed, you can import the libraries required. I used the OS library for reading the Amazon S3 server access logs, and for writing data to external files, boto3 for list the S3 server access logs inn the S3 logging bucket, and Pandas for analyzing the data.

import os
import boto3
import pandas as pd

Next, I set a parameter for my S3 logging bucket, and created an Amazon S3 client using the boto3 library.

bucket = 'demo-access-logs-bucket'
s3_client = boto3.client('s3')

Preparing Amazon S3 servers access logs for analysis

For this example, I created an Amazon S3 static website, and configured the S3 server access logs for the static website bucket. The prefix that the logs are delivered to in my logging bucket matches the name of the static website bucket. Before reading the S3 server access log objects from my logging bucket, I first list the objects in the logging bucket under the prefix for my static website bucket. I also use Python slicing to only return the log file object key from the “Contents” section of the JSON response. Each log object key is then added to list called log_objects.

log_objects = []

paginator = s3_client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket = bucket, Prefix = 'demo-cf-origin')
for each in result:
    key_list = each['Contents']
    for key in key_list:
        log_objects.append(key['Key'])

Next, I create an empty list called log_data. I use a for loop to read each log object in the log_objects list from Amazon S3 and append it to the log_data list, along with the heading fields for the S3 server access log columns. I then use Pandas to concatenate the log_data list and set it as a Pandas data frame, called df. Last, I perform a head operation on the data frame to see its format.

log_data = []
for log_key in log_objects:
    log_data.append(pd.read_csv('s3://' + bucket + '/' + log_key, sep = " ", names=['Bucket_Owner', 'Bucket', 'Time', 'Time_Offset', 'Remote_IP', 'Requester_ARN/Canonical_ID',
               'Request_ID',
               'Operation', 'Key', 'Request_URI', 'HTTP_status', 'Error_Code', 'Bytes_Sent', 'Object_Size',
               'Total_Time',
               'Turn_Around_Time', 'Referrer', 'User_Agent', 'Version_Id', 'Host_Id', 'Signature_Version',
               'Cipher_Suite',
               'Authentication_Type', 'Host_Header', 'TLS_version'],
        usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]))

Output:

df = pd.concat(log_data)
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3883 entries, 0 to 0
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   Bucket_Owner                3883 non-null   object
 1   Bucket                      3883 non-null   object
 2   Time                        3883 non-null   object
 3   Time_Offset                 3883 non-null   object
 4   Remote_IP                   3883 non-null   object
 5   Requester_ARN/Canonical_ID  3883 non-null   object
 6   Request_ID                  3883 non-null   object
 7   Operation                   3883 non-null   object
 8   Key                         3883 non-null   object
 9   Request_URI                 3883 non-null   object
 10  HTTP_status                 3883 non-null   object
 11  Error_Code                  3883 non-null   object
 12  Bytes_Sent                  3883 non-null   object
 13  Object_Size                 3883 non-null   object
 14  Total_Time                  3883 non-null   object
 15  Turn_Around_Time            3883 non-null   object
 16  Referrer                    3883 non-null   object
 17  User_Agent                  3883 non-null   object
 18  Version_Id                  3883 non-null   object
 19  Host_Id                     3883 non-null   object
 20  Signature_Version           3883 non-null   object
 21  Cipher_Suite                3883 non-null   object
 22  Authentication_Type         3883 non-null   object
 23  Host_Header                 3883 non-null   object
 24  TLS_version                 3883 non-null   object
dtypes: object(25)
memory usage: 788.7+ KB

Now that we have our S3 server access logs read into a Pandas data frame, we can start analyzing our S3 activity.

Monitoring a static website

For an S3 static website, it may be useful to see which objects are being requested, where requests are coming from, how many errors your visitors are getting, what kind of errors your visitors are getting, and which content is returning those errors. For these exercises, I use the value_counts function to get information on the number of times various S3 requests appear in the logs.

See which objects are most frequently accessed

df[(df['Operation'] == 'REST.GET.OBJECT')]['Key'].value_counts()

Output:

favicon.ico              161
index.html                99
puppy.jpg                 39
goldfish.jpg              26
kitten.jpg                18
s3-sal-pandas-02.html      1
Name: Key, dtype: int64

Create a graph showing your most frequently accessed objects

top_five_objects = df[(df['Operation'] == 'REST.GET.OBJECT')]['Key'].value_counts().nlargest(5)
top_five_objects.plot.pie(label='')

Output:

<AxesSubplot:>

Create a graph showing your most frequently accessed objects

See the response codes for your bucket

response_codes = df['HTTP_status'].value_counts()
response_codes.plot.bar()

Output:

<AxesSubplot:>

See the response codes for your bucket

See which IP addresses are downloading objects from your static website

df[(df['Operation'] == 'REST.GET.OBJECT')]['Remote_IP'].value_counts()

Output:

130.176.137.70     11
130.176.137.131    10
130.176.137.139     9
130.176.137.134     8
130.176.137.143     7
                   ..
64.252.73.141       1
64.252.73.213       1
130.176.137.75      1
70.132.33.155       1
64.252.118.85       1
Name: Remote_IP, Length: 115, dtype: int64

See which IP addresses are getting access denied errors

df[(df['HTTP_status'] == 403)]['Remote_IP'].value_counts()

Output:

130.176.137.70     7
130.176.137.98     5
130.176.137.131    5
130.176.137.134    5
130.176.137.143    5
                  ..
64.252.73.192      1
70.132.33.100      1
64.252.122.201     1
70.132.33.135      1
130.176.137.68     1
Name: Remote_IP, Length: 79, dtype: int64

Find out which Amazon S3 keys are returning access denied

df[(df['HTTP_status'] == 403)]['Key'].value_counts()

Output:

favicon.ico    161
Name: Key, dtype: int64

Monitor Amazon S3 Lifecycle management

A frequent question that customers have is how they can tell whether their S3 Lifecycle rules are working. S3 server access logging includes information on activity performed by S3 Lifecycle processing, including object expirations and object transitions.

For this exercise, I created a new data frame for logs stored under a different prefix in the same centralized logging bucket. This time the prefix matches the name of an S3 bucket that has lifecycle rules enabled. I go through the same steps to create the data frame as I did for the S3 static website bucket.

lifecycle_log_objects = []

paginator = s3_client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket = bucket, Prefix = 'demo-lifecycle')
for each in result:
    key_list = each['Contents']
    for key in key_list:
        lifecycle_log_objects.append(key['Key'])
lifecycle_log_data = []
for lifecycle_log in lifecycle_log_objects:
    lifecycle_log_data.append(pd.read_csv('s3://' + bucket + '/' + lifecycle_log, sep = " ", names=['Bucket_Owner', 'Bucket', 'Time', 'Time_Offset', 'Remote_IP', 'Requester_ARN/Canonical_ID',
               'Request_ID',
               'Operation', 'Key', 'Request_URI', 'HTTP_status', 'Error_Code', 'Bytes_Sent', 'Object_Size',
               'Total_Time',
               'Turn_Around_Time', 'Referrer', 'User_Agent', 'Version_Id', 'Host_Id', 'Signature_Version',
               'Cipher_Suite',
               'Authentication_Type', 'Host_Header', 'TLS_version'],
        usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]))

lifecycle_df = pd.concat(lifecycle_log_data)
lifecycle_df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4609 entries, 0 to 0
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   Bucket_Owner                4609 non-null   object
 1   Bucket                      4609 non-null   object
 2   Time                        4609 non-null   object
 3   Time_Offset                 4609 non-null   object
 4   Remote_IP                   4609 non-null   object
 5   Requester_ARN/Canonical_ID  4609 non-null   object
 6   Request_ID                  4609 non-null   object
 7   Operation                   4609 non-null   object
 8   Key                         4609 non-null   object
 9   Request_URI                 4609 non-null   object
 10  HTTP_status                 4609 non-null   object
 11  Error_Code                  4609 non-null   object
 12  Bytes_Sent                  4609 non-null   object
 13  Object_Size                 4609 non-null   object
 14  Total_Time                  4609 non-null   object
 15  Turn_Around_Time            4609 non-null   object
 16  Referrer                    4609 non-null   object
 17  User_Agent                  4609 non-null   object
 18  Version_Id                  4526 non-null   object
 19  Host_Id                     4609 non-null   object
 20  Signature_Version           4609 non-null   object
 21  Cipher_Suite                4609 non-null   object
 22  Authentication_Type         4609 non-null   object
 23  Host_Header                 4609 non-null   object
 24  TLS_version                 4609 non-null   object
dtypes: object(25)
memory usage: 936.2+ KB

Get a count of lifecycle operations performed

For my test, I uploaded 40 objects to three different prefixes in my Amazon S3 bucket and applied rules to expire or transition to S3 Glacier Deep Archive based on prefix name (note that the S3 Lifecycle API operations differentiate transitions between S3 Glacier Deep Archive and transitions to other S3 storage classes). I later uploaded additional objects to the expiration prefix to provide additional examples.

lifecycle_df[(lifecycle_df['Requester_ARN/Canonical_ID'] == 'AmazonS3')]['Operation'].value_counts()

Output:

S3.EXPIRE.OBJECT            180
S3.CREATE.DELETEMARKER       46
S3.TRANSITION_GDA.OBJECT     45
S3.TRANSITION.OBJECT         41
Name: Operation, dtype: int64

Get a list of objects that have been expired and the date they were expired

You can also use this to generate reports on object transitions by changing the API call in the operation. You can find a list of API operations for S3 Lifecycle in the documentation. For this example, I joined the Time and Time – Offset columns to a single Date column.

lifecycle_df['Date'] = lifecycle_df[['Time', 'Time_Offset']].agg(' '.join, axis=1)
lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT')][['Key', 'Date']]

Output:

	Key	Date
0	folder2/test001.txt	[14/Dec/2021:10:50:38 +0000]
1	folder2/test002.txt	[14/Dec/2021:10:50:38 +0000]
0	expire/	[01/Jul/2021:18:59:38 +0000]
1	expire/test21.txt	[01/Jul/2021:18:59:39 +0000]
2	expire/test12.txt	[01/Jul/2021:18:59:39 +0000]
…	…	…
41	expiration/test39.txt	[04/Nov/2021:18:47:52 +0000]
42	expiration/test43.txt	[04/Nov/2021:18:47:52 +0000]
43	expiration/test41.txt	[04/Nov/2021:18:47:52 +0000]
44	expiration/test44.txt	[04/Nov/2021:18:47:53 +0000]
45	expiration/test45.txt	[04/Nov/2021:18:47:53 +0000]

180 rows × 2 columns

Get a list of objects that were expired on a specific day

lifecycle_df['Date'] = lifecycle_df[['Time', 'Time_Offset']].agg(' '.join, axis=1)
lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT') & (lifecycle_df['Date'].str.contains('01/Jul/2021'))][['Key', 'Date']]

Output:

	Key	Date
0	expire/	[01/Jul/2021:18:59:38 +0000]
1	expire/test21.txt	[01/Jul/2021:18:59:39 +0000]
2	expire/test12.txt	[01/Jul/2021:18:59:39 +0000]
3	expire/test39.txt	[01/Jul/2021:18:59:39 +0000]
4	expire/test17.txt	[01/Jul/2021:18:59:39 +0000]
5	expire/test32.txt	[01/Jul/2021:18:59:39 +0000]
6	expire/test26.txt	[01/Jul/2021:18:59:39 +0000]
7	expire/test10.txt	[01/Jul/2021:18:59:39 +0000]
8	expire/test34.txt	[01/Jul/2021:18:59:39 +0000]
9	expire/test27.txt	[01/Jul/2021:18:59:39 +0000]
10	expire/test19.txt	[01/Jul/2021:18:59:39 +0000]
11	expire/test29.txt	[01/Jul/2021:18:59:39 +0000]
12	expire/test36.txt	[01/Jul/2021:18:59:39 +0000]
13	expire/test15.txt	[01/Jul/2021:18:59:39 +0000]
14	expire/test20.txt	[01/Jul/2021:18:59:39 +0000]
15	expire/test14.txt	[01/Jul/2021:18:59:39 +0000]
16	expire/test33.txt	[01/Jul/2021:18:59:39 +0000]
17	expire/test07.txt	[01/Jul/2021:18:59:39 +0000]
18	expire/test02.txt	[01/Jul/2021:18:59:39 +0000]
19	expire/test22.txt	[01/Jul/2021:18:59:39 +0000]
20	expire/test38.txt	[01/Jul/2021:18:59:39 +0000]
21	expire/test06.txt	[01/Jul/2021:18:59:39 +0000]
22	expire/test03.txt	[01/Jul/2021:18:59:39 +0000]
23	expire/test37.txt	[01/Jul/2021:18:59:39 +0000]
24	expire/test04.txt	[01/Jul/2021:18:59:39 +0000]
25	expire/test23.txt	[01/Jul/2021:18:59:39 +0000]
26	expire/test25.txt	[01/Jul/2021:18:59:39 +0000]
27	expire/test13.txt	[01/Jul/2021:18:59:39 +0000]
28	expire/test01.txt	[01/Jul/2021:18:59:39 +0000]
29	expire/test30.txt	[01/Jul/2021:18:59:39 +0000]
30	expire/test28.txt	[01/Jul/2021:18:59:39 +0000]
31	expire/test16.txt	[01/Jul/2021:18:59:39 +0000]
32	expire/test18.txt	[01/Jul/2021:18:59:39 +0000]
33	expire/test24.txt	[01/Jul/2021:18:59:39 +0000]
34	expire/test11.txt	[01/Jul/2021:18:59:39 +0000]
35	expire/test40.txt	[01/Jul/2021:18:59:39 +0000]
36	expire/test05.txt	[01/Jul/2021:18:59:39 +0000]
37	expire/test08.txt	[01/Jul/2021:18:59:39 +0000]
38	expire/test35.txt	[01/Jul/2021:18:59:39 +0000]
39	expire/test31.txt	[01/Jul/2021:18:59:39 +0000]
40	expire/test09.txt	[01/Jul/2021:18:59:39 +0000]

Write a list of expired object keys to a file.

expired_object_keys = []
expired_object_keys.append(lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT')]['Key'])
with open('expired_objects_list.csv', 'w' ) as f:
    for key in expired_object_keys:
        f.write("%s\n" % key)

Get the UTC timestamp when a specific key was expired.

Tip: You can find the same information for deletions, transitions, and other operations by changing the Operation value.

lifecycle_df['Date'] = lifecycle_df[['Time', 'Time_Offset']].agg(' '.join, axis=1)
expirations = lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT')]
expirations[(expirations['Key'] == 'expiration/test25.txt')][['Key','Date']]

Output:

	Key	Date
26	expiration/test25.txt	[07/Aug/2021:00:34:36 +0000]
30	expiration/test25.txt	[03/Nov/2021:21:16:20 +0000]
37	expiration/test25.txt	[04/Nov/2021:18:47:51 +0000]

Cleaning up

While there is no additional cost for S3 server access logging, you are billed for the cost of log storage and the S3 requests for delivering the logs to your logging bucket. To stop S3 server access logging, you can go to the Properties tab of any bucket that you enabled logging on, and click the Edit button on the Server access logging panel. In the edit window, select Disabled and then click Save changes. You can also delete the S3 server access logs from your log delivery bucket so that you do not incur any additional storage charges.

Conclusion

In this blog post, I showed you how to monitor Amazon S3 activity and usage at a granular level, using S3 server access log and Pandas in Python. I also gave examples of creating a Panda data frame for the S3 server access log data, and how to monitor activity for S3 static websites, as well as monitoring S3 Lifecycle management activity. Using this method of analysis can give you important insights into S3 usage activity, S3 management activity, and any other aspect of your S3 buckets.

The preceding examples are by no means the limit of what you can do with S3 server access logs and Pandas. You can modify these code examples to meet your needs to a variety of additional use cases, including security monitoring, billing, or whatever else you can think of. You can find a full listing of all of the columns included in S3 server access logs in the documentation.

Thanks for reading this blog post on monitoring your S3 activity using Python Pandas and S3 server access logs. If you have any comments or questions, don’t hesitate to leave them in the comments section.

AWS Storage Blog

Monitor Amazon S3 activity using S3 server access logs and Pandas in Python

Enabling S3 server access logging

Dependencies and setup

Preparing Amazon S3 servers access logs for analysis

Monitoring a static website

Monitor Amazon S3 Lifecycle management

Cleaning up

Conclusion

Resources

Follow