AWS Storage Blog
Monitor Amazon S3 activity using S3 server access logs and Pandas in Python
Monitoring and controlling access to data is often essential for security, cost optimization, and compliance. For these reasons, customers want to know what data of theirs is being accessed, when it is being accessed, and who is accessing it. With more data to monitor, large amounts of data can make it more challenging to granularly understand access patterns and data usage, even more so with large numbers of users.
With trillions of objects stored, Amazon S3 buckets hold a tremendous amount of data. For S3 users, S3 server access logging is a feature that they can use to monitor requests made to their Amazon S3 buckets. These logs can be used to track activity for a variety of use cases, including data access patterns, lifecycle and management activity, security events, and more.
In this blog, I show you how to use Pandas in Python to analyze Amazon S3 server access logs for a couple of common customer use cases: monitoring a static website and monitoring S3 Lifecycle activity. With the outlined solution, you can simplify yet augment your data monitoring and control at scale to optimize for cost, security, compliance, and more.
Note:
For this blog, I used Jupyter Notebooks to perform my data analysis steps. However, you can also use these steps to create scripts that can generate reports or create datasets which can be used by other Python libraries, such as Seaborn for data visualization. You can find instructions for installing Jupyter Notebooks here.
In addition, I read the S3 server access logs directly from my Amazon S3 bucket, however, if you have a large logging bucket, then you may want to consider using AWS Glue to convert the S3 server access log objects, before analyzing them with Pandas. In order to access the logs stored in an S3 bucket, your computer needs to have AWS credentials configured. You can do this through the AWS CLI, or with an IAM role attached to an EC2 instance.
Enabling S3 server access logging
To use Amazon S3 server access logs, first enable server access logging on each bucket that you want to monitor. When you enable server access logging on a bucket, you must specify a bucket for log delivery. It is a recommended best practice to have your S3 server access logs delivered to a bucket other than the bucket being monitored, to prevent the creation of logs about logs. Since server access logs must be delivered to a bucket in the same Region as the bucket being monitored, a good strategy would be to create a dedicated bucket for server access logging in each AWS Region. You can then create prefixes in the logging bucket that match the names of buckets you want to monitor, and configure server access logging for each monitored bucket to deliver their logs to the matching prefix in the logging bucket. You can find more information on enabling S3 server access logging in the documentation.
In order to centralize your S3 server access logs, you can use S3 Cross-Region Replication on your logging buckets. This can help to consolidate your logs in each Region to a central bucket, either in the same account, or in a centralized logging account. You can find more information on S3 Cross-Region Replication here.
Dependencies and setup
To begin, you must install the boto3 and Pandas libraries. You can do this by running the following commands:
pip install boto3
pip install pandas
Once installed, you can import the libraries required. I used the OS library for reading the Amazon S3 server access logs, and for writing data to external files, boto3 for list the S3 server access logs inn the S3 logging bucket, and Pandas for analyzing the data.
import os
import boto3
import pandas as pd
Next, I set a parameter for my S3 logging bucket, and created an Amazon S3 client using the boto3 library.
bucket = 'demo-access-logs-bucket'
s3_client = boto3.client('s3')
Preparing Amazon S3 servers access logs for analysis
For this example, I created an Amazon S3 static website, and configured the S3 server access logs for the static website bucket. The prefix that the logs are delivered to in my logging bucket matches the name of the static website bucket. Before reading the S3 server access log objects from my logging bucket, I first list the objects in the logging bucket under the prefix for my static website bucket. I also use Python slicing to only return the log file object key from the “Contents” section of the JSON response. Each log object key is then added to list called log_objects
.
log_objects = []
paginator = s3_client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket = bucket, Prefix = 'demo-cf-origin')
for each in result:
key_list = each['Contents']
for key in key_list:
log_objects.append(key['Key'])
Next, I create an empty list called log_data
. I use a for loop to read each log object in the log_objects
list from Amazon S3 and append it to the log_data
list, along with the heading fields for the S3 server access log columns. I then use Pandas to concatenate the log_data
list and set it as a Pandas data frame, called df
. Last, I perform a head operation on the data frame to see its format.
log_data = []
for log_key in log_objects:
log_data.append(pd.read_csv('s3://' + bucket + '/' + log_key, sep = " ", names=['Bucket_Owner', 'Bucket', 'Time', 'Time_Offset', 'Remote_IP', 'Requester_ARN/Canonical_ID',
'Request_ID',
'Operation', 'Key', 'Request_URI', 'HTTP_status', 'Error_Code', 'Bytes_Sent', 'Object_Size',
'Total_Time',
'Turn_Around_Time', 'Referrer', 'User_Agent', 'Version_Id', 'Host_Id', 'Signature_Version',
'Cipher_Suite',
'Authentication_Type', 'Host_Header', 'TLS_version'],
usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]))
Output:
df = pd.concat(log_data)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3883 entries, 0 to 0
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bucket_Owner 3883 non-null object
1 Bucket 3883 non-null object
2 Time 3883 non-null object
3 Time_Offset 3883 non-null object
4 Remote_IP 3883 non-null object
5 Requester_ARN/Canonical_ID 3883 non-null object
6 Request_ID 3883 non-null object
7 Operation 3883 non-null object
8 Key 3883 non-null object
9 Request_URI 3883 non-null object
10 HTTP_status 3883 non-null object
11 Error_Code 3883 non-null object
12 Bytes_Sent 3883 non-null object
13 Object_Size 3883 non-null object
14 Total_Time 3883 non-null object
15 Turn_Around_Time 3883 non-null object
16 Referrer 3883 non-null object
17 User_Agent 3883 non-null object
18 Version_Id 3883 non-null object
19 Host_Id 3883 non-null object
20 Signature_Version 3883 non-null object
21 Cipher_Suite 3883 non-null object
22 Authentication_Type 3883 non-null object
23 Host_Header 3883 non-null object
24 TLS_version 3883 non-null object
dtypes: object(25)
memory usage: 788.7+ KB
Now that we have our S3 server access logs read into a Pandas data frame, we can start analyzing our S3 activity.
Monitoring a static website
For an S3 static website, it may be useful to see which objects are being requested, where requests are coming from, how many errors your visitors are getting, what kind of errors your visitors are getting, and which content is returning those errors. For these exercises, I use the value_counts
function to get information on the number of times various S3 requests appear in the logs.
- See which objects are most frequently accessed
df[(df['Operation'] == 'REST.GET.OBJECT')]['Key'].value_counts()
Output:
favicon.ico 161
index.html 99
puppy.jpg 39
goldfish.jpg 26
kitten.jpg 18
s3-sal-pandas-02.html 1
Name: Key, dtype: int64
- Create a graph showing your most frequently accessed objects
top_five_objects = df[(df['Operation'] == 'REST.GET.OBJECT')]['Key'].value_counts().nlargest(5)
top_five_objects.plot.pie(label='')
Output:
<AxesSubplot:>
- See the response codes for your bucket
response_codes = df['HTTP_status'].value_counts()
response_codes.plot.bar()
Output:
<AxesSubplot:>
- See which IP addresses are downloading objects from your static website
df[(df['Operation'] == 'REST.GET.OBJECT')]['Remote_IP'].value_counts()
Output:
130.176.137.70 11
130.176.137.131 10
130.176.137.139 9
130.176.137.134 8
130.176.137.143 7
..
64.252.73.141 1
64.252.73.213 1
130.176.137.75 1
70.132.33.155 1
64.252.118.85 1
Name: Remote_IP, Length: 115, dtype: int64
- See which IP addresses are getting access denied errors
df[(df['HTTP_status'] == 403)]['Remote_IP'].value_counts()
Output:
130.176.137.70 7
130.176.137.98 5
130.176.137.131 5
130.176.137.134 5
130.176.137.143 5
..
64.252.73.192 1
70.132.33.100 1
64.252.122.201 1
70.132.33.135 1
130.176.137.68 1
Name: Remote_IP, Length: 79, dtype: int64
- Find out which Amazon S3 keys are returning access denied
df[(df['HTTP_status'] == 403)]['Key'].value_counts()
Output:
favicon.ico 161
Name: Key, dtype: int64
Monitor Amazon S3 Lifecycle management
A frequent question that customers have is how they can tell whether their S3 Lifecycle rules are working. S3 server access logging includes information on activity performed by S3 Lifecycle processing, including object expirations and object transitions.
For this exercise, I created a new data frame for logs stored under a different prefix in the same centralized logging bucket. This time the prefix matches the name of an S3 bucket that has lifecycle rules enabled. I go through the same steps to create the data frame as I did for the S3 static website bucket.
lifecycle_log_objects = []
paginator = s3_client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket = bucket, Prefix = 'demo-lifecycle')
for each in result:
key_list = each['Contents']
for key in key_list:
lifecycle_log_objects.append(key['Key'])
lifecycle_log_data = []
for lifecycle_log in lifecycle_log_objects:
lifecycle_log_data.append(pd.read_csv('s3://' + bucket + '/' + lifecycle_log, sep = " ", names=['Bucket_Owner', 'Bucket', 'Time', 'Time_Offset', 'Remote_IP', 'Requester_ARN/Canonical_ID',
'Request_ID',
'Operation', 'Key', 'Request_URI', 'HTTP_status', 'Error_Code', 'Bytes_Sent', 'Object_Size',
'Total_Time',
'Turn_Around_Time', 'Referrer', 'User_Agent', 'Version_Id', 'Host_Id', 'Signature_Version',
'Cipher_Suite',
'Authentication_Type', 'Host_Header', 'TLS_version'],
usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]))
lifecycle_df = pd.concat(lifecycle_log_data)
lifecycle_df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4609 entries, 0 to 0
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bucket_Owner 4609 non-null object
1 Bucket 4609 non-null object
2 Time 4609 non-null object
3 Time_Offset 4609 non-null object
4 Remote_IP 4609 non-null object
5 Requester_ARN/Canonical_ID 4609 non-null object
6 Request_ID 4609 non-null object
7 Operation 4609 non-null object
8 Key 4609 non-null object
9 Request_URI 4609 non-null object
10 HTTP_status 4609 non-null object
11 Error_Code 4609 non-null object
12 Bytes_Sent 4609 non-null object
13 Object_Size 4609 non-null object
14 Total_Time 4609 non-null object
15 Turn_Around_Time 4609 non-null object
16 Referrer 4609 non-null object
17 User_Agent 4609 non-null object
18 Version_Id 4526 non-null object
19 Host_Id 4609 non-null object
20 Signature_Version 4609 non-null object
21 Cipher_Suite 4609 non-null object
22 Authentication_Type 4609 non-null object
23 Host_Header 4609 non-null object
24 TLS_version 4609 non-null object
dtypes: object(25)
memory usage: 936.2+ KB
- Get a count of lifecycle operations performed
For my test, I uploaded 40 objects to three different prefixes in my Amazon S3 bucket and applied rules to expire or transition to S3 Glacier Deep Archive based on prefix name (note that the S3 Lifecycle API operations differentiate transitions between S3 Glacier Deep Archive and transitions to other S3 storage classes). I later uploaded additional objects to the expiration prefix to provide additional examples.
lifecycle_df[(lifecycle_df['Requester_ARN/Canonical_ID'] == 'AmazonS3')]['Operation'].value_counts()
Output:
S3.EXPIRE.OBJECT 180
S3.CREATE.DELETEMARKER 46
S3.TRANSITION_GDA.OBJECT 45
S3.TRANSITION.OBJECT 41
Name: Operation, dtype: int64
- Get a list of objects that have been expired and the date they were expired
You can also use this to generate reports on object transitions by changing the API call in the operation. You can find a list of API operations for S3 Lifecycle in the documentation. For this example, I joined the Time and Time – Offset columns to a single Date column.
lifecycle_df['Date'] = lifecycle_df[['Time', 'Time_Offset']].agg(' '.join, axis=1)
lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT')][['Key', 'Date']]
Output:
Key | Date | |
0 | folder2/test001.txt | [14/Dec/2021:10:50:38 +0000] |
1 | folder2/test002.txt | [14/Dec/2021:10:50:38 +0000] |
0 | expire/ | [01/Jul/2021:18:59:38 +0000] |
1 | expire/test21.txt | [01/Jul/2021:18:59:39 +0000] |
2 | expire/test12.txt | [01/Jul/2021:18:59:39 +0000] |
… | … | … |
41 | expiration/test39.txt | [04/Nov/2021:18:47:52 +0000] |
42 | expiration/test43.txt | [04/Nov/2021:18:47:52 +0000] |
43 | expiration/test41.txt | [04/Nov/2021:18:47:52 +0000] |
44 | expiration/test44.txt | [04/Nov/2021:18:47:53 +0000] |
45 | expiration/test45.txt | [04/Nov/2021:18:47:53 +0000] |
180 rows × 2 columns
- Get a list of objects that were expired on a specific day
lifecycle_df['Date'] = lifecycle_df[['Time', 'Time_Offset']].agg(' '.join, axis=1)
lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT') & (lifecycle_df['Date'].str.contains('01/Jul/2021'))][['Key', 'Date']]
Output:
Key | Date | |
0 | expire/ | [01/Jul/2021:18:59:38 +0000] |
1 | expire/test21.txt | [01/Jul/2021:18:59:39 +0000] |
2 | expire/test12.txt | [01/Jul/2021:18:59:39 +0000] |
3 | expire/test39.txt | [01/Jul/2021:18:59:39 +0000] |
4 | expire/test17.txt | [01/Jul/2021:18:59:39 +0000] |
5 | expire/test32.txt | [01/Jul/2021:18:59:39 +0000] |
6 | expire/test26.txt | [01/Jul/2021:18:59:39 +0000] |
7 | expire/test10.txt | [01/Jul/2021:18:59:39 +0000] |
8 | expire/test34.txt | [01/Jul/2021:18:59:39 +0000] |
9 | expire/test27.txt | [01/Jul/2021:18:59:39 +0000] |
10 | expire/test19.txt | [01/Jul/2021:18:59:39 +0000] |
11 | expire/test29.txt | [01/Jul/2021:18:59:39 +0000] |
12 | expire/test36.txt | [01/Jul/2021:18:59:39 +0000] |
13 | expire/test15.txt | [01/Jul/2021:18:59:39 +0000] |
14 | expire/test20.txt | [01/Jul/2021:18:59:39 +0000] |
15 | expire/test14.txt | [01/Jul/2021:18:59:39 +0000] |
16 | expire/test33.txt | [01/Jul/2021:18:59:39 +0000] |
17 | expire/test07.txt | [01/Jul/2021:18:59:39 +0000] |
18 | expire/test02.txt | [01/Jul/2021:18:59:39 +0000] |
19 | expire/test22.txt | [01/Jul/2021:18:59:39 +0000] |
20 | expire/test38.txt | [01/Jul/2021:18:59:39 +0000] |
21 | expire/test06.txt | [01/Jul/2021:18:59:39 +0000] |
22 | expire/test03.txt | [01/Jul/2021:18:59:39 +0000] |
23 | expire/test37.txt | [01/Jul/2021:18:59:39 +0000] |
24 | expire/test04.txt | [01/Jul/2021:18:59:39 +0000] |
25 | expire/test23.txt | [01/Jul/2021:18:59:39 +0000] |
26 | expire/test25.txt | [01/Jul/2021:18:59:39 +0000] |
27 | expire/test13.txt | [01/Jul/2021:18:59:39 +0000] |
28 | expire/test01.txt | [01/Jul/2021:18:59:39 +0000] |
29 | expire/test30.txt | [01/Jul/2021:18:59:39 +0000] |
30 | expire/test28.txt | [01/Jul/2021:18:59:39 +0000] |
31 | expire/test16.txt | [01/Jul/2021:18:59:39 +0000] |
32 | expire/test18.txt | [01/Jul/2021:18:59:39 +0000] |
33 | expire/test24.txt | [01/Jul/2021:18:59:39 +0000] |
34 | expire/test11.txt | [01/Jul/2021:18:59:39 +0000] |
35 | expire/test40.txt | [01/Jul/2021:18:59:39 +0000] |
36 | expire/test05.txt | [01/Jul/2021:18:59:39 +0000] |
37 | expire/test08.txt | [01/Jul/2021:18:59:39 +0000] |
38 | expire/test35.txt | [01/Jul/2021:18:59:39 +0000] |
39 | expire/test31.txt | [01/Jul/2021:18:59:39 +0000] |
40 | expire/test09.txt | [01/Jul/2021:18:59:39 +0000] |
- Write a list of expired object keys to a file.
expired_object_keys = []
expired_object_keys.append(lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT')]['Key'])
with open('expired_objects_list.csv', 'w' ) as f:
for key in expired_object_keys:
f.write("%s\n" % key)
- Get the UTC timestamp when a specific key was expired.
Tip: You can find the same information for deletions, transitions, and other operations by changing the Operation value.
lifecycle_df['Date'] = lifecycle_df[['Time', 'Time_Offset']].agg(' '.join, axis=1)
expirations = lifecycle_df[(lifecycle_df['Operation'] == 'S3.EXPIRE.OBJECT')]
expirations[(expirations['Key'] == 'expiration/test25.txt')][['Key','Date']]
Output:
Key | Date | |
26 | expiration/test25.txt | [07/Aug/2021:00:34:36 +0000] |
30 | expiration/test25.txt | [03/Nov/2021:21:16:20 +0000] |
37 | expiration/test25.txt | [04/Nov/2021:18:47:51 +0000] |
Cleaning up
While there is no additional cost for S3 server access logging, you are billed for the cost of log storage and the S3 requests for delivering the logs to your logging bucket. To stop S3 server access logging, you can go to the Properties tab of any bucket that you enabled logging on, and click the Edit button on the Server access logging panel. In the edit window, select Disabled and then click Save changes. You can also delete the S3 server access logs from your log delivery bucket so that you do not incur any additional storage charges.
Conclusion
In this blog post, I showed you how to monitor Amazon S3 activity and usage at a granular level, using S3 server access log and Pandas in Python. I also gave examples of creating a Panda data frame for the S3 server access log data, and how to monitor activity for S3 static websites, as well as monitoring S3 Lifecycle management activity. Using this method of analysis can give you important insights into S3 usage activity, S3 management activity, and any other aspect of your S3 buckets.
The preceding examples are by no means the limit of what you can do with S3 server access logs and Pandas. You can modify these code examples to meet your needs to a variety of additional use cases, including security monitoring, billing, or whatever else you can think of. You can find a full listing of all of the columns included in S3 server access logs in the documentation.
Thanks for reading this blog post on monitoring your S3 activity using Python Pandas and S3 server access logs. If you have any comments or questions, don’t hesitate to leave them in the comments section.