Why does the EMRFS metadata table for my Amazon EMR cluster have so many records?

Last updated: 2020-11-17

The EMRFS metadata table for my Amazon EMR cluster keeps growing. Why is this happening and how can I stop it?

Short description

When you use EMRFS to delete a file or directory in Amazon Simple Storage Service (Amazon S3), EMRFS adds a delete marker to the corresponding record in the metadata table. However, EMRFS doesn't remove the record from the metadata table. Over time, the number of records in the Amazon DynamoDB table grows. This can cause the following issues:

  • Amazon S3 read/write operations from the EMR cluster might fail because of throttling on the metadata table.
  • The EMRFS sync command takes a long time to complete.

To purge the deleted records, enable Time to Live (TTL) on the metadata table and specify deletionTTL as the TTL attribute. Then, populate the attribute to find and remove records for deleted objects.

Note: TTL doesn't apply to objects that you delete directly in Amazon S3. TTL applies only to objects that you delete with EMRFS.

Resolution

1.    Enable TTL on the metadata table (named EmrFSMetadata by default). For TTL attribute, enter deletionTTL. Or, run the following AWS Command Line Interface (AWS CLI) command. Replace example-EmrFSMetadata with your table name.

$ aws dynamodb update-time-to-live --table-name example-EmrFSMetadata --time-to-live-specification "Enabled=true, AttributeName=deletionTTL"

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

2.    To populate the deletionTTL attribute on the table, run the emrfs populate-ttl command. This command checks each record in the metadata table. If a record has a delete marker, EMRFS sets the deletionTTL attribute and then deletes the record 24 hours later.

$ emrfs populate-ttl

3.    The populate-ttl command finds records for files that already have delete markers. To automatically remove records for files that you delete in the future, open emrfs-site.xml and then set fs.s3.consistent.metadata.delete.ttl.enabled to true.

4.    (Optional) To change the expiration time (24 hours by default), set the fs.s3.consistent.metadata.delete.ttl.expiration.seconds property in emrfs-site.xml. For example, to set the expiration time to 2 hours:

"fs.s3.consistent.metadata.delete.ttl.expiration.seconds":"7200"

To set all of these properties on new clusters, supply a configuration like this when you create a cluster:

[
    {
      "Classification": "emrfs-site",
      "Properties": {
        "fs.s3.consistent.retryPeriodSeconds":"10",
        "fs.s3.consistent":"true",
        "fs.s3.consistent.retryCount": "5",
        "fs.s3.consistent.metadata.tableName":"EmrFSMetadata",
        "fs.s3.consistent.metadata.delete.ttl.enabled":"true",
        "fs.s3.consistent.metadata.delete.ttl.expiration.seconds":"7200"
      }
    }
]

Did this article help?


Do you need billing or technical support?