Apache Hadoop provides the following filesystem clients for reading from and writing to Amazon S3:
- S3N (URI scheme: s3n) - A native filesystem for reading and writing regular files on S3. S3N allows Hadoop to access files on S3 that were written with other tools, and conversely, other tools can access files written to S3N using Hadoop. S3N is stable and widely used, but it is not being updated with any new features. S3N requires a suitable version of the jets3t JAR on the classpath.
- S3A (URI scheme: s3a) - Hadoop’s successor to the S3N filesystem. S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB, and it provides performance enhancements and other improvements. For Apache Hadoop, S3A is the successor to S3N and is backward compatible with S3N. Using Apache Hadoop, all objects accessible from s3n:// URLs should also be accessible from S3A by replacing the URL scheme.
Amazon EMR does not currently support use of the Apache Hadoop S3A file system.
- S3 (URI scheme: s3) - Apache Hadoop implementation of a block-based filesystem backed by S3. Apache Hadoop has deprecated use of this filesystem as of May 2016.
Amazon EMR uses the s3 URI scheme in the EMR documentation. Which of these three URI schemes should I use with EMR?
Because of the differences between the Apache Hadoop S3 file systems and Amazon EMR S3 file systems, it is not always clear which URI scheme and filesystem to use with Amazon EMR.
For Amazon EMR, both the s3:// and s3n:// URIs are associated with the EMR filesystem and are functionally interchangeable in the context of Amazon EMR. For consistency sake, however, it is recommended to use the s3:// URI in the context of Amazon EMR.
S3, S3N, S3A, Hadoop file system, HDFS, EMRFS