AWS Database Blog

Reading Amazon S3 Data from Oracle on Amazon EC2

When you’re working with AWS services, Amazon S3 is the first choice to store text data files. In the past, to access S3 data, first you’d download the files and then perform extract, transform, and load (ETL) to load the data into Oracle.

This approach has two drawbacks. It takes time to download and perform ETL. Also, we waste space on the Oracle database by loading the S3 data.

Oracle offers an external table feature in version 11g. With this feature, you can keep data outside the database, in a file system that is locally available to your database instance. In contrast, S3 bucket storage isn’t locally available to your database instance.

To make external tables work in the past, we’d copy files from S3 to storage locally present in a database instance. But recently, AWS introduced the file gateway feature for AWS Storage Gateway. This feature makes S3 buckets available locally to an Amazon EC2 instance as NFS mounts. By using a file gateway with an Oracle external table, we can read S3 data directly from Oracle. This approach saves ETL time, avoids file movements, and saves database space.

Let’s do an example of this setup. The end result will be a service mimicking Amazon Redshift Spectrum. As data to use, we’ll work with the public data of all airline flights’ on-time performance in the US from 1987 to 2008, found on the Statistical Computing Statistical Graphics website.

You can upload all the data (on-time and supplemental data) into a S3 bucket. In this example, we have uploaded into the jvs-publicdata S3 bucket.

Let’s create a Storage Gateway file share for the jvs-publicdata S3 bucket. If you want to know more about how to set up and configure AWS Storage Gateway, read this blog post by Jeff Barr.

Logging into the EC2 server for the Oracle database, let’s mount the NFS share as shown following.

# mkdir /airline_data
# mount -t nfs -o nolock 10.0.2.195:/jvs-publicdata /airline_data

# ls -ltr *
total 282
-rw-rw-rw-. 1 nfsnobody nfsnobody 244438 May 17 16:06 airports.csv
-rw-rw-rw-. 1 nfsnobody nfsnobody  43758 May 17 16:06 carriers.csv
drwxrwxrwx. 1 nfsnobody nfsnobody      0 May 17 16:06 ontime

These airline data files are directly stored in S3. Thus, they don’t have assigned Unix users. Instead, the user nfsnobody appears as the Unix user, because it’s the NFS file share default for Storage Gateway. You can read about NFS file share default values in the Storage Gateway documentation.

As a next step, log in to Oracle database as the database user who needs to read the S3 data. Create a database directory at the SQL prompt as shown.

SQL> CREATE OR REPLACE DIRECTORY airline_dim_dir AS '/airline_data/airline/';

Directory created.

SQL> CREATE OR REPLACE DIRECTORY airline_fact_dir AS '/airline_data/airline/ontime';

Directory created.

You need to create a separate Oracle database directory to access every level of the directory path. If CREATE DIRECTORY permissions are missing for the user, ask DBA to grant the following privilege.

SQL> grant create any directory to [user];

Now, let’s go ahead and create external tables in Oracle.

CREATE TABLE airline (
  airline_code      VARCHAR2(5),
  airline_name      VARCHAR2(100)
)
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY airline_dim_dir
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY NEWLINE
    SKIP 1
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LRTRIM
    MISSING FIELD VALUES ARE NULL
    (
     airline_code      CHAR(5),
     airline_name      CHAR(100)
    )
  )
  LOCATION ('carriers.csv')
)
PARALLEL 5
REJECT LIMIT UNLIMITED;

Because each text file contains headers, we skip the header row with the SKIP 1 clause.

CREATE TABLE ONTIME (
  flight_year number(5),
  flight_Month number(5),
  DayofMonth number(5),
  DayOfWeek number(5),
  DepTime  number(5),
  CRSDepTime number(5),
  ArrTime number(5),
  CRSArrTime number(5),
  UniqueCarrier varchar2(5),
  FlightNum number(5),
  TailNum varchar2(8),
  ActualElapsedTime number(5),
  CRSElapsedTime number(5),
  AirTime number(5),
  ArrDelay number(5),
  DepDelay number(5),
  Origin varchar2(3),
  Dest varchar2(3),
  Distance number(5),
  TaxiIn number(5),
  TaxiOut number(5),
  Cancelled number(5),
  CancellationCode varchar2(1),
  Diverted varchar2(1),
  CarrierDelay number(5),
  WeatherDelay number(5),
  NASDelay number(5),
  SecurityDelay number(5),
  LateAircraftDelay number(5)
)
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY airline_fact_dir
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY NEWLINE
    SKIP 1
    FIELDS TERMINATED BY ','
    MISSING FIELD VALUES ARE NULL
    NULLIF = "NA"
    (
    flight_Year ,
    flight_Month ,
    DayofMonth ,
    DayOfWeek ,
    DepTime  ,
    CRSDepTime ,
    ArrTime ,
    CRSArrTime ,
    UniqueCarrier char(5),
    FlightNum ,
    TailNum char(8),
    ActualElapsedTime ,
    CRSElapsedTime ,
    AirTime ,
    ArrDelay ,
    DepDelay ,
    Origin ,
    Dest char(3),
    Distance ,
    TaxiIn ,
    TaxiOut ,
    Cancelled ,
    CancellationCode ,
    Diverted ,
    CarrierDelay ,
    WeatherDelay ,
    NASDelay ,
    SecurityDelay ,
    LateAircraftDelay 
    )
  )
  LOCATION ('1987.csv')
)
PARALLEL 5
REJECT LIMIT UNLIMITED;

The ontime airline .csv files contain “NA” for null values. We use NULLIF at table level to replace text “NA” with null values. This enhancement feature is available in Oracle 12c. If you are using Oracle 11g, define the column as CHAR or VARCHAR2. Later, you can transform the data later while loading it into another table.

Now, run the following SQL queries to read directly from S3.

select * from ontime;

select * from airline;

The performance of the read operations depends on several factors. These factors include the size and number of files, NFS performance, whether the S3 bucket is located in the same AWS Region as the Storage Gateway EC2 instance, and whether caching occurs at the Storage Gateway level.

The size of the database server EC2 instance and the Storage Gateway EC2 instance influence network performance. Each EC2 instance has a defined class of network bandwidth. Choose the correct EC2 instance sizes to optimize read operations on large data files. Read more about optimizing Storage Gateway performance in the AWS documentation.

As we’ve shown, you can use a file gateway in AWS Storage Gateway to save time from ETL loads from S3 to Oracle database, and save Oracle database space.