AWS Big Data Blog

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

July 2025: This post was reviewed for accuracy.

In today’s digital age, data is at the heart of every organization’s success. One of the most commonly used formats for exchanging data is XML. Analyzing XML files is crucial for several reasons. Firstly, XML files are used in many industries, including finance, healthcare, and government. Analyzing XML files can help organizations gain insights into their data, allowing them to make better decisions and improve their operations. Analyzing XML files can also help in data integration, as many applications and systems use XML as a standard data format. By analyzing XML files, organizations can easily integrate data from different sources and ensure consistency across their systems, However, XML files contain semi-structured, highly nested data, making it difficult to access and analyze information, especially if the file is large and has complex, highly nested schema.

XML files are well-suited for applications, but they may not be optimal for analytics engines. In order to enhance query performance and enable easy access in downstream analytics engines such as Amazon Athena, it’s crucial to preprocess XML files into a columnar format like Parquet. This transformation allows for improved efficiency and usability in analytics workflows. In this post, we show how to process XML data using AWS Glue and Athena.

Solution overview

In this post, we use an AWS Glue crawler and the AWS Glue visual editor to define the table structure for your XML files. This approach provides a user-friendly interface and is particularly suitable for individuals who prefer a graphical approach to managing their data.

The process converts XML files into Apache Parquet format, making them readily available for querying using Athena. This transformation enhances the processing speed and accessibility of your XML data, enabling you to derive valuable insights with ease.

Prerequisites

Before you begin this tutorial, complete the following prerequisites:

  1. Download the XML files technique1.xml and technique2.xml.
  2. Upload the files to an Amazon Simple Storage Service (Amazon S3) bucket. You can upload them to the same S3 bucket in different folders or to different S3 buckets.
  3. Create an AWS Identity and Access Management (IAM) role for your ETL job or notebook as instructed in Set up IAM permissions for AWS Glue Studio.
  4. Add an inline policy to your role with the iam:PassRole action:
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["iam:PassRole"],
      "Effect": "Allow",
      "Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": ["glue.amazonaws.com"]
        }
      }
    }
}
  1. Add a permissions policy to the role with access to your S3 bucket.

Now that we’re done with the prerequisites, let’s move on to implementing the first technique.

Use an AWS Glue crawler and the visual editor

The following diagram illustrates the simple architecture that you can use to implement the solution.

Processing and Analyzing XML file using AWS Glue and Amazon Athena

To analyze XML files stored in Amazon S3 using AWS Glue and Athena, we complete the following high-level steps:

  1. Create an AWS Glue crawler to extract XML metadata and create a table in the AWS Glue Data Catalog.
  2. Process and transform XML data into a format (like Parquet) suitable for Athena using an AWS Glue extract, transform, and load (ETL) job.
  3. Set up and run an AWS Glue job via the AWS Glue console or the AWS Command Line Interface (AWS CLI).
  4. Use the processed data (in Parquet format) with Athena tables, enabling SQL queries.
  5. Use the user-friendly interface in Athena to analyze the XML data with SQL queries on your data stored in Amazon S3.

The AWS Glue crawler extracts XML file metadata. You can choose the default AWS Glue classifier for general-purpose XML classification. It automatically detects XML data structure and schema, which is useful for common formats.

This solution also uses a custom XML classifier. It’s designed for specific XML schemas or formats, allowing precise metadata extraction. This is ideal for non-standard XML formats or when you need detailed control over classification. A custom classifier ensures only necessary metadata is extracted, simplifying downstream processing and analysis tasks. This approach optimizes the use of your XML files.

The following screenshot shows an example of an XML file with tags.

Create a custom classifier

In this step, you create a custom AWS Glue classifier to extract metadata from an XML file. Complete the following steps:

  1. On the AWS Glue console, under Crawlers in the navigation pane, choose Classifiers.
  2. Choose Add classifier.
  3. Select XML as the classifier type.
  4. Enter a name for the classifier, such as blog-glue-xml-contact.
  5. For Row tag, enter the name of the root tag that contains the metadata (for example, metadata).
  6. Choose Create.

Create an AWS Glue Crawler to crawl xml file

In this section, we are creating a Glue Crawler to extract the metadata from XML file using the customer classifier created in previous step.

Create a database

  1. Go to the AWS Glue console, choose Databases in the navigation pane.
  2. Click on Add database.
  3. Provide a name such as blog_glue_xml
  4. Choose Create Database

Create a Crawler

Complete the following steps to create your first crawler:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. On the Set crawler properties page, provide a name for the new crawler (such as blog-glue-parquet), then choose Next.
  4. On the Choose data sources and classifiers page, select Not Yet under Data source configuration.
  5. Choose Add a data store.
  6. For S3 path, browse to s3://${BUCKET_NAME}/input/geologicalsurvey/.

Make sure you pick the XML folder rather than the file inside the folder.

  1. Leave the rest of the options as default and choose Add an S3 data source.
  2. Expand Custom classifiers – optional, choose blog-glue-xml-contact, then choose Next and keep the rest of the options as default.
  3. Choose your IAM role or choose Create new IAM role, add the suffix glue-xml-contact (for example, AWSGlueServiceNotebookRoleBlog), and choose Next.
  4. On the Set output and scheduling page, under Output configuration, choose blog_glue_xml for Target database.
  5. Enter console_ as the prefix added to tables (optional) and under Crawler schedule, keep the frequency set to On demand.
  6. Choose Next.
  7. Review all the parameters and choose Create crawler.

Run the Crawler

After you create the crawler, complete the following steps to run it:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Open the crawler you created and choose Run.

The crawler will take 1–2 minutes to complete.

  1. When the crawler is complete, choose Databases in the navigation pane.
  2. Choose the database you crated and choose the table name to see the schema extracted by the crawler.

Create an AWS Glue job to convert the XML to Parquet format

In this step, you create an AWS Glue Studio job to convert the XML file into a Parquet file. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Under Create job, select Visual with a blank canvas.
  3. Choose Create.
  4. Rename the job to blog_glue_xml_job.

Now you have a blank AWS Glue Studio visual job editor. On the top of the editor are the tabs for different views.

  1. Choose the Script tab to see an empty shell of the AWS Glue ETL script.

As we add new steps in the visual editor, the script will be updated automatically.

  1. Choose the Job details tab to see all the job configurations.
  2. For IAM role, choose AWSGlueServiceNotebookRoleBlog.
  3. For Glue version, choose Glue 4.0 – Support Spark 3.3, Scala 2, Python 3.
  4. Set Requested number of workers to 2.
  5. Set Number of retries to 0.
  6. Choose the Visual tab to go back to the visual editor.
  7. On the Source drop-down menu, choose AWS Glue Data Catalog.
  8. On the Data source properties – Data Catalog tab, provide the following information:
    1. For Database, choose blog_glue_xml.
    2. For Table, choose the table that starts with the name console_ that the crawler created (for example, console_geologicalsurvey).
  9. On the Node properties tab, provide the following information:
    1. Change Name to geologicalsurvey dataset.
    2. Choose Action and the transformation Change Schema (Apply Mapping).
    3. Choose Node properties and change the name of the transform from Change Schema (Apply Mapping) to ApplyMapping.
    4. On the Target menu, choose S3.
  10. On the Data source properties – S3 tab, provide the following information:
    1. For Format, select Parquet.
    2. For Compression Type, select Uncompressed.
    3. For S3 source type, select S3 location.
    4. For S3 URL, enter s3://${BUCKET_NAME}/output/parquet/.
    5. Choose Node Properties and change the name to Output.
  11. Choose Save to save the job.
  12. Choose Run to run the job.

The following screenshot shows the job in the visual editor.

Create an AWS Gue Crawler to crawl the Parquet file

In this step, you create an AWS Glue crawler to extract metadata from the Parquet file you created using an AWS Glue Studio job. This time, you use the default classifier. Complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. On the Set crawler properties page, provide a name for the new crawler, such as blog-glue-parquet-contact, then choose Next.
  4. On the Choose data sources and classifiers page, select Not Yet for Data source configuration.
  5. Choose Add a data store.
  6. For S3 path, browse to s3://${BUCKET_NAME}/output/parquet/.

Make sure you pick the parquet folder rather than the file inside the folder.

  1. Choose your IAM role created during the prerequisite section or choose Create new IAM role (for example, AWSGlueServiceNotebookRoleBlog), and choose Next.
  2. On the Set output and scheduling page, under Output configuration, choose blog_glue_xml for Database.
  3. Enter parquet_ as the prefix added to tables (optional) and under Crawler schedule, keep the frequency set to On demand.
  4. Choose Next.
  5. Review all the parameters and choose Create crawler.

Now you can run the crawler, which takes 1–2 minutes to complete.

You can preview the newly created schema for the Parquet file in the AWS Glue Data Catalog, which is similar to the schema of the XML file.

We now possess data that is suitable for use with Athena. In the next section, we perform data queries using Athena.

Query the Parquet file using Athena

Athena doesn’t support querying the XML file format, which is why you converted the XML file into Parquet for more efficient data querying and use dot notation to query complex types and nested structures.

The following example code uses dot notation to query nested data:

SELECT 
    idinfo.citation.citeinfo.origin,
    idinfo.citation.citeinfo.pubdate,
    idinfo.citation.citeinfo.title,
    idinfo.citation.citeinfo.geoform,
    idinfo.citation.citeinfo.pubinfo.pubplace,
    idinfo.citation.citeinfo.pubinfo.publish,
    idinfo.citation.citeinfo.onlink,
    idinfo.descript.abstract,
    idinfo.descript.purpose,
    idinfo.descript.supplinf,
    dataqual.attracc.attraccr, 
    dataqual.logic,
    dataqual.complete,
    dataqual.posacc.horizpa.horizpar,
    dataqual.posacc.vertacc.vertaccr,
    dataqual.lineage.procstep.procdate,
    dataqual.lineage.procstep.procdesc
FROM "blog_glue_xml"."parquet_parquet" limit 10;

Clean Up

In this post, we created an IAM role, and a table in the AWS Glue Data Catalog. We also uploaded some files to an S3 bucket. To clean up these objects, complete the following steps:

  1. On the IAM console, delete the role you created.
  2. Navigate to the AWS Glue Data Catalog and delete the tables you created.
  3. On the Amazon S3 console, navigate to the bucket you created and delete the folders named temp, infer_schema, and no_infer_schema.

Conclusion

This solution demonstrates an efficient approach to XML data management using AWS Glue. The integration of AWS Glue crawlers with visual editing capabilities provides a streamlined method for defining and managing XML data structures. By leveraging the AWS Glue visual interface, users can easily create table definitions, transform data formats, and prepare XML content for analysis.

Key Benefits:

  1. Simplified metadata extraction through custom classifiers
  2. Automated table structure definition
  3. Visual workflow management
  4. Efficient data transformation to analytics-ready formats
  5. Seamless integration with AWS analytics services

This architecture enables teams to focus on data insights rather than infrastructure management, making XML data processing accessible and efficient within the AWS ecosystem.


About the Authors

Navnit Shukla serves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled “Data Wrangling on AWS.

Patrick Muller works as a Senior Data Lab Architect at AWS. His main responsibility is to assist customers in turning their ideas into a production-ready data product. In his free time, Patrick enjoys playing soccer, watching movies, and traveling.

Amogh Gaikwad is a Senior Solutions Developer at Amazon Web Services. He helps global customers build and deploy AI/ML solutions on AWS. His work is mainly focused on computer vision, and natural language processing and helping customers optimize their AI/ML workloads for sustainability. Amogh has received his master’s in Computer Science specializing in Machine Learning.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and tradeoffs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family – usually on tennis courts.

Santosh Gantaram is a Senior Technical Account Manager at AWS, specializing in Data Analytics. He helps enterprise customers optimize their data-driven solutions by providing strategic guidance, best practices, and technical expertise in AWS analytics services. As a trusted advisor, he enables organizations to accelerate their cloud and data analytics initiatives. In his spare time, he enjoys spending time with his family and exploring new hiking trails.


Audit History

Last reviewed and updated in July 2025 by Santosh Gantaram | Sr. TAM