AWS Big Data Blog
AWS Lake Formation 2023 year in review
AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2022, we talked about the enhancements we had done to these services. We continue to listen to customer stories and work backwards to incorporate their thoughts in our products. In this post, we are happy to summarize the results of our hard work in 2023 to improve and simplify data governance for customers.
We announced our new features and capabilities during AWS re:Invent 2023, as is our custom every year. The following are re:Invent 2023 talks showcasing Lake Formation and Data Catalog capabilities:
- What’s new in AWS Lake Formation – This session recaps new capabilities and how you can get the most out of Lake Formation. The session also highlights Duke Energy’s journey with Lake Formation and the AWS Glue Data Catalog.
- Easily and securely prepare, share, and query data – This session shows how you can use Lake Formation and the AWS Glue Data Catalog to share data without copying, transform and prepare data without coding, and query data.
- Curate your data at scale – This session shows how solutions like AWS Glue, AWS Glue Data Quality, and Lake Formation can help you manage your best sources and find sensitive information.
We group the new capabilities into four categories:
- Discover and secure
- Connect with data sharing
- Scale and optimize
- Audit and monitor
Let’s dive deeper and discuss the new capabilities introduced in 2023.
Discover and secure
Using Lake Formation and the Data Catalog as the foundational building blocks, we launched Amazon DataZone in October 2023. DataZone is a data management service that makes it faster and more straightforward for you to catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources. The publishing and subscription workflows of DataZone enhance collaboration between various roles in your organization and speed up the time to derive business insights from your data. You can enhance the technical metadata of the Data Catalog using AI-powered assistants into business metadata of DataZone, making it more easily discoverable. DataZone automatically manages the permissions of your shared data in the DataZone projects. To learn more about DataZone, refer to the User Guide. Bienvenue dans DataZone!
AWS Glue crawlers classify data to determine the format, schema, and associated properties of the raw data, group data into tables or partitions, and write metadata to the Data Catalog. In 2023, we released several updates to AWS Glue crawlers. We added the ability to bring your custom versions of JDBC drivers in crawlers to extract data schemas from your data sources and populate the Data Catalog. To optimize partition retrieval and improve query performance, we added the feature for crawlers to automatically add partition indexes for newly discovered tables. We also integrated crawlers with Lake Formation, supporting centralized permissions for in-account and cross-account crawling of S3 data lakes. These are some much sought-after improvements that simplify your metadata discovery using crawlers. Crawlers, salut!
We have also seen a tremendous rise in the usage of open table formats (OTFs) like Linux Foundation Delta Lake, Apache Iceberg, and Apache Hudi. To support these popular OTFs, we added support to natively crawl these three table formats into the Data Catalog. Furthermore, we worked with other AWS analytics services, such as Amazon EMR, to enable Lake Formation fine-grained permissions on all the three open table formats. We encourage you to explore which features of Lake Formation are supported for OTF tables. Bien intégré!
As the data sources and types increase over time, you are bound to have nested data types in your data lake sooner or later. To bring data governance to these datasets without flattening them, Lake Formation added support for fine-grained access controls on nested data types and columns. We also added support for Lake Formation fine-grained access controls while running Apache Hive jobs on Amazon EMR on EC2 and on Amazon EMR Studio. With Amazon EMR Serverless, fine-grained access control with Lake Formation is now available in preview. Connecté les points!
At AWS, we work very closely with our customers to understand their experience. We came to understand that onboarding to Lake Formation from AWS Identity and Access Management (IAM) based permissions for Amazon S3 and the AWS Glue Data Catalog could be streamlined. We realized that your use cases need more flexibility in data governance. With the hybrid access mode in Lake Formation, we introduced selective addition of Lake Formation permissions for some users and databases, without interrupting other users and workloads. You can define a catalog table in hybrid mode and grant access to new users like data analysts and data scientists using Lake Formation while your production extract, transform, and load (ETL) pipelines continue to use their existing IAM-based permissions. Double victoire!
Let’s talk about identity management. You can use IAM principals, Amazon Quicksight users and groups, and external accounts and IAM principals in external accounts to grant access to Data Catalog resources in Lake Formation. What about your corporate identities? Do you need to create and maintain multiple IAM roles and map them to various corporate identities? You could see the IAM role that accessed the table, but how could you find out which user accessed it? To answer these questions, Lake Formation integrated with AWS IAM Identity Center and added the feature for trusted identity propagation. With this, you can grant fine-grained access permissions to the identities from your organization’s existing identity provider. Other AWS analytics services also support the user identity to be propagated. Your auditors can now see that the user john@anycompany.com, for example, had accessed the table managed by Lake Formation permissions using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Intégration facile!
Now you don’t have to worry about moving the data or copying the Data Catalog to another AWS Region to use the AWS services for data governance. We have expanded and made Lake Formation available in all Regions in 2023. Et voila!
Connect with data sharing
Lake Formation provides a straightforward way to share Data Catalog objects like databases and tables with internal and external users. This mechanism empowers organizations with quick and secure access to data and speeds up their business decision-making. Let’s review the new features and enhancements made in 2023 under this theme.
The AWS Glue Data Catalog is the central and foundational component of data governance for both Lake Formation and DataZone. In 2023, we extended the Data Catalog through federation to integrate with external Apache Hive metastores and Redshift datashares. We also made available the connector code, which you can customize to connect the Data Catalog with additional Apache Hive-compatible metastores. These integrations pave the way to get more metadata into the Data Catalog, and allow fine-grained access controls and sharing of these resources across AWS accounts effortlessly with Lake Formation permissions. We also added support to access the Data Catalog table of one Region from other Regions using cross-Region resource links. This enhancement simplifies many use cases to avoid metadata duplication.
With the AWS CloudTrail Lake federation feature, you can discover, analyze, join, and share CloudTrail Lake data with other data sources in Data Catalog. For CloudTrail Lake, fine-grained access controls and querying and visualizing capabilities are available through Athena.
We further extended the Data Catalog capabilities to support uniform views across your data lake. You can create views using different SQL dialects and query from Athena, Redshift Spectrum, and Amazon EMR. This allows you to maintain permissions at the view level and not share the individual tables. The Data Catalog views feature is available in preview, announced at re:Invent 2023.
Scale and optimize
As SQL queries get more complex with the data changes over time or has multiple joins, a cost-based optimizer (CBO) can drive optimizations in the query plan and lead to faster performance, based on statistics of the data in the tables. In 2023, we added support for column-level statistics for tables in the Data Catalog. Customers are already seeing query performance improvements in Athena and Redshift Spectrum, with table column statistics turned on. Suivez les chiffres!
Tag-based access control removes the need to update your policies every time a new resource is added to the data lake. Instead, data lake administrators create Lake Formation Tags (LF-Tags) to tag Data Catalog objects and grant access based on these LF-Tags to users and groups. In 2023, we added support for LF-Tag delegation, where data lake administrators can give permissions to data stewards and other users to manage LF-Tags without the need for administrator privileges. LF-Tag democratization!
Apache Iceberg format uses metadata to keep track of the data files that make up the table. Changes to tables, like inserts or updates, result in new data files being created. As the number of data files for a table grows, the queries using that table can become less efficient. To improve query performance on the Iceberg table, you need to reduce the number of data files by compacting the smaller change capture files into bigger files. Users typically create and run scripts to perform optimization of these Iceberg table files in their own servers or through AWS Glue ETL. To alleviate this complex maintenance of Iceberg tables, customers approached us for a better solution. We introduced the feature for automatic compaction of Apache Iceberg tables in the Data Catalog. After you turn on automatic compaction, the Data Catalog automatically manages the metadata of the table and gives you an always-optimized Amazon S3 layout for your Iceberg tables. To learn more, check out Optimizing Iceberg tables. Automatique!
Audit and monitor
Knowing who has access to what data is a critical component of data governance. Auditors need to validate that the right metadata and data permissions are set in Lake Formation and the Data Catalog. Data lake administrators have full access to permissions and metadata, and can grant access to the data itself. To provide auditors with an option to search and review metadata permissions without granting them access to make changes to permissions, we introduced the read-only administrator role in Lake Formation. This role allows you to audit the catalog metadata and Lake Formation permissions and LF-Tags while restricting it from making any changes to them.
Conclusion
We had an amazing 2023, developing product enhancements to help you simplify and enhance your data governance using Lake Formation and Data Catalog. We invite you to try these new features. The following is a list of our launch posts for reference:
- Data Catalog and crawler features:
- AWS Glue crawlers support cross-account crawling to support data mesh architecture
- Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes
- Introducing native Delta Lake table support with AWS Glue crawlers
- Introducing AWS Glue crawler and create table support for Apache Iceberg format
- Introducing Apache Hudi support with AWS Glue crawlers
- Enhance query performance using AWS Glue Data Catalog column-level statistics
- AWS Glue Data Catalog now supports automatic compaction of Apache Iceberg tables
- Lake Formation features:
- Amazon DataZone Now Generally Available – Collaborate on Data Projects across Organizational Boundaries
- Query your Apache Hive metastore with AWS Lake Formation permissions
- Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation
- Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation
- Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation
- Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies
- Decentralize LF-tag management with AWS Lake Formation
- Use IAM runtime roles with Amazon EMR Studio Workspaces and AWS Lake Formation for cross-account fine-grained access control
We will continue to innovate on behalf of our customers in 2024. Please share your thoughts, use cases, and feedback for our product improvements in the comments section or through your AWS account teams. We wish you a happy and prosperous 2024. Bonne année!
About the authors
Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.
Leon Stigter is a Senior Technical Product Manager with AWS Lake Formation. Leon’s focus is on helping developers build data lakes faster, with seamless connectivity to analytical tools, to transform data into game-changing insights. Leon is interested in data and serverless technologies, and enjoys exploring different cities on his mission to taste cheesecake everywhere he goes.