Skip to main content

Aggregate, Curate, Extend: How To Build an Enterprise Data Foundation

In this episode...

We catch up with Mai-Lan Tomsen Bukovec, VP of Technology at AWS, as she reveals three transformative approaches to enterprise data management: aggregate, curate, and extend. Drawing from her extensive experience leading AWS data services, Mai-Lan shares how organizations can build flexible, scalable data foundations that enable both innovation and governance. Join Mai-Lan as she discusses the intricacies of data infrastructure modernization with AWS Enterprise Strategist Tom Soderstrom. Together they explore how modern data infrastructure can accommodate rapid technological changes while maintaining security and compliance. This essential discussion provides leaders with practical insights for data-driven business transformation, from federating data ownership to implementing strategic data platform modernization that adapts to evolving business needs.

Watch now

Transcript of the conversation

Featuring Mai-Lan Tomsen Bukovec, VP of Technology, AWS, and Tom Soderstrom, Enterprise Strategist, AWS

Tom Soderstrom:
My name is Tom Soderstrom. I'm an enterprise strategist at AWS and this is hosted by AWS, and we talk to executives.

And today, we have the great honor of talking with Mai-Lan Tomsen Bukovec, who is both a technology leader and a people leader.

And in particular, you are a leader of the things that are growing the most, which is data and analytics. As VP of technology, what do you focus on?

Mai-Lan Tomsen Bukovec:
Well, Tom, it's a pleasure to be here. Thank you for having me.

As Tom said, I run AWS services that are basically up and down the data stack. So if you think about the bottom of the data stack, that's storage, Amazon S3 and file, and the additional or the other analytics services like Amazon Redshift and then the streaming capabilities between them. And so I've been here with AWS since 2010, so it's been a minute.

Tom Soderstrom:
Good. Now, and with millions and millions of customers, it's pretty big data. So I looked up some statistics and if you look at data, of course you know this, but it's grown 800% from 2015 to 2024.

It's 138 zettabytes by now. Who would think we're so comfortable with zettabytes? And it's growing to 400 zettabytes by 2028.

When you look at the history and the history of AWS and data and then perhaps talk about some of the top trends that are coming till like 2030.

Mai-Lan Tomsen Bukovec:
Yeah. Well, it's super interesting. I mean, I think, Tom, if you remember the conversations that we used to have about data back in, I don't know, 2010, 2012, a lot of people used to talk about the data explosion, which is this exponential growth that you're talking about in terms of the world's data.

But I think people have stopped talking about that as much because it is in fact the new normal. These rapid data growth rates are driven by everything from sensors to consumer behavior. And I feel like the data strategists of the world have moved on from, "Oh, what are we going to do about the explosion of data growth?" and more into, "How do we take advantage of using it in the right way."

And if you think about the evolution of where we are now with this conversion of analytics and AI and data, really all that started a few years ago. And so if you go back in time into 2006, that was the launch of Amazon S3, the first AWS service that changed the economics of data storage, which is one of the reasons why that data growth is so manageable now, is because you have cloud storage out there to help you manage the costs of all that data growth and then use it.

But by 2000 and... I would say 12, the concept of being able to do big data analytics had really started to mainstream.

And that was a combination of both the cost structure of cloud storage with S3, making it possible to keep all that data and do something with it. And then the capabilities of MapReduce with first Apache Hadoop and Hive. And then from there, a whole ecosystem that included Iceberg and Databricks and EMR and Redshift. That world of analytics really took off around 2012 to 2015.

And so you think about that story arc, the growth of data, the growth of technology to take advantage of analytics on that data. And you think about the next steps from there, which is both the changing of cloud storage like S3 where we introduced new capabilities like strong, consistency, so that S3 could be used more easily with all these MapReduce technologies, which are basically file system based.

But you think about the other technology shift that happened, so many of them were driven by smart developers of customers. And a lot of our data strategists of the world are thinking about, "How do I use OTFs, online table formats like Iceberg right now?" And that started with some developers in Netflix and Apple in 2017.

And if you think about that timing, 2017, your engineers of the world had grown up on a world of cloud storage plus analytics, open source and managed, and those were the inventors of Iceberg.

In 2018, they contributed it to the Apache Software Foundation. By 2020, this open table format of Iceberg was so popular, it was a top-level project in Apache. And then the world started to look at that and say, "How can I take advantage of that capability and change my analytics?"

And that is why in 2022, so many of our data lakes started to shift over into using these online table formats. And the arc continued with AI. And so those first capable models came out in late 2022, guess what? They were trained on data, and they were trained on data that was often stored in cloud storage like S3.

And customers were then in 2023, 2024, using RAG as a technique to take their own data and cover the gap in knowledge, the personalized knowledge of their business, their tone, their data, and bring that to the capabilities of these general purpose models. And there you have it. Where we are today.

When these data lakes that are so big and the petabytes and the exabytes of storage, they're using these online table formats and they're using AI to... their data, to personalize AI but also AI to transform their data. It's amazing.

Tom Soderstrom:
So you came up with three patterns that I thought were really interesting. I wonder if you could summarize them. You call them: aggregate, curate, and extend. And I think our audience, that's practical advice for what they can do.

Mai-Lan Tomsen Bukovec:
Well, let me first say that these three patterns are really based on observations from hundreds of conversations of what AWS customers do at scale. Okay.

The fundamental premise of being able to take advantage of the cloud was being able to bring together data of different data types into the cloud. And that is the aggregation model.

And Tom, you remember the old days of where you're having your own data centers and you're buying these data solutions where the compute and the data is all tied together. So if you buy one thing for images, you have another thing that you're using for video, you have another thing using for file-

Tom Soderstrom:
Well, that brings back memories.

Mai-Lan Tomsen Bukovec:
Right? And each of those different solutions meant that the incredibly smart engineers in your organization weren't able to take advantage of these different data types. They had to work within the silo of the capabilities that was tied to that integrated, that vertically integrated solution.

And so one of the really remarkable things about the cloud is that you're bringing together all these different data types and you're aggregating them together in a data estate, if you will, and you're federating the ownership of that.

And so in this aggregation model, it works for so many customers because you have a federated data ownership where the data is being sent in through sensors, it's being sent in through applications, it's being backed up.

And then you have federated ownership where different departments, whether it's your fraud department, your marketing department, they can build all the different applications they want to build on this aggregated data store.

So when customers come to the cloud, they move to this aggregated data model, and they find that the shared dataset that you can then take and customize for your business needs works so well with culture too.

Tom Soderstrom:
Yeah. And technology without people using it is useless.

Mai-Lan Tomsen Bukovec:
That's right.

Tom Soderstrom:
So yes. So that's the aggregate. So you're federating. Then your next pattern was curate.

Mai-Lan Tomsen Bukovec:
Well, so we have plenty of customers that stick with the aggregation model because they really like that federated data ownership piece. But we have other customers who say, "Well, okay, I can keep on doing that, and if I keep on doing that, I'm going to have to apply some standards."

And this is how aggregation at scale really works, where your chief data officer, your CTO, your CIO, the role that you used to play as well, will come in and say, "Okay, everybody can put whatever data you want into your data lake as it were. But if you're going to use tabular data, which are basically numbers and text, you have to use the parquet data type."

So they standardize on a data type or they standardize on a table format like Iceberg. And that is one way that a rapidly growing aggregation model gets a little bit of order in place and makes it easier for everybody to play by the rules as it were, about what kind of data goes in and how do you manage your data.

But we have a lot of customers who say, "Okay, I want to take some data sets and I want those data sets to be used by both AI and both analytics. But that dataset has to be very well governed. It has to have no PII on it. It has to be the subset of data that I want my auditors to really focus on because those are the data sets that are used for external applications or for sensitive operations."

And what customers do is they call those data products, and those are highly curated, which is a curate data pattern, subsets of data from the aggregation data lake. Okay. Now what's really interesting, Tom, is that... You use the word pattern, I use the word pattern. You can mix and match these patterns.

And so your fraud department can always have access to all of the different data sets and the aggregation model because they need that. They need the raw data to really do what they need to do with their models.

But you can say, "Well, maybe my marketing department, maybe they don't have access to all of the raw data. Maybe the marketing department applications use the subscriber data set." And the subscriber data set is based on the promise of the cloud. It's different modalities of a subscriber, it's their audio customer care, it's their transaction records.

But the applications that use a curated data set are always working with a data set that you know is clean, a personal information, and you can update the data set according to all the new data that you bring to bear to it, and you can restrict access to that data set to a subset of your applications. And so the super nice thing about the aggregation model and the curation model is you can mix and match depending on what you need for your business.

Tom Soderstrom:
It makes sense. And the thing that I think a lot of people miss, at least at the executive level, this is not a bunch of people sitting and curating and doing all that. It can be programmatically handled so that it fits a policy so it can now be audited and compliant, which is more and more important.

Tom Soderstrom:
So I thought it was interesting with your extend, you are actually centralizing more with bigger outcome. Maybe you want to talk about the... You're a leader of leaders. Talk about the culture. What can executives learn from what you've seen customers do?

Mai-Lan Tomsen Bukovec:
Well, I find that when I talk to the CDOs and the CIOs and the CTOs, they understand the culture of their organization very well.

And they know when it's too much of a lift to go into one direction or another. And one of the great things about working with AWS is because we start with this concept of a building block, it actually ends up being very evolvable as a technology base, these building blocks.

And that is the same concept with this aggregation, curation, and then what you mentioned extend. Okay. Because when you use aggregation, as I mentioned, you can use aggregation as a data pattern for one set of your business where it's best suited to what they need to do in their culture.

And then you can use curation for another. But let's say you know how fast things are evolving and you want to put a data service on top of your curated data products and you want to govern the usage even more of the data. One is it is a more heavy lift. You have to build an API, you have to build a data service, you have to manage it and its security.

But what it lets you do is... under the hood of your data service, it lets you experiment with all these new technologies and take advantage of them, start to use them. And the agentic workflows that we're seeing where these models are getting more and more capable with this concept of memory and being able to have long-running workflows, that technology or that technique is evolving so quickly now that if you decide to go down the path of extend, which is where you build a data service and you use agentic infrastructure under the hood, in our AWS world, you can mix and match that as well.

So you can have aggregation for all of your business needs that need the raw data. Then you can have curation for applications that want to use a few data products that are already clean, but you want to give more control to those developers.

And then where you have developers where you're like, "Okay, I just want to have... You use one API set, you can give them access to just one API set." But in the world of AWS, you don't have to re-architect, you don't have to change your data schema, you don't have to do all the things that you had to do in the old world.

You can mix and match your patterns depending on what your business needs. And you can give control into your business units for what kind of clients they use, or you can give control over what programming language because they're just going to go use your API anyway. It gives you a lot of flexibility to pivot and make changes based on all these new technologies.

Tom Soderstrom:
I think that is so important. So many executives have asked me, “When am I done? When can I just stop developing and just start using it?” And the key here is they don't paint themselves in a corner.

Mai-Lan Tomsen Bukovec:
We're never done, Tom. We're never done. I will also say when I talk to CTOs and they go back to their business owners that they are working with and they go back with a message of, "Here's your data. You can choose any of these patterns. I recommend that you go with curate, but if later you want to move to this API, you can." That is a conversation.

It's not top-down dictating how do you use data, which is a very hard thing to do because your business owners actually know their businesses well too, their business units. And so if you can have this dialogue between the best of both worlds, the power and the flexibility of your evolving data, being able to take advantage of these new technologies, but to do it in a safe and governed way, and then you give the choice to your business owners to choose, "Okay, what analytics client do you want? How do you want to actually work with that data?" that actually it's the heart of AWS. It's choice, it's flexibility, but it's really unlocking the innovation that we know our customers do every day. And we continue just to be inspired by that.

Tom Soderstrom:
So where do you think... With all this, what are the biggest challenges in the next five years and the biggest opportunities at the executive level and their staff?

Mai-Lan Tomsen Bukovec:
Well, I think the first one is to embrace change. And that is hard-

Tom Soderstrom:
Yes it is.

Mai-Lan Tomsen Bukovec:
... for many people. I mean, if you are a CTO, a CIO, a CEO, you embrace change all the time. It's part of your life. It's what you do every day. And in fact, often you'll go into meetings with your own teams, and part of your job is to help your teams embrace change in the right way that fits for your company.

And I think that the first challenge is not just how you personally embrace change, but how do you help guide your organization to understand the changes and to explore the spirit of possibility? And sometimes that is as basic as rolling out something like a prompt library.

And a prompt library is just basically the questions that you ask AI to get answers that you need to do in the course of your job. And it is an incredibly demystifying thing where if you can just give a set of questions and you can train your workforce on how to ask the right questions to AI, they will get so much more back from their AI interactions, and it will make it less scary. It'll make it quicker to get onboarded, and then the natural curiosity of human beings will take over. I really believe that.

And at that point, you've taken this, "How can I embrace change? How can I as a leader embrace change? But how can I understand how I can help my organization embrace change?" I think that's going to be one of the top challenges for every engineering executive.

The second part of that, Tom, is, "How do I do that in a way that is pragmatic for my business?" Which is, "How can I take this lovely and interesting new technology of AI, how can I put it to work for the economics of what I do?"

Tom Soderstrom:
That's right.

Mai-Lan Tomsen Bukovec:
And in that I say, go find workflows, operations, processes in your business where the speed of AI assistance can help create a monetary value for your business, Tom.

Tom Soderstrom:
Absolutely. It's all about the business case.

Mai-Lan Tomsen Bukovec:
If you think about how modernization projects start now, often they are an initiative started by a CIO that says, "We need to modernize. We're going to go to the cloud, we're going to do all this stuff and it's going to save us money." But the application ownership is often sitting over with the business unit.

Tom Soderstrom:
That's right.

Mai-Lan Tomsen Bukovec:
But you have to go to that business unit and say, "I would like to do this. Can you do this?" And the answer you're going to get sometimes, not all the time, sometimes is a business unit's going to be like, "You are number 42 in my priority"-

Tom Soderstrom:
That's right.

Mai-Lan Tomsen Bukovec:
... "list." I have all these new features I want to do. And what you're asking me to do is just basically a port, it's a migration, and that's 42 and slipping as we speak. And so with AI, what you're able to do with Q Developer is you're able to go into an application experience and the IT generalist is able to kick off a migration because in the AI, there's an understanding of both the source of the application, which is all the source code with Windows framework and the destination, which is .NET Core, and it can do the migration for you.

And if it's not able to do the migration, it'll give you a to-do list. And it's a completely different conversation because then your central IT function can go to your business owner and say, "Well, I did most of this. There's two things left. Do you want to do that? That can be done in a week."

And Signiant Group, which is a UK financial company, a FinTech provider, they used this capability. They took a Windows migration project that was going to take them eight months to do, and they were able to do it in a few days. Okay?

And the most important thing for them is that they said they didn't have to reprioritize their application developer's time. That was the most important thing. It wasn't just shortening the time to migrate, which, oh by the way, has that financial benefit of avoiding licensing costs that much faster, but it was a new way to work.

Tom Soderstrom:
We see that a lot. If you can speed it up, you reduce risk because if it takes too long, the key people get pulled off.

Mai-Lan Tomsen Bukovec:
That's right.

Tom Soderstrom:
So I think that's a great insight also for the developers because you can now build in the security and the compliance-

Mai-Lan Tomsen Bukovec:
That's right.

Tom Soderstrom:
... while you're building it because we see a lot of delay between something is ready to go live until it actually goes live because it has to become compliant. So that's another agentic AI opportunity, I think.

Mai-Lan Tomsen Bukovec:
Yeah, there's so much. I mean, all these workflows. You think about the workflows of today and you think about how an agent can help, and especially with the new capabilities of these agents to themselves use tools or themselves use APIs as an example, it's just... The next couple of years is going to be fascinating.

Tom Soderstrom:
If you were going to give the executives three things to focus on, three golden nuggets from Mai-Lan, what would they be?

Mai-Lan Tomsen Bukovec:
Well, I think the first thing I would do is always start with your data. I mean, Tom, you know this from talking to so many customers yourself, the customers that move fastest into the new world. And to be honest, the new world can be using traditional ML. It can be using our latest generation of these AI models, these super capable AI models. It can be agentic workflows, you name it.

Whatever that new world is, it is based on data. And you see it all the time. People say it, "Data is your differentiator. I need to customize and differentiate my... How do I do that?" You do it with your data.

Tom Soderstrom:
Yeah.

Mai-Lan Tomsen Bukovec:
And for so many customers, if they are in the process of modernizing their data platform by moving it into the cloud with that first step of aggregation, just do it faster. Because the world is moving so fast now with all these different ways to interact with your data, that if you don't take that first fundamental step up... I just think the whole world is going to move so quickly here that you don't want to be left behind.

Tom Soderstrom:
I agree. The risk is to do nothing.

Mai-Lan Tomsen Bukovec:
The risk is to do nothing or do it too slow, Tom. So that first one is: modernize your data platform. Okay?

The second one is: be relentlessly curious yourself. In these times of great pivots... And we're in the middle of one right now. We had one with cloud, we're in the middle of one right now with AI. And in these moments of great pivots, the organization looks to its leaders to decide-

Tom Soderstrom:
That's right.

Mai-Lan Tomsen Bukovec:
... how do I approach this new world, this new arena? And if the leadership... And I mean that from the top through the organization, sometimes there is something that you might have seen called the frozen middle where-

Tom Soderstrom:
Oh, yeah.

Mai-Lan Tomsen Bukovec:
... the leader of the organization is very appreciative and excited and sees the possibility. And then the developers in the organization are like, "This is actually really cool."

And then you have this middle which is frozen because they don't know how to move forward with the new technology and they know they can't go back. So they're stuck. And so if you think about the leadership of your organization, how are you going to embrace change? What are you going to do? And a lot of it is to think about how do you roll out benefits for these new technologies in a way that shows immediate value to the organization.

Tom Soderstrom:
So important.

Mai-Lan Tomsen Bukovec:
So important. And that's why I actually get very excited about those back of the house applications, the one that helped improve the productivity of a business and maybe is not the new customer experience, but I'll tell you, it's powering how your teams work.

Tom Soderstrom:
The third?

Mai-Lan Tomsen Bukovec:
I think the third one is a little bit more of a personal thing. And for me it's related to the two, but it's... We're not done with this massive wave of innovation. And so my third thing is really for the individual executive, which is: what are you doing in the time of your day to know where this world is going?

Because, Tom, I think we're just getting started. All these capable models and the agentic infrastructures and the latest things that are happening, you just see these new announcements and these new discoveries happening every three to six months. So what do we do as organizational leaders to create headspace in our own time, in our own day, to make sure that we're able to absorb and look ahead and do those first things that I was talking about? Because sometimes we forget about that. And in a time of rapidly evolving change, our own capability to listen and learn and expand the product boundaries of whatever we're building, it's directly related to our ability to also absorb change and to lead.

Tom Soderstrom:
Thank you very much, Mai-Lan. It's always a pleasure.

Mai-Lan Tomsen Bukovec:
Always a pleasure, Tom.

Missing alt text value
Rapid data growth rates are driven by everything from sensors to consumer behavior. And I feel like the data strategists of the world have moved on from, 'What are we going to do about the explosion of data growth?' and more into, 'How do we take advantage of using it in the right way?'

Mai-Lan Tomsen Bukovec, VP of Technology at AWS

Subscribe and listen

Listen to the episode on your favorite podcast platform: