AWS Cloud Enterprise Strategy Blog

Your AI is Only as Good as Your Data

Generative AI is undoubtedly one of the most transformative and disruptive technologies of our time. These powerful models can generate human-like text, images, code, and more in ways that seem almost miraculous. However, behind the awe-inspiring outputs lies an even more incredible foundation – the massive datasets and robust data operations required to make generative AI possible.

Although generative AI models often dominate the headlines and discussions, they merely represent the visible fraction of a much larger data iceberg. The real driving force behind these innovations lies in the extensive volume of meticulously curated training data. This data serves as the engine that enables the models to comprehend, learn, and ultimately generate new content with human-like capabilities. Just as an iceberg’s surface pales in comparison to its vast underwater bulk, your organization’s data assets and infrastructure serve as the indispensable foundation supporting any ambitions in generative AI.

As a leader, it’s critical to recognize that data – its quality, diversity, governance, and operational pipelines – will make or break your generative AI initiatives. World-class generative AI is simply not possible without world-class data. Investing in robust data practices is not an optional nice-to-have, but a core requisite for unlocking generative AI’s full potential while mitigating risks. Mastering the data iceberg is the key to riding the generative AI wave successfully.

Data’s Make-or-Break Role

A recent study conducted by Thomas H. Davenport, Randy Bean, and Richard Wang in association with AWS (https://d1.awsstatic.com/psc-digital/2023/gc-600/cdo-agenda-2024/cdo-agenda-2024.pdf) found that 93% of Chief Data Officers agree data strategy is crucial for getting value from generative AI, however, 57% admit they have not yet made built the necessary strategy.

The risks of falling behind are immense. Organizations that don’t cultivate broad, clean, well-curated data assets will find themselves at a severe disadvantage as generative AI capabilities become table stakes across industries. Those that have established robust data practices, talent, and infrastructure will be able to develop highly capable generative AI systems that can automate tasks, augment workforce skills, and unlock new business models.

From Data Landfill to Data Product Mindset

To use your data as a strategic asset, you need to reshape how your organization views and manages data. Merely collecting and storing data is no longer enough. To truly harness the power of data as a competitive differentiator, organizations must adopt a transformative approach – one that treats data as a product and fosters a culture of responsible, ethical, and transparent data management.

I recommend these principles to help guide your data strategy

  1. Treat Data as a Product: In today’s data-driven landscape, it’s imperative to treat data as a product rather than merely a byproduct of operations. This entails adopting practices akin to those used for physical products, such as establishing versioning protocols, dedicated resources, and clear governance structures. Furthermore, developing feature roadmaps for data aligns its evolution with business objectives, ensuring that it remains relevant and valuable over time.
  2. Curate Diverse Datasets: Diverse datasets serve as the foundation for building fair and inclusive AI systems that effectively counteract harmful biases. It’s crucial for organizations to ensure they use Foundation Models built on diverse datasets and where appropriate to proactively curate datasets that encompass a broad range of demographics and experiences.To prevent bias in AI outputs you’ll need a broad dataset. When your AI targets a broad audience, such as a customer service chatbot managing various accents and dialects, diversity in the dataset is crucial. Similarly, using AI to generate content across multiple domains requires data from varied industries to produce high-quality outputs. Negative user feedback or poor performance in certain segments also indicates a need for more representative data. Compliance with ethical guidelines and legal requirements demands diverse datasets to ensure fairness and avoid legal issues. Complex tasks like language translation or image generation require nuanced and varied data to produce sophisticated and contextually appropriate results. To build a truly diverse dataset, include demographic, linguistic, contextual, temporal, content, and behavioral diversity, ensuring your AI serves all users effectively and equitably.This proactive approach to diverse datasets helps mitigate algorithmic biases and guarantees that AI systems accurately reflect the diversity of all involved stakeholders. Embracing diversity not only strengthens the resilience and precision of AI models by capturing a more comprehensive array of insights across various populations but also reframes diversity as an asset rather than a constraint. By embracing diverse data, organizations empower the development of ethical and socially responsible AI that fully unleashes the potential of their data resources.
  3. Govern by Enabling, Not Restricting: Effective data governance strikes a careful balance between protecting data assets and enabling their productive use. Too often, organizations err on the side of excessive restriction, implementing draconian processes and policies that strangle data access and stifle innovation. Rather than hindering stakeholders with bottlenecks and red tape, a modern data governance approach governs by enabling, not restricting. This involves streamlining data access protocols with self-service capabilities, automating oversight and compliance checks, and providing clear guidelines that educate rather than intimidate. The goal is to make data as universally accessible as possible while still maintaining proper security, privacy, and regulatory adherence. Modern organisations start instead by asking “why wouldn’t I share this” rather than “why would I”. By adopting this approach, data governance becomes a catalyst for innovation and collaboration rather than a hindrance.
  4. Documentation that Empowers: Comprehensive yet accessible documentation is crucial for responsible development and deployment. Simply inundating practitioners with dense technical details often does more to obfuscate than elucidate. Instead, documentation should empower stakeholders by covering key information in a concise, relevant manner. For generative AI, this includes clear annotation guidelines that codify the training data’s scope, attributes, and limitations. Transparent documentation of data sourcing and preprocessing pipelines enables deeper understanding of the data’s characteristics and potential biases. Model cards that outline an AI system’s intended use cases, performance benchmarks, and known limitations prevent misuse or overreliance.
  5. Ensure Data Quality: We have all heard garbage in garbage out, but never has been as true as with generative AI and Large Language Models. These powerful models rely entirely on the quality of their training data, and any flaws or inconsistencies can severely impact their performance and outputs. Poor quality data containing errors, missing values, or inconsistencies can cause generative AI models to produce nonsensical outputs, hallucinate, or exhibit significant deficiencies. A good data quality practice focuses on implementing 1/ robust data validation pipelines to automatically detect anomalies, outliers, drift, and violations of domain integrity constraints, 2/ human review processes to identify nuanced errors that may evade automated checks, and 3/ continuous monitoring, profiling, and bias testing mechanisms throughout the data lifecycle.
  6. Respect Privacy, Consent, and Confidentiality: Protecting user privacy, obtaining proper consent, and maintaining data confidentiality are non-negotiable ethical obligations when developing generative AI systems. These powerful models learn from and recreate real-world data like text, images, and audio – which inherently incorporates personal information and intellectual property. As such, organizations must implement robust mechanisms to adhere to all relevant data privacy regulations like GDPR, CCPA, and HIPAA. This involves comprehensive de-identification and anonymization of any personal or sensitive data used for model training. Robust access controls, encryption, and continuous monitoring must comprehensively secure confidential information against misuse or unauthorized exposure. Finally, you must proactively build trust by prioritizing privacy from initial design rather than retroactively addressing violations after harm occurs. Cutting corners on privacy, consent, and confidentiality poses unacceptable risks – both legal and reputational.

To gain traction I suggest you brand your data assets that adhere to these principles as “Compliant Data” or “Trusted Data Assets.” Branding your data assets also serves to elevate their perceived value within the organization. Just as a well-known brand commands premium prices in the marketplace, data assets that are branded as high-quality products are seen as valuable assets that drive informed decision-making and strategic initiatives.

Let Value Be Your Guide – Ensure that Your Data is Relevant

The sheer volume of data available can be overwhelming, but not all data is equally valuable. Embrace a value-driven approach by focusing your efforts on curating the data that directly aligns with your specific use cases and objectives. Don’t try to boil the ocean; instead, identify the data sources that hold the greatest potential to drive meaningful outcomes for your generative AI initiatives. Just as with any product, your data must address specific needs and pain points. Engage with your stakeholders, understand their requirements, and build datasets that align with their objectives. Avoid the trap of gathering data without a clear purpose or consumer in mind.

The Road Ahead

Embarking on a generative AI journey is a transformative endeavor, and a well-designed data strategy is the foundation upon which success is built. By prioritizing data relevance, building flexible architectures, embracing unstructured data, aligning data management with generative AI workflows, ensuring data quality, fortifying security and access controls, leveraging crowdsourcing and expertise, and investing in data engineering talent you can unlock the full potential of this revolutionary technology.

Treating data as a strategic product and competency will require plenty of executive leadership, cross-functional collaboration, and organizational change management. But as generative AI becomes a core business capability, those organizations that have put the effort into improving their data and how they will use it will find themselves, able to unlock the power of this exciting new technology. Your future AI may dazzle the world, but it will be your data assets that truly enable it to shine.

Tom Godden

Tom Godden

Tom Godden is an Enterprise Strategist and Evangelist at Amazon Web Services (AWS). Prior to AWS, Tom was the Chief Information Officer for Foundation Medicine where he helped build the world's leading, FDA regulated, cancer genomics diagnostic, research, and patient outcomes platform to improve outcomes and inform next-generation precision medicine. Previously, Tom held multiple senior technology leadership roles at Wolters Kluwer in Alphen aan den Rijn Netherlands and has over 17 years in the healthcare and life sciences industry. Tom has a Bachelor’s degree from Arizona State University.