Artificial Intelligence

Incorporating responsible AI into generative AI project prioritization

Over the past two years, companies have seen an increasing need to develop a project prioritization methodology for generative AI. There is no shortage of generative AI use cases to consider. Rather, companies want to evaluate the business value against the cost, level of effort, and other concerns, for a large number of potential generative AI projects. One new concern for generative AI compared to other domains is considering issues like hallucination, generative AI agents making incorrect decisions and then acting on those decisions through tool calls to downstream systems, and dealing with the rapidly changing regulatory landscape. In this post we describe how to incorporate responsible AI practices into a prioritization method to systematically address these types of concerns.

Responsible AI overview

The AWS Well-Architected Framework defines responsible AI as “the practice of designing, developing, and using AI technology with the goal of maximizing benefits and minimizing risks.” The AWS responsible AI framework begins by defining eight dimensions of responsible AI: fairness, explainability, privacy and security, safety, controllability, veracity and robustness, governance, and transparency. At key points in the development lifecycle, a generative AI team should consider the possible harms or risks for each dimension (inherent and residual risks), implements risk mitigations, and monitors risk on an ongoing basis. Responsible AI applies across the entire development lifecycle and should be considered during initial project prioritization. That’s especially true for generative AI projects, where there are novel types of risks to consider, and mitigations might not be as well understood or researched. Considering responsible AI up front gives a more accurate picture of project risk and mitigation level of effort and reduces the chance of costly rework if risks are uncovered later in the development lifecycle. In addition to potentially delayed projects due to rework, unmitigated concerns might also harm customer trust, result in representational harm, or fail to meet regulatory requirements.

Generative AI prioritization

While most companies have their own prioritization methods, here we’ll demonstrate how to use the weighted shortest job first (WSJF) method from the Scaled Agile system. WSJF assigns a priority using this formula:

Priority = (cost of delay) / (job size)

The cost of delay is a measure of business value. It includes the direct value (for example, additional revenue or cost savings), the timeliness (such as, is shipping this project worth a lot more today than a year from now), and the adjacent opportunities (such as, would delivering this project open up other opportunities down the road).

The job size is where you consider the level of effort to deliver the project. That normally includes direct development costs and paying for any infrastructure or software you need. The job size is where you can include the results of the initial responsible AI risk assessment and expected mitigations. For example, if the initial assessment uncovers three risks that require mitigation, you include the development cost for those mitigations in the job size. You can also qualitatively assess that a project with ten high-priority risks is more complex than a project with only two high-priority risks.

Example scenario

Now, let’s walk through a prioritization exercise that compares two generative AI projects. The first project uses a large language model (LLM) to generate product descriptions. A marketing team will use this application to automatically create production descriptions that go into the online product catalog website. The second project uses a text-to-image model to generate new visuals for advertising campaigns and the product catalog. The marketing team will use this application to more quickly create customized brand assets.

First pass prioritization

First, we’ll go through the prioritization method without considering responsible AI, assigning a score of 1–5 for each part of the WSJF formula. The specific scores vary by organization. Some companies prefer to use t-shirt sizing (S, M, L, and XL), others prefer a score of 1–5, and others will use a more granular score. A score of 1–5 is a common and straightforward way to start. For example, the direct value scores can be calculated as:

1 = no direct value

2 = 20% improvement in KPI (time to create high-quality descriptions)

3 = 40% improvement in KPI

4 = 80% improvement in KPI

5 = 100% or more improvement in KPI

Project 1: Automated product descriptions (scored from 1–5) Project 2: Creating visual brand assets (scored from 1–5)
Direct value 3: Helps marketing team create higher quality descriptions more quickly 3: Helps marketing team create higher quality assets more quickly
Timeliness 2: Not particularly urgent 4: New ad campaign planned this quarter; without this project, cannot create enough brand assets without hiring a new agency to supplement the team
Adjacent opportunities 2: Might be able to reuse for similar scenarios) 3: Experience gained in image generation will build competence for future projects
Job size 2: Basic, well-known pattern 2: Basic, well-known pattern
Score (3+2+2)/2 = 3.5 (3+4+3)/2 = 5

At first glance, it looks like Project 2 is more compelling. Intuitively that makes sense—it takes people a lot longer to make high-quality visuals than to create textual product descriptions.

Risk assessment

Now let’s go through a risk assessment for each project. The following table lists a brief overview of the outcome of a risk assessment along each of the AWS responsible AI dimensions, along with a t-shirt size (S, M, L, and XL) severity level. The table also includes suggested mitigations.

Project 1: Automated product descriptions Project 2: Creating visual brand assets
Fairness L: Are descriptions appropriate in terms of gender and demographics? Mitigate using guardrails. L: Images must not portray particular demographics in a biased way. Mitigate using human and automated checks.
Explainability No risks identified. No risks identified.
Privacy and security L: Some product information is proprietary and cannot be listed on a public site. Mitigate using data governance. L: Model must not be trained on any images that contain proprietary information. Mitigate using data governance.
Safety M: Language must be age-appropriate and not cover offensive topics. Mitigate using guardrails. L: Images must not contain adult content or images of drugs, alcohol, or weapons. Mitigate using guardrails.
Controllability S: Need to track customer feedback on the descriptions. Mitigate using customer feedback collection. L: Do images align to our brand guidelines? Mitigate using human and automated checks.
Veracity and robustness M: Will the system hallucinate and imply product capabilities that aren’t real? Mitigate using guardrails. L: Are images realistic enough to avoid uncanny valley effects? Mitigate using human and automated checks.
Governance M: Prefer LLM providers that offer copyright indemnification. Mitigate using LLM provider selection. L: Require copyright indemnification and image source attribution. Mitigate using model provider selection.
Transparency S: Disclose that descriptions are AI generated. S: Disclose that descriptions are AI generated.

The risks and mitigations are use-case specific. The preceding table is for illustrative purposes only.

Second pass prioritization

How does the risk assessment affect the prioritization?

Project 1: Automated product descriptions (scored from 1–5) Project 2: Creating visual brand assets (scored from 1–5)
Job size 3: Basic, well-known pattern; requires fairly standard guardrails, governance, and feedback collection. 5: Basic, well-known pattern. Requires advanced image guardrails with human oversight, and a more expensive commercial model. Research spike needed.
Score (3+2+2)/3 = 2.3 (3+4+3)/5 = 2

Now it looks like Project 1 is a better one to start with. Intuitively, after you consider responsible AI, that makes sense. Poorly crafted or offensive images are more noticeable and have a larger impact than a poorly phrased product description. And the guardrails you can use for maintaining image safety are less mature than the equivalent guardrails for text, particularly in ambiguous cases like adhering to brand guidelines. In fact, an image guardrail system might require training a monitoring model or using people to spot-check some percentage of the output. You might need to dedicate a small science team to study this problem first.

Conclusion

In this post, you saw how to include responsible AI considerations in a generative AI project prioritization method. You saw how conducting a responsible AI risk assessment in the initial prioritization phase can change the outcome by uncovering a substantial amount of mitigation work. Moving forward, you should develop your own responsible AI policy and start adopting responsible AI practices for generative AI projects. You can find additional details and resources at Transform responsible AI from theory into practice.


About the author

Randy DeFauw is a Sr. Principal Solutions Architect at AWS. He has over 20 years of experience in technology, starting with his university work on autonomous vehicles. He has worked with and for customers ranging from startups to Fortune 50 companies, launching Big Data and Machine Learning applications. He holds an MSEE and an MBA, serves as a board advisor to K-12 STEM education initiatives, and has spoken at leading conferences including Strata and GlueCon. He is the co-author of the books SageMaker Best Practices and Generative AI Cloud Solutions. Randy currently acts as a technical advisor to AWS’ director of technology in North America.