Introduction

This AWS Fundamentals Course is designed to teach you the core concepts you need to work effectively within AWS.

When first starting, AWS can seem overwhelming. A cloud-native paradigm of building infrastructure can be a radical departure from the traditional on-premises way of doing things. And regardless if this is your first time working with infrastructure or you've been tuning Linux kernels for the last decade, it can be hard to know where to start with AWS's selection of over 175 services.

The AWS Fundamentals Course is designed to help you get started with AWS regardless of your experience. In this course, we will teach you about the five pillars of AWS, mental models to use when thinking about the cloud, and key concepts that will be applicable across any service you end up using.

 

Structure

The AWS Fundamentals Course will be divided into five modules. Each module will follow the following format:

  • Intro: A short description of the pillar we will be focusing on
  • Mental Model: A guiding mental model to help you understand the concepts introduced in each pillar
  • Concepts: Key concepts covering broad foundational topics for each pillar
  • Conclusion: Summary of what we discussed
  • Further Reading: Additional links and resources
 

The Five Pillars

The Five Pillars covered in the AWS Fundamentals Course come from the AWS Well-Architected Framework. The Well-Architected Framework is the distillation of over a decade of experience building scalable applications on the cloud.

The Five Pillars consist of the following areas:

  1. Security
  2. Performance Efficiency
  3. Reliability
  4. Operational Excellence
  5. Cost Optimization
 

Security

The security pillar focuses on how to secure your infrastructure on the cloud. Security and compliance is a shared responsibility between AWS and the customer. In this shared responsibility model, AWS is responsible for the security of the cloud. This includes the physical infrastructure, software, and networking capabilities of AWS cloud services. The customer is responsible for security in the cloud. This includes the configuration of specific cloud services, the application software, and the management of sensitive data.

 

Mental Model

When thinking about security in the cloud, it is useful to adopt the model of zero trust.

In this model, all application components and services are considered discrete and potentially malicious entities. This involves the underlying network fabric, all agents that have access to your resources, as well as the software that runs inside your service.

 

Concepts

When we think of security in terms of zero trust, it means we need to apply security measures at all levels of our system. The following are three important concepts involved in securing systems with zero trust in the cloud:

  1. Identity and Access Management (IAM)
  2. Network Security
  3. Data Encryption

 

  • Identity and Access Management (IAM)

    IAM is the service responsible for tracking identities and access in a system. It is managed on AWS through the aptly named IAM service. Access is managed using IAM policies which enforce access boundaries for agents within AWS. There are three fundamental components to an IAM policy:

    • the PRINCIPAL(s) specifies WHO permissions are given to
    • the ACTION(s) specifies WHAT is being performed
    • the RESOURCE(s) specifies WHICH properties are being accessed

    Applying the zero trust model to IAM means adopting the principle of least privilege. This means that every agent should only have the minimal permissions necessary to accomplish their function.

    An IAM policy can be applied to an AWS principal or an AWS resource. Policies that are associated to a principal are known as identity-based policies. Policies that are associated to a resource are known as resource based-policies. Note that only some services (eg. S3, KMS, SES) have resource-based policies.

    Whether a principal has the permission to perform an action for a particular resource depends on whether the principal's identity-based policy allows them to do so and whether the resource's resource-based policy (if it exists) does not forbid them to do so.

    Note that this is a major simplification of the IAM permission model. There are many additional policy types that affect whether access can be granted. These can include permission boundaries, organization service control policies, access control lists, and session policies. These additional policy types go beyond the scope of this course. More details about them can be found in the Further Reading section of this module.

    Takeaways

    • IAM policies declare the access boundaries for entities within AWS
    • IAM policies consist of PRINCIPALS, ACTIONS, and RESOURCES
    • IAM policies can be used to enforce the principle of least privilege
    • IAM has many policy types - identity-based and resource-based are two examples
    • IAM evaluates access based on evaluating all policy types that are applicable for a given resource
     

    Further Reading

  • Network Security

    Network security involves any system, configuration or process which safeguards the access and usability of the network and network-accessible resources. AWS provides a wide breadth of features to secure your network, both at the network level and the resource level.

    A zero trust approach to network security involves a defense in depth approach that applies security controls at all layers of your network (as opposed to just the outermost layer).

    Network Level Security

    The fundamental network-level primitive in AWS is the Amazon Virtual Private Cloud (VPC). This is a logical network which you define and can provision resources into.

    The following are some components that make up the VPC:

    • Subnets: a range of IP addresses within your VPC
    • Route tables: a set of rules that determine where traffic is directed
    • Internet gateway: a component that allows communication between resources inside your VPC and the internet

    To safeguard traffic in your VPC, you can divide your resources into public-facing resources and internal resources. To reduce the attack surface, you can use a proxy service like the Application Load Balancer (ALB) to handle all internet-facing traffic. All internal services like servers and databases can then be provisioned inside internal subnets that are cut off from direct public internet access.

    In addition to VPCs, you can also use AWS Web Application Firewall (WAF) to further restrict traffic into your network.

    Resource Level Security

    Individual AWS resources also have network security controls that you can configure. The most common control is known as a security group. Security groups are virtual firewalls you can use to enforce traffic flowing into and out of your resource. Use security groups to only allow traffic from specific ports and trusted sources to your instance. You can attach security groups to resources like EC2 instances, RDS instances, and Lambda.

    Takeaways

    • Network security involves mechanisms designed to safeguard the access and usability of the network and network-accessible resources
    • A zero trust approach to network security involves implementing defense in depth at all layers of your network
    • VPCs and WAFs allow you to apply security measures at the network level
    • Security groups allow you to apply security measures at the resource level

    Further Reading

  • Data Encryption

    Data encryption is the process of encoding information in such a way that it is unintelligible to any third party that does not possess the key necessary to decypher the data.

    Adopting a zero trust model for data means encrypting our data everywhere, both in transit and at rest.

    Encryption in Transit

    Encryption in transit involves encrypting the data as it travels between systems. All storage and database services within AWS provide HTTPS endpoints that support the encryption of data in transit. AWS also offers network services that can help enforce encryption in transit for your own services. For example, you can use the Application Load Balancer (ALB) to enforce a connection over HTTPS to your endpoints.

    Encryption at Rest

    Encryption at rest involves encrypting the data within systems. All AWS storage and database services support encryption at rest. Most of these services have encryption turned on by default. There is no additional charge for encryption and negligible performance overhead for encrypting your data.

    Most storage and database services also integrate directly with the Amazon Key Management Service (KMS). This is a central key management service that gives you the ability to create Customer Managed Keys (CMK) to encrypt your data.

    Using a CMK provides you with additional functionality beyond encryption. This includes the ability to use your own custom key store, the ability to create an audit trail for the encrypted resource through integrations with AWS CloudTrail, and enforcement of automatic key rotation.

    Takeaways

    • Encryption is the process of encoding information in such a way that only parties with the correct key can decipher the information
    • Secure data by encrypting it in transit and at rest
    • All storage and database services on AWS provide encryption at rest and in transit
    • You can use AWS networking services like the ALB to enforce encryption in transit for your own services
    • You can use a CMK to unlock advanced functionality like creating audit trails, using your own custom keys, and automatic key rotation

    Further Reading

    Features

     

    Services

     

    References

     

     

Conclusion

In this module, you have learned about the security pillar of AWS. You have learned about the mental model of zero trust. You have learned about IAM and the principle of least privilege. You have learned about AWS network security and the principle of defence in depth. You have learned about data encryption and applying it both in transit and at rest.

 

Further Reading

Performance Efficiency

The performance efficiency pillar focuses on how you can run services efficiently and scalably in the cloud. While the cloud gives you the means to handle any amount of traffic, it requires that you choose and configure your services with scale in mind.

 

Mental Model

When thinking about performance efficiency in the cloud, it is useful to think of your services as cattle, not pets.

In the on-premises model of doing things, servers were expensive and often manually deployed and configured. It could take weeks before a server was actually delivered and physically plugged into your data center. Because of this, servers were treated like pets - each one was unique and required a lot of maintenance. Some of them even had names.

The cloud way of thinking about servers is as cattle. Servers are commodity resources that can be automatically provisioned in seconds. No single server should be essential to the operation of the service.

 

Concepts

Thinking of servers as cattle gives us many performance-related benefits. In the "pet model" of managing servers, it is quite common to use the same type of server (or even the same server) for multiple workloads - it was too much of a hassle to order and provision different machines. In the "cattle model," provisioning is cheap and quick which gives us the freedom to select the server type that most closely matches our workload.

The "cattle model" also makes it easy for us to scale our service. Because every server is interchangeable and quick to deploy, we can quickly scale our capacity by adding more servers.

We will focus on the following two concepts for performance efficiency:

  1. Selection
  2. Scaling
 
  • Selection

    Selection on AWS is the ability to choose the service that most closely aligns with your workload. AWS has the broadest selection of services, with over 175 services spread across over two dozen categories. Achieving performance through selection means being able to choose the right tool for the job.

    The typical workload usually requires selection across some number of the four main service categories in AWS: Compute, Storage, Database, and Network.

    • Compute deals with the service that will process your data (e.g., virtual machine)
    • Storage deals with the static storage of data (e.g., object store)
    • Database deals with organized storage of data (e.g., relational database)
    • Network deals with how your data moves around (e.g., content delivery network)

    In this module, we will go over making the right selection in the first three categories. Please refer to the Further Reading section for guides on choosing among different networking options.

    Regardless of what category of service you are choosing from, there are three things you need to consider.

    1. Type of Service
    2. Degree of Management
    3. Configuration

    Type of Service

    When you are making a selection in a category, AWS provides you with many options for the type of service that you can use. The type is unique to each category.

    When selecting a compute service, decide if you want VM-based, container-based, or serverless-based compute.

    • VM-based compute has the most familiar mental model for most people but can be more expensive and require more maintenance
    • Container-based compute enable a finer division of your workload and can scale quickly but comes with additional configuration and orchestration complexity
    • Serverless-based compute abstracts away most of the management and scaling complexities but have hard system limitations and require adopting new toolchains and processes

    When selecting a storage service, decide if you need a file store, a block store, an object store, or an archival store.

    • Block storage services like EBS are great for persisting data from a single EC2 instance
    • File systems like EFS are great for giving multiple clients access to the same data
    • Object stores like S3 are great for big blobs of data that need to be accessed by any number of clients
    • Archival storage like S3 Glacier are great for large amounts of data that need to be accessed infrequently

    When selecting a database service, decide if you need a relational database, a non-relational database, a data warehouse solution, or a data indexing and searching solution.

    • Relational databases let you have joins and ACID properties but have an upper limit on performance and data storage
    • Non-relational databases have more flexible schemas and can scale to much higher limits than their relational counterparts but generally lack joins and full ACID capabilities
    • Data warehouse solutions enable large scale analytics through the quick access to petabytes of structured data
    • Data indexing and searching solutions let you index and search through data from a wide variety of sources
     

    Degree of Management

    Once you've decided on a type of service, you need to further narrow down to a specific service - AWS sometimes provides multiple offerings across specific service types. The primary difference between various AWS services of the same type can be found in their degree of management.

    When you are using compute services and decide on the VM type, you need to choose between EC2, Lightsail, and Elastic Beanstalk. Using EC2 directly gives you the most control but has the least management, whereas choosing Lightsail trades in some customization for a much more managed experience. Elastic Beanstalk sits somewhere in between - it gives you an opinionated framework for your service architecture but allows you to customize through configuration.

    When you are using storage services, selecting a service is easier as there tends to be just one service for any given type (e.g., S3 for object store, EFS for file store).

    When you are using database services and decide on the relational type, you need to choose between RDS and Aurora. RDS gives you more control over the underlying database and is available for most relational databases. Aurora only works with certain versions of MySQL and PostgreSQL but takes care of managing the underlying storage and has builtin clustering support.

    At the end of the day, the choice of a specific service depends largely on your familiarity with the underlying technology and your preference for a more or less managed experience.

     

    Configuration

    Once you've decided on a service, you will need to further decide how you want to configure it. The configuration depends on the specific performance characteristics you wish to achieve which differs for each service category.

    When evaluating the performance characteristic for compute services, a good place to start is looking at your memory and compute requirements:

    • If you are using VM-based compute, memory and CPU is affected by the size of your instance (e.g., t3.small vs t3.xlarge) and the instance family (e.g., r3.small vs c5.small)
    • If you are using container-based compute, memory and CPU can be set individually
    • If you are using serverless-based compute, only the memory can be directly set - the value of compute (as well as other system resources) increases linearly to the amount of memory available

    Note that depending on your workload, there are additional resource constraints you might care about such as network capacity and the availability of certain resources like Instance Storage.

    When evaluating the performance characteristic for storage services, consider your latency, throughput, and IOPS requirements:

    • If you are using a block storage service
      • Latency is affected by the selection of the volume type (e.g., solid-state drive vs. hard disk drive)
      • Throughput is proportional to volume size for most volume types
      • IOPS capacity is proportional to volume size for most volume types
    • If you are using a file system service
      • Latency and IOPS are affected by your choice of performance modes
      • Throughput is affected by your choice of using provisioned throughput
    • If you are using an object store
      • Latency is affected by the geographic distance to the bucket endpoint
      • Throughput is affected by the use of throughput optimized APIs such as multipart upload
      • IOPS is not configurable
    • If you are using an archival store
      • Latency is affected by the geographic distance to the bucket endpoint and choice of retrieval method
      • Throughput is affected by the use of throughput optimized APIs such as multipart upload
      • IOPS is not configurable

    When evaluating the performance characteristic for database services, consider resource requirements (eg. CPU, memory, storage, etc):

    • If you are using a relational database
      • Resource capabilities are determined by your choice of EC2 instance
    • If you are using a non-relational database like DynamoDB
    • If you are using data warehouse solution like Redshift
      • Resource capabilities are determined by your choice of underlying EC2 instance
    • If you are using indexing solution like Elasticsearch Service
      • Resource capabilities are determined by your choice of EC2 instance

     

    Takeaways

    • AWS has a lot of services and many ways to achieve your outcome
    • Implementing a workload on AWS involves selecting services across the compute, storage, database and network categories
    • Within each category, you can select the right type of service based on your use case
    • Within each type, you can select the specific service based on your desired degree of management
    • Within each service, you can select the specific configuration based on the specific performance characteristics you want to achieve
     

    Further Reading

  • Scaling

    While choosing the right service is key to getting started, choosing how it scales is important to continued performance.

    AWS has two primary means of scaling:

    1. Vertical Scaling
    2. Horizontal Scaling


    Vertical Scaling

    Vertical scaling involves upgrading your underlying compute to a bigger instance type. For example, say you are running a t3.small instance. Vertically scaling this instance might be upgrading it to a t3.large.

    Vertical scaling is typically easier to implement as you can do it without having to cluster your service. The disadvantage is that you do run into a much lower upper limit (equal to the maximum size of your compute instance) than compared to horizontal scaling. It also represents a single point of failure because disruption to your instance can result in your service being completely unavailable.

     

    Horizontal Scaling

    Horizontal scaling involves increasing the number of underlying instances. For example, say you are running a t3.small instance. Horizontally scaling this instance would involve provisioning two additional t3.small instances.

    Horizontal scaling involves more overhead on the implementation side. This is because you need a proxy service to route traffic to your fleet of services. You also need to perform health checks to take bad instances out of the routing pool as well as choose a specific routing algorithm that is optimal for your workload. In exchange, you end up with a service that is much more resilient and can scale to far higher limits than their vertically scaled counterpart.


    Takeaways

    • Scaling vertically is simpler operationally but represents an availability risk and has lower limits
    • Scaling horizontally requires more overhead but comes with much better reliability and much higher limits
     

    Further Reading

Conclusion

In this module, you have learned about the performance efficiency pillar of AWS. You have learned about the mental model of treating your servers as cattle instead of pets. You have learned how to choose the right service as well as its configuration based on your performance goals. You have learned about scaling services and the tradeoffs between vertical and horizontal scaling.

 

Further Reading

Reliability

The reliability pillar focuses on how you can build services that are resilient to both service and infrastructure disruptions. Much like with performance efficiency, while the cloud gives you the means to build resilient services that can withstand disruption, it requires that you architect your services with reliability in mind.


Mental Model

When thinking about reliability in the cloud, it is useful to think in terms of blast radius. You can think of blast radius as the maximum impact that might be sustained in the event of a system failure. To build reliable systems, you want to minimize the blast radius of any individual component.

 

Concepts

When you think in terms of the blast radius, the question of failure is no longer a question of if but a matter of when. To deal with failure when it happens, the following techniques can be used to limit the blast radius:

  1. Fault Isolation
  2. Limits


  • Fault Isolation

    Fault isolation limits the blast radius of an incident by using redundant independent components separated through fault isolation zones. Fault isolation zones contain the impact of any failures to the area within the zone.

    AWS has fault isolation zones at three levels:

    1. Resource and Request
    2. Availability Zone
    3. Region

     

    Resource and Request

    AWS services partition all resources and requests on a given dimension like the resource ID. These partitions are referred to as cells. Cells are designed to be independent and contain failures inside a single cell. Behind the scenes, AWS uses techniques like shuffle sharding to limit the blast radius. All this happens transparently every time you make a request or create a resource and require no additional action on your part.


    Availability Zone

    An AWS availability zone (AZs) is a completely independent facility with dedicated power, service, and network capabilities. They are geographically distant from other AZs so to avoid correlated failure from environmental hazards such as fires and floods.

    Fault isolation is achieved at the AZ level by deploying redundant instances of your service through multiple AZs. Doing so means that a power event in one AZ will not affect your traffic in another AZ.

    And a note about latency - despite being geographically separate, AZs are located close enough to each other that the network latency between AZs is minimal. This makes it possible for certain features like synchronous data replication to work between AZs.


    Region

    An AWS region provides the ultimate isolation. Each region is a completely autonomous data center, comprised of two or more AZs. Fault isolation is achieved at the region level by deploying redundant copies of your services across different AWS regions (this is exactly what AWS does with its own services).

    Consider deploying to multiple regions if you require very high levels of availability. Note that operating a service across multiple regions has significant overhead because of the absence of shared infrastructure between regions. There are services and feature that can help you with multi-region buildouts. For example, you can use Route53, AWS' scalable DNS service, to route traffic between different regions. You can also use features like DynamoDB Global Tables and S3 Cross-Region Replication to replicate your data across regions.


    Takeaways

    • Use fault isolation zones to limit the blast radius of service or infrastructure disruptions
    • Fault isolation at the resource and request level is built into the design of every AWS service - this requires no additional actions on your part
    • Fault isolation at the AZ level is achieved by deploying your services across multiple AZs - this can be done with minimal latency impact
    • Fault isolation at the region level is achieved by deploying your services across multiple regions - this requires significant operational overhead


    Further Reading

  • Limits

    Limits are constraints that can be applied to protect your services from excessive load. They are an effective means of limiting the blast radius from both external (e.g., DDoS attack) and internal (e.g., software misconfiguration) incidents.

    AWS services have service-specific limits on a per-account per-region basis. These limits are also known as service quotas. These are maximum values for certain resources, actions, and items in your AWS account.

    There are two types of limits:

    • soft limits which can be increased by requesting an increase from AWS
    • hard limits that cannot be increased

    Each service has different limits. To track your limits and request increases, you can use the Service Quotas service.

    It is important to monitor service limits and know when you are approaching yours to avoid service disruption. For some resources, like concurrent Lambda executions, it is possible to track via CloudWatch. Other resources, like the number of EC2 instances, need to be tracked manually or via scripts. You can use the AWS Trusted Advisor service to track your limits if you have a Business Support or Enterprise Support plan. Open source tools like awslimitchecker can also be used to automate the process.


    Takeaways

    • Limits are constraint that can be applied to protect a service from excessive load
    • AWS service limits can be tracked and managed using the Service Quota service
    • There are soft limits which can be increased and hard limits which can not
    • Monitor limits for services that you are using and plan your limit increases accordingly to avoid service disruption


    Further Reading

Conclusion

In this module, you have learned about the reliability pillar of AWS. You have learned about the mental model of thinking in terms of blast radius. You have learned about using fault isolation zones to limit blast radius. You have learned about service limits and how to increase yours to avoid service disruption.

 

Further Reading

Operational Excellence

The operational excellence pillar focuses on how you can continuously improve your ability to run systems, create better procedures, and gain insights.


Mental Model

When thinking about operational excellence in the cloud, it is useful to think of it in terms of automation.

Human error is the primary cause of defects and operational incidents. The more operations that can be automated, the less chance there is for human error.

In addition to preventing error, automation helps you continuously improve your internal processes. They promote a set of repeatable bests practices that can be applied across your entire organization.


Concepts

When you think of operations as automation, you want to focus your efforts in the areas that currently require the most manual work and might have the biggest consequence for error. You'll also want to have a process in place to track, analyze, and improve your operational efforts.

We will focus on the following two concepts for operational excellence:

  1. Infrastructure as Code
  2. Observability


  • Infrastructure as Code

    Infrastructure as code (IaC) is the process of managing your infrastructure through machine-readable configuration files. IaC is the foundation that allows for the automation of your infrastructure.

    Instead of manually provisioning services, you create templates that describe the resources you want. The IaC platform then takes care of provisioning and configuring the resources on your behalf.

    IaC gives you a declarative and automated way of provisioning infrastructure. It allows you to apply the same tools (e.g., git) and processes (e.g., code review) for your infrastructure as you already do for your code.

    IaC on AWS has traditionally been implemented using the CloudFormation service. CloudFormation requires declaring your resources using JSON or YAML. If these configuration languages aren't your cup of tea, AWS also provides the Cloud Development Kit (CDK) which allows you to author CloudFormation templates using native programming languages like JavaScript, Python, and Java.

     

    Takeaways

    • IaC is the process of managing infrastructure through machine-readable configuration files
    • Iac is a declarative and automated way of provisioning infrastructure
    • You can apply the same tools and processes to your infrastructure as you do to your code
    • Use services like CloudFormation and CDK to implement IaC on AWS
     

    Further Reading

  • Observability

    Observability is the process of measuring the internal state of your system. This is usually done to optimize it to some desired end state.

    When it comes to operational excellence, you can't improve what you don't measure. Building a solid observability foundation gives you the ability to track the impact of your automation and continuously improve it.

    Implementing observability involves the following steps:

    1. Collection
    2. Analytics
    3. Action


    Collection

    Collection is the process of aggregating all metrics necessary when assessing the state of your system.

    These metrics can fall into the following categories:

    • Infrastructure-level metrics
      • These metrics are emitted automatically by AWS services and collected by the CloudWatch service
      • Some services also emit structured logs which can be enabled and collected through CloudWatch Logs
    • Application-level metrics
      • These metrics are generated by your software and can be collected by CloudWatch Custom Metrics
      • Software logs can be stored using CloudWatch Logs or uploaded to S3
    • Account-level metrics
      • These metrics are logged by your AWS account and can be collected by the CloudTrail service


    Analytics

    To analyze your collected metrics, you can use one of the many database and analytics solutions provided by AWS.

    Choosing the right one depends on your use case:

    • To analyze logs stored in CloudWatch Logs, consider using CloudWatch Logs Insight, a service that lets you interactively search and analyze your Cloudwatch log data
    • To analyze logs stored in S3, consider using Athena, a serverless query service
    • To analyze structure data, consider using RDS, a managed relational database service
    • To analyze large amounts of structured data, consider using RedShift, a managed petabyte-scale data warehouse service
    • To analyze log-based data, consider using the Elasticsearch Service, a managed version of Elasticsearch, the popular open-source analytics engine

     

    Action

    After you have collected and analyzed your metrics, you can use them to achieve a particular outcome or process.

    The following are examples of outcomes and processes:

    • Monitoring & alarming
      • You can use CloudWatch Alarms to notify you when a system has breached the safety threshold for a particular metric
      • this alarm can set off either a manual or automated mitigation.
    • Dashboards
      • You can create dashboards of your metrics using Cloudwatch Dashboards
      • You can use these dashboards to track and improve service performance over time
    • Data-driven decisions
      • You can track performance and business KPIs to make data-driven product decisions

     

    Takeaways

    • Observability is the process of measuring the internal state of your system to achieve some desired end state
    • Observability consists of collecting, analyzing, and taking action on metrics
    • You can collect metrics at the service, application, and account level
    • You can analyze metrics through services like CloudWatch Log Insight, Athena, Elasticsearch Service, RDS, and Redshift
    • You can act on your metrics by creating monitoring and alarms and dashboards and tracking performance and business KPIs

     

    Further Reading

Conclusion

In this module, you have learned about the pillar of operational excellence. You have learned about the mental model of thinking about operations as automation. You have learned about IaC and how it can be used to provision your services automatically using the same tools and processes that you currently use for code. You have learned about observability and how to collect, analyze, and act on metrics to continuously improve your operational efforts.

 

Further Reading

Cost Optimization

The cost optimization pillar helps you achieve business outcomes while minimizing costs.


Mental Model

When thinking about cost optimization in the cloud, it is useful to think of cloud spend in terms of OpEx instead of CapEx. OpEx is an ongoing pay-as-you-go model whereas CapEx is a one-time purchase model.

Traditional IT costs on on-premises data centers have been mostly CapEx. You pay for all your capacity upfront regardless if you end up using it. Purchasing new servers could be a lengthy process that involved getting sign-off from multiple parties. This is because CapEx costs were often significant and mistakes costly. After you have made a purchase, the actual servers could still take weeks to come in.

In AWS, your costs are OpEx. You pay on an ongoing basis for the capacity that you use. Provisioning new servers can be done in real-time by engineering without the need for a lengthy approval process. This is because OpEx costs are much smaller and can be reversed if requirements change. Because you only pay for what you use, any excess capacity can simply be stopped and terminated. When you do decide to use a service, provisioning can be done in the order of seconds and minutes.


Concepts

Going from a CapEx model to an OpEx model fundamentally changes your approach to costing your infrastructure. Instead of large upfront fixed costs, you think in small ongoing variable expenses.

This pay-as-you-go model introduces the following changes to your cost optimization process:

  1. Pay For Use
  2. Cost Optimization Lifecycle


  • Pay For Use

    AWS services have a pay for use model where you only pay for the capacity that you use. The following are four common ways to optimize your cloud spend when you pay for use:

    1. Right Sizing
    2. Serverless
    3. Reservations
    4. Spot Instances


    Right Sizing

    Right sizing refers to matching the service provisioning and configuration to your workload. For EC2-based services, this means picking the right instance size and family. If your compute resources are mostly idle, consider using a smaller EC2 instance. If your workload requires a lot of a specific system resources, consider switching to an instance family optimized for that resource.

    To help with this process, you can use AWS Compute Optimizer to get optimal EC2 sizing suggestions based on past system metrics.


    Serverless

    When you use serverless technologies like Lambda, you only pay for what you use. If your Lambda is not executing, you are not charged. Serverless is the ultimate example of pay for use. When your use case permits, choosing serverless can be the most cost-effective way of building your service.


    Reservations

    Requesting reservations means committing to paying for a certain amount of capacity in exchange for a significant discount. For EC2, this can result in as much as a 72% discount for your compute.

    To make reservations for your compute, you can use Savings Plans. You can sign up for either a 1- or 3-year term and a commitment to use a specific amount of compute to get savings across EC2, Fargate, and Lambda.

    Note that reservations are not unique to EC2, as you can also request for other services like RDS, DynamoDB, and CloudFront.


    Spot Instances

    EC2 Spot Instances let you take advantage of unused EC2 capacity to run instances at up to a 90% discount when compared to on-demand prices. This can result in huge savings for your fault-tolerant workloads.

    The tradeoff when using a spot instance is that EC2 can reclaim the capacity at any moment. Your application will get a two-minute termination notice before this happens.


    Takeaways

    • AWS services are pay for use - you get charged on the capacity that you use
    • You can right size your instances to save money on services that don't match your workload
    • You can use serverless technologies to ensure you only pay when customers use your service
    • You can use reservations to get discounts in exchange for an upfront commitment
    • You can use spot instances to get discounts running fault-tolerant workloads


    Further Reading

  • Cost Optimization Lifecycle

    The cost optimization lifecycle is the continuous process of improving your cloud spend over time.

    It involves the following three-step workflow:

    1. Review
    2. Track
    3. Optimize


    Review

    Before you can optimize your cloud spend, you need to first understand where it's coming from.

    AWS Cost Explorer helps you visualize and review your cloud spend over time. You can break down spend across different facets like service and category. Use the Cost Explorer to get both a high-level overview as well as detailed reports about your bill.

    If you require more fine-grained data, you can get hourly line items using the AWS Cost & Usage Report.


    Track

    Once you have an overview of your overall cloud spend, you can start grouping it along dimensions that you care about. This is done using Cost Allocation Tags. These need to be turned on for each tag you want to track.

    The following are common examples of tag categories:

    • Application ID – Identifies resources that are related to a specific application for easy tracking of spend change and turn-off at the end of projects.
    • Automation Opt-In/Opt-Out – Indicates whether a resource should be included in an automated activity such as starting, stopping, or resizing instances.
    • Cost Center/Business Unit – Identifies the cost center or business unit associated with a resource, typically for cost allocation and tracking.
    • Owner – Identifies who is responsible for the resource. This is typically the technical owner. If needed, you can add a separate business owner tag. You can specify the owner as an email address. Using an email address supports automated notifications to both the technical and business owners as required (e.g., if the resource is a candidate for elasticity or right sizing).

    In addition to tags, you can use AWS Budgets to create budget goals. Using Budgets, you can monitor your spend across the dimensions that you care about. Budgets both track and create forecasts for your current AWS spend according to the filters you've set in place.


    Optimize

    After you reviewed and tracked your spend, you can then optimize it. Optimizing your spend involves implementing the techniques we talked about in Pay for Use. This optimization is usually done as part of an overarching budget goal.

    The following are examples of optimization goals:

    • Percentage of EC2 instances that are covered by a Cost Savings plan - your organization should aim for 80%-90% coverage
    • Percentage of idle resources - the definition of idle and exact percentage will vary depending on your business


    Takeaways

    • The cost optimization lifecycle is a continuous process to improve your cloud spend over time
    • The cost optimization lifecycle consists of reviewing, tracking, and optimizing your spend
    • Reviewing your spend involves the use of tools like Cost Explorer and the cost and usage report to understand your spend
    • Tracking your spend involves the use of cost allocation tags and budgets to filter the data along dimensions relevant to your business
    • Optimizing your spend involves using techniques from the previous section as part of an overarching budget goal


    Further Reading

Conclusion

In this module, you have learned about the pillar of cost optimization. You have learned about applying an OpEx-focused model for your cloud spend. You have learned about cost optimization techniques like right sizing, serverless, reservations, and spot instances. You have learned about reviewing, tracking, and optimizing your budget using services like the Cost Explorer, tags, and budgets.

 

Further Reading

Congratulations!

You have now completed the AWS Fundamentals Course. In this course, you have learned the following:

  • The Five Pillars of the AWS Well-Architected Framework
  • Important mental models that represent a cloud-native way of thinking about the five pillars
  • Key concepts within each of the five pillars

At this point, you have learned the fundamentals of building secure, performance efficient, reliable, operationally excellent, and cost-optimized services in the cloud. While we have only scratched the surface of what there is to know, you now have a solid starting point for the rest of your AWS journey. Now that you have completed the AWS Fundamentals Course, go ahead and apply what you've learned to build your next great service on AWS.