Why Even the Pros Need a Cloud Management Solution to Optimize Cost and Performance
By Alana Fitts, CloudCheckr Director of Sales Strategy
At CloudCheckr, we don’t presently have—and never have had—a data center. This frees up capital, manpower, and real estate to focus on developing world-class software.
We are “all in” on Amazon Web Services (AWS) and use the AWS Cloud to host our software-as-a-service (SaaS) offering. We’ve made this decision because of AWS’ leadership, innovation in developing new services, and the value of pay-as-you-go pricing. We can scale our environment footprint efficiently, and myriad AWS services help us continually improve and optimize our deployments.
But what does it mean to be “all in” on AWS?
CloudCheckr manages billions of dollars worth of customer cloud spend on multiple Amazon Elastic Compute Cloud (Amazon EC2) instances running Microsoft Windows Server.
These instances run C# and talk to SQL Server hosted on Amazon Relational Database (Amazon RDS). Our platform leverages cross-account roles to make Application Programming Interface (API) calls to customers’ AWS accounts and collect data, which is then stored on Amazon RDS instances.
CloudCheckr operates with allocated storage in excess of 550 terabytes, and more than 30,000 active database connections each day. For data that can be represented in eXtensible Markup Language (XML) and does not need the relational database performance of a traditional database, we leverage Amazon Simple Storage Service (Amazon S3) containers for low-overhead and lower cost.
Many XML files include pre-calculated datasources so that customer page load times can be reduced when they execute queries. In addition, we’re constantly evaluating and testing new AWS services and add-ons, both in support of the betterment of our customers’ experience, and to meet internal efficiency and performance goals.
All of this complexity means we need a cloud management solution. Just like our customers, we use CloudCheckr.
What is a Cloud Management Tool?
A Cloud Management Tool (CMT) should optimize costs, ensure security and compliance, track inventory, and monitor for changes, among other functions and features that assist a cloud-native company.
A Cloud Management Platform (CMP), meanwhile, should ideally support multiple cloud vendors, and provide a single pane of glass to manage all of your cloud resources.
Identifying a CMP that’s available as a software-as-a-service (SaaS) enables you to get started immediately with no overhead and nothing to install, including no agents. However, if you operate in a secure environment, you may want an Amazon Machine Image (AMI) version of your chosen CMP, so you’re able to fulfill data sovereignty requirements.
Why Did We Choose to Use Our Own Tools?
We have achieved AWS Competencies in Cloud Management Tools, Security, and Government, and it’s been several years since we began using our own tools to manage our AWS environments. We are continuously incorporating different platform features across departments, but we weren’t always so wise to the value of our own CMTs.
There’s a story of the shoemaker whose family has no shoes. It’s easy for vendors to get caught up in the needs of their customers that they forget to tend to their own needs.
When CloudCheckr was a young startup, we admittedly had to make a deliberate effort to focus on leveraging all our platform has to offer. The results were as significant for CloudCheckr as they were for our clients.
Figure 1 – The Cost Savings Report displays ways to optimize cloud spend, including eliminating idle and unused resources.
With a little attention, we identified opportunities to save hundreds of thousands of dollars and hours of operational time. Those resources were able to be rerouted into growing our business, adding more employees, and developing more features for the platform overall. We discovered that we could improve our security, cost, availability, and usage if we just followed our own best practices.
Now, we pair our CMP with complementary offerings: Slack for employee communication and notifications; Datadog to track memory metrics on our instances; JIRA as our support and engineering ticketing system; and Xacta for our ongoing compliance documentation management.
Coupled with our pre-built integrations, CloudCheckr offers hundreds of API calls that enable users to seamlessly push and pull data into other business intelligence (BI) systems and cloud enablement tools.
With that as the backdrop, here’s the story of how we make sure CloudCheckr employees don’t “go shoeless” by proactively and thoroughly managing our AWS environments using the CloudCheckr Cloud Management Platform, which is available on AWS Marketplace.
CloudCheckr for Cost
Early in our AWS journey, we implemented a tag strategy to support the ability to manage resource access control, cost tracking, automation, and organization across environments.
Like many cloud-native companies, we require that all resources have a cost center tag. CloudCheckr’s Tagging Rules feature allows users to configure all rules applicable to their deployments, including that resources not be spun up without tags. Once launched, CloudCheckr pushes alerts for improperly-tagged resources and when any new tags are detected to help combat accidental misspellings and manage the incorporation of new keys and values.
Years ago, CloudCheckr’s DevOps lead built several key Advanced Grouping reports, which are saved within the platform and emailed to our database administrators on a weekly cadence to track costs by tag.
The team is also responsible for assessment of our environment against the CloudCheckr Cost Changes report, which gives by-description analysis of changes in the use of resources. Recently, our engineering team had an architecture conversation that was inspired by a trend identified through these custom reports.
Advanced Grouping allows us to group costs by a number of parameters, most notably by resource and then usage type, to drill into which are costing us the most money. Due to the nature of our agentless architecture, CloudCheckr is constantly polling cost and security logs to understand the latest changes to a customer’s cloud. This means the CloudCheckr application needs to talk to Amazon S3 buckets, which generates a lot of “get” and “put” requests.
After identifying the magnitude of associated spend generated by these API requests, we were able to have architecture conversations about ways to group those requests so we can save more money.
Our DevOps engineers use CloudCheckr’s utilization features to assess Amazon EC2 and Amazon RDS usage over a period of time, prior to making purchase decisions. We have many instances that perform essentially the same function across our architecture, but have very different workloads based on customer and job type.
As shown in Figure 2, CloudCheckr’s Amazon RDS heat maps are helpful in identifying where we might be under or over-provisioned, especially when filtered by tag keys.
Figure 2 – CloudCheckr can graph heat maps that show the level of utilization for Amazon EC2, Amazon RDS and other services.
Because of the number of individual Amazon EC2 and Amazon RDS resources in CloudCheckr’s consolidated billing structure, we are able to take advantage of huge savings through purchasing reserved capacity for both AWS services.
CloudCheckr provides reserved purchase recommendations across multiple cloud services, including Amazon EC2, Amazon RDS, Amazon Elasticache, Amazon Redshift, and Amazon DynamoDB. We can base recommendations off 30, 60, 90, or 180 days of historical utilization data.
We manage a complex portfolio of reserved instances (RIs), so our DevOps team has configured Slack alerts to be sent when an RI is being underutilized so that we can better leverage all of our paid-for capacity. In addition, we have alerts set up for any time an RI is nearing expiration, is purchased, or if the price drops for an identical reservation within 30 days of purchase, in which case we could return and repurchase at the lower price.
CloudCheckr for Security and Compliance
CloudCheckr ingests all security-related AWS data streams, including AWS CloudTrail, Amazon CloudWatch Logs, VPC Flow Logs, and AWS Config reports, to give you a comprehensive view of security stature.
Out of the box, our platform includes more than 30 pre-built alerts based off our industry best practices for environment security, as logged in CloudTrail. Our alerts can be sent via a number of notification methods, including email, PagerDuty, Slack, ServiceNow, SysLog, Amazon Simple Notification Service (SNS) , Splunk, and more. We can also tie alerts to AWS Lambda functions in order to automate responses.
These features, combined with more than 300 other security and availability-related best practice checks, enable CloudCheckr users to fulfill compliance and audit requirements while monitoring for resource usage and public accessibility. Without a tool to highlight actionable information and give visibility to relevant stakeholders, important details could be buried into gigabytes of log information.
Our DevOps team uses alerts and notifications for both production and development environments. We have a number of CloudCheckr’s pre-built notifications configured, including for IAM Access Key Creation, Password Policy Changes, Root Account Usage, and New IP Addresses or Event Types Detected in CloudTrail.
We have a dedicated Slack channel for CloudTrail-based alerts that allows our team to communicate prior to changes that may indicate a false positive problem, and any anomalous behavior is highlighted.
CloudCheckr’s security alerts help us to identify human error, such as when a resource is erroneously placed into the wrong security group, or into the default VPC, making it publicly accessible when it should not be. When resources are misconfigured at launch, they often need to be fully recreated.
Detecting these essential security vulnerabilities in near real-time helps us route tasks within CloudCheckr’s DevOps management structure to optimize operations and avoid wasted capacity.
Figure 3 – CloudCheckr offers more than 550 Best Practice Checks, categorized by security, cost, usage, and availability.
Recently, we were able to avoid a security vulnerability when our Blacklisted IP Addresses alert indicated a potentially malicious IP address was making API calls to our Key Management Service (KMS).
The calls originated from São Paulo, and because our application is hosted and accessed primarily in the AWS US-East-1, EU-Frankfurt, and AU-Sydney regions, the attempt looked to be suspicious. CloudCheckr allowed us to investigate the access attempt and determine that, in this case, there was no actual threat.
Our DevOps lead began his diagnostic by using AWS Common Searches for CloudTrail, searched by IAM user, and drilled into other activity and normal behavior for that specific user. Ultimately, we discovered the access was a normal pattern, but the origination point had changed. We triangulated the detail and were able to rule out any foul play.
For documented exceptions, such as when performing penetration tests on our environment, CloudCheckr’s Total Compliance functionality adds comments on particular events to denote reasons for non-compliance and build our audit logs. Total Compliance presents both a point-in-time score for 35 regulations, as well as historical graphs showing the organization’s progress towards completeness. This functionality will prove essential during our yearly SOC 2 audit.
Our InfoSec team also tracks our Perimeter Assessment report to check in periodically on high-level public accessibility of various application elements. Particularly, we are focused on maintaining appropriately-limited Amazon S3 bucket permissions, and CloudCheckr has dozens of checks and reports to perform attestation to the accessibility of such resources.
At CloudCheckr, we know that cloud management is not one-size-fits-all. Because of the varied security and compliance concerns of our customer-end users, the extent of the IAM policy permissions can also vary by customer. Our basic policy is read-only and fully interactive, allowing users to see the purpose of each individual permission, including which reports are contingent on it.
For organizations willing to add write permissions to their policy, we do offer advanced automation features, including self-healing for security vulnerabilities, RI rebalancing, and right-sizing.
To foster efficient collaboration between teams, organizations need to maintain complete visibility across their cloud infrastructure. Beyond basic data aggregation capabilities, internal finance, IT, and security teams need deeper, more actionable insights about the dynamic components across their environments. CloudCheckr provides cost optimization, full security and compliance monitoring, and automation all in one unified platform.
CloudCheckr helps customers to manage several billion dollars of yearly AWS spend by providing all of the necessary tools to reduce costs, optimize service usage, and monitor for security vulnerabilities. With each bi-weekly release, we continue to deliver new and improved functionality to fill our ever-expanding cloud management toolkit.
We build for our customers, and we build from experience.
The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.
CloudCheckr – APN Partner Spotlight
CloudCheckr is an AWS Competency Partner. They provide a unified cost and security automation platform that gives you visibility, insight, and automation for your AWS environment.
*Already worked with CloudCheckr? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.