AWS Partner Network (APN) Blog
How to Turn Archive Data into Actionable Insights with Cohesity and AWS
By Edwin Galang, Cloud Solutions Architect at Cohesity
By Girish Chanchlani, Partner Solutions Architect at AWS
It’s no secret that data is growing at a staggering rate, and the challenge for enterprises is how to manage this growth in a cost-effective manner. According to IDC’s Data Age 2025 whitepaper, published in late 2018, data will grow to 175 ZB by the year 2025, with an annual growth of over 60 percent per year.
In addition to managing the data, CIOs are looking for ways to get insights out of the data so their organizations can create actionable outcomes. This includes using data as an input for building machine learning (ML) models and for security analysis to identify vulnerabilities.
Splunk’s The State of Dark Data whitepaper surveyed more than 1,300 business leaders worldwide and found that 80 percent of respondents believe data in their environment is “very” or “extremely” valuable to their organization’s overall success, while 55 percent said their data is dark. Dark data has untapped potential, as it’s not used in any manner to derive insights or for decision making.
In this post, we will describe how to use the CloudArchive Direct feature of Cohesity’s DataPlatform with Amazon Web Services (AWS) analytics services to drive insights into customers’ Network Attached Storage (NAS) data.
Cohesity is an AWS Advanced Technology Partner with the AWS Storage Competency that is redefining data management to lower total cost of ownership (TCO) while simplifying the way businesses manage and protect their data.
Cohesity Data Management Solution
Cohesity’s DataPlatform provides enterprises a modern software-defined data management solution for data protection, file and object services, dev/test, and disaster recovery.
The solution is deployed as a scale-out cluster on-premises and on AWS, providing enterprises the ability to backup on-premise and AWS workloads. For NAS data, CloudArchive Direct allows enterprises the ability to backup data directly to Amazon Simple Storage Service (Amazon S3).
Figure 1 – Cohesity’s comprehensive software-defined data platform.
CloudArchive Direct Overview
CloudArchive Direct enables enterprises to centrally manage their NAS backup data from the Cohesity Dashboard, and reduces their Cohesity cluster storage consumption by storing NAS data directly in Amazon S3.
The Cohesity cluster only stores metadata and index data, allowing users to search the data and recovery volumes, files systems, and files and folders directly from S3.
The NAS data can be stored in its native format, or it can be compressed and deduplicated for additional storage efficiencies.
After CloudArchive Direct has archived the NAS data to Amazon S3, enterprises can use AWS analytics services like AWS Glue, Amazon Athena, and Amazon QuickSight to analyze and provide insights into their data. These services provide enterprises the ability to easily tap into their dark data and derive meaningful business outcomes from it.
Figure 2 – CloudArchive Direct to AWS.
Analyzing Archive Data
Let’s walk through an example of using CloudArchive Direct with AWS Glue and Amazon Athena to extract data from a NAS backup stored on Amazon S3, and do analysis of the data.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. This prepared data can be analyzed by Amazon Athena, which is an interactive query service for analyzing data on S3 using standard SQL.
We’ll then use Amazon QuickSight, which is a business analytics service, to visualize the results in an easy to understand graphical format.
In our example, a company has a NAS file share with 500,000+ files that multiple departments use to store their data. The data includes archived emails that the legal department has saved to the file share, as well as historical data the biology department has stored from monitoring rivers in their area.
The IT team needs to maintain a backup of the NAS data, and have an offsite backup for seven years in order to meet their business continuity and regulatory requirements. With CloudArchive Direct, they can meet both requirements by backing up the data directly to Amazon S3.
In one of the folders, the biology team has archived five years of river data in CSV files. Each CSV file contains a year’s worth of data with multiple parameters about the river. The biology team wants to aggregate the data together and analyze it for a new project.
Step1: Configure Cohesity CloudArchive Direct to archive NAS Data to Amazon S3
- In the Cohesity Dashboard, configure the Amazon S3 target where the NAS data will be stored.
- Add the NAS source to the Cohesity cluster.
- Configure a Policy that specifies the data retention and S3 target for the data on the NAS source. A policy is a reusable set of settings that define how and when objects are protected, replicated and archived.
- Create a Protection Group to back up the NAS data. A Protection Group is a backup job that runs on a schedule, based on an associated policy, to back up data from the NAS sources and store it on Amazon S3.
Figure 3 – Cohesity CloudDirect Archive Protection Group.
- In the Protection Group configuration screen shown in Figure 3 above, the CloudDirect feature is enabled with the option to retain the data in native format on Amazon S3.
Step 2: Analyzing Data with AWS Glue and Amazon Athena
- In the AWS console, configure AWS Glue Crawler to connect to the S3 bucket. Run the crawler and create a table in a database.
- After the crawler has completed, we can view the data in Amazon Athena. In the following example, AWS Glue has created a table in the database with data that’s been extracted and inserted into the table.
Figure 4 – River data in Amazon Athena.
Step 3: Visualizing River Data with Amazon QuickSight
After the crawler has aggregated the data from the CSV files, Amazon QuickSight can be used to quickly and easily create graphs and charts of the historical data.
Figure 5 – Amazon QuickSight displaying graph of river staged and discharged data over a five-year period.
In this example, the biology team was able to quickly create a graph comparing the river staged and river discharged data from the last five years. Normally, this process was tedious and prone to issues and errors, which led to delays in analyzing the data.
Each time new data was collected or a new analysis was needed, a biology team member had to manually import the data into a spreadsheet and share it with the team. Since only one team member could edit the spreadsheet at a time, it slowed down the analysis process and led to the spreadsheet occasionally getting corrupted, which required them to recreate the spreadsheet.
Also, when multiple team members needed to work on the spreadsheet, they would create multiple copies. This led to confusion, as it was unclear which was the latest copy and consumed more space on the file share.
By using the AWS analytics solution, the biology team was able to:
- Easily analyze the data in real-time and share the results with the team.
- Automate the process using AWS Glue and Amazon Athena and reduce errors and delays.
- Quickly create graphs in minutes versus days or weeks with Amazon QuickSight.
In this post, we shared how Cohesity’s CloudDirect feature helps customers backup on-premises NAS storage systems to Amazon S3. Once the data is in S3, it can be analyzed by AWS analytics services to provide customers insight into their data for creating actionable business outcomes.
This allows IT teams the ability to meet their backup requirements, and provides a mechanism for analyzing that data easily and in a short period of time.
This solution can be extended to use services such as Amazon Macie for discovering sensitive data, Amazon Rekognition for image and video analysis, Amazon SageMaker for building, training, and deploying machine learning models, and for similar data analysis use cases.
For more information about DataPlatform and CloudArchive Direct, please visit cohesity.com.
Cohesity – AWS Partner Spotlight
Cohesity is an AWS Competency Partner that is redefining data management to lower total cost of ownership (TCO) while simplifying the way businesses manage and protect their data.
Contact Cohesity | Practice Overview
*Already worked with Cohesity? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.