AWS Partner Network (APN) Blog

VMware Greenplum on AWS: Parallel Postgres for Enterprise Analytics at Scale

Editor’s note: This post was updated in June 2022 to reflect the most current information.

By Ivan Bishop, Partner Solutions Architect, ISV Migrations at AWS
By Sue Ulintz Mosovich, Product Marketing Manager – VMware Greenplum

VMware-Greenplum-AWS-Partners-1
VMware Greenplum
VMware-Greenplum-APN-Blog-Connect-1

Are you thinking of deploying VMware Tanzu Greenplum on Amazon Web Services (AWS)? Many customers also want to shift the responsibility for infrastructure to AWS when considering their data strategy.

With an on-premises deployment, it takes time to provision floor space in a data center, run power cables and fiber, and ensure adequate cooling. Then, you must acquire the hardware, provision IP addresses, install and harden the operating system (OS) across multiple machines, and finally address monitoring and security. Only then can you install and configure VMware Greenplum to your on-premises infrastructure.

With VMware Greenplum on AWS, deployments are completely automated and completed in less than a couple of hours—not days or weeks. In fact, the barrier to entry is low enough that business units may deploy production-ready clusters as a self-service, without IT involvement.

VMware Tanzu Greenplum is a commercial, fully-featured massively parallel processing (MPP) data warehouse platform powered by the open source VMware Greenplum database. It provides powerful and rapid analytics on petabyte scale data volumes, and is available on AWS Marketplace.

VMware and AWS have worked together to make deployment and ongoing operations of VMware Greenplum easy and painless. Speed, ease of management, and security are some of the key reasons we see enterprises shifting VMware Greenplum to AWS.

In this post, we focus on leveraging VMware Greenplum (parallel Postgres) for enterprise-scale analytics. We present discussions around deployment, updates, security, and speed.

VMware Greenplum is an AWS Partner with Competencies in DevOps and Containers that helps the world’s largest companies transform the way they build and run software.

Performance Benefits

Customers use VMware Greenplum because it’s fast. You can run your query and the results will come back moments later. VMware Greenplum on AWS is optimized for performance and can be even faster than a comparably configured on-premises deployment.

To achieve the best performance, we tuned both VMware Greenplum and AWS resources in the following ways:

  • All VMware Greenplum nodes are placed into Auto Scaling Groups to boost resiliency.
  • All nodes are selected for ideal balance of price and performance.
  • Data volumes use throughput optimized ST1 disks.

The CloudFormation template for VMware Greenplum uses AWS placement groups to minimize latency by placing nodes in close physical proximity.

We gauge performance on AWS using the same open source utilities that are used for on-premises deployments: gpcheckperf, and the TPC-DS benchmark. We also factor in the documented AWS specs for each virtual machine (VM) and disk type.

The best way to deploy VMware Greenplum is via AWS Marketplace. Follow the documentation, and your deployment will be up and running quickly, in just a few hours.

VMware-Greenplum-1

Figure 1 – Typical VMware Greenplum deployment on AWS.

Node Replacement

VMware Greenplum nodes are deployed on AWS using Auto Scaling Groups. It automatically provisions the number of nodes specified, and if a node fails for any reason, the Auto Scaling Group automatically terminates the failed node and replaces it with a new one.

VMware-Greenplum-2

Figure 2 – Failed VMware Greenplum nodes replaced by Auto Scaling Group.

For data availability, VMware Greenplum uses mirroring, a concept similar to HDFS replication (two copies of the data). When a node fails, the Coordinator (mdw) node “promotes” the Mirror Segment to act as a Primary. After the new node comes online, the self-healing mechanism goes to work. It executes the commands needed to restore the system to its fully-functional state.

To ensure that user queries operate as normal during Segment recovery, the pgBouncer connection pooler pauses queries before Segments are rebalanced. This ensures that queries stay in the queue during Segment recovery.

Single Coordinator Node Replacement

In an on-premises deployment of VMware Greenplum, a Standby Coordinator node is recommended. This node is mostly idle; it’s there in case the Coordinator node fails, ensuring continuity if and when the Coordinator node is replaced.

Thanks to self-healing on AWS, the Standby Coordinator process has been moved to the first Segment node as part of the automated AWS install process. Scripts within the Amazon Machine Image (AMI) assign roles to the nodes in the Auto Scaling Group. If the Coordinator node were to fail, the Standby Coordinator is temporarily made to be a Coordinator, and then demoted back to be a Standby Coordinator. This is all done automatically.

VMware-Greenplum-3

Figure 3 – The Coordinator node (mdw) distributes queries via the network interconnect to Segment nodes (sdw1-3).

The Coordinator node is the entry to the VMware Greenplum database system, accepting client connections and SQL queries, and distributing work to the segment instances (SDWn). When a user connects to the database via the VMware Greenplum Coordinator and issues a query, processes are created in each segment database to handle the work of that query.

By carefully matching AWS instance types and storage usage, customers can optimize their AWS consumption and VMware Greenplum license spend whilst preserving or increasing performance.

Disk Snapshots

Amazon Elastic Block Store (Amazon EBS) volumes have a snapshot feature that is useful in backing up an EBS volume to Amazon Simple Storage Service (Amazon S3). EBS snapshots are stored in Amazon S3, but not in a user-visible bucket.

VMware Greenplum on AWS includes the gpsnap utility. This automates the execution of EBS snapshots in parallel for your entire cluster.

VMware-Greenplum-4

Figure 4 – Making a gpsnap backup for a future possible restore.

Each disk gets a snapshot and is tagged so that gpsnap can be used to restore the snapshots to the correct nodes and mounts.

A backup can be created with gpsnap on AWS extremely quickly—typical execution times are around one minute. Snapshot performance is completely dependent on AWS, and VMware Greenplum waits until all of the disk snapshots are in the “pending” status before a database restart process kicks off.

The snapshots then have to complete, and that performance depends on how full the disks are and if there are prior snapshots. The gpcronsnap utility automates the scheduled execution of backups and is pre-configured to execute weekly.

Disaster Recovery

A great advantage of deploying VMware Greenplum on AWS is taking advantage of EBS snapshots for disaster recovery (DR).

VMware-Greenplum-5

Figure 5 – With VMware Greenplum, gpsnap data can be copied across AWS regions.

The aforementioned gpsnap utility can copy a snapshot from one region to another. You can then restore it to a new cluster when needed in a different region.

This is an on-demand, DR solution that is cost effective. You don’t need to add the cost and complexity of a second cluster.

Upgrading VMware Greenplum

Another cloud-only utility for VMware Greenplum is gprelease, which automates the upgrade of VMware Greenplum on AWS. It also upgrades optional packages, like MADlib for VMware Greenplum, VMware Greenplum Command Center, and PostGIS for VMware Greenplum.

The gpcronrelease utility runs weekly and will notify you when a new version is available. Even the cloud tools such as gpsnap and gprelease are upgraded with gprelease.

Automated Maintenance

Customers will enjoy peak performance for VMware Greenplum by following a few proven best practices, like analyzing, vacuuming, and reindexing.

All of these practices are combined in the gpmaintain utility, which automates many of the administrative tasks needed in a production database. The gpcronmaintain utility automates scheduled maintenance and can be easily configured to run more or less frequently.

Optional Installs

During the initial deployment of VMware Greenplum on AWS, many optional components are available, such as:

  • VMware Greenplum Database provides a collection of data science-related Python modules that can be used with the VMware Greenplum Database using PL/Python or PL/R languages.
  • VMware Greenplum Command Center (GPCC) is a web-based application for monitoring and managing VMware Greenplum clusters. GPCC works with data collected by agents running on the segment hosts and saved to the gpperfmon database.
  • MADlib is an open-source library for scalable in-database analytics. With the MADlib extension, you can use MADlib functionality in a VMware Greenplum database.
  • PostGIS is a spatial database extension that allows GIS objects to be stored in a VMware Greenplum database.

It is possible to use the gpoptional utility to install or re-install any of these components after the deployment has been completed to further customize the deployment. Additional details for the gpoptional utility can be found in the release notes.

VMware-Greenplum-6

Figure 6 – With VMware Greenplum, gpsnap data can be copied across AWS regions.

This utility simplifies installing optional components after the deployment has been completed. Simply run gpoptional to see the optional installation options. This tool is also used in conjunction with gprelease to upgrade or reinstall already installed optional packages. Meanwhile, phpPgAdmin and Command Center are installed with every deployment automatically, and MADlib, Data Science Python, Data Science R, PL/R, PostGIS, PL/Container, Backup/Restore, PXF, and gpcopy can optionally be installed with gpoptional.

Web-Based phpPgAdmin

VMware Greenplum on AWS also includes phpPgAdmin, a web-based SQL tool. Business users, developers, and administrators use phpPgAdmin to perform ad hoc queries and browse schemas. It’s a handy utility for many common scenarios.

A self-signed SSL certificate is created during the deployment, so that traffic from your browser to the cluster is encrypted.

VMware-Greenplum-7

Figure 7 – Self-signed/commercial SSL certificate encrypt client-VMware Greenplum connections.

In the figure above, you see a VMware Greenplum SSL connection encrypting a query using a self-signed certificate.

Security in Review

Security is paramount, so VMware Greenplum has worked with AWS to incorporate a number of best practices. These capabilities are designed to reduce your risk and ensure compliance with common enterprise requirements.

The VMware Greenplum AMI is regularly reviewed and scanned for vulnerabilities. The AWS CloudFormation template is also reviewed by AWS Solutions Architects, offering additional protection.

VMware Greenplum also protects your credentials by disabling SSH password authentication and using SSH keys instead. VMware Greenplum also use MD5 encrypted password authentication, and has disabled root and password file logins.

Want data encryption at rest? It’s available via Amazon EBS encryption. An added bonus, your snapshots are automatically encrypted if the source EBS volume is encrypted.

Lastly, all VMware Greenplum deployments can be created in a dedicated Amazon Virtual Private Cloud (VPC) to ensure network isolation and easier management of security rules.

You could add partner products when data stored in VMware Tanzu VMware Greenplum needs to meet higher data security requirements. To comply with regulations such as HIPAA, PCI, FIPS, and GDPR, apply encryption, set external control policies, and use data masking.

Learn more about VMware Greenplum’s security access control.

Summary

In this post, we provided a stepwise discussion on why running VMware Greenplum on AWS is a compelling option for enterprise-scale analytics.

You can choose to leverage AWS for a VMware Greenplum to simplify deployments over a traditional on-premises solution. Depending on the requirement, customers can opt for either bring-your-own-license (suitable for long term deployments) or pay-as-you-go (suitable for short-term deployments/proof of concepts).

Performance of the VMware Greenplum database is comparable to, or greater than, the on-premises deployed solution by right-sizing the selected instance types during a highly automated CloudFormation execution. TPC-DS benchmark data helps align performance with instance pricing.

The AWS-deployed environment scales and “self heals” using Auto Scaling Groups, while day-to-day backups and disaster recovery (even across AWS regions) are possible by leveraging Amazon EBS snapshots combined with the VMware Greenplum gpsnap tool.

Upgrading VMware Greenplum is simplified using the cloud-only gprelease tool, whereas the core data science and other in-database analytics tools may be readily (re)configured using the gprelease and gpmaintain utilities.

Furthermore, optional installs provide a highly customized, customer-centric data science environment. The phpPgAdmin tool provides easy access to VMware Greenplum databases to run queries and perform schema analysis via SSL, if needed

VMware Tanzu Greenplum works closely with AWS to deploy and maintain a secure operating environment, and AWS Marketplace makes it simple for even small business groups to deploy VMware Greenplum on AWS.

You can learn more about VMware Greenplum in the eBook Data Warehousing with VMware Greenplum, Second Edition, and feel free to contact us at tanzu-aws@vmware.com.

.
VMware-Greenplum-APN-Blog-Connect-1
.


VMware Tanzu Greenplum – AWS Partner Spotlight

VMware Greenplum is an AWS Partner that provides powerful and rapid analytics on petabyte scale data volumes.

Contact VMware Greenplum | Partner Overview | AWS Marketplace