Overview
Spark master web UI
The Spark master web UI cluster overview, showing the live standalone master and its registered worker.
Spark master web UI
Spark worker detail
Completed Spark application
This is a repackaged open source software product wherein additional charges apply for cloudimg support services.
Overview
Apache Spark is the open source unified analytics engine for large-scale data processing. This AMI delivers Spark 4 fully installed and configured as a single-node standalone cluster on EC2, giving data engineers and small teams a working analytics engine without cluster orchestration overhead, Kubernetes dependencies, or managed-service lock-in. Submit your first spark-submit job or PySpark session within minutes of launch - not hours of configuration.
This is a repackaged open source software product with additional charges for cloudimg support services.
Why This AMI Over Alternatives
Unlike managed services that add cluster spin-up latency and per-cluster pricing, this image gives you full root access to a production-grade Spark installation on a single instance you control. There is no vendor lock-in beyond EC2, no orchestration layer to manage, and no multi-node complexity for workloads that fit a single powerful instance. For dev/test environments, proof-of-concept pipelines, or cost-sensitive production workloads, you get predictable compute costs with expert support included.
Application Stack
- Apache Spark 4 with the standalone cluster manager
- Spark master and Spark worker running as systemd services under a dedicated unprivileged spark user (automatic restart on failure, no external orchestrator needed)
- Java 17 providing the JVM runtime for every Spark process
- Python 3 installed for PySpark
- spark-submit, spark-sql, spark-shell, and pyspark CLI tools ready to use immediately
AWS Integrations
This Spark image works with core AWS data services:
- Amazon S3 - Read and write data directly from S3 buckets for scalable, durable storage of input datasets, intermediate results, and output files. Use S3 as your data lake layer without managing HDFS.
- Amazon EBS - The dedicated data volume leverages EBS for independently resizable, encrypted storage. Enable EBS encryption to protect Spark worker data, SQL warehouse contents, and daemon logs at rest.
- AWS IAM - Attach IAM instance profiles to control access to S3 buckets, DynamoDB tables, and other AWS resources without embedding credentials in your Spark jobs.
Ready to Use
The Spark distribution, configuration, systemd units, and standalone cluster are all in place at boot. The master web UI is served on port 8080, showing the cluster state, workers, and every running or completed application. Submit your first job with spark-submit or start an interactive PySpark session immediately.
Security and Hardening
- Spark processes run under a dedicated unprivileged user - not root
- No passwords or shared credentials baked into the image
- Supports EBS encryption at rest for the data volume
- Recommended deployment behind a security group restricting port8080 (master UI) and port 7077 (master RPC) to trusted CIDR ranges only
- cloudimg support can assist with enabling TLS for master-worker communication and configuring Spark authentication
- On first boot, a one-shot service writes a non-secret information file and marks itself complete - no secrets are generated or stored
Dedicated Data Volume
A separate, independently resizable EBS data volume holds the Spark worker work directory, the SQL warehouse, and daemon logs. This prevents disk-full failures during large shuffles or extended job runs by keeping cluster data isolated from the operating system disk. Resize the volume as your workloads grow without reprovisioning the instance.
Example Use Case: Nightly ETL Pipeline Development
A data engineer prototyping a nightly ETL pipeline reads raw CSV or Parquet files from an S3 bucket, transforms them with PySpark, and writes cleansed output back to S3. The single-node cluster handles datasets up to hundreds of gigabytes on memory-optimized instances. Once validated, the same spark-submit scripts can be promoted to a multi-node cluster with minimal changes.
Additional Use Cases
- Large-scale batch data processing and ETL
- Interactive analytics and ad hoc SQL over large datasets
- Data engineering pipeline development and testing
- Machine learning feature engineering
- Proof-of-concept clusters before scaling to multi-node deployments
Getting Started
Book a free consultation with cloudimg engineers to discuss your Spark deployment requirements, architecture review, or guided pilot setup. Our team can help you select the right instance type, configure security groups, enable encryption, and optimize job performance for your specific workload.
All product and company names are trademarks or registered trademarks of their respective holders. Use of them does not imply any affiliation with or endorsement by them.
Highlights
- Apache Spark 4 launches as a fully configured single-node standalone cluster with the master and worker supervised by systemd. If a process fails, systemd restarts it automatically - no external orchestrator, no Kubernetes dependency, and no multi-node complexity. Unlike managed services with cluster spin-up overhead, your analytics engine is accepting jobs within minutes of instance launch, giving data engineers immediate productivity.
- Java 17 and Python 3 are pre-installed and version-matched so spark-submit, spark-sql, spark-shell, and PySpark all work immediately with zero version reconciliation. This eliminates the hours typically spent resolving JVM and Python compatibility issues on a fresh install. The dedicated EBS data volume keeps shuffle data and logs separate from the OS disk, preventing disk-full failures during large jobs and supporting EBS encryption at rest.
- 24/7 technical support from cloudimg engineers with a one-hour average response for critical issues. Get expert help with Spark deployment, cluster configuration, job tuning, enabling TLS and authentication, and performance optimization. Unlike community-only support on free AMIs, you have a dedicated team ready to assist with production issues around the clock.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Financing for AWS Marketplace purchases
Pricing
Free trial
- ...
Dimension | Description | Cost/hour |
|---|---|---|
m5.xlarge Recommended | m5.xlarge | $0.12 |
t2.micro | t2.micro instance type | $0.04 |
t3.micro | t3.micro instance type | $0.04 |
m8azn.metal-24xl | m8azn.metal-24xl instance type | $0.24 |
g4dn.4xlarge | g4dn.4xlarge instance type | $0.24 |
c8i-flex.12xlarge | c8i-flex.12xlarge instance type | $0.24 |
c7i-flex.xlarge | c7i-flex.xlarge instance type | $0.12 |
m8a.16xlarge | m8a.16xlarge instance type | $0.24 |
c5d.12xlarge | c5d.12xlarge instance type | $0.24 |
t3a.nano | t3a.nano instance type | $0.00 |
Vendor refund policy
Refunds available on request.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
64-bit (x86) Amazon Machine Image (AMI)
Amazon Machine Image (AMI)
An AMI is a virtual image that provides the information required to launch an instance. Amazon EC2 (Elastic Compute Cloud) instances are virtual servers on which you can run your applications and workloads, offering varying combinations of CPU, memory, storage, and networking resources. You can launch as many instances from as many different AMIs as you need.
Version release notes
Initial release of Apache Spark 4 as a single node standalone cluster.
Additional details
Usage instructions
Connect via SSH on port 22 as the default login user for your operating system variant (the user guide lists it per variant). The Spark master and worker start automatically under systemd. Browse to http://<instance-public-ip>:8080/ to open the Spark master web UI. To run a job, SSH in, switch to the spark user with 'sudo -iu spark', source the environment with 'source ~/setEnv.sh', then use spark-submit or pyspark. The standalone cluster ships with Spark authentication disabled; the user guide explains how to enable a shared secret before exposing the RPC port beyond the instance.
Resources
Vendor resources
Support
Vendor support
cloudimg provides 24/7 technical support for this Apache Spark product by email and live chat.
Contact
Email: support@cloudimg.co.uk Live chat: Available around the clock
Response Times
Critical issues receive a one-hour average response time. Our engineers assist with deployment, configuration, updates, performance tuning, job optimization, enabling authentication and TLS, and general troubleshooting.
What We Help With
- Initial deployment and launch guidance
- Spark cluster configuration and tuning
- Security hardening (authentication, TLS, security group recommendations)
- Job performance optimization
- Troubleshooting errors and failures
- Guidance on instance type selection for your workload
- Assistance with S3 integration and IAM configuration
Getting Started
Contact us for a free consultation to discuss your Spark deployment requirements, review your architecture, or request a guided pilot setup. We can help you select the right EC2 instance family based on your data volume and processing needs.
Refunds
If you experience issues with the product, contact our support team and we will work to resolve them. For refund requests, please reach out via email with your subscription details.
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products
