Overview
Cartesia is the leading Voice AI foundation model research and development company powering the next generation of Voice AI applications. The team pioneered State Space Models during their PhDs at Stanford and commercialized the architecture in real-time speech synthesis.
Highlights
- Sonic's support for 40+ language with accent localization and multilingual voices reaches customers around the world.
- Full control over emotional expressiveness, speed, volume and more, all at 2-4x lower latencies than alternatives.
- Achieve accurate pronunciation for complex phone numbers, addresses, and IDs every invocation.
Details
Introducing multi-product solutions
You can now purchase comprehensive solutions tailored to use cases and industries.
Features and programs
Financing for AWS Marketplace purchases
Pricing
Dimension | Description | Cost |
|---|---|---|
ml.m5.4xlarge Inference (Batch) Recommended | Model inference on the ml.m5.4xlarge instance type, batch mode | $0.001/host/hour |
inference.count.m.i.c Inference Pricing | inference.count.m.i.c Inference Pricing | $0.037/request |
Vendor refund policy
Refunds are not allowed.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker model
An Amazon SageMaker model package is a pre-trained machine learning model ready to use without additional training. Use the model package to create a model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
Updated API
Additional details
Inputs
- Summary
The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generation.
Input data descriptions
The following table describes supported input data fields for real-time inference and batch transform.
Field name | Description | Constraints | Required |
|---|---|---|---|
context_id | A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging. | - | Yes |
transcript | The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic 3 models: https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion | - | Yes |
language | The language code of the transcript text. Supported codes include:
en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | - | Yes |
output_format | Must match the raw option from the Cartesia TTS SSE API: https://docs.cartesia.ai/api-reference/tts/sse#body-output-format. Only raw is supported. | - | Yes |
voice | Matches the voice field from the Cartesia TTS SSE API: https://docs.cartesia.ai/api-reference/tts/sse#body-voice. Only mode = id is supported. Example: { "mode": "id", "id": "voice_123" } | - | Yes |
generation_config | Optional configuration object matching the API schema: https://docs.cartesia.ai/api-reference/tts/sse#body-generation-config | - | No |
add_timestamps | Whether to include word-level timestamps in the output: https://docs.cartesia.ai/api-reference/tts/sse#body-add-timestamps | - | No |
add_phoneme_timestamps | Whether to include phoneme-level timestamps in the output: https://docs.cartesia.ai/api-reference/tts/sse#body-add-phoneme-timestamps | No | |
use_normalized_timestamps | Whether timestamps should be normalized (0–1 range): https://docs.cartesia.ai/api-reference/tts/sse#body-use-normalized-timestamps | - | No |
Support
Vendor support
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.