Amazon Polly turns text into lifelike speech using deep learning. Polly supports dozens of languages and offers multiple voice engines at different price points. The cost difference between voice types is significant — Neural voices cost 4x more than Standard, and Long-Form voices run at 25x the Standard rate. Understanding which voice engine your application actually needs is the single biggest lever for controlling Polly costs.
TL;DR: Standard voices cost $4.00 per million characters, Neural voices cost $16.00 per million, and Long-Form voices cost $100.00 per million. Free tier includes 5M Standard characters and 1M Neural characters monthly for 12 months. Use Standard voices for IVR and notifications, reserve Neural for customer-facing audio.
Voice Engine Pricing
Amazon Polly offers four distinct voice engine tiers, each targeting different use cases and quality levels.
| Voice Engine | Cost per 1M Characters | Best For | Quality Level |
|---|---|---|---|
| Standard | $4.00 | IVR, alerts, internal tools | Good |
| Neural | $16.00 | Podcasts, e-learning, apps | Near-human |
| Long-Form | $100.00 | Audiobooks, long articles | Highest naturalness |
| Generative | $100.00 | Conversational, expressive | Most expressive |
Speech Marks Pricing
| Feature | Cost per 1M Characters |
|---|---|
| Speech Marks (Standard) | $4.00 |
| Speech Marks (Neural) | $16.00 |
Speech Marks provide metadata about speech timing — word boundaries, sentence boundaries, viseme data for lip-sync, and SSML marks. They are billed separately from audio generation, so requesting both audio and speech marks for the same text doubles your character count.
Free Tier Details
AWS Polly includes a generous free tier for the first 12 months after account creation.
| Voice Engine | Free Characters per Month | Duration |
|---|---|---|
| Standard | 5 million | 12 months |
| Neural | 1 million | 12 months |
| Long-Form | Not included | N/A |
| Generative | Not included | N/A |
At average speaking rates, 5 million Standard characters translates to roughly 80-100 hours of generated audio per month. That is more than enough for prototyping and low-volume production use.
How Characters Are Counted
Polly bills based on the number of characters processed, including SSML tags.
- Plain text: Every character counts, including spaces and punctuation
- SSML input: The SSML markup tags themselves are counted as characters
- Minimum charge: Each API request has a minimum charge of 100 characters
- Billing granularity: Billed in increments of individual characters (no rounding up to blocks)
When using SSML for pronunciation control, emphasis, or pauses, your effective character count increases. A sentence with SSML tags can be 2-3x longer than the plain text equivalent.
SSML Impact Example
| Input Type | Raw Text Length | Billed Characters |
|---|---|---|
| Plain text | 1,000 chars | 1,000 |
| SSML with pauses | 1,000 chars | ~1,400 |
| SSML with phonemes | 1,000 chars | ~2,200 |
| Heavy SSML formatting | 1,000 chars | ~3,000 |
Real-World Cost Examples
| Use Case | Voice Engine | Monthly Volume | Monthly Cost |
|---|---|---|---|
| IVR phone system | Standard | 2M characters | $8.00 |
| E-learning platform | Neural | 5M characters | $80.00 |
| Podcast generation | Neural | 10M characters | $160.00 |
| Audiobook production | Long-Form | 3M characters | $300.00 |
| Mobile app narration | Neural | 500K characters | $8.00 |
| Accessibility (screen reader) | Standard | 20M characters | $80.00 |
| News article reader | Generative | 8M characters | $800.00 |
Full Cost Breakdown: E-Learning Platform
A typical e-learning platform generating course narration with Neural voices:
| Component | Monthly Volume | Cost |
|---|---|---|
| Neural voice generation | 5M characters | $80.00 |
| Speech Marks (for captions) | 5M characters | $80.00 |
| S3 storage for audio | 50 GB | $1.15 |
| CloudFront delivery | 200 GB | $17.00 |
| Total | $178.15 |
Brand Voices
Brand Voices let you create a custom Neural voice exclusive to your organization. Pricing has two components:
| Component | Cost |
|---|---|
| Voice creation | Custom pricing (contact AWS) |
| Per-character usage | $100.00 per 1M characters |
Brand Voice creation requires working with the AWS team and providing voice training data. The per-character usage rate matches Long-Form pricing. This option makes sense only for large enterprises with millions of characters of monthly output where brand consistency matters.
Supported Output Formats
Polly supports multiple audio formats at no pricing difference:
| Format | Use Case | File Size (relative) |
|---|---|---|
| MP3 | Web, mobile apps | 1x (baseline) |
| OGG (Vorbis) | Web streaming | 0.8x |
| PCM | Telephony, processing | 10x |
| JSON (Speech Marks) | Lip-sync, captions | Metadata only |
Choosing compressed formats like MP3 or OGG reduces your storage and data transfer costs without affecting Polly billing.
Cost Optimization Tips
-
Use Standard voices where Neural is not required — Standard voices cost 75% less and work well for alerts, IVR menus, and internal tools where top-tier naturalness is unnecessary.
-
Cache generated audio aggressively — If the same text is spoken repeatedly (greetings, menu options, common responses), generate once and store in S3. A single cached phrase eliminates all future Polly charges for that content.
-
Minimize SSML markup — SSML tags count as billed characters. Use SSML only when pronunciation or timing control adds measurable value. Plain text with proper punctuation often produces acceptable results.
-
Batch requests up to the 100-character minimum — Every API call bills at least 100 characters. Avoid sending single words or very short phrases as separate requests.
-
Use the SynthesizeSpeech API for text under 3,000 characters and the StartSpeechSynthesisTask API for longer content. The async task API stores results directly in S3, avoiding the need to handle large responses in your application.
-
Evaluate Long-Form vs Neural carefully — At $100 vs $16 per million characters, Long-Form voices cost 6.25x more. Run A/B tests to determine whether your users notice the quality difference for your specific content type.
-
Pre-generate content during off-peak hours — Polly has no time-based pricing differences, but pre-generating content ensures you are not making expensive real-time calls.
Polly vs Alternatives
| Service | Cost per 1M Characters | Voice Quality | Languages |
|---|---|---|---|
| AWS Polly (Standard) | $4.00 | Good | 60+ |
| AWS Polly (Neural) | $16.00 | Excellent | 30+ |
| Google Cloud TTS (Standard) | $4.00 | Good | 50+ |
| Google Cloud TTS (Neural) | $16.00 | Excellent | 40+ |
| Azure Speech (Neural) | $16.00 | Excellent | 100+ |
| ElevenLabs | ~$24.00 | Superior | 29 |
Polly's competitive advantage is deep AWS integration — direct S3 output, IAM authentication, CloudFront distribution, and Lambda triggers for event-driven generation.
Related Guides
FAQ
How does Polly count characters for billing?
Polly counts every character in your request, including spaces, punctuation, and SSML tags. Each API request has a 100-character minimum. If you send 50 characters, you are billed for 100. If you send plain text, only the visible characters count. If you use SSML, the full XML markup is included in the character count.
Can I use the free tier for production workloads?
Yes. The Polly free tier (5M Standard characters, 1M Neural characters per month) applies to production use with no restrictions. After 12 months, you pay standard rates. Many small applications stay within the free tier during their first year.
When should I use Long-Form instead of Neural voices?
Long-Form voices are optimized for narrating content longer than a paragraph — they maintain natural prosody, pacing, and expression across extended passages. Use them for audiobooks, long articles, and educational content where listeners will hear minutes of continuous speech. For short-form content like alerts, UI feedback, or brief responses, Neural voices are indistinguishable in quality at 84% lower cost.
Lower Your Polly Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your Polly text-to-speech costs. Through group buying power, Wring negotiates better rates so you pay less per million characters.
