Fine-tuning adapts a foundation model to your specific domain, style, or task by training on your own examples. Instead of relying on lengthy system prompts and few-shot examples, a fine-tuned model internalizes your requirements and produces consistent outputs with shorter prompts — saving tokens and improving quality for specialized use cases.
TL;DR: Bedrock supports fine-tuning for select models (Titan, Llama, Cohere). Fine-tuning costs $8 per model unit-hour for training. You need 50-10,000 training examples in JSONL format. Fine-tuning is worth it when: (1) prompt engineering has plateaued in quality, (2) you need consistent output format/style, or (3) token costs from large prompts exceed training costs. For most use cases, start with prompt engineering and Knowledge Bases before investing in fine-tuning.
When to Fine-Tune vs Other Approaches
| Approach | Best For | Cost | Setup Time |
|---|---|---|---|
| Prompt engineering | Quick iteration, general tasks | Token costs only | Minutes |
| Few-shot prompting | Format consistency with examples | Higher token costs | Minutes |
| Knowledge Bases (RAG) | Grounding in your data | Vector store + tokens | Hours |
| Fine-tuning | Domain specialization, style adaptation | Training + inference | Days |
| Continued pre-training | Teaching domain knowledge | Highest | Days-weeks |
Fine-Tune When:
- Prompt engineering quality has plateaued and you need better performance
- Your system prompt exceeds 2,000 tokens (fine-tuning eliminates the need for lengthy instructions)
- You need consistent output format across thousands of requests
- Domain-specific language or terminology must be used precisely
- You have 100+ high-quality labeled examples
Don't Fine-Tune When:
- You need the model to access current or changing information (use RAG instead)
- You have fewer than 50 training examples
- The task is general-purpose (summarization, translation)
- Prompt engineering produces acceptable quality
Supported Models
| Model | Fine-Tuning | Continued Pre-Training |
|---|---|---|
| Amazon Titan Text Express | Yes | Yes |
| Amazon Titan Text Lite | Yes | Yes |
| Meta Llama 3.1 8B | Yes | No |
| Meta Llama 3.1 70B | Yes | No |
| Cohere Command | Yes | No |
Note: Claude models are not available for fine-tuning on Bedrock. For Claude customization, use prompt engineering, Knowledge Bases, or Anthropic's direct fine-tuning program. For self-hosted alternatives, see SageMaker.
Training Data Format
Training data must be in JSONL format, stored in S3:
{"prompt": "Classify this support ticket: My order hasn't arrived", "completion": "category: shipping, priority: medium, sentiment: frustrated"}
{"prompt": "Classify this support ticket: The app keeps crashing", "completion": "category: technical, priority: high, sentiment: frustrated"}
Data Requirements
| Requirement | Details |
|---|---|
| Format | JSONL (one example per line) |
| Minimum examples | 50 (recommended: 500-5,000) |
| Maximum examples | Model-dependent (typically 10,000-100,000) |
| Validation split | 20% holdout recommended |
| Quality | Consistent, correct, representative of desired behavior |
Data Preparation Tips
- Quality over quantity: 500 perfect examples outperform 5,000 noisy ones
- Cover edge cases: Include examples of tricky inputs and expected handling
- Consistent format: All completions should follow the exact same output structure
- Balanced classes: For classification, balance examples across categories
- Remove duplicates: Duplicate examples cause overfitting without improving quality
Pricing
Fine-Tuning Training
| Component | Cost |
|---|---|
| Training | $8.00 per model unit-hour |
| Typical job | 2-8 hours depending on data size and model |
A typical fine-tuning job with 1,000 examples on Titan Text Express takes approximately 2-4 hours, costing $16-32.
Custom Model Inference
| Component | Cost |
|---|---|
| Provisioned Throughput (required) | Model-dependent, commitment required |
| No on-demand option | You must purchase Provisioned Throughput to use a fine-tuned model |
Important: Fine-tuned models on Bedrock require Provisioned Throughput for inference — there's no pay-per-token option. This means you need sustained usage to justify the fixed cost.
Continued Pre-Training
| Component | Cost |
|---|---|
| Training | $6.00 per model unit-hour |
| Typical job | 4-24 hours depending on corpus size |
Fine-Tuning Process
Step 1: Prepare Training Data
Create JSONL file with prompt-completion pairs. Upload to S3. Create a validation set (20% holdout).
Step 2: Configure Training Job
Set in the Bedrock console or API:
- Base model: The foundation model to fine-tune
- Training data: S3 location of JSONL file
- Validation data: S3 location of holdout set
- Hyperparameters: Epochs, learning rate, batch size
- Output: S3 location for model artifacts
Step 3: Monitor Training
Track training metrics:
- Training loss: Should decrease consistently
- Validation loss: Should decrease and then plateau (not increase)
- If validation loss increases: Training is overfitting — reduce epochs or add more data
Step 4: Evaluate
Compare fine-tuned model against the base model on a held-out test set:
- Run identical prompts through both models
- Score outputs on accuracy, format compliance, and quality
- Verify that fine-tuning improved the target metric
Step 5: Deploy
Purchase Provisioned Throughput for the fine-tuned model and integrate into your application using the custom model ID.
Hyperparameter Guidance
| Parameter | Default | Recommendation |
|---|---|---|
| Epochs | 5 | Start with 3-5, increase if validation loss still decreasing |
| Learning rate | Model-dependent | Use default unless you see instability |
| Batch size | Model-dependent | Larger = faster training, potentially less stable |
Key rule: Monitor validation loss. If it starts increasing while training loss decreases, you're overfitting. Stop and use the checkpoint from the lowest validation loss.
Related Guides
- AWS Bedrock Pricing Guide
- AWS Bedrock LLM Models Guide
- AWS SageMaker Cost Optimization Guide
- AWS Bedrock vs SageMaker
FAQ
Is fine-tuning worth the Provisioned Throughput requirement?
Only if your inference volume is high enough to justify reserved capacity. For sporadic usage, prompt engineering with few-shot examples is more cost-effective. Fine-tuning makes financial sense when daily inference costs already exceed the Provisioned Throughput minimum.
How many examples do I need for good results?
For classification tasks: 50-200 examples per class. For generation tasks: 500-2,000 examples. For complex reasoning: 2,000-5,000 examples. More data generally improves quality, but diminishing returns set in above 5,000 examples for most tasks.
Can I fine-tune a fine-tuned model (iterative fine-tuning)?
Not directly on Bedrock. Each fine-tuning job starts from the base foundation model. To iterate, add new examples to your training data and run a new fine-tuning job from the base model.
Lower Your Bedrock Fine-Tuning Costs with Wring
Wring helps you access AWS credits and volume discounts to lower your Bedrock fine-tuning costs. Through group buying power, Wring negotiates better rates so you pay less per training hour.
