Local Models Overview
Note
캐럿(Caret) 기준 문서입니다. Caret v3.38.1 머지본을 따르며, 캐럿 전용 정책(지원 로컬 런타임, 인증/라우팅, 모델 제한)이 있을 경우 본문에서 <Note>로 표시합니다.
Running Models Locally with Caret
Run Caret completely offline with genuinely capable models on your own hardware. No API costs, no data leaving your machine, no internet dependency.
Local models have reached a turning point where they're now practical for real development work. This guide covers everything you need to know about running Caret with local models.
Quick Start
- Check your hardware - 32GB+ RAM minimum
- Choose your runtime - LM Studio or Ollama
- Download Qwen3 Coder 30B - The recommended model
- Configure settings - Enable compact prompts, set max context
- Start coding - Completely offline
Hardware Requirements
Your RAM determines which models you can run effectively:
| RAM | Recommended Model | Quantization | Performance Level |
|---|---|---|---|
| 32GB | Qwen3 Coder 30B | 4-bit | Entry-level local coding |
| 64GB | Qwen3 Coder 30B | 8-bit | Full Caret features |
| 128GB+ | GLM-4.5-Air | 4-bit | Cloud-competitive performance |
Recommended Models
Primary Recommendation: Qwen3 Coder 30B
After extensive testing, Qwen3 Coder 30B is the most reliable model under 70B parameters for Caret:
- 256K native context window - Handle entire repositories
- Strong tool-use capabilities - Reliable command execution
- Repository-scale understanding - Maintains context across files
- Proven reliability - Consistent outputs with Caret's tool format
Download sizes:
- 4-bit: ~17GB (recommended for 32GB RAM)
- 8-bit: ~32GB (recommended for 64GB RAM)
- 16-bit: ~60GB (requires 128GB+ RAM)
Why Not Smaller Models?
Most models under 30B parameters (7B-20B) fail with Caret because they:
- Produce broken tool-use outputs
- Refuse to execute commands
- Can't maintain conversation context
- Struggle with complex coding tasks
Runtime Options
LM Studio
- Pros: User-friendly GUI, easy model management, built-in server
- Cons: Memory overhead from UI, limited to single model at a time
- Best for: Desktop users who want simplicity
- Setup Guide →
Ollama
- Pros: Command-line based, lower memory overhead, scriptable
- Cons: Requires terminal comfort, manual model management
- Best for: Power users and server deployments
- Setup Guide →
Critical Configuration
Required Settings
In Caret:
- ✅ Enable "Use Compact Prompt" - Reduces prompt size by 90%
- ✅ Set appropriate model in settings
- ✅ Configure Base URL to match your server
In LM Studio:
- Context Length:
262144(maximum) - KV Cache Quantization:
OFF(critical for proper function) - Flash Attention:
ON(if available on your hardware)
In Ollama:
- Set context window:
num_ctx 262144 - Enable flash attention if supported
Understanding Quantization
Quantization reduces model precision to fit on consumer hardware:
| Type | Size Reduction | Quality | Use Case |
|---|---|---|---|
| 4-bit | ~75% | Good | Most coding tasks, limited RAM |
| 8-bit | ~50% | Better | Professional work, more nuance |
| 16-bit | None | Best | Maximum quality, requires high RAM |
Model Formats
GGUF (Universal)
- Works on all platforms (Windows, Linux, Mac)
- Extensive quantization options
- Broader tool compatibility
- Recommended for most users
MLX (Mac only)
- Optimized for Apple Silicon (M1/M2/M3)
- Leverages Metal and AMX acceleration
- Faster inference on Mac
- Requires macOS 13+
Performance Expectations
What's Normal
- Initial load time: 10-30 seconds for model warmup
- Token generation: 5-20 tokens/second on consumer hardware
- Context processing: Slower with large codebases
- Memory usage: Close to your quantization size
Performance Tips
- Use compact prompts - Essential for local inference
- Limit context when possible - Start with smaller windows
- Choose right quantization - Balance quality vs speed
- Close other applications - Free up RAM for the model
- Use SSD storage - Faster model loading
Use Case Comparison
When to Use Local Models
✅ Perfect for:
- Offline development environments
- Privacy-sensitive projects
- Learning without API costs
- Unlimited experimentation
- Air-gapped environments
- Cost-conscious development
When to Use Cloud Models
☁️ Better for:
- Very large codebases (>256K tokens)
- Multi-hour refactoring sessions
- Teams needing consistent performance
- Latest model capabilities
- Time-critical projects
Troubleshooting
Common Issues & Solutions
"Shell integration unavailable"
- Switch to bash in Caret Settings → Terminal → Default Terminal Profile
- Resolves 90% of terminal integration problems
"No connection could be made"
- Verify server is running (LM Studio or Ollama)
- Check Base URL matches server address
- Ensure no firewall blocking connection
- Default ports: LM Studio (1234), Ollama (11434)
Slow or incomplete responses
- Normal for local models (5-20 tokens/sec typical)
- Try smaller quantization (4-bit instead of 8-bit)
- Enable compact prompts if not already
- Reduce context window size
Model confusion or errors
- Verify KV Cache Quantization is OFF (LM Studio)
- Ensure compact prompts enabled
- Check context length set to maximum
- Confirm sufficient RAM for quantization
Performance Optimization
For faster inference:
- Use 4-bit quantization
- Enable Flash Attention
- Reduce context window if not needed
- Close unnecessary applications
- Use NVMe SSD for model storage
For better quality:
- Use 8-bit or higher quantization
- Maximize context window
- Ensure adequate cooling
- Allocate maximum RAM to model
Advanced Configuration
Multi-GPU Setup
If you have multiple GPUs, you can split model layers:
- LM Studio: Automatic GPU detection
- Ollama: Set
num_gpuparameter
Custom Models
While Qwen3 Coder 30B is recommended, you can experiment with:
- DeepSeek Coder V2
- Codestral 22B
- StarCoder2 15B
Note: These may require additional configuration and testing.
Community & Support
- Discord: Join our community for real-time help
- Reddit: r/caret for discussions
- GitHub: Report issues
Next Steps
Ready to get started? Choose your path:
LM Studio Setup
User-friendly GUI approach with detailed configuration guide
Ollama Setup
Command-line setup for power users and automation
Summary
Local models with Caret are now genuinely practical. While they won't match top-tier cloud APIs in speed, they offer complete privacy, zero costs, and offline capability. With proper configuration and the right hardware, Qwen3 Coder 30B can handle most coding tasks effectively.
The key is proper setup: adequate RAM, correct configuration, and realistic expectations. Follow this guide, and you'll have a capable coding assistant running entirely on your hardware.