Local Models Overview

ℹ️Note

캐러티(Careti) 기준 문서입니다. Careti v3.38.1 머지본을 따르며, 캐러티 전용 정책(지원 로컬 런타임, 인증/라우팅, 모델 제한)이 있을 경우 본문에서 <Note>로 표시합니다.

Running Models Locally with Careti

Run Careti completely offline with genuinely capable models on your own hardware. No API costs, no data leaving your machine, no internet dependency.

Local models have reached a turning point where they're now practical for real development work. This guide covers everything you need to know about running Careti with local models.

Quick Start

Check your hardware - 32GB+ RAM minimum
Choose your runtime - LM Studio or Ollama
Download Qwen3 Coder 30B - The recommended model
Configure settings - Enable compact prompts, set max context
Start coding - Completely offline

Hardware Requirements

Your RAM determines which models you can run effectively:

RAM	Recommended Model	Quantization	Performance Level
32GB	Qwen3 Coder 30B	4-bit	Entry-level local coding
64GB	Qwen3 Coder 30B	8-bit	Full Careti features
128GB+	GLM-4.5-Air	4-bit	Cloud-competitive performance

Recommended Models

Primary Recommendation: Qwen3 Coder 30B

After extensive testing, Qwen3 Coder 30B is the most reliable model under 70B parameters for Careti:

256K native context window - Handle entire repositories
Strong tool-use capabilities - Reliable command execution
Repository-scale understanding - Maintains context across files
Proven reliability - Consistent outputs with Careti's tool format

Download sizes:

4-bit: ~17GB (recommended for 32GB RAM)
8-bit: ~32GB (recommended for 64GB RAM)
16-bit: ~60GB (requires 128GB+ RAM)

Why Not Smaller Models?

Most models under 30B parameters (7B-20B) fail with Careti because they:

Produce broken tool-use outputs
Refuse to execute commands
Can't maintain conversation context
Struggle with complex coding tasks

Runtime Options

LM Studio

Pros: User-friendly GUI, easy model management, built-in server
Cons: Memory overhead from UI, limited to single model at a time
Best for: Desktop users who want simplicity
Setup Guide →

Ollama

Pros: Command-line based, lower memory overhead, scriptable
Cons: Requires terminal comfort, manual model management
Best for: Power users and server deployments
Setup Guide →

Critical Configuration

Required Settings

In Careti:

✅ Enable "Use Compact Prompt" - Reduces prompt size by 90%
✅ Set appropriate model in settings
✅ Configure Base URL to match your server

In LM Studio:

Context Length: 262144 (maximum)
KV Cache Quantization: OFF (critical for proper function)
Flash Attention: ON (if available on your hardware)

In Ollama:

Set context window: num_ctx 262144
Enable flash attention if supported

Understanding Quantization

Quantization reduces model precision to fit on consumer hardware:

Type	Size Reduction	Quality	Use Case
4-bit	~75%	Good	Most coding tasks, limited RAM
8-bit	~50%	Better	Professional work, more nuance
16-bit	None	Best	Maximum quality, requires high RAM

Model Formats

GGUF (Universal)

Works on all platforms (Windows, Linux, Mac)
Extensive quantization options
Broader tool compatibility
Recommended for most users

MLX (Mac only)

Optimized for Apple Silicon (M1/M2/M3)
Leverages Metal and AMX acceleration
Faster inference on Mac
Requires macOS 13+

Performance Expectations

What's Normal

Initial load time: 10-30 seconds for model warmup
Token generation: 5-20 tokens/second on consumer hardware
Context processing: Slower with large codebases
Memory usage: Close to your quantization size

Performance Tips

Use compact prompts - Essential for local inference
Limit context when possible - Start with smaller windows
Choose right quantization - Balance quality vs speed
Close other applications - Free up RAM for the model
Use SSD storage - Faster model loading

Use Case Comparison

When to Use Local Models

✅ Perfect for:

Offline development environments
Privacy-sensitive projects
Learning without API costs
Unlimited experimentation
Air-gapped environments
Cost-conscious development

When to Use Cloud Models

☁️ Better for:

Very large codebases (>256K tokens)
Multi-hour refactoring sessions
Teams needing consistent performance
Latest model capabilities
Time-critical projects

Troubleshooting

Common Issues & Solutions

"Shell integration unavailable"

Switch to bash in Careti Settings → Terminal → Default Terminal Profile
Resolves 90% of terminal integration problems

"No connection could be made"

Verify server is running (LM Studio or Ollama)
Check Base URL matches server address
Ensure no firewall blocking connection
Default ports: LM Studio (1234), Ollama (11434)

Slow or incomplete responses

Normal for local models (5-20 tokens/sec typical)
Try smaller quantization (4-bit instead of 8-bit)
Enable compact prompts if not already
Reduce context window size

Model confusion or errors

Verify KV Cache Quantization is OFF (LM Studio)
Ensure compact prompts enabled
Check context length set to maximum
Confirm sufficient RAM for quantization

Performance Optimization

For faster inference:

Use 4-bit quantization
Enable Flash Attention
Reduce context window if not needed
Close unnecessary applications
Use NVMe SSD for model storage

For better quality:

Use 8-bit or higher quantization
Maximize context window
Ensure adequate cooling
Allocate maximum RAM to model

Advanced Configuration

Multi-GPU Setup

If you have multiple GPUs, you can split model layers:

LM Studio: Automatic GPU detection
Ollama: Set num_gpu parameter

Custom Models

While Qwen3 Coder 30B is recommended, you can experiment with:

DeepSeek Coder V2
Codestral 22B
StarCoder2 15B

Note: These may require additional configuration and testing.

Community & Support

Discord: Join our community for real-time help
Reddit: r/caret for discussions
GitHub: Report issues

Next Steps

Ready to get started? Choose your path:

User-friendly GUI approach with detailed configuration guide

Command-line setup for power users and automation

Summary

Local models with Careti are now genuinely practical. While they won't match top-tier cloud APIs in speed, they offer complete privacy, zero costs, and offline capability. With proper configuration and the right hardware, Qwen3 Coder 30B can handle most coding tasks effectively.

The key is proper setup: adequate RAM, correct configuration, and realistic expectations. Follow this guide, and you'll have a capable coding assistant running entirely on your hardware.

Local Models Overview

ℹ️Note

Running Models Locally with Careti

Quick Start

Hardware Requirements

Recommended Models

Primary Recommendation: Qwen3 Coder 30B

Why Not Smaller Models?

Runtime Options

LM Studio

Ollama

Critical Configuration

Required Settings

Understanding Quantization

Model Formats

Performance Expectations

What's Normal

Performance Tips

Use Case Comparison

When to Use Local Models

When to Use Cloud Models

Troubleshooting

Common Issues & Solutions

Performance Optimization

Advanced Configuration

Multi-GPU Setup

Custom Models

Community & Support

Next Steps

LM Studio Setup

Ollama Setup

Summary

ℹ️Note

Running Models Locally with Careti​

Quick Start​

Hardware Requirements​

Recommended Models​

Primary Recommendation: Qwen3 Coder 30B​

Why Not Smaller Models?​

Runtime Options​

LM Studio​

Ollama​

Critical Configuration​

Required Settings​

Understanding Quantization​

Model Formats​

Performance Expectations​

What's Normal​

Performance Tips​

Use Case Comparison​

When to Use Local Models​

When to Use Cloud Models​

Troubleshooting​

Common Issues & Solutions​

Performance Optimization​

Advanced Configuration​

Multi-GPU Setup​

Custom Models​

Community & Support​

Next Steps​

LM Studio Setup

Ollama Setup

Summary​

Running Models Locally with Careti

Quick Start

Hardware Requirements

Recommended Models

Primary Recommendation: Qwen3 Coder 30B

Why Not Smaller Models?

Runtime Options

LM Studio

Ollama

Critical Configuration

Required Settings

Understanding Quantization

Model Formats

Performance Expectations

What's Normal

Performance Tips

Use Case Comparison

When to Use Local Models

When to Use Cloud Models

Troubleshooting

Common Issues & Solutions

Performance Optimization

Advanced Configuration

Multi-GPU Setup

Custom Models

Community & Support

Next Steps

Summary