Model Formats
Model Formats
GGUF Format
GGUF (GPT-Generated Unified Format) is the standard quantized format for LM Studio. It offers:
- Optimized inference on consumer hardware
- Reduced memory footprint compared to full precision models
- Cross-platform compatibility
- Efficient loading and execution
Quantization Levels
| Level | Bits | Size Reduction | Quality Loss | Use Case |
|---|---|---|---|---|
| Q2_K | 2-bit | 90% | High | Resource-constrained environments |
| Q3_K_M | 3-bit | 85% | Moderate-High | Mobile, limited RAM |
| Q3_K_S | 3-bit | 85% | Moderate-High | Smaller variant |
| Q3_K_L | 3-bit | 85% | Moderate-High | Larger variant |
| Q4_K_M | 4-bit | 75% | Moderate | Best quality-to-size ratio |
| Q4_K_S | 4-bit | 75% | Moderate | Smaller variant |
| Q4_K_L | 4-bit | 75% | Moderate | Larger variant |
| Q5_K_M | 5-bit | 60% | Low | High quality output |
| Q5_K_S | 5-bit | 60% | Low | Smaller variant |
| Q5_K_L | 5-bit | 60% | Low | Larger variant |
| Q6_K | 6-bit | 50% | Very Low | Near-lossless quality |
| Q8_0 | 8-bit | 25% | Minimal | Highest quality available |
Recommendation: Start with Q4_K_M for the best balance of quality and performance.
Finding Models
Finding Models
Primary Sources
- Hugging Face (huggingface.co/models): Largest model repository, filter by GGUF format
- TheBloke: Excellent GGUF conversions, high-quality quantization
- Ollama Library: Pre-optimized models, easy to install
- Model Creator Pages: Original model repositories on Hugging Face
Search Filters on Hugging Face
- Format: Filter by “GGUF”
- Task: Text Generation, Chat, Instruction-following
- Language: English (or your preferred language)
- Size: Check model card for RAM requirements
Popular Model Families to Find
- Llama 2/3: Versatile, well-documented, excellent chat quality
- Mistral: Efficient, good reasoning, smaller context window
- Phi: Extremely small, fast inference, good for laptops
- Gemma: Google’s model, good performance-to-size ratio
- Qwen: Strong reasoning, multilingual support
Hardware Requirements
Hardware Requirements
Apple Silicon Unified Memory
LM Studio leverages Apple Silicon’s unified memory architecture for efficient model loading:
- M1/M2/M3/M4: Can load larger models due to unified memory
- Unified Memory Advantage: Both CPU and GPU can access same memory pool
- Metal GPU Acceleration: Automatic GPU offloading on Apple Silicon
RAM vs Model Size Guide
| Available RAM | Recommended Model Size | Example Models |
|---|---|---|
| 8 GB | 3-7B parameters (Q4) | Phi-3, Mistral-7B (Q4_K_M) |
| 16 GB | 7-13B parameters (Q4) | Mistral-7B, Llama-7B (Q5), Neural-Chat |
| 24 GB | 13-34B parameters (Q4) | Mistral-7B (Q8), Llama-13B (Q6), Dolphin-2.6-Mixtral |
| 32+ GB | 34-70B parameters (Q4) | Llama-34B, Mixtral-8x7B (Q4), Yi-34B |
Context Length Impact
- Short context (2K-4K): Minimal RAM impact, fastest inference
- Medium context (4K-8K): Moderate overhead, recommended for most uses
- Long context (8K-32K): Significant RAM usage, enables longer conversations
- Extended context (32K+): Requires substantial RAM, use with smaller models
Metal GPU Acceleration
- Automatically enabled on Apple Silicon
- Offloads computation to GPU, reducing CPU load
- Adjust GPU layers in settings for optimal performance
- Benefits more pronounced with larger models
Chat Configuration
Chat Configuration
System Prompt
The system prompt defines the AI’s behavior and role:
You are a helpful, harmless, and honest assistant.
Provide clear, concise answers and ask clarifying questions when needed.
Sampling Parameters
| Parameter | Range | Purpose | Recommended |
|---|---|---|---|
| Temperature | 0.0 - 2.0 | Randomness of responses | 0.7 (balanced), 0.3 (focused), 1.0+ (creative) |
| Top P | 0.0 - 1.0 | Diversity of token selection | 0.95 (natural), 0.5 (focused) |
| Top K | 1 - 100+ | Number of top tokens to consider | 40-50 (good default) |
| Repeat Penalty | 1.0 - 2.0 | Penalizes repeated tokens | 1.1 (mild), 1.2 (strong) |
| Frequency Penalty | 0.0 - 2.0 | Reduces token frequency | 0.0 (disabled), 0.1-0.3 (mild) |
Context Length
- Default: 2048 tokens (usually fine for most chats)
- Extended: 4096-8192 (for longer conversations)
- Maximum: Check model card, don’t exceed available VRAM
Chat Templates
Most modern GGUF models include chat templates automatically. If needed:
- Go to Model Settings
- Select appropriate chat format: ChatML, Llama, Mistral, etc.
- Test conversation to ensure formatting is correct
Local API Server
Local API Server
Starting the Server
- Open Server tab in LM Studio
- Select a loaded model
- Click Start Server
- Default endpoint:
http://localhost:1234 - OpenAI-compatible API available at
http://localhost:1234/v1
Default Port Configuration
- Port: 8000 (can be changed in settings)
- CORS: Enabled for local requests
- Authentication: Optional (can be disabled)
cURL Example
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "model-name",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
Python Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio"
)
response = client.chat.completions.create(
model="model-name",
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.7
)
print(response.choices[0].message.content)
Node.js Example
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:1234/v1",
apiKey: "lm-studio",
});
const completion = await client.chat.completions.create({
model: "model-name",
messages: [{ role: "user", content: "Hello!" }],
temperature: 0.7,
});
console.log(completion.choices[0].message.content);
Model Families
Model Families
Llama Series (Meta)
- Best for: General-purpose tasks, instruction-following, coding
- Characteristics: Well-documented, large community, diverse sizes (7B-70B)
- Recommended: Llama 2-7B (Q4_K_M) or Llama 3-8B for balanced performance
- Strengths: Strong reasoning, good instruction adherence, extensive fine-tunes
Mistral (Mistral AI)
- Best for: Efficient inference, creative writing, roleplay
- Characteristics: Small (7B), fast, good quality
- Recommended: Mistral-7B-Instruct (Q4_K_M) for most users
- Strengths: Excellent speed-to-quality ratio, multi-language support
Phi (Microsoft)
- Best for: Lightweight applications, resource-constrained environments
- Characteristics: Extremely small (3B-14B), surprisingly capable
- Recommended: Phi-3 (Q4_K_M) for laptops and lower-end machines
- Strengths: Fast, efficient, surprisingly capable for size
Gemma (Google)
- Best for: General-purpose, well-balanced performance
- Characteristics: Available in 2B, 7B, sizes with strong base model quality
- Recommended: Gemma-7B-Instruct (Q4_K_M)
- Strengths: Clean training data, good instruction following, safe by default
Qwen (Alibaba)
- Best for: Multilingual support, strong reasoning, coding
- Characteristics: Excellent performance on benchmarks, strong Chinese support
- Recommended: Qwen-7B-Chat (Q4_K_M) for multilingual needs
- Strengths: Advanced reasoning, good coding ability, better non-English support
Performance Optimization
Performance Optimization
Batch Size Tuning
- Batch Size 1: Lower memory, slower for multiple requests
- Batch Size 4-8: Better throughput for API server
- Higher Batch Sizes: Increase VRAM usage, adjust based on available memory
- Recommendation: Start with batch size 4, adjust up if VRAM available
GPU Offloading (Metal/CUDA)
- Metal (Apple Silicon): Automatically enabled, manual layer adjustment available
- GPU Layers: Increase to offload more computation to GPU (if VRAM allows)
- Monitor: Watch VRAM usage, reduce if system becomes unstable
- Impact: Significant speed improvement, especially for larger models
Context Length Tradeoffs
- Longer context = More VRAM usage: Exponential relationship with context size
- Sweet spot: 4K tokens for most conversations, 8K for extended discussions
- Short context: Use 2K for fastest inference on limited hardware
- Benchmark: Test with your target context size before production use
Prompt Caching
- Concept: Reuse computation from repeated prompts
- Benefit: Faster responses for system prompts, repeated context
- Enable: Check model/server settings for prompt cache options
- Use Case: Ideal for chat interfaces with consistent system prompts
Multi-Model Workflows
Multi-Model Workflows
Switching Models Per Task
- Reasoning Tasks: Use Llama or Qwen (stronger logic)
- Creative Writing: Use Mistral or specialized fiction models
- Code Generation: Use Llama, Phi, or Code Llama
- Speed-critical: Use Phi or Mistral-7B
- Quality-focused: Use larger models (13B+, Q5+)
Comparison Testing Workflow
- Load Model A, have conversation, take notes on quality
- Load Model B in new chat session
- Ask same questions to both models
- Compare outputs for tone, accuracy, speed
- Document preferred model for future use
Running Multiple Models Simultaneously
- Limitation: Each model loaded consumes VRAM
- Workaround: Use Server API with one model, switch via API calls
- Practical: Load one large model OR multiple small models (e.g., two Phi-3 models)
- Monitor: Watch VRAM usage to avoid system crashes
Integration with Other Tools
Integration with Other Tools
Continue.dev (IDE Autocomplete)
- Install Continue extension in VS Code
- Configure LM Studio server endpoint:
http://localhost:1234 - Set model in Continue settings
- Use โโงโ for autocomplete
Cursor (AI Code Editor)
- Go to Cursor Settings > Features > Models
- Add custom model:
http://localhost:1234 - Select LM Studio model
- Use Cursor’s AI features with local model
Open WebUI (Chat Interface)
- Install Open WebUI (Docker recommended)
- Add connection:
http://localhost:1234/v1 - Select LM Studio model
- Full web-based chat interface
AnythingLLM (Knowledge Base RAG)
- Configure custom OpenAI provider
- Base URL:
http://localhost:1234/v1 - Model: Select LM Studio model
- Add documents for RAG retrieval
Aider (AI Pair Programmer)
- Install aider:
pip install aider-chat - Configure:
aider --model openai/local --openai-api-base http://localhost:1234/v1 - Start conversation with codebase context
Tips & Tricks
Tips & Tricks
Managing Disk Space
- Check model size: ~4-8GB for 7B models (Q4), ~13-16GB for 13B models
- Move models folder: Edit settings to use external SSD
- Symbolic links: Link to external storage:
ln -s /Volumes/ExternalDrive/models ~/.lmstudio/models - Delete unused: Remove quantization variants you don’t use
Pre-loading Models
- Startup: Set default model in settings to auto-load on launch
- Warm-up: First request after load may be slow, expect 1-3 second delay
- Server mode: Load model in Server tab, leave running for API requests
- Switching: Use โL to quickly load different models between chats
Conversation Export
- In Chat tab, click export icon (usually arrow/save symbol)
- Choose format: Markdown, JSON, or plain text
- Save to desired location
- Markdown format preserves formatting and is most readable
Prompt Templates
- Create
.txtor.mdfiles with favorite prompts - Paste into system prompt or chat when needed
- Customize with project-specific instructions
- Build library: coding, writing, analysis, brainstorming templates
Performance Monitoring
- Activity Monitor: Watch CPU/Memory in Activity Monitor during inference
- Temperature: Monitor Mac temperature, reduce GPU layers if overheating
- First Run: Expect slower speed as model loads from disk
- Subsequent Runs: Much faster as model stays in memory