LM Studio Cheatsheet | BeeDifferent

Model Formats

GGUF Format

GGUF (GPT-Generated Unified Format) is the standard quantized format for LM Studio. It offers:

Optimized inference on consumer hardware
Reduced memory footprint compared to full precision models
Cross-platform compatibility
Efficient loading and execution

Quantization Levels

Level	Bits	Size Reduction	Quality Loss	Use Case
Q2_K	2-bit	90%	High	Resource-constrained environments
Q3_K_M	3-bit	85%	Moderate-High	Mobile, limited RAM
Q3_K_S	3-bit	85%	Moderate-High	Smaller variant
Q3_K_L	3-bit	85%	Moderate-High	Larger variant
Q4_K_M	4-bit	75%	Moderate	Best quality-to-size ratio
Q4_K_S	4-bit	75%	Moderate	Smaller variant
Q4_K_L	4-bit	75%	Moderate	Larger variant
Q5_K_M	5-bit	60%	Low	High quality output
Q5_K_S	5-bit	60%	Low	Smaller variant
Q5_K_L	5-bit	60%	Low	Larger variant
Q6_K	6-bit	50%	Very Low	Near-lossless quality
Q8_0	8-bit	25%	Minimal	Highest quality available

Recommendation: Start with Q4_K_M for the best balance of quality and performance.

Finding Models

Primary Sources

Hugging Face (huggingface.co/models): Largest model repository, filter by GGUF format
TheBloke: Excellent GGUF conversions, high-quality quantization
Ollama Library: Pre-optimized models, easy to install
Model Creator Pages: Original model repositories on Hugging Face

Format: Filter by “GGUF”
Task: Text Generation, Chat, Instruction-following
Language: English (or your preferred language)
Size: Check model card for RAM requirements

Popular Model Families to Find

Llama 2/3: Versatile, well-documented, excellent chat quality
Mistral: Efficient, good reasoning, smaller context window
Phi: Extremely small, fast inference, good for laptops
Gemma: Google’s model, good performance-to-size ratio
Qwen: Strong reasoning, multilingual support

Hardware Requirements

Apple Silicon Unified Memory

LM Studio leverages Apple Silicon’s unified memory architecture for efficient model loading:

M1/M2/M3/M4: Can load larger models due to unified memory
Unified Memory Advantage: Both CPU and GPU can access same memory pool
Metal GPU Acceleration: Automatic GPU offloading on Apple Silicon

RAM vs Model Size Guide

Available RAM	Recommended Model Size	Example Models
8 GB	3-7B parameters (Q4)	Phi-3, Mistral-7B (Q4_K_M)
16 GB	7-13B parameters (Q4)	Mistral-7B, Llama-7B (Q5), Neural-Chat
24 GB	13-34B parameters (Q4)	Mistral-7B (Q8), Llama-13B (Q6), Dolphin-2.6-Mixtral
32+ GB	34-70B parameters (Q4)	Llama-34B, Mixtral-8x7B (Q4), Yi-34B

Context Length Impact

Short context (2K-4K): Minimal RAM impact, fastest inference
Medium context (4K-8K): Moderate overhead, recommended for most uses
Long context (8K-32K): Significant RAM usage, enables longer conversations
Extended context (32K+): Requires substantial RAM, use with smaller models

Metal GPU Acceleration

Automatically enabled on Apple Silicon
Offloads computation to GPU, reducing CPU load
Adjust GPU layers in settings for optimal performance
Benefits more pronounced with larger models

Chat Configuration

System Prompt

The system prompt defines the AI’s behavior and role:

You are a helpful, harmless, and honest assistant. 
Provide clear, concise answers and ask clarifying questions when needed.

Sampling Parameters

Parameter	Range	Purpose	Recommended
Temperature	0.0 - 2.0	Randomness of responses	0.7 (balanced), 0.3 (focused), 1.0+ (creative)
Top P	0.0 - 1.0	Diversity of token selection	0.95 (natural), 0.5 (focused)
Top K	1 - 100+	Number of top tokens to consider	40-50 (good default)
Repeat Penalty	1.0 - 2.0	Penalizes repeated tokens	1.1 (mild), 1.2 (strong)
Frequency Penalty	0.0 - 2.0	Reduces token frequency	0.0 (disabled), 0.1-0.3 (mild)

Context Length

Default: 2048 tokens (usually fine for most chats)
Extended: 4096-8192 (for longer conversations)
Maximum: Check model card, don’t exceed available VRAM

Chat Templates

Most modern GGUF models include chat templates automatically. If needed:

Go to Model Settings
Select appropriate chat format: ChatML, Llama, Mistral, etc.
Test conversation to ensure formatting is correct

Local API Server

Starting the Server

Open Server tab in LM Studio
Select a loaded model
Click Start Server
Default endpoint: http://localhost:1234
OpenAI-compatible API available at http://localhost:1234/v1

Default Port Configuration

Port: 8000 (can be changed in settings)
CORS: Enabled for local requests
Authentication: Optional (can be disabled)

cURL Example

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Python Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)

response = client.chat.completions.create(
    model="model-name",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

print(response.choices[0].message.content)

Node.js Example

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:1234/v1",
  apiKey: "lm-studio",
});

const completion = await client.chat.completions.create({
  model: "model-name",
  messages: [{ role: "user", content: "Hello!" }],
  temperature: 0.7,
});

console.log(completion.choices[0].message.content);

Model Families

Llama Series (Meta)

Best for: General-purpose tasks, instruction-following, coding
Characteristics: Well-documented, large community, diverse sizes (7B-70B)
Recommended: Llama 2-7B (Q4_K_M) or Llama 3-8B for balanced performance
Strengths: Strong reasoning, good instruction adherence, extensive fine-tunes

Mistral (Mistral AI)

Best for: Efficient inference, creative writing, roleplay
Characteristics: Small (7B), fast, good quality
Recommended: Mistral-7B-Instruct (Q4_K_M) for most users
Strengths: Excellent speed-to-quality ratio, multi-language support

Phi (Microsoft)

Best for: Lightweight applications, resource-constrained environments
Characteristics: Extremely small (3B-14B), surprisingly capable
Recommended: Phi-3 (Q4_K_M) for laptops and lower-end machines
Strengths: Fast, efficient, surprisingly capable for size

Gemma (Google)

Best for: General-purpose, well-balanced performance
Characteristics: Available in 2B, 7B, sizes with strong base model quality
Recommended: Gemma-7B-Instruct (Q4_K_M)
Strengths: Clean training data, good instruction following, safe by default

Qwen (Alibaba)

Best for: Multilingual support, strong reasoning, coding
Characteristics: Excellent performance on benchmarks, strong Chinese support
Recommended: Qwen-7B-Chat (Q4_K_M) for multilingual needs
Strengths: Advanced reasoning, good coding ability, better non-English support

Performance Optimization

Batch Size Tuning

Batch Size 1: Lower memory, slower for multiple requests
Batch Size 4-8: Better throughput for API server
Higher Batch Sizes: Increase VRAM usage, adjust based on available memory
Recommendation: Start with batch size 4, adjust up if VRAM available

GPU Offloading (Metal/CUDA)

Metal (Apple Silicon): Automatically enabled, manual layer adjustment available
GPU Layers: Increase to offload more computation to GPU (if VRAM allows)
Monitor: Watch VRAM usage, reduce if system becomes unstable
Impact: Significant speed improvement, especially for larger models

Context Length Tradeoffs

Longer context = More VRAM usage: Exponential relationship with context size
Sweet spot: 4K tokens for most conversations, 8K for extended discussions
Short context: Use 2K for fastest inference on limited hardware
Benchmark: Test with your target context size before production use

Prompt Caching

Concept: Reuse computation from repeated prompts
Benefit: Faster responses for system prompts, repeated context
Enable: Check model/server settings for prompt cache options
Use Case: Ideal for chat interfaces with consistent system prompts

Multi-Model Workflows

Switching Models Per Task

Reasoning Tasks: Use Llama or Qwen (stronger logic)
Creative Writing: Use Mistral or specialized fiction models
Code Generation: Use Llama, Phi, or Code Llama
Speed-critical: Use Phi or Mistral-7B
Quality-focused: Use larger models (13B+, Q5+)

Comparison Testing Workflow

Load Model A, have conversation, take notes on quality
Load Model B in new chat session
Ask same questions to both models
Compare outputs for tone, accuracy, speed
Document preferred model for future use

Running Multiple Models Simultaneously

Limitation: Each model loaded consumes VRAM
Workaround: Use Server API with one model, switch via API calls
Practical: Load one large model OR multiple small models (e.g., two Phi-3 models)
Monitor: Watch VRAM usage to avoid system crashes

Integration with Other Tools

Continue.dev (IDE Autocomplete)

Install Continue extension in VS Code
Configure LM Studio server endpoint: http://localhost:1234
Set model in Continue settings
Use ⌘⇧⌘ for autocomplete

Cursor (AI Code Editor)

Go to Cursor Settings > Features > Models
Add custom model: http://localhost:1234
Select LM Studio model
Use Cursor’s AI features with local model

Open WebUI (Chat Interface)

Install Open WebUI (Docker recommended)
Add connection: http://localhost:1234/v1
Select LM Studio model
Full web-based chat interface

AnythingLLM (Knowledge Base RAG)

Configure custom OpenAI provider
Base URL: http://localhost:1234/v1
Model: Select LM Studio model
Add documents for RAG retrieval

Aider (AI Pair Programmer)

Install aider: pip install aider-chat
Configure: aider --model openai/local --openai-api-base http://localhost:1234/v1
Start conversation with codebase context

Tips & Tricks

Managing Disk Space

Check model size: ~4-8GB for 7B models (Q4), ~13-16GB for 13B models
Move models folder: Edit settings to use external SSD
Symbolic links: Link to external storage: ln -s /Volumes/ExternalDrive/models ~/.lmstudio/models
Delete unused: Remove quantization variants you don’t use

Pre-loading Models

Startup: Set default model in settings to auto-load on launch
Warm-up: First request after load may be slow, expect 1-3 second delay
Server mode: Load model in Server tab, leave running for API requests
Switching: Use ⌘L to quickly load different models between chats

Conversation Export

In Chat tab, click export icon (usually arrow/save symbol)
Choose format: Markdown, JSON, or plain text
Save to desired location
Markdown format preserves formatting and is most readable

Prompt Templates

Create .txt or .md files with favorite prompts
Paste into system prompt or chat when needed
Customize with project-specific instructions
Build library: coding, writing, analysis, brainstorming templates

Performance Monitoring

Activity Monitor: Watch CPU/Memory in Activity Monitor during inference
Temperature: Monitor Mac temperature, reduce GPU layers if overheating
First Run: Expect slower speed as model loads from disk
Subsequent Runs: Much faster as model stays in memory