LM Studio Local LLM runner
LM Studio Cheatsheet

Model Formats

Model Formats

GGUF Format

GGUF (GPT-Generated Unified Format) is the standard quantized format for LM Studio. It offers:

  • Optimized inference on consumer hardware
  • Reduced memory footprint compared to full precision models
  • Cross-platform compatibility
  • Efficient loading and execution

Quantization Levels

Level Bits Size Reduction Quality Loss Use Case
Q2_K 2-bit 90% High Resource-constrained environments
Q3_K_M 3-bit 85% Moderate-High Mobile, limited RAM
Q3_K_S 3-bit 85% Moderate-High Smaller variant
Q3_K_L 3-bit 85% Moderate-High Larger variant
Q4_K_M 4-bit 75% Moderate Best quality-to-size ratio
Q4_K_S 4-bit 75% Moderate Smaller variant
Q4_K_L 4-bit 75% Moderate Larger variant
Q5_K_M 5-bit 60% Low High quality output
Q5_K_S 5-bit 60% Low Smaller variant
Q5_K_L 5-bit 60% Low Larger variant
Q6_K 6-bit 50% Very Low Near-lossless quality
Q8_0 8-bit 25% Minimal Highest quality available

Recommendation: Start with Q4_K_M for the best balance of quality and performance.

Finding Models

Finding Models

Primary Sources

  • Hugging Face (huggingface.co/models): Largest model repository, filter by GGUF format
  • TheBloke: Excellent GGUF conversions, high-quality quantization
  • Ollama Library: Pre-optimized models, easy to install
  • Model Creator Pages: Original model repositories on Hugging Face

Search Filters on Hugging Face

  • Format: Filter by “GGUF”
  • Task: Text Generation, Chat, Instruction-following
  • Language: English (or your preferred language)
  • Size: Check model card for RAM requirements
  • Llama 2/3: Versatile, well-documented, excellent chat quality
  • Mistral: Efficient, good reasoning, smaller context window
  • Phi: Extremely small, fast inference, good for laptops
  • Gemma: Google’s model, good performance-to-size ratio
  • Qwen: Strong reasoning, multilingual support

Hardware Requirements

Hardware Requirements

Apple Silicon Unified Memory

LM Studio leverages Apple Silicon’s unified memory architecture for efficient model loading:

  • M1/M2/M3/M4: Can load larger models due to unified memory
  • Unified Memory Advantage: Both CPU and GPU can access same memory pool
  • Metal GPU Acceleration: Automatic GPU offloading on Apple Silicon

RAM vs Model Size Guide

Available RAM Recommended Model Size Example Models
8 GB 3-7B parameters (Q4) Phi-3, Mistral-7B (Q4_K_M)
16 GB 7-13B parameters (Q4) Mistral-7B, Llama-7B (Q5), Neural-Chat
24 GB 13-34B parameters (Q4) Mistral-7B (Q8), Llama-13B (Q6), Dolphin-2.6-Mixtral
32+ GB 34-70B parameters (Q4) Llama-34B, Mixtral-8x7B (Q4), Yi-34B

Context Length Impact

  • Short context (2K-4K): Minimal RAM impact, fastest inference
  • Medium context (4K-8K): Moderate overhead, recommended for most uses
  • Long context (8K-32K): Significant RAM usage, enables longer conversations
  • Extended context (32K+): Requires substantial RAM, use with smaller models

Metal GPU Acceleration

  • Automatically enabled on Apple Silicon
  • Offloads computation to GPU, reducing CPU load
  • Adjust GPU layers in settings for optimal performance
  • Benefits more pronounced with larger models

Chat Configuration

Chat Configuration

System Prompt

The system prompt defines the AI’s behavior and role:

You are a helpful, harmless, and honest assistant. 
Provide clear, concise answers and ask clarifying questions when needed.

Sampling Parameters

Parameter Range Purpose Recommended
Temperature 0.0 - 2.0 Randomness of responses 0.7 (balanced), 0.3 (focused), 1.0+ (creative)
Top P 0.0 - 1.0 Diversity of token selection 0.95 (natural), 0.5 (focused)
Top K 1 - 100+ Number of top tokens to consider 40-50 (good default)
Repeat Penalty 1.0 - 2.0 Penalizes repeated tokens 1.1 (mild), 1.2 (strong)
Frequency Penalty 0.0 - 2.0 Reduces token frequency 0.0 (disabled), 0.1-0.3 (mild)

Context Length

  • Default: 2048 tokens (usually fine for most chats)
  • Extended: 4096-8192 (for longer conversations)
  • Maximum: Check model card, don’t exceed available VRAM

Chat Templates

Most modern GGUF models include chat templates automatically. If needed:

  1. Go to Model Settings
  2. Select appropriate chat format: ChatML, Llama, Mistral, etc.
  3. Test conversation to ensure formatting is correct

Local API Server

Local API Server

Starting the Server

  1. Open Server tab in LM Studio
  2. Select a loaded model
  3. Click Start Server
  4. Default endpoint: http://localhost:1234
  5. OpenAI-compatible API available at http://localhost:1234/v1

Default Port Configuration

  • Port: 8000 (can be changed in settings)
  • CORS: Enabled for local requests
  • Authentication: Optional (can be disabled)

cURL Example

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Python Example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio"
)

response = client.chat.completions.create(
    model="model-name",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)

print(response.choices[0].message.content)

Node.js Example

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:1234/v1",
  apiKey: "lm-studio",
});

const completion = await client.chat.completions.create({
  model: "model-name",
  messages: [{ role: "user", content: "Hello!" }],
  temperature: 0.7,
});

console.log(completion.choices[0].message.content);

Model Families

Model Families

Llama Series (Meta)

  • Best for: General-purpose tasks, instruction-following, coding
  • Characteristics: Well-documented, large community, diverse sizes (7B-70B)
  • Recommended: Llama 2-7B (Q4_K_M) or Llama 3-8B for balanced performance
  • Strengths: Strong reasoning, good instruction adherence, extensive fine-tunes

Mistral (Mistral AI)

  • Best for: Efficient inference, creative writing, roleplay
  • Characteristics: Small (7B), fast, good quality
  • Recommended: Mistral-7B-Instruct (Q4_K_M) for most users
  • Strengths: Excellent speed-to-quality ratio, multi-language support

Phi (Microsoft)

  • Best for: Lightweight applications, resource-constrained environments
  • Characteristics: Extremely small (3B-14B), surprisingly capable
  • Recommended: Phi-3 (Q4_K_M) for laptops and lower-end machines
  • Strengths: Fast, efficient, surprisingly capable for size

Gemma (Google)

  • Best for: General-purpose, well-balanced performance
  • Characteristics: Available in 2B, 7B, sizes with strong base model quality
  • Recommended: Gemma-7B-Instruct (Q4_K_M)
  • Strengths: Clean training data, good instruction following, safe by default

Qwen (Alibaba)

  • Best for: Multilingual support, strong reasoning, coding
  • Characteristics: Excellent performance on benchmarks, strong Chinese support
  • Recommended: Qwen-7B-Chat (Q4_K_M) for multilingual needs
  • Strengths: Advanced reasoning, good coding ability, better non-English support

Performance Optimization

Performance Optimization

Batch Size Tuning

  • Batch Size 1: Lower memory, slower for multiple requests
  • Batch Size 4-8: Better throughput for API server
  • Higher Batch Sizes: Increase VRAM usage, adjust based on available memory
  • Recommendation: Start with batch size 4, adjust up if VRAM available

GPU Offloading (Metal/CUDA)

  • Metal (Apple Silicon): Automatically enabled, manual layer adjustment available
  • GPU Layers: Increase to offload more computation to GPU (if VRAM allows)
  • Monitor: Watch VRAM usage, reduce if system becomes unstable
  • Impact: Significant speed improvement, especially for larger models

Context Length Tradeoffs

  • Longer context = More VRAM usage: Exponential relationship with context size
  • Sweet spot: 4K tokens for most conversations, 8K for extended discussions
  • Short context: Use 2K for fastest inference on limited hardware
  • Benchmark: Test with your target context size before production use

Prompt Caching

  • Concept: Reuse computation from repeated prompts
  • Benefit: Faster responses for system prompts, repeated context
  • Enable: Check model/server settings for prompt cache options
  • Use Case: Ideal for chat interfaces with consistent system prompts

Multi-Model Workflows

Multi-Model Workflows

Switching Models Per Task

  1. Reasoning Tasks: Use Llama or Qwen (stronger logic)
  2. Creative Writing: Use Mistral or specialized fiction models
  3. Code Generation: Use Llama, Phi, or Code Llama
  4. Speed-critical: Use Phi or Mistral-7B
  5. Quality-focused: Use larger models (13B+, Q5+)

Comparison Testing Workflow

  1. Load Model A, have conversation, take notes on quality
  2. Load Model B in new chat session
  3. Ask same questions to both models
  4. Compare outputs for tone, accuracy, speed
  5. Document preferred model for future use

Running Multiple Models Simultaneously

  • Limitation: Each model loaded consumes VRAM
  • Workaround: Use Server API with one model, switch via API calls
  • Practical: Load one large model OR multiple small models (e.g., two Phi-3 models)
  • Monitor: Watch VRAM usage to avoid system crashes

Integration with Other Tools

Integration with Other Tools

Continue.dev (IDE Autocomplete)

  1. Install Continue extension in VS Code
  2. Configure LM Studio server endpoint: http://localhost:1234
  3. Set model in Continue settings
  4. Use โŒ˜โ‡งโŒ˜ for autocomplete

Cursor (AI Code Editor)

  1. Go to Cursor Settings > Features > Models
  2. Add custom model: http://localhost:1234
  3. Select LM Studio model
  4. Use Cursor’s AI features with local model

Open WebUI (Chat Interface)

  1. Install Open WebUI (Docker recommended)
  2. Add connection: http://localhost:1234/v1
  3. Select LM Studio model
  4. Full web-based chat interface

AnythingLLM (Knowledge Base RAG)

  1. Configure custom OpenAI provider
  2. Base URL: http://localhost:1234/v1
  3. Model: Select LM Studio model
  4. Add documents for RAG retrieval

Aider (AI Pair Programmer)

  1. Install aider: pip install aider-chat
  2. Configure: aider --model openai/local --openai-api-base http://localhost:1234/v1
  3. Start conversation with codebase context

Tips & Tricks

Tips & Tricks

Managing Disk Space

  • Check model size: ~4-8GB for 7B models (Q4), ~13-16GB for 13B models
  • Move models folder: Edit settings to use external SSD
  • Symbolic links: Link to external storage: ln -s /Volumes/ExternalDrive/models ~/.lmstudio/models
  • Delete unused: Remove quantization variants you don’t use

Pre-loading Models

  • Startup: Set default model in settings to auto-load on launch
  • Warm-up: First request after load may be slow, expect 1-3 second delay
  • Server mode: Load model in Server tab, leave running for API requests
  • Switching: Use โŒ˜L to quickly load different models between chats

Conversation Export

  1. In Chat tab, click export icon (usually arrow/save symbol)
  2. Choose format: Markdown, JSON, or plain text
  3. Save to desired location
  4. Markdown format preserves formatting and is most readable

Prompt Templates

  1. Create .txt or .md files with favorite prompts
  2. Paste into system prompt or chat when needed
  3. Customize with project-specific instructions
  4. Build library: coding, writing, analysis, brainstorming templates

Performance Monitoring

  • Activity Monitor: Watch CPU/Memory in Activity Monitor during inference
  • Temperature: Monitor Mac temperature, reduce GPU layers if overheating
  • First Run: Expect slower speed as model loads from disk
  • Subsequent Runs: Much faster as model stays in memory
LM Studio Shortcuts

Main Tabs & Navigation

Chat Interface

Chat Interface
Shortcut Action
โŒ˜N New chat session
โŒ˜โŒซ Clear current chat
โŒ˜โ†ฉ Send message
โŒ˜โ‡งโ†ฉ Soft wrap message (add newline)
โŒ˜. Stop generation
โŒ˜โŒฅโ†ฉ Regenerate last response
โŒ˜A (in input) Select all text in input field

Model Management

Model Management
Shortcut Action
โŒ˜F Search available models
โŒ˜D Download selected model
โŒ˜L Load/unload model
โŒ˜โŒซ Delete model from disk
โ†ฉ (in models list) Load selected model
โŒ˜I View model information

Server Controls

Server Controls
Shortcut Action
โŒ˜โ‡งS Start/stop local server
โŒ˜โ‡งC Copy server URL to clipboard
โŒ˜โ‡งP Show server port settings

Text & Input

Text & Input
Shortcut Action
โŒ˜C Copy selected text/response
โŒ˜V Paste text
โŒ˜Z Undo last action
โŒ˜โ‡งZ Redo last action
โŒ˜โ‡งC Copy code block
โŒฅโ† Jump to previous word
โŒฅโ†’ Jump to next word

Window Management

Window Management
Shortcut Action
โŒ˜+ Zoom in
โŒ˜- Zoom out
โŒ˜0 Reset zoom level
โŒ˜F Toggle fullscreen
โŒ˜M Minimize window
โŒ˜W Close current tab

Settings & Display

Settings & Display
Shortcut Action
โŒ˜, Open preferences/settings
โŒ˜โ‡งD Toggle dark/light mode
โŒ˜โ‡งT Open theme selector

Notes

  • LM Studio uses standard macOS shortcuts for common operations like copy (โŒ˜C), paste (โŒ˜V), and undo (โŒ˜Z)
  • Tab navigation shortcuts (โŒ˜โŒฅ1/2/3) may vary depending on your LM Studio version
  • Server shortcuts require the Server tab to be accessible
  • Press and hold modifier keys while clicking to access context menus with additional options
  • Custom shortcuts can be configured in LM Studio preferences