Overview
Run powerful open-source language models locally for development, testing, or air-gapped deployments. This guide covers Ollama, vLLM, LM Studio, and other local model solutions with the MCP Server.
Local models provide privacy, no API costs, and offline capabilities - perfect for development and sensitive data scenarios.
Why Local Models?
Privacy & Security
No data leaves your infrastructure
GDPR/HIPAA compliance
Air-gapped deployments
Full data control
Cost Savings
No API usage costs
No rate limits
Predictable infrastructure costs
Scale without per-token fees
Performance
Low latency (no network calls)
Customizable hardware
Model fine-tuning
Offline capability
Flexibility
Any open-source model
Custom fine-tuned models
Experiment freely
Version control models
Ollama Setup
Best for : Easy setup, development, quick testing
Install Ollama
# Download from ollama.com
# Or use Homebrew
brew install ollama
# Start Ollama service
ollama serve
Pull Models
# Popular models
ollama pull llama3.2:3b # Meta's Llama 3.2 (3B params, fast)
ollama pull mistral:7b # Mistral 7B (balanced)
ollama pull qwen2.5:14b # Qwen 2.5 (14B params, powerful)
ollama pull codellama:13b # Code-specialized
# List installed models
ollama list
# Model info
ollama show llama3.2:3b
Configure MCP Server
# .env
LLM_PROVIDER = ollama
OLLAMA_BASE_URL = http://localhost:11434
LLM_MODEL_NAME = llama3.2:3b
Test
from mcp_server_langgraph.llm.factory import LLMFactory
llm = LLMFactory(
provider = "ollama" ,
model_name = "llama3.2:3b" ,
base_url = "http://localhost:11434"
)
response = await llm.ainvoke( "What is machine learning?" )
print (response.content)
Ollama Configuration
from langchain_community.llms import Ollama
llm = Ollama(
model = "llama3.2:3b" ,
base_url = "http://localhost:11434" ,
# Generation parameters
temperature = 0.7 ,
num_predict = 2048 , # Max tokens
top_k = 40 ,
top_p = 0.9 ,
repeat_penalty = 1.1 ,
# Performance
num_ctx = 4096 , # Context window
num_thread = 8 , # CPU threads
num_gpu = 1 , # Use GPU
)
Available Models
Model Size Use Case Memory Required
llama3.2:3b 3B Fast chat, general 4GB mistral:7b 7B Balanced performance 8GB qwen2.5:14b 14B Complex reasoning 16GB codellama:13b 13B Code generation 16GB llama3.1:70b 70B Maximum intelligence 64GB
vLLM Setup
Best for : High throughput, production deployments
Install vLLM
# Requires CUDA GPU
uv pip install vllm
# Or with Docker
docker pull vllm/vllm-openai:latest
Start vLLM Server
# Start server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--port 8000 \
--tensor-parallel-size 2
# Or with Docker
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-3B-Instruct
Configure MCP Server
# .env
LLM_PROVIDER = openai # vLLM is OpenAI-compatible
OPENAI_API_BASE = http://localhost:8000/v1
OPENAI_API_KEY = EMPTY
LLM_MODEL_NAME = meta-llama/Llama-3.2-3B-Instruct
Test
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url = "http://localhost:8000/v1" ,
api_key = "EMPTY" ,
model = "meta-llama/Llama-3.2-3B-Instruct"
)
response = await llm.ainvoke( "Explain quantum computing" )
print (response.content)
vLLM Features
## Enable speculative decoding (faster inference)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B \
--num-speculative-tokens 5
## Multi-GPU support
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 # Use 4 GPUs
## Quantization (reduce memory)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--quantization awq # or gptq, bitsandbytes
LM Studio
Best for : Desktop GUI, beginners, Windows users
Install LM Studio
Download from https://lmstudio.ai
Install for your platform (Windows/Mac/Linux)
Launch LM Studio
Download Models
Click “Discover” tab
Search for models (e.g., “Llama 3.2”)
Click download
Wait for download to complete
Start Local Server
Click “Local Server” tab
Select model from dropdown
Click “Start Server”
Note the server URL (default: http://localhost:1234 )
Configure MCP Server
# .env
LLM_PROVIDER = openai
OPENAI_API_BASE = http://localhost:1234/v1
OPENAI_API_KEY = lm-studio
LLM_MODEL_NAME = local-model
Production Deployment
Kubernetes with Ollama
apiVersion : apps/v1
kind : Deployment
metadata :
name : ollama
namespace : mcp-server-langgraph
spec :
replicas : 1
selector :
matchLabels :
app : ollama
template :
metadata :
labels :
app : ollama
spec :
containers :
- name : ollama
image : ollama/ollama:latest
ports :
- containerPort : 11434
resources :
requests :
memory : "8Gi"
cpu : "4000m"
nvidia.com/gpu : "1"
limits :
memory : "16Gi"
cpu : "8000m"
nvidia.com/gpu : "1"
volumeMounts :
- name : models
mountPath : /root/.ollama
livenessProbe :
httpGet :
path : /
port : 11434
initialDelaySeconds : 30
periodSeconds : 10
volumes :
- name : models
persistentVolumeClaim :
claimName : ollama-models
---
apiVersion : v1
kind : Service
metadata :
name : ollama
namespace : mcp-server-langgraph
spec :
selector :
app : ollama
ports :
- port : 11434
targetPort : 11434
---
apiVersion : v1
kind : PersistentVolumeClaim
metadata :
name : ollama-models
namespace : mcp-server-langgraph
spec :
accessModes :
- ReadWriteOnce
resources :
requests :
storage : 100Gi
storageClassName : fast-ssd
Load Models via Init Container
spec :
initContainers :
- name : pull-models
image : ollama/ollama:latest
command :
- sh
- -c
- |
ollama serve &
sleep 10
ollama pull llama3.2:3b
ollama pull mistral:7b
pkill ollama
volumeMounts :
- name : models
mountPath : /root/.ollama
MCP Server with Local LLM
apiVersion : apps/v1
kind : Deployment
metadata :
name : mcp-server-langgraph
spec :
replicas : 3
template :
spec :
containers :
- name : agent
image : mcp-server-langgraph:latest
env :
- name : LLM_PROVIDER
value : "ollama"
- name : OLLAMA_BASE_URL
value : "http://ollama:11434"
- name : LLM_MODEL_NAME
value : "llama3.2:3b"
resources :
requests :
memory : "2Gi"
cpu : "1000m"
limits :
memory : "4Gi"
cpu : "4000m"
Model Selection
By Task Type
def select_local_model ( task_type : str ) -> str :
"""Select optimal local model for task"""
models = {
"chat" : "llama3.2:3b" , # Fast conversation
"code" : "codellama:13b" , # Code generation
"reasoning" : "qwen2.5:14b" , # Complex analysis
"creative" : "mistral:7b" , # Creative writing
"summarization" : "llama3.2:3b" , # Fast summaries
}
return models.get(task_type, "llama3.2:3b" )
## Use dynamically
llm = LLMFactory(
provider = "ollama" ,
model_name = select_local_model( "code" )
)
By Hardware
import psutil
import GPUtil
def select_model_by_hardware ():
"""Select model based on available resources"""
# Check GPU
try :
gpus = GPUtil.getGPUs()
if gpus and gpus[ 0 ].memoryTotal > 16000 : # >16GB VRAM
return "qwen2.5:14b"
elif gpus and gpus[ 0 ].memoryTotal > 8000 : # >8GB VRAM
return "mistral:7b"
except :
pass
# Check RAM
ram_gb = psutil.virtual_memory().total / ( 1024 ** 3 )
if ram_gb > 16 :
return "mistral:7b"
elif ram_gb > 8 :
return "llama3.2:3b"
else :
return "llama3.2:1b" # Smallest model
model = select_model_by_hardware()
GPU Acceleration
## Ollama automatically uses GPU if available
ollama run llama3.2:3b
## Check GPU usage
nvidia-smi
## Force CPU-only
CUDA_VISIBLE_DEVICES = "" ollama serve
## Use specific GPU
CUDA_VISIBLE_DEVICES = 1 ollama serve
Quantization
## Ollama models are pre-quantized
## Pull different quantization levels:
ollama pull llama3.2:3b-q4_0 # 4-bit (smallest, fastest)
ollama pull llama3.2:3b-q5_0 # 5-bit (balanced)
ollama pull llama3.2:3b-q8_0 # 8-bit (larger, better quality)
ollama pull llama3.2:3b # Default (usually q4)
## Trade-off: smaller = faster but lower quality
Concurrent Requests
## Ollama handles concurrent requests
import asyncio
queries = [
"What is Python?" ,
"Explain async/await" ,
"What is FastAPI?"
]
## Process in parallel
async def process_batch ( queries : list ):
tasks = [llm.ainvoke(q) for q in queries]
results = await asyncio.gather( * tasks)
return results
results = await process_batch(queries)
Context Caching
## Reuse context across requests
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory()
## First request
response1 = await llm.ainvoke(
"Explain machine learning" ,
memory = memory
)
## Second request (reuses context)
response2 = await llm.ainvoke(
"Give me an example" ,
memory = memory # Ollama remembers previous context
)
Fine-Tuning
Create Modelfile
## Modelfile
FROM llama3.2:3b
## Set custom system prompt
SYSTEM """You are a helpful AI assistant specializing in software engineering.
Always provide code examples when relevant."""
## Set parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER stop "Human:"
PARAMETER stop "Assistant:"
Build Custom Model
## Create model
ollama create my-custom-model -f Modelfile
## Test
ollama run my-custom-model "Write a Python function"
## Use in MCP Server
## .env
LLM_MODEL_NAME = my-custom-model
Fine-Tune with LoRA
## Use libraries like axolotl or unsloth
uv pip install unsloth
## Fine-tune script
python fine_tune.py \
--base_model llama3.2:3b \
--dataset my-training-data.json \
--output_dir ./fine-tuned-model
## Import to Ollama
ollama create my-fine-tuned -f Modelfile.finetune
Monitoring
Resource Usage
import psutil
import GPUtil
def monitor_resources ():
"""Monitor CPU, RAM, and GPU usage"""
# CPU
cpu_percent = psutil.cpu_percent( interval = 1 )
# RAM
ram = psutil.virtual_memory()
ram_percent = ram.percent
# GPU
try :
gpus = GPUtil.getGPUs()
if gpus:
gpu_percent = gpus[ 0 ].load * 100
gpu_memory = gpus[ 0 ].memoryUsed
else :
gpu_percent = 0
gpu_memory = 0
except :
gpu_percent = 0
gpu_memory = 0
return {
"cpu_percent" : cpu_percent,
"ram_percent" : ram_percent,
"gpu_percent" : gpu_percent,
"gpu_memory_mb" : gpu_memory
}
## Log metrics
metrics = monitor_resources()
logger.info( "Resource usage" , ** metrics)
from prometheus_client import Histogram, Counter
## Define metrics
local_llm_latency = Histogram(
'local_llm_latency_seconds' ,
'Local LLM inference latency' ,
[ 'model' ]
)
local_llm_tokens = Counter(
'local_llm_tokens_total' ,
'Total tokens processed' ,
[ 'model' , 'type' ]
)
## Track
import time
@local_llm_latency.labels ( model = "llama3.2:3b" ) .time()
async def call_local_llm ( query : str ):
response = await llm.ainvoke(query)
# Estimate tokens (approximate)
input_tokens = len (query.split())
output_tokens = len (response.content.split())
local_llm_tokens.labels( model = "llama3.2:3b" , type = "input" ).inc(input_tokens)
local_llm_tokens.labels( model = "llama3.2:3b" , type = "output" ).inc(output_tokens)
return response
Troubleshooting
Error : CUDA out of memory or system freezeSolutions :# Use smaller model
ollama pull llama3.2:1b # 1B instead of 3B
# Use quantized model
ollama pull llama3.2:3b-q4_0 # 4-bit quantization
# Reduce context window
# In Modelfile:
PARAMETER num_ctx 2048 # Smaller context
# Clear GPU memory
docker restart ollama
Causes : CPU-only, large model, no optimizationSolutions :# Check GPU usage
nvidia-smi
# Use GPU
# Ensure CUDA is installed and GPU is detected
# Use smaller/quantized model
ollama pull llama3.2:3b-q4_0
# Optimize parameters
PARAMETER num_thread 8 # More CPU threads
Error : Error: model 'llama3.2:3b' not foundSolutions :# Pull model
ollama pull llama3.2:3b
# List installed models
ollama list
# Verify model name (case-sensitive)
ollama show llama3.2:3b
Error : Connection refused to localhost:11434Solutions :# Check Ollama is running
ps aux | grep ollama
# Start Ollama
ollama serve
# Or as service
sudo systemctl start ollama
# Check port
netstat -tuln | grep 11434
# Test connection
curl http://localhost:11434
Best Practices
Minimum :
CPU: 4 cores
RAM: 8GB
Storage: 50GB
Recommended :
CPU: 8+ cores
RAM: 16GB+
GPU: NVIDIA with 8GB+ VRAM
Storage: 100GB+ SSD
Production :
CPU: 16+ cores
RAM: 32GB+
GPU: NVIDIA A100/H100
Storage: 500GB+ NVMe SSD
Development : Small models (1-3B params)
Testing : Medium models (7B params)
Production : Based on use case (7-70B)
Air-gapped : Pre-download all needed models
Run Ollama in isolated environment
Don’t expose Ollama port publicly
Use authentication if remote access needed
Validate all inputs
Monitor resource usage
Horizontal: Multiple Ollama instances
Vertical: Larger GPU, more RAM
Load balancing across instances
Model caching on shared storage
Next Steps
Local LLMs Ready : Run powerful open-source models with complete privacy and control!