0
VRAM Requirements for GPT-OSS Models via Ollama
## Models Overview
| Model | Total Parameters | Active Parameters (MoE) | Memory Usage (MXFP4) | Recommended VRAM |
|-------|-----------------|------------------------|---------------------|------------------|
| **gpt-oss:20B** | ~21 B | ~3.6 B | ~13–14 GB | **≥ 16 GB** |
| **gpt-oss:120B** | ~117–118 B | ~5.1 B | ~60–65 GB | **≥ 80 GB** (or ≥60 GB with tuning) |
**Note:** These values are for **quantized MXFP4** versions typically used in Ollama, llama.cpp, vLLM, etc.
## 🔍 Details
### gpt-oss:20B
- Runs comfortably on a **16 GB GPU**.
- Benchmarks report **~13–14 GB** VRAM usage in MXFP4.
- Ideal for most high-end consumer GPUs.
### gpt-oss:120B
- Requires **~60–65 GB VRAM** for smooth performance in MXFP4 quantization.
- Runs best on **≥ 80 GB GPUs** (e.g., A100 80GB, AMD Instinct, H100).
- Can work with **60 GB VRAM setups**, but with careful tuning and performance trade-offs.
## ⚠️ Additional Notes
- **Unquantized (FP16)** versions require **much more VRAM**—up to **2× more**.
- Running **on CPU or using swap** is possible but extremely **slow** and **not practical** for real-time inference.
- Context length (up to **128k tokens**) impacts memory usage.
- Using Ollama or llama.cpp with MoE (Mixture of Experts) allows partial activation of model weights, reducing memory load.
## ✅ Recommendations
### Use gpt-oss:20B if you have:
- Consumer GPUs like RTX 3090/4090 (24 GB)
- Apple M-series with 16–32 GB shared memory
- Efficient inference needs
### Use gpt-oss:120B only if you have:
- 80 GB+ GPU
- 60–65 GB VRAM with optimizations
- Enterprise-level hardware (A100, H100, etc.)
## 🔗 References
- Ollama GPT-OSS:120B
- HuggingFace: GPT-OSS Blog
- Simon Willison's GPT-OSS Notes
- AMD Blog on GPT-OSS Inference
- Reddit: Ollama VRAM Experiences
0 points•by raj•2 months ago