VRAM Requirements for GPT-OSS Models via Ollama

## Models Overview | Model | Total Parameters | Active Parameters (MoE) | Memory Usage (MXFP4) | Recommended VRAM | |-------|-----------------|------------------------|---------------------|------------------| | **gpt-oss:20B** | ~21 B | ~3.6 B | ~13–14 GB | **≥ 16 GB** | | **gpt-oss:120B** | ~117–118 B | ~5.1 B | ~60–65 GB | **≥ 80 GB** (or ≥60 GB with tuning) | **Note:** These values are for **quantized MXFP4** versions typically used in Ollama, llama.cpp, vLLM, etc. ## 🔍 Details ### gpt-oss:20B - Runs comfortably on a **16 GB GPU**. - Benchmarks report **~13–14 GB** VRAM usage in MXFP4. - Ideal for most high-end consumer GPUs. ### gpt-oss:120B - Requires **~60–65 GB VRAM** for smooth performance in MXFP4 quantization. - Runs best on **≥ 80 GB GPUs** (e.g., A100 80GB, AMD Instinct, H100). - Can work with **60 GB VRAM setups**, but with careful tuning and performance trade-offs. ## ⚠️ Additional Notes - **Unquantized (FP16)** versions require **much more VRAM**—up to **2× more**. - Running **on CPU or using swap** is possible but extremely **slow** and **not practical** for real-time inference. - Context length (up to **128k tokens**) impacts memory usage. - Using Ollama or llama.cpp with MoE (Mixture of Experts) allows partial activation of model weights, reducing memory load. ## ✅ Recommendations ### Use gpt-oss:20B if you have: - Consumer GPUs like RTX 3090/4090 (24 GB) - Apple M-series with 16–32 GB shared memory - Efficient inference needs ### Use gpt-oss:120B only if you have: - 80 GB+ GPU - 60–65 GB VRAM with optimizations - Enterprise-level hardware (A100, H100, etc.) ## 🔗 References - Ollama GPT-OSS:120B - HuggingFace: GPT-OSS Blog - Simon Willison's GPT-OSS Notes - AMD Blog on GPT-OSS Inference - Reddit: Ollama VRAM Experiences

0 points•by raj•5 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?