New in llama.cpp: Model Management

https://huggingface.co/blog/ggml-org/model-management-in-llamacpp(huggingface.co)

The llama.cpp server now features a "router mode" for dynamic model management, allowing users to load, unload, and switch between multiple models without a restart. This feature automatically discovers GGUF models from a cache or specified directory and loads them on-demand when first requested. It uses a multi-process architecture for stability and employs a least-recently-used (LRU) eviction policy to manage memory when the maximum number of loaded models is reached. Users can specify which model to use in their API requests, manually load or unload models, and configure settings for all models or on a per-model basis.

0 points•by hdt•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?