$KTransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Key Features

Enhanced kernel optimizations and placement/parallelism strategies for improved performance.

Compatible with 🤗 Transformers interface and RESTful APIs compliant with OpenAI and Ollama.

Run powerful models like DeepSeek-Coder-V3 using only 14GB VRAM and 382GB DRAM.

Feb 15, 2025

Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%) (Up to 16 Tokens/s)

Feb 10, 2025

Support for single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup.