Release history
Text Generation Web UI releases
The original local LLM interface. Text, vision, tool-calling, training, and more. 100% offline.
All releases
22 shown
- Use cuda13.1 build if `nvidia-smi` reports CUDA Version >= 13.1; otherwise use cuda12.4
- ik_llama.cpp offers new quant types – choose based on preference
- Portable builds now support Windows, Linux, macOS with specific GPU/ROCm/CPU variants
- Redesigned chat composer: taller input area with paperclip and action buttons pinned to bottom (Gemini/DeepSeek style)
- Smooth scroll animation when sending a new message
- Electron improvements – persist window bounds, add --no-electron flag, disable spellcheck in chat input
Full changelog
Changes
- Redesigned chat composer: Taller input area with the paperclip and message-action buttons pinned to the bottom, similar to Gemini and DeepSeek.
- Smooth scroll animation when sending a new message: Inspired by Gemini's chat UI.
- Electron improvements:
- Persist window bounds and maximize state across launches.
- Add a
--no-electronflag to skip the desktop window and use the web UI in the browser instead. - Disable spellcheck in the chat input.
- API: Add support for list-format content in tool and assistant messages.
- Add more space below the last chat/chat-instruct message so its action buttons have breathing room.
Bug fixes
- Fix speculative decoding broken by upstream llama.cpp arg renames (#7541).
- Fix truncation length reverting after model load on UI reload (#7540).
- Don't clear the chat input when sending a message with no model loaded (#7542).
- Electron:
- Fix big character picture failing to load (#7540).
- Fix
--listenmode in the launcher. - Fix missing log colors on Windows.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/68380ae11b564af67196afc70f10c99dbb532fa9
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/9a26522af234f8db079ae3735f35ab6c20fe2c66
Portable builds
TextGen is now a desktop app for local LLMs. Download, unzip, double-click.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (891 MB) | Download (1.23 GB) |
| NVIDIA (CUDA 13.1) | Download (817 MB) | Download (1.33 GB) |
| AMD/Intel (Vulkan) | Download (336 MB) | — |
| AMD (ROCm 7.2) | Download (604 MB) | — |
| CPU only | Download (319 MB) | Download (334 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (848 MB) | Download (1.20 GB) |
| NVIDIA (CUDA 13.1) | Download (803 MB) | Download (1.33 GB) |
| AMD/Intel (Vulkan) | Download (324 MB) | — |
| AMD (ROCm 7.2) | Download (396 MB) | — |
| CPU only | Download (307 MB) | Download (334 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (271 MB) |
| Intel (x86_64) | Download (283 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
textgen-4.6/
textgen-4.7/
user_data/ <-- shared by both installs
- When upgrading portable builds, replace the entire executable with the new `textgen`/`textgen.bat` and adjust any scripts that called the old start script.
- Existing configurations using `--row-split` must be migrated to `--split-mode tensor` (or another supported mode).
- Portable build selection now requires CUDA version check: use `cuda13.1` builds if `nvidia-smi` reports CUDA ≥ 13.1, otherwise use `cuda12.4`.
- Run `textgen` / `textgen.bat` executables instead of previous start scripts; old script invocation no longer works.
- Flag `--row-split` removed, replaced by new `--split-mode` flag with a `tensor` option for multi‑GPU inference.
- Native desktop builds bundle Electron and launch as native windows (run via `textgen`/`textgen.bat`).
- Tensor parallelism support added via `--split-mode tensor` in llama.cpp, boosting multi‑GPU performance by >60%.
- UI overhaul: Inter font default, Lucide SVG icons for actions, segmented chat mode control, redesigned input card, flat underline tab indicator, hairline sidebar handles.
Full changelog
Changes
- Native desktop app: Portable builds now bundle Electron and open as a native window. Run
textgen/textgen.batinstead of the previous start scripts. Pass--listenor--nowebuito skip the window and run the server directly. - Major UI overhaul:
- Replace Noto Sans with Inter as the default font.
- Replace emoji refresh/save/delete buttons with Lucide SVG icons.
- Turn the chat mode selector (chat / chat-instruct / instruct) into a 3-button segmented control.
- Redesign the chat input as a single rounded card with a circular accent-colored send button.
- Use a flat underline for the active tab indicator.
- Replace the sidebar toggle buttons with 3px hairline handles on desktop.
- Tensor parallelism for llama.cpp: New
--split-modeflag (replacing--row-split) with atensoroption that can make multi-GPU inference 60%+ faster. On the ik_llama.cpp backend,tensorandrowfall back tograph. - Replace DuckDuckGo HTML scraping in the web search tool with the ddgs library, which is more robust against DuckDuckGo's bot blocking.
- Add support for standalone
.jinja/.jinja2instruction template files in the UI, in addition to the existing.yamlformat (#7517).
Bug fixes
- Fix Stop button being ignored during tool call approval, and not interrupting between tool turns in multi-turn tool loops.
- Fix race condition in the ExLlamaV3 backend that could affect concurrent API requests.
- Fix extension settings not saving for extensions inside
user_data/extensions(#7525).
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/09294365a9d7a2b786584d59525b034622c3ed81
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/9f1deefa7128889fd8a947964f04262bfa724b84
- Update transformers to 5.6
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (891 MB) | Download (1.23 GB) |
| NVIDIA (CUDA 13.1) | Download (816 MB) | Download (1.33 GB) |
| AMD/Intel (Vulkan) | Download (336 MB) | — |
| AMD (ROCm 7.2) | Download (604 MB) | — |
| CPU only | Download (318 MB) | Download (334 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (848 MB) | Download (1.20 GB) |
| NVIDIA (CUDA 13.1) | Download (803 MB) | Download (1.32 GB) |
| AMD/Intel (Vulkan) | Download (324 MB) | — |
| AMD (ROCm 7.2) | Download (395 MB) | — |
| CPU only | Download (306 MB) | Download (334 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (271 MB) |
| Intel (x86_64) | Download (283 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
textgen-4.6/
textgen-4.7/
user_data/ <-- shared by both installs
- --listen or --nowebui can be used to skip the native window and run the server directly.
- Portable builds require CUDA 13.1 when `nvidia-smi` reports CUDA Version ≥ 13.1; otherwise use the CUDA 12.4 build.
- Removed previous start script invocation; run `textgen` / `textgen.bat` to launch the Electron‑wrapped native window.
- Tensor parallelism flag `--split-mode tensor` for up to 60%+ faster multi‑GPU inference with llama.cpp.
- Major UI overhaul: default font Inter, Lucide SVG icons, segmented chat mode control, redesigned input card, flat tab indicator, hairline sidebar handles.
Full changelog
Changes
- Native desktop app: Portable builds now bundle Electron and open as a native window. Run
textgen/textgen.batinstead of the previous start scripts. Pass--listenor--nowebuito skip the window and run the server directly. - Major UI overhaul:
- Replace Noto Sans with Inter as the default font.
- Replace emoji refresh/save/delete buttons with Lucide SVG icons.
- Turn the chat mode selector (chat / chat-instruct / instruct) into a 3-button segmented control.
- Redesign the chat input as a single rounded card with a circular accent-colored send button.
- Use a flat underline for the active tab indicator.
- Replace the sidebar toggle buttons with 3px hairline handles on desktop.
- Tensor parallelism for llama.cpp: New
--split-modeflag (replacing--row-split) with atensoroption that can make multi-GPU inference 60%+ faster. On the ik_llama.cpp backend,tensorandrowfall back tograph. - Replace DuckDuckGo HTML scraping in the web search tool with the ddgs library, which is more robust against DuckDuckGo's bot blocking.
- Add support for standalone
.jinja/.jinja2instruction template files in the UI, in addition to the existing.yamlformat (#7517).
Bug fixes
- Fix Stop button being ignored during tool call approval, and not interrupting between tool turns in multi-turn tool loops.
- Fix race condition in the ExLlamaV3 backend that could affect concurrent API requests.
- Fix extension settings not saving for extensions inside
user_data/extensions(#7525).
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/09294365a9d7a2b786584d59525b034622c3ed81
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/9f1deefa7128889fd8a947964f04262bfa724b84
- Update transformers to 5.6
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (891 MB) | Download (1.23 GB) |
| NVIDIA (CUDA 13.1) | Download (816 MB) | Download (1.33 GB) |
| AMD/Intel (Vulkan) | Download (336 MB) | — |
| AMD (ROCm 7.2) | Download (604 MB) | — |
| CPU only | Download (318 MB) | Download (334 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (848 MB) | Download (1.20 GB) |
| NVIDIA (CUDA 13.1) | Download (803 MB) | Download (1.32 GB) |
| AMD/Intel (Vulkan) | Download (324 MB) | — |
| AMD (ROCm 7.2) | Download (395 MB) | — |
| CPU only | Download (306 MB) | Download (334 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (271 MB) |
| Intel (x86_64) | Download (283 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
textgen-4.6/
textgen-4.7/
user_data/ <-- shared by both installs
- Native portable builds bundle Electron; launch via `textgen`/`textgen.bat`, optional `--listen` / `--nowebui` flags.
- UI redesign: Inter font, Lucide SVG icons, segmented chat mode control, rounded input card with accent send button, flat underline active tab indicator, hairline sidebar handles.
- Tensor parallelism via new `--split-mode tensor` flag for up to 60%+ multi‑GPU speedup on llama.cpp (fallback to graph on ik_llama.cpp).
Full changelog
Changes
- Native desktop app: Portable builds now bundle Electron and open as a native window. Run
textgen/textgen.batinstead of the previous start scripts. Pass--listenor--nowebuito skip the window and run the server directly. - Major UI overhaul:
- Replace Noto Sans with Inter as the default font.
- Replace emoji refresh/save/delete buttons with Lucide SVG icons.
- Turn the chat mode selector (chat / chat-instruct / instruct) into a 3-button segmented control.
- Redesign the chat input as a single rounded card with a circular accent-colored send button.
- Use a flat underline for the active tab indicator.
- Replace the sidebar toggle buttons with 3px hairline handles on desktop.
- Tensor parallelism for llama.cpp: New
--split-modeflag (replacing--row-split) with atensoroption that can make multi-GPU inference 60%+ faster. On the ik_llama.cpp backend,tensorandrowfall back tograph. - Replace DuckDuckGo HTML scraping in the web search tool with the ddgs library, which is more robust against DuckDuckGo's bot blocking.
- Add support for standalone
.jinja/.jinja2instruction template files in the UI, in addition to the existing.yamlformat (#7517).
Bug fixes
- Fix Stop button being ignored during tool call approval, and not interrupting between tool turns in multi-turn tool loops.
- Fix race condition in the ExLlamaV3 backend that could affect concurrent API requests.
- Fix extension settings not saving for extensions inside
user_data/extensions(#7525).
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/09294365a9d7a2b786584d59525b034622c3ed81
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/9f1deefa7128889fd8a947964f04262bfa724b84
- Update transformers to 5.6
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (891 MB) | Download (1.23 GB) |
| NVIDIA (CUDA 13.1) | Download (816 MB) | Download (1.33 GB) |
| AMD/Intel (Vulkan) | Download (336 MB) | — |
| AMD (ROCm 7.2) | Download (604 MB) | — |
| CPU only | Download (318 MB) | Download (334 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (848 MB) | Download (1.20 GB) |
| NVIDIA (CUDA 13.1) | Download (803 MB) | Download (1.32 GB) |
| AMD/Intel (Vulkan) | Download (324 MB) | — |
| AMD (ROCm 7.2) | Download (395 MB) | — |
| CPU only | Download (306 MB) | Download (334 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (271 MB) |
| Intel (x86_64) | Download (283 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
textgen-4.6/
textgen-4.7/
user_data/ <-- shared by both installs
Minor fixes and improvements.
Changelog
Updated to v4.7.1
https://github.com/oobabooga/textgen/releases/tag/v4.7.1
- SSRF vulnerabilities fixed in URL fetching with backslash and userinfo rejection
- Tool call confirmation UI with approve/reject buttons
- Stdio MCP server support via mcp.json
- preserve_thinking chat template parameter
Full changelog
Changes
- Tool call confirmation: Add inline approve/reject/always-approve buttons that appear before each tool call is executed. Enable via the new "Confirm tool calls" checkbox in the Chat tab.
- Stdio MCP server support: In addition to HTTP MCP servers, you can now configure local subprocess-based MCP servers via
user_data/mcp.json, using the same format as Claude Desktop and Cursor. [Tutorial] preserve_thinkingchat template parameter: New UI checkbox and--preserve-thinkingCLI flag to control whether thinking blocks from prior turns are kept in the context.- UI: Sidebars overhaul: Sidebars now toggle independently and persist their state on page refresh. Default visibility adapts to viewport width.
- llama.cpp: Pass
--draft-min 48by default for draftless speculative decoding. - Only show the "Reasoning effort" and "Enable thinking" controls for models whose chat template actually uses them.
- Cache MCP tool discovery to avoid re-querying servers on each generation.
- Add model download branch handling in download_model_wrapper (#7506). Thanks, @Th-Underscore.
- UI: Improve border colors in light theme, fix code block copy button colors and centering, fix code block scrollbar flash during page load, improve past chats menu spacing.
Security
- Fix SSRF vulnerabilities in URL fetching: add backslash and userinfo rejection, validate every redirect hop.
Bug fixes
- Fix Gemma 4 thinking tags not hidden after tool calls (#7509).
- Fix GPT-OSS channel tokens leaking in UI after tool calls.
- Fix Slider preprocess not handling None from cleared number input. 🆕 - v4.6.1.
- llama.cpp: Fix multimodal by using server's random media marker. 🆕 - v4.6.1.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/6217b49583432f55014c2a0551f453d42b300530
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/286ce324baed17c95faec77792eaa6bdb1c7a5f5
- Update ExLlamaV3 to 0.0.30
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (766 MB) | Download (1.1 GB) |
| NVIDIA (CUDA 13.1) | Download (686 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (196 MB) | — |
| AMD (ROCm 7.2) | Download (499 MB) | — |
| CPU only | Download (178 MB) | Download (194 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (747 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (696 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (208 MB) | — |
| AMD (ROCm 7.2) | Download (307 MB) | — |
| CPU only | Download (190 MB) | Download (217 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (156 MB) |
| Intel (x86_64) | Download (162 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs
- SSRF vulnerabilities fixed in URL fetching with backslash and userinfo rejection
- Tool call confirmation UI with approve/reject buttons
- Stdio MCP server support via mcp.json
- preserve_thinking chat template parameter
Full changelog
Changes
- Tool call confirmation: Add inline approve/reject/always-approve buttons that appear before each tool call is executed. Enable via the new "Confirm tool calls" checkbox in the Chat tab.
- Stdio MCP server support: In addition to HTTP MCP servers, you can now configure local subprocess-based MCP servers via
user_data/mcp.json, using the same format as Claude Desktop and Cursor. [Tutorial] preserve_thinkingchat template parameter: New UI checkbox and--preserve-thinkingCLI flag to control whether thinking blocks from prior turns are kept in the context.- UI: Sidebars overhaul: Sidebars now toggle independently and persist their state on page refresh. Default visibility adapts to viewport width.
- llama.cpp: Pass
--draft-min 48by default for draftless speculative decoding. - Only show the "Reasoning effort" and "Enable thinking" controls for models whose chat template actually uses them.
- Cache MCP tool discovery to avoid re-querying servers on each generation.
- Add model download branch handling in download_model_wrapper (#7506). Thanks, @Th-Underscore.
- UI: Improve border colors in light theme, fix code block copy button colors and centering, fix code block scrollbar flash during page load, improve past chats menu spacing.
- Pre-load MCP stdio server tools at startup. 🆕 - v4.6.1.
Security
- Fix SSRF vulnerabilities in URL fetching: add backslash and userinfo rejection, validate every redirect hop.
Bug fixes
- Fix Gemma 4 thinking tags not hidden after tool calls (#7509).
- Fix GPT-OSS channel tokens leaking in UI after tool calls.
- Fix Slider preprocess not handling None from cleared number input. 🆕 - v4.6.1.
- llama.cpp: Fix multimodal by using server's random media marker. 🆕 - v4.6.1.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/6217b49583432f55014c2a0551f453d42b300530
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/286ce324baed17c95faec77792eaa6bdb1c7a5f5
- Update ExLlamaV3 to 0.0.30
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (766 MB) | Download (1.1 GB) |
| NVIDIA (CUDA 13.1) | Download (686 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (196 MB) | — |
| AMD (ROCm 7.2) | Download (499 MB) | — |
| CPU only | Download (178 MB) | Download (194 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (747 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (696 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (208 MB) | — |
| AMD (ROCm 7.2) | Download (307 MB) | — |
| CPU only | Download (190 MB) | Download (217 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (156 MB) |
| Intel (x86_64) | Download (162 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs
- SSRF vulnerability fixes in URL fetching
- Tool call confirmation with inline buttons
- Stdio MCP server support via subprocess
- Preserve thinking chat template parameter
Full changelog
Changes
- Tool call confirmation: Add inline approve/reject/always-approve buttons that appear before each tool call is executed. Enable via the new "Confirm tool calls" checkbox in the Chat tab.
- Stdio MCP server support: In addition to HTTP MCP servers, you can now configure local subprocess-based MCP servers via
user_data/mcp.json, using the same format as Claude Desktop and Cursor. [Tutorial] preserve_thinkingchat template parameter: New UI checkbox and--preserve-thinkingCLI flag to control whether thinking blocks from prior turns are kept in the context.- UI: Sidebars overhaul: Sidebars now toggle independently and persist their state on page refresh. Default visibility adapts to viewport width.
- llama.cpp: Pass
--draft-min 48by default for draftless speculative decoding. - Only show the "Reasoning effort" and "Enable thinking" controls for models whose chat template actually uses them.
- Cache MCP tool discovery to avoid re-querying servers on each generation.
- Add model download branch handling in download_model_wrapper (#7506). Thanks, @Th-Underscore.
- UI: Improve border colors in light theme, fix code block copy button colors and centering, fix code block scrollbar flash during page load, improve past chats menu spacing.
Security
- Fix SSRF vulnerabilities in URL fetching: add backslash and userinfo rejection, validate every redirect hop.
Bug fixes
- Fix Gemma 4 thinking tags not hidden after tool calls (#7509).
- Fix GPT-OSS channel tokens leaking in UI after tool calls.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/commit/6217b49583432f55014c2a0551f453d42b300530
- Update ik_llama.cpp to https://github.com/ikawrakow/ik_llama.cpp/commit/286ce324baed17c95faec77792eaa6bdb1c7a5f5
- Update ExLlamaV3 to 0.0.30
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
[!NOTE]
NVIDIA GPU: Ifnvidia-smireports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (766 MB) | Download (1.1 GB) |
| NVIDIA (CUDA 13.1) | Download (686 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (196 MB) | — |
| AMD (ROCm 7.2) | Download (499 MB) | — |
| CPU only | Download (178 MB) | Download (194 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (747 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (696 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (208 MB) | — |
| AMD (ROCm 7.2) | Download (307 MB) | — |
| CPU only | Download (190 MB) | Download (217 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (156 MB) |
| Intel (x86_64) | Download (162 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs
- Project renamed to TextGen; GitHub URL changed to github.com/oobabooga/textgen
- Reduce VRAM peak in prompt logprobs forward pass
- UI: Add sky-blue color for quoted text in light mode
- Logits display improvements
- Logits display improvements
- Add sky-blue color for quoted text in light mode
- Reduced VRAM peak in prompt logprobs forward pass
- GitHub repository URL changed to github.com/oobabooga/textgen
- Logits display improvements
- Light mode color enhancements
- To update a portable install, download and extract the latest version and replace the 'user_data' folder with your existing one.
- The 'cpu-moe' checkbox has been moved to extra flags.
- For NVIDIA GPUs, use the cuda13.1 build if nvidia-smi reports CUDA Version >= 13.1, otherwise use cuda12.4.
- Removed the deprecated 'settings' parameter from the model load endpoint in the API.
- MCP server support for using remote servers in the Chat tab
- Windows + ROCm portable builds added
- API responses for image generation now include embedded generation metadata
- Removed obsolete models/config.yaml forcing instruction template detection from model metadata
- Fixed ACL bypass via case-insensitive path matching on Windows/macOS
- Added server-side validation for Dropdown, Radio, and CheckboxGroup
- Sanitized filenames in prompt file operations to mitigate CWE-22
- Gemma 4 tool-calling support for API and UI
- ik_llama.cpp backend with advanced KV quantization and MoE optimizations
- API completions now accept echo and logprobs parameters returning per-token log probabilities
- Removed `models/config.yaml` and switched instruction template detection to model metadata, breaking custom template logic.
- Renamed internal log field from 'truncation length' to 'context length', affecting parsers expecting the old term.
- Fixed ACL bypass in Gradio due to case-insensitive path matching on Windows and macOS.
- Added server-side validation for Dropdown, Radio, and CheckboxGroup components.
- Blocked SSRF in superbooga extensions by validating URLs to prevent private/internal network access.
- Gemma 4 model now supports full tool-calling in UI and API.
- ik_llama.cpp backend adds KV cache rotation quantizations and MoE optimizations.
- API completions now accept echo and logprobs parameters with token-level probabilities.
- Fixed ACL bypass via case‑insensitive path matching on Windows and macOS
- Added server‑side validation for Gradio dropdown, radio, and checkbox group controls
- Mitigated SSRF in superbooga extensions by blocking private/internal network requests
- Gemma‑4 model integration with full tool‑calling support
- ik_llama.cpp backend with new quantizations and MoE optimizations
- API now supports echo, logprobs, and token‑level logprob IDs
Updated to v4.3.1! https://github.com/oobabooga/text-generation-webui/releases/tag/v4.3.1
- Removed the `higher_rank_limit` training parameter, changing VRAM usage and potentially affecting training stability.
- Deleted 52 legacy instruction templates, requiring updates to any custom prompts that used them.
- Anthropic-compatible `/v1/messages` endpoint supporting system messages, tool use, image inputs, and thinking blocks.
- Literal flag parsing in `--extra-flags` now accepts complex flag strings in a single argument.
- Portable builds now use a stripped Python distribution, reducing download size.
- Removed legacy Qwen3-Thinking and Qwen3-No-Thinking presets, affecting users who selected them.
- Changed default `ctx-size` to auto (0) when `--gpu-layers -1`, altering expected context size behavior.
- Added UI tool-calling support with collapsible accordion view for each function call.
- Introduced incognito chat mode for temporary, RAM-only sessions.
- Refactored reasoning extraction into a modular component supporting multiple model formats.
Updated to v4.1.1! https://github.com/oobabooga/text-generation-webui/releases/tag/v4.1.1
- Custom UI framework reduces startup latency and streaming overhead
- Parallel API requests increase llama.cpp throughput
- Training pipeline now supports chat data formats and applies LoRA to all linear layers
Chat messages now display cleaner tables and separators. Fixed model loading failures when EOS tokens are disabled and resolved symbolic-link errors in llama-cpp binaries. Updated llama.cpp and bitsandbytes, improving reliability of portable builds.