2026-05-09 22:10 GMT+8 · summary_2026-05-09_22-10.md

🤖 AI News Summary - 2026-05-09 22:10 GMT+8

Focused AI/dev subreddit roundup.

Full site: https://kkklobsterfarming.github.io/ai-news-summary-site/

What changed since last run

Never worked — r/OpenWebUI
Open WebUI v0.9.3 (and v0.9.4) is out — massive performance wins, message editing finally fixed — r/OpenWebUI
Qwen 35B-A3B is very usable with 12GB of VRAM — r/LocalLLaMA
Testing MiMo-V2.5-IQ3_S with 1'048'576 context — r/LocalLLaMA
Got MTP + TurboQuant running — Qwen3.6-27B – 80+ t/s at 262K context on a single RTX 4090 — r/LocalLLaMA
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP — r/LocalLLaMA
Deterministic execution analysis for multi-step LLM workflows (open source) — r/llmdevs
I gave claude the worst prompt but it still made something cool — r/ClaudeAI
Technitium now supports Single Sign-on with OIDC — r/selfhosted
Tribue to April’s LLM releases — r/LocalLLaMA
What are your best HARNESS/PLUGINS/SKILLS/MCPS you use with Claude Code? — r/ClaudeCode
Which token optimizer would you recommend ? — r/ClaudeCode

r/openai

No non-pinned/newsworthy posts fetched after filtering.

r/LocalLLaMA

Qwen 35B-A3B is very usable with 12GB of VRAM
- Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: `Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf ` The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.
- Timestamp: 2026-05-09 05:22 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is mostly positive. Top reactions focus on ngl these numbers are way better than i expected for a 35b moe on 12gb. kinda wanna try the mtp setup on my 4060 now | With similar setup 12gb vram 3070 context is fine up to 90k minimum, have yet to try higher. I…
- Author: /u/jwestra
Testing MiMo-V2.5-IQ3_S with 1'048'576 context
- [Image: Testing MiMo-V2.5-IQ3_S with 1'048'576 context] llama-server.exe –model “H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf” –ctx-size 1048576 –threads 16 –host 127.0.0.1 (http://127.0.0.1) –no-mmap…
- Timestamp: 2026-05-09 17:10 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is split between critical and joking. Top reactions focus on Yes it is fast, but I found this IQ3_S quant to be kinda bad: In a few tests that I did it got stuck into reasoning loop. | Note that temp 1.0 is the recommended official temperature. Setting it to…
- Author: /u/LegacyRemaster
Got MTP + TurboQuant running — Qwen3.6-27B – 80+ t/s at 262K context on a single RTX 4090
- So I’ve been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant’s lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use. So after a day of vibecoding I think I may have gotten something viable.
- Timestamp: 2026-05-09 05:15 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is mostly positive. Top reactions focus on Current implementations of TBQ are not nearly lossless, that’s why they are not merged into mainline llama.cpp AFAIK | Exactly, it was not working as the google’s paper described. Overall sentiment — post: positive;…
- Author: /u/indrasmirror
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
- Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec with 80%+ draft…
- Timestamp: 2026-05-09 19:57 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is mostly positive. Top reactions focus on Hey OP thank you so much for this. I have an underutilized 5070ti and I’m going to try this out. Hopefully this weekend. | Btw did you try DeepSeekV4? I’m kinda curious for this model too.. Overall sentiment — post:…
- Author: /u/janvitos
Tribue to April’s LLM releases
- [Image: Tribue to April’s LLM releases] April 2026 was a turning point for local LLMs.
- Timestamp: 2026-05-09 07:48 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is mostly positive. Top reactions focus on Great concept and execution, felt like a techno GLaDOS. But my biggest takeaway from this video is… is that how you guys pronounce GGUF? | never ask r/llocallama (/r/llocallama) how they pronounce gguf. Overall…
- Author: /u/Everlier

r/llmdevs

Deterministic execution analysis for multi-step LLM workflows (open source)
- [Image: Deterministic execution analysis for multi-step LLM workflows (open source)] X-Ray is a deterministic execution-analysis engine for multi-step LLM workflows. It evaluates execution structure rather than output quality.
- Timestamp: 2026-05-09 00:35 GMT+8
- Author: /u/velorynintel

r/OpenWebUI

Never worked
- I’m using Open WebUI via Docker and Ollama. I want to use it for being a gamemaster to run ttrpgs and want to upload a world bible into a knowledge base.
- Timestamp: 2026-05-09 11:35 GMT+8
- Community: Community reaction (heuristic-fallback): Top reactions focus on Could you share some examples? Like conversation history, screenshots, etc? By the way which model were you using? | Thanks, yeah I’d thought about using silly tavern but that looked more complicated to set up campaigns itself and so I’d went with…
- Author: /u/zerocool647
Open WebUI v0.9.3 (and v0.9.4) is out — massive performance wins, message editing finally fixed
- Open WebUI v0.9.3 (and v0.9.4 hotfix) is out — massive performance wins, message editing finally fixed The big stuff 🚀 Massive performance improvements to loading - Chat history maps now load from normalized message records, slashing…
- Timestamp: 2026-05-09 16:23 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is split between critical and positive. Top reactions focus on New version coming out in a few mins - notes fixed | This happened the last few releases with different big features. Maybe they need a pre-release that a few heavy users use to catch huge…….
- Author: /u/ClassicMain

r/selfhosted

Technitium now supports Single Sign-on with OIDC
- So I am just reading the release notes before updating my technitium instance and then there it was, OIDC support!!! I haven’t seen it mentioned here yet, but it has been a blessing.
- Timestamp: 2026-05-09 04:53 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is split between critical and skeptical. Top reactions focus on Expand the replies to this comment to learn how AI was used in this post/project. | OIDC support on infra tools is such a nice quality-of-life upgrade. Anything that reduces one-off local…
- Author: /u/Harry_Butz

r/ClaudeAI

I gave claude the worst prompt but it still made something cool
- [Image: I gave claude the worst prompt but it still made something cool] I was trying to search for “video game roguelike with medieval fantasy themes” but with the world’s worst prompt (which the sub has roasted me for. Thanks you guys)…
- Timestamp: 2026-05-09 20:50 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is mostly positive. Top reactions focus on That’s impressive. Claude has so much potential for game dev. I’m working on 2 games right now and making full use of Claude. I think… | I think the zero-to-one aspect of Claude is always so fascinating….
- Author: /u/Financial-Coffee-380

r/ClaudeCode

What are your best HARNESS/PLUGINS/SKILLS/MCPS you use with Claude Code?
- I’ve been using Claude Code for a while, and I just realized I’m just getting started. Got to know about /compact just yesterday.
- Timestamp: 2026-05-09 18:59 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is split between joking and concerned. Top reactions focus on You definitely don’t want to do what everyone does and put everything in md files or have 20+ skills. As this will drive up token… | i made my own little harness using claude code, to work the…
- Author: /u/AssociationSure6273
Which token optimizer would you recommend ?
- I’ve found few projects compression tool that does the job but not sure if they’re completemary or can be used in parallel Most claim the same RTK: “CLI proxy that reduces LLM token consumption by 60-90% on common dev commands.”…
- Timestamp: 2026-05-09 20:51 GMT+8
- Community: Community reaction (heuristic-fallback): The comment section is split between positive and skeptical. Top reactions focus on I would not stack all of these at first. You’ll save tokens and lose debuggability. Repomix is useful when you need a one-time “read this… | I think you have to actually measure it in your own…
- Author: /u/zakblacki

r/Codex

No non-pinned/newsworthy posts fetched after filtering.

Generated 2026-05-09 22:10 GMT+8 | Next update in 2 hours