Skip to content

Engineering · 2025

Light Thinker

The premise is that small local models can sustain autonomous, open-ended reasoning if the surrounding system, not the model, carries the hard parts. The "Lightweight Thinking System" runs indefinitely: it explores topics, deepens understanding, and discovers new questions on its own, treating user messages as high-priority interrupts rather than the trigger to start. To keep a small model's job simple, the model never emits JSON or makes routing decisions — it writes natural language with section markers, and a centralized backend parses those markers and decides what happens next.

The architecture has three tiers. Grade 1 is a persistent interface proxy: one per agent, it receives Telegram webhooks, routes messages to the correct room, attributes each message to a human or another bot, and drops it into a per-room thought journal. Grade 2 is the ephemeral orchestrator — spawned for a single execution loop, it reasons about the current topic, then terminates; the backend reads its output and decides whether to spawn a search or read worker, report to the user, mark a topic done, enqueue discovered topics, or simply continue thinking. Grade 3 workers are atomic and can recurse up to a bounded depth, handling web search, PDF download-and-extract with OCR fallback, summary fetching, and code or analysis tasks. Generations pass a summary "baton" forward and die, so context stays clean and token cost stays low.

Several mechanisms make long autonomous runs survivable, all implemented in the backend rather than asked of the model. Consensus inference runs a critical prompt several times with different seeds and takes the majority. Summary compaction periodically merges recent summaries into one clean consolidated state to prevent drift over long chains. Stall detection watches for repetition, empty outputs, and stuck delegation loops, injecting nudges or forcing topic transitions. Because multiple agents can share a group chat, anti-loop protection gives bot messages lower priority than human ones and enforces per-room cooldowns and hourly rate limits so two agents cannot spiral into an infinite exchange. A security layer blocks requests to private or reserved addresses and logs outbound traffic, and all data is scoped per agent-and-room pair for isolation.

The project is implemented through its planned operational phases — Telegram routing, local-model integration, the full reasoning loop, recursive workers, security hardening, and scheduled data cleanup — with an integration and unit test suite and a monitoring dashboard for agents, rooms, journals, and live status. Honest context: it targets local-first single-node deployment, depends on a locally running model server and PostgreSQL plus PDF and OCR tooling, and its own notes list production deployment and multi-node scaling as remaining steps. Reasoning quality is fundamentally bounded by the small models it is designed around — the framework's contribution is the scaffolding that makes them usable autonomously, not the model itself.

← Back to all work