LLM knowledge wiki — Junghwan Park

The project addresses a recurring problem with note-taking over research literature: documents are too coarse to retrieve precisely, and free-text tags drift so the same concept gets different names across domains. llmwiki's answer is to make the passage — about one paragraph of a source — the unit of knowledge. Every passage page holds an "original" block (a short, citation-safe quote) and a "gist" block (a faithful paraphrase), so each claim is traceable to its source while remaining searchable as a clean semantic unit. Lists, tables, and figure captions count as single passages; equation-only paragraphs fold into their explanatory neighbor.

The system is organized as a strict four-layer data flow with explicit transformations between layers: raw holds immutable source text and metadata (downloaders only — no LLM output ever lands here); screening holds triage, exclusion lists, and structured local-model filtering output; reading holds canonical deep-read prose notes from a single designated reader model; and wiki holds the finished, schema-conformant pages. Pages come in five kinds — documents (a thin table of contents over their passages), passages (the core unit), entities, concepts, and synthesis — and entity, concept, and synthesis pages assemble knowledge by citing passages, so every higher-level assertion remains traceable down to a quoted source.

Two cross-cutting systems give the wiki its coherence. An embeddings sidecar mirrors the wiki tree (managed only through an indexer tool, marked binary in git to suppress diff noise) and embeds each passage's gist; cosine similarity drives both natural-language search and link recommendation, with a deliberately conservative rule that similar-passage links are proposed but only added after user approval. An auto-managed ontology — a node-type catalog, a relation catalog, and a canonical-terms dictionary mapping aliases to canonical IDs — cuts across all domains, so a concept like JITAI has one canonical identifier everywhere it appears, even though domains are otherwise physically separated by directory and namespace.

The design philosophy is explicit about division of labor and guardrails: humans curate and question, agents write and tidy, and certain moves (changing the ontology's structure, large reorganizations, cross-domain links) require user sign-off rather than autonomous action. Honest context: this is a private, single-user knowledge base and an evolving schema, and its operating manual is written for the agents that maintain it rather than as a public product — so the artifact to evaluate is the architecture and discipline of the system, not a fixed corpus.