Autonomous data scientist

The problem it targets is research integrity under automation. If an agent is allowed to download data, run analyses, and report findings on its own, the obvious failure modes are p-hacking, holdout leakage, hypothesizing after results are known, and ungrounded claims. This project's answer is to encode the discipline of pre-registered, holdout-validated science as an enforced pipeline rather than a guideline. The domain is fixed: mobile health, just-in-time adaptive interventions, ecological momentary assessment, and wearables, seeded from roughly two hundred paper and note nodes in a graph store.

The architecture is a thin conductor driving ten ordered phases: pick a topic, check dataset eligibility, pre-register, fetch data, split a holdout, explore, lock the analysis, validate, critique, and synthesize. State persists across sessions in an atomic state file plus a lockfile with stale-lock reclamation, so a run resumes exactly where it stopped. Several safeguards are real code, not policy. A dedicated eligibility phase reads codebooks and license terms before any hypothesis is sealed, without opening analysis rows. The hypothesis is pre-registered as a hashed document with an estimand, confounders, a causal diagram, an identification strategy, and an explicit split plan; row-level random splitting is forbidden for repeated-measures data, defaulting to participant or time-block splits. Fetching runs inside a subprocess egress jail that only permits allow-listed hosts. The validation runner enforces script-hash and holdout-hash matching plus a single-run ledger, so the held-out split is opened exactly once with the frozen, committed analysis script. The critique phase refuses to pass unless it produces threats across required categories: confounding, power, data quality, multiple testing, and leakage.

The data layer is concrete: curated public sources such as PhysioNet and Zenodo under permissive licenses, with a catalogue of clinically substantive datasets carrying participant fields, feature definitions, and pre-written hypothesis seeds and analysis templates. Analysis is CPU-only because the host GPU is reserved, with a hard disk cap and least-recently-used recycling. Each completed cycle emits a Synthesis node linked by edges to the Hypothesis, Experiment, Result, and Critique and to relevant prior literature.

As honest context, the system reached a meaningful but bounded milestone: a real autonomous loop completed sixteen research cycles end-to-end against live dataset downloads and a real store, with LLM-driven topic selection, exploration, and critique, before being paused by an external rate limit. Findings to date have been modest, including null results, which is the expected and even healthy output of a pipeline whose whole point is to make false positives hard to manufacture. The design spec itself was hardened against an adversarial review, adding the eligibility phase, the process-level jail, and hash-gated validation in response.