How it works

Two frequency tables, a max log-ratio, and a threshold calibrated on your own code.

argot is deliberately simple. There is no neural network at scoring time — the model is two token-frequency distributions and a maximum log-likelihood ratio. That’s the whole idea, and it’s why argot fits in seconds and scores in milliseconds, entirely on CPU.

The mental model

A regex catches what you can write down. A type checker catches what you can prove. argot catches what your team has implicitly agreed on by repetition — naming patterns, error-handling shapes, control-flow idioms, the difference between response.raise_for_status() and if response.status_code >= 400: raise.

It builds two distributions:

the repo distribution — how tokens are used across your codebase’s history, and
the generic baseline — a broad open-source corpus baseline bundled with argot.

A hunk is suspicious when at least one of its tokens is far more likely under the generic baseline than under your repo. High surprise means “this looks like generic open-source code, not code from here.”

Two phases

The pipeline splits into fit (run once per repo, and after major refactors) and check (run on every diff).

Run extract → train → calibrate once; the calibrated threshold feeds every check.

Fit

extract — walks git log, slices each commit into hunks, and tokenizes every hunk and its surrounding context with a language-aware tree-sitter tokenizer. Output: .argot/dataset.jsonl.
train — counts BPE tokens across the repo’s non-test source files (the repo distribution) and loads the bundled generic baseline. Data-dominant files (locale tables, fixtures, generated code) are excluded so they don’t pollute the distribution.
calibrate — samples representative top-level functions and classes from your repo, scores them, and sets the threshold to the maximum score over those “normal” hunks. Per-language repos get one threshold per language.

argot fit runs all three for you and writes .argot/scorer-config.json.

Check

For each changed hunk, argot runs a short pipeline:

Typicality filter — skip hunks that are structurally data-dominant (mostly literals) or live in a data-dominant file. The n-gram model would only see noise there.
Import checker — if a hunk imports a module that’s foreign to the repo’s own first-party import set, flag it immediately (reason: import).
BPE scorer — compute the max-surprise score, adjusted by a small per-callee penalty, and flag the hunk if the adjusted score exceeds the calibrated threshold.

The math, in one line:

surprise (t) = lo g P_{baseline} (t) - lo g P_{repo} (t) score (hunk) = t \in tokens (hunk) max surprise (t)

A high score means at least one token is far more common in generic code than in this repo — a reliable signal of foreign style. Comments and docstrings are blanked before scoring, so natural language doesn’t inflate the signal.

For the full scoring model — the call-receiver penalty, file clustering, and the per-corpus auto-detect probe — see The scoring model.