How it works
Two frequency tables, a max log-ratio, and a threshold calibrated on your own code.
argot is deliberately simple. There is no neural network at scoring time — the model is two token-frequency distributions and a maximum log-likelihood ratio. That’s the whole idea, and it’s why argot fits in seconds and scores in milliseconds, entirely on CPU.
The mental model
A regex catches what you can write down. A type checker catches what you can prove. argot catches what your team has implicitly agreed on by repetition — naming patterns, error-handling shapes, control-flow idioms, the difference between
response.raise_for_status()andif response.status_code >= 400: raise.
It builds two distributions:
- the repo distribution — how tokens are used across your codebase’s history, and
- the generic baseline — a broad open-source corpus baseline bundled with argot.
A hunk is suspicious when at least one of its tokens is far more likely under the generic baseline than under your repo. High surprise means “this looks like generic open-source code, not code from here.”
Two phases
The pipeline splits into fit (run once per repo, and after major refactors) and check (run on every diff).
extract → train → calibrate once; the calibrated threshold feeds every check.Fit
- extract — walks
git log, slices each commit into hunks, and tokenizes every hunk and its surrounding context with a language-aware tree-sitter tokenizer. Output:.argot/dataset.jsonl. - train — counts BPE tokens across the repo’s non-test source files (the repo distribution) and loads the bundled generic baseline. Data-dominant files (locale tables, fixtures, generated code) are excluded so they don’t pollute the distribution.
- calibrate — samples representative top-level functions and classes from your repo, scores them, and sets the threshold to the maximum score over those “normal” hunks. Per-language repos get one threshold per language.
argot fit runs all three for you and writes .argot/scorer-config.json.
Check
For each changed hunk, argot runs a short pipeline:
- Typicality filter — skip hunks that are structurally data-dominant (mostly literals) or live in a data-dominant file. The n-gram model would only see noise there.
- Import checker — if a hunk imports a module that’s foreign to the repo’s own first-party import
set, flag it immediately (
reason: import). - BPE scorer — compute the max-surprise score, adjusted by a small per-callee penalty, and flag the hunk if the adjusted score exceeds the calibrated threshold.
The math, in one line:
A high score means at least one token is far more common in generic code than in this repo — a reliable signal of foreign style. Comments and docstrings are blanked before scoring, so natural language doesn’t inflate the signal.
For the full scoring model — the call-receiver penalty, file clustering, and the per-corpus auto-detect probe — see The scoring model.