argot docs GitHub ↗

Languages

Tree-sitter tokenization, supported languages, and per-language calibration for monorepos.

argot is language-agnostic by design. All language-specific logic — import extraction, callee extraction, prose masking, sampleable-range enumeration — is encapsulated in LanguageAdapter implementations. Nothing in the scorer hardcodes a framework, a library, or a corpus.

Tree-sitter tokenization

Hunks are tokenized with tree-sitter, an incremental, error-tolerant parser. Two properties matter here:

Supported out of the box

LanguageStatus
PythonSupported
TypeScript / JavaScriptSupported

More languages are adapter-shaped work — the model and pipeline don’t change.

Per-language calibration

A mixed Python + TypeScript monorepo would, with a single threshold, calibrate against a joint distribution dominated by whichever language has broader token diversity. argot instead emits one threshold per language present in the repo, and check dispatches each hunk by file extension.

So a TypeScript hunk is judged against the repo’s TypeScript voice, and a Python hunk against its Python voice — no cross-language bleed. This is automatic; there’s nothing to configure.

What gets excluded

The token model only makes sense over code, so a few things are kept out of both training and scoring: