Languages
Tree-sitter tokenization, supported languages, and per-language calibration for monorepos.
argot is language-agnostic by design. All language-specific logic — import extraction, callee
extraction, prose masking, sampleable-range enumeration — is encapsulated in LanguageAdapter
implementations. Nothing in the scorer hardcodes a framework, a library, or a corpus.
Tree-sitter tokenization
Hunks are tokenized with tree-sitter, an incremental, error-tolerant parser. Two properties matter here:
- It works on partial, syntactically invalid fragments — essential, because a diff hunk is almost always a mid-block slice that doesn’t parse on its own.
- It gives a single uniform interface for every supported language, so adding a language is a matter of writing an adapter, not a new pipeline.
Supported out of the box
| Language | Status |
|---|---|
| Python | Supported |
| TypeScript / JavaScript | Supported |
More languages are adapter-shaped work — the model and pipeline don’t change.
Per-language calibration
A mixed Python + TypeScript monorepo would, with a single threshold, calibrate against a joint
distribution dominated by whichever language has broader token diversity. argot instead emits one
threshold per language present in the repo, and check dispatches each hunk by file extension.
So a TypeScript hunk is judged against the repo’s TypeScript voice, and a Python hunk against its Python voice — no cross-language bleed. This is automatic; there’s nothing to configure.
What gets excluded
The token model only makes sense over code, so a few things are kept out of both training and scoring:
- Data-dominant files — modules that are ≥80% top-level array/object literals (locale tables, fixtures, generated lookups). The same structural predicate runs at fit and check time.
- Comments and docstrings — blanked before scoring, so prose doesn’t inflate the surprise signal.
- Test files and conventional directories — skipped today as a placeholder default, moving to user-configurable rules with the suppression surface.