The scoring model

The BPE log-ratio, the call-receiver penalty, file clustering, and per-corpus auto-detect.

This page is the deep end — the full scoring model. For the plain-English version, start with How it works.

The whole score, in one line — a BPE surprise term plus a call-receiver penalty:

score (hunk) = BPE surprise t \in tokens (hunk) max lo g \frac{P _{baseline} ( t )}{P _{repo} ( t )} + call-receiver penalty min (\sum_{c} w (c), cap)

BPE surprise

argot tokenizes each hunk with the UnixCoder BPE tokenizer — only the vocabulary is used, not the neural network. It then compares two smoothed distributions, the repo ( $A$ ) and the generic baseline ( $B$ ):

P_{A} (t) = \frac{count _{A} ( t )}{total _{A}} + ε P_{B} (t) = \frac{count _{B} ( t )}{total _{B}} + ε

surprise (t) = lo g P_{B} (t) - lo g P_{A} (t) score (hunk) = t \in tokens (hunk) max surprise (t)

A high score means at least one token is far more common in generic open-source code than in this repo. Prose lines (comments, docstrings) are blanked before scoring so natural language doesn’t inflate the signal.

The call-receiver penalty

The raw BPE score is adjusted by a small per-callee penalty over the hunk’s distinct dotted callees $c$ :

adjusted = bpe + min (c \sum w (c), cap)

The weight $w (c)$ depends on how the callee relates to the repo and to its file’s cluster:

w (c) = ⎩ ⎨ ⎧ α + r α β β c unattested, but its root is known (req.send vs req.get) c and its root are both unattested c is attested, but absent from its file’s cluster c is cluster-rare, in \leq τ cluster files (auto-detected)

Shipping config: $α = 2.0, r = 2.0, β = 5.0, cap = 5.0, K = 8, τ = 2$ .

The cluster-conditional term targets context-dependent breaks — a known callee showing up in a file kind it never belongs to (think Math.random inside a deterministic faker provider, even though Math.random exists elsewhere in the repo’s tests).

File clustering

At fit time, every non-data-dominant source file is reduced to its callee bag (the set of dotted call expressions, via tree-sitter), encoded as a 128-perm MinHash signature, and clustered into K = 8 groups with KMeans. Each cluster’s attested set is the union of its files’ callees. At score time a hunk’s file is mapped to its cluster (or the Jaccard-nearest one if it’s new). The clustering is derived purely from callee statistics — no path patterns, no per-corpus heuristics.

Calibration and auto-detect

Calibration samples up to 500 representative top-level functions and classes, scores them, and sets the BPE threshold to the max score over those normal hunks. Because calibration hunks come from files already in the corpus, their callees are subsets of their cluster’s attested set — so calibration scores are invariant under both $α$ and $β$ . The threshold is set against raw BPE; the penalty exists only to push genuinely anomalous hunks past it at score time.

Calibration also runs a per-corpus auto-detect probe: it loads ~1000 diff hunks and measures the fire rate of the cluster-rare rule. If the rule fires on < 5% of hunks it’s informative and stays enabled; otherwise it’s disabled to avoid Zipf-tail false-positive floods. This is what keeps the same config honest across very different repos.

Reason attribution

A hunk is flagged if the import checker fires (foreign import) or the adjusted BPE score exceeds the threshold. The reason is call_receiver when the penalty pushed a below-threshold BPE over the line, and bpe when raw BPE already crossed it. Scores and reasons are always included in the output.