Benchmark¶

RCG ships a labeled dataset and a rcg benchmark harness so the detectors' quality is measurable and reproducible — not just asserted.

rcg benchmark benchmarks/dataset.jsonl --embedder hashing --judge mock --semantic

On the 62-pair labeled set (deterministic config — mock judge, lexical embedder):

pass	precision	recall	F1
syntactic	1.000	0.500	0.667
combined (lexical embedder + mock judge)	0.867	0.500	0.634

By category, the syntactic pass scores 1.000 / 1.000 on approval-stance and 1.000 / 0.800 on modality — perfect precision, and the approval-stance logic that kills false positives works exactly as intended.

An honest finding

Non-keyword semantic conflicts (rules that disagree without sharing words) need the Anthropic judge — the deterministic mock judge is the bottleneck there, not the embedder. A real (sentence-transformers) embedder widens the semantic pass's candidate recall (0.269 → 0.462) at a precision cost; converting those to true positives needs --judge anthropic. We publish this rather than claim a recall lift the mock judge masks.

Reproduce the real-embeddings + real-judge numbers:

pip install 'rule-coherence-graph[embeddings]'
export ANTHROPIC_API_KEY=sk-...
rcg benchmark benchmarks/dataset.jsonl --embedder sentence-transformers --judge anthropic --semantic

Full tables, every config, and the complete reading: benchmarks/RESULTS.md on GitHub.

Caveats

The dataset is small and synthetic/illustrative, and the default embedder is lexical — install the [embeddings] extra for real semantic recall. Treat these numbers as a regression signal and a starting point, not a leaderboard.