Deterministic Scoring: Why Reproducibility Matters in Risk Infrastructure

Executive Summary

Risk scoring without reproducibility is risk theater. When a desk sizes a position because the score said 67, and a month later compliance asks why, the only acceptable answer is "given these inputs and these version stamps, the scoring engine returns exactly 67 every time we run it." SettleRisk enforces this through closed-form scoring, append-only snapshots, and four version stamps on every score. This post explains the contract and shows how to use it in a TCA workflow.

Core Concept

A scoring engine is deterministic when, holding inputs and version stamps fixed, it returns identical outputs on every invocation. SettleRisk satisfies this by:

Closed-form arithmetic. No statistical model inference at runtime. Driver points combine via published formulas.
Version stamps. Every score carries four stamps:

| Stamp | Meaning | |-------|---------| | heuristics_version | The closed-form formula version | | stat_model_version | Always "none" in v1 | | llm_extractor_version | The prompt + schema version used in extraction | | driver_taxonomy_version | The driver registry version |
Canonicalized inputs. Rules text is normalized (\r\n -> \n, no whitespace collapse) before any character offsets are computed.
Append-only snapshots. Score history is never updated in place. A new computation creates a new snapshot with a new timestamp.

The contract: holding the four version stamps and the canonicalized rules text constant, the score is byte-identical.

Worked Example

A reproducibility audit looks like this in Python:

from settlerisk import SettleRiskClient
import json

client = SettleRiskClient(api_key="sk-...")

# Pull current score
current = client.get_risk_score("polymarket", "0xreplay")
print(f"Current score: {current.aggregate_risk_score}")
print(f"Versions: {current.version.model_dump()}")

# Snapshot
snapshot = {
    "rules_text": current.rules_text,
    "version": current.version.model_dump(),
}

# A week later, re-evaluate with same inputs and same version stamps
replay = client.evaluate_rules(
    platform="polymarket",
    rules_text=snapshot["rules_text"],
    pin_versions=snapshot["version"],
)
print(f"Replay score: {replay.aggregate_risk_score}")
# Replay score: 67  (identical to current)

assert current.aggregate_risk_score == replay.aggregate_risk_score
for d_c, d_r in zip(current.drivers, replay.drivers):
    assert d_c.points_contribution == d_r.points_contribution
    assert d_c.evidence.start_char == d_r.evidence.start_char
    assert d_c.evidence.end_char == d_r.evidence.end_char

In TypeScript:

import { SettleRiskClient } from "settlerisk";

const client = new SettleRiskClient({ apiKey: "sk-..." });
const current = await client.getRiskScore("polymarket", "0xreplay");

const replay = await client.evaluateRules({
  platform: "polymarket",
  rulesText: current.rulesText,
  pinVersions: current.version,
});

console.assert(current.aggregateRiskScore === replay.aggregateRiskScore);
for (let i = 0; i < current.drivers.length; i++) {
  console.assert(
    current.drivers[i].pointsContribution === replay.drivers[i].pointsContribution
  );
}

If pin_versions is omitted, the engine uses the latest versions, which may produce a different result if the taxonomy or heuristics have evolved. The reproducibility guarantee only holds when versions are pinned explicitly.

Implementation Notes

Persist all four version stamps with every traded position. A trade ticket without version stamps is unauditable. Persist them in the same row as the sized position.

Replay uses evaluate-rules, not risk-score. The GET endpoints serve from latest cached snapshots, which may have evolved past the version you originally used. For replay, always use the synchronous /v1/evaluate-rules POST with pin_versions.

Snapshots are append-only. SettleRisk never overwrites a historical score. Every recomputation creates a new row. This means you can pull historical scores at any point in time via the snapshot index.

| Use case | Endpoint | |----------|----------| | Latest score for a market | GET /v1/markets/.../risk-score | | Historical score at point in time | GET /v1/markets/.../risk-score?at=2026-04-01T00:00:00Z | | Replay with pinned versions | POST /v1/evaluate-rules with pin_versions | | Score snapshot history | GET /v1/markets/.../score-snapshots |

Canonicalization matters. Two rules-text strings that look identical but differ in line-endings (\r\n vs \n) will produce different SHA-256 digests. Canonicalize before hashing.

Failure Modes

1. Pinning only some versions. Pinning heuristics_version but not driver_taxonomy_version still allows the score to drift. Pin all four.

2. Using GET endpoints for replay. GETs serve from latest snapshot. They are correct for current-state queries; they are wrong for replay.

3. Hashing un-canonicalized text. A \r\n line in the input that was canonicalized to \n during scoring will hash differently. Use the canonicalization function (or the SHA-256 returned by the API) as your audit anchor.

4. Skipping version logging in compliance workflows. "We use SettleRisk" is not an audit answer. "We used SettleRisk heuristics_version 1.0.0 with driver_taxonomy_version 1.2.0 on rules_sha256 abc..." is.

5. Assuming version bumps don't change scores. They can. Always re-run with new versions in shadow mode before flipping production logic.

Checklist

[ ] Persist all four version stamps with every position
[ ] Use evaluate-rules (POST) for replay, not GET
[ ] Canonicalize rules text before hashing
[ ] Shadow-run new versions before adopting
[ ] Pull score-snapshot history for any disputed trade
[ ] Document your version-pinning policy in compliance

Sources + Further Reading

SettleRisk methodology — full reproducibility contract
Driver attribution post — the driver registry
Reproducible Research in Computational Science (Peng 2011)
Versioning Schemes for Scientific Software (Knepper et al. 2020)

Free key at /signup — pull versioned scores from day one.