Confidence Calibration (Shared Sigmoid-of-Linear Combination)

Overview

Confidence calibration is the shared scoring layer that every autonomous navigation technique uses to convert a typed diagnostics dataclass into a calibrated \([0, 1]\) confidence on its NavTechniqueResult. Each technique declares a YAML spec — a constant baseline, a list of linear terms keyed by diagnostic-attribute name, optional hard-zero gates, and an optional post-sigmoid clamp — and the shared evaluator applies that spec uniformly. Centralising the math means a config-load validation pass can verify every spec at startup and adding a new technique requires no new scoring code.

Theory

Every autonomous technique’s confidence formula has the same shape:

\[z = \alpha_{0} + \sum_{i} \alpha_{i} \, \mathrm{normalize}_{i}(x_{i}), \qquad c = \sigma(z),\]

where \(x_{i}\) is the raw value of the i-th diagnostic attribute on the technique’s result, \(\mathrm{normalize}_{i}\) applies a per-term offset / divisor / cap transformation, \(\alpha_{0}\) and \(\alpha_{i}\) are configured coefficients, and \(\sigma\) is the logistic sigmoid.

Per-term normalisation

The normalisation transformation applied to each raw value is

\[\begin{split}\mathrm{normalize}(x) = \begin{cases} \mathrm{clip}\!\left(\dfrac{x - o}{d},\; 0,\; \mathrm{cap}\right) & \text{when a cap is set} \\[6pt] \dfrac{x - o}{d} & \text{otherwise} \end{cases}\end{split}\]

where \(o\) is the optional offset (default zero), \(d\) is the divisor (default one, required non-zero), and the cap clamps the post-scale value to \([0, \mathrm{cap}]\) when present. The cap, when set, must lie in \([0, 1]\).

The offset shifts the term’s “responsive interval” along the raw axis (subtracting the offset moves the threshold for \(\mathrm{normalize}(x) = 0\)); the divisor sets how quickly the term saturates as the raw value grows; the cap stops a runaway raw value from dominating the sigmoid argument.

Sigmoid combination

The summed argument \(z\) is fed into the numerically-stable logistic sigmoid

\[\begin{split}\sigma(z) = \begin{cases} \dfrac{1}{1 + e^{-z}} & z \ge 0 \\[6pt] \dfrac{e^{z}}{1 + e^{z}} & z < 0 \end{cases}\end{split}\]

so the formula stays well-defined for arbitrarily large positive or negative arguments.

Hard-zero gates

Each spec may declare a mapping of diagnostic-attribute name to expected boolean. Before the sigmoid is evaluated, the evaluator checks each entry: if the attribute on the diagnostics object is truthy and the spec demands True (or both are False), the corresponding short-circuit fires and the calibrated confidence is forced to zero, regardless of the linear-combination sum. Hard-zero gates are how techniques surface their structural failure modes (the converged offset sits on the search-window edge; the M-estimator fit was spurious; the per-feature reliability gate dropped every input).

Post-sigmoid hard cap

After the sigmoid evaluates, an optional hard_cap in \([0, 1]\) clamps the result from above. This is the right place to encode an algorithmic ceiling that does not depend on the formula’s input — for example, a brightness-weighted-centroid technique whose output is intrinsically less informative than a limb fit caps its post-sigmoid confidence at 0.4 even when every term saturates.

Per-term breakdown

The evaluator can return a per-term contribution trace alongside the calibrated confidence. The trace records, for each term, the raw attribute value, the normalised value, the alpha, and the resulting alpha-times-normalised contribution to the sigmoid argument. Logging this trace at INFO when confidence falls below a threshold gives an operator a one-line diagnostic of which term (or which hard-zero gate) drove the result down.

Restrictions and assumptions

Term divisors must be strictly non-zero; the dataclass constructor rejects zero at config-load time.
Caps, when set, must lie in \([0, 1]\).
Every term’s feature name and every hard-zero key must reference an attribute the diagnostics object actually carries. The orchestrator’s startup-time validate_registered_confidence_specs() walk catches unknown attribute names before any image is processed; if a YAML spec references an unknown field the process fails fast.
The offset / divisor / cap transformation is dimensional but the framework is unit-agnostic — the YAML divisor must be quoted in the same units as the raw diagnostic value.

Sources of uncertainty

The calibrated confidence is the output of a deterministic functional form; there is no stochastic component. What it does capture is the empirical relationship between the documented diagnostics and the per-image fit quality, as encoded by the per-technique YAML coefficients. What it does not capture is the diagnostic’s own measurement uncertainty (if a technique misreports its DT residual, the confidence formula will trust the misreport), nor any cross-technique consistency (the ensemble combine handles that).

Configuration

Confidence calibration is the consumer of YAML, not a producer. Every technique’s confidence spec lives under techniques.<TechniqueName> in src/nav/config_files/config_510_techniques.yaml alongside its tuning block. The spec shape is:

alpha0 — float (dimensionless). Baseline contribution to the sigmoid argument. Negative values pull the sigmoid below 0.5 by default; positive values push it above.
terms — list of mappings. Each entry has:
- feature — str, the diagnostic-attribute name. Must exist on the technique’s diagnostics dataclass and appear in the technique’s confidence_attributes allow-list.
- alpha — float (dimensionless). Linear coefficient applied after normalisation.
- offset — float, default 0.0. Subtracted from the raw value before division. Same units as the raw value.
- divisor — float, default 1.0. Divides after offset; must be non-zero. Same units as the raw value.
- cap_at — float in \([0, 1]\) or null, default null. Optional upper bound on the normalised value. When set, both clips negative values to 0 and the post-scale value to cap_at.
hard_zero_if — mapping of str to bool, default empty. Keys must reference attributes the diagnostics object carries (or live on the technique’s adapter object). When the attribute matches the demanded boolean, the calibrated confidence is forced to zero.
hard_cap — float in \([0, 1]\) or null, default null. Post-sigmoid clamp.

This module exposes no module-level numeric constants of its own; the spec values come from YAML and the runtime constructors validate them.

Implementation

Source files:

src/nav/nav_technique/confidence.py — the ConfidenceSpec, ConfidenceTerm, ConfidenceTermContribution, and ConfidenceBreakdown dataclasses plus the evaluate_sigmoid_combination() evaluator.
src/nav/nav_technique/confidence_config.py — YAML-to-ConfidenceSpec loader used by Config at startup.
src/nav/nav_technique/nav_technique.py — validate_registered_confidence_specs() and log_confidence_breakdown(), the orchestrator-side validation and logging helpers.

Public surface (autodocumented at nav.nav_technique):

ConfidenceSpec — the per-technique formula. Fields:
- alpha0 — sigmoid-argument baseline.
- terms — tuple of ConfidenceTerm linear contributions.
- hard_zero_if — short-circuit map.
- hard_cap — optional post-sigmoid clamp.
ConfidenceTerm — one linear term. Fields:
- feature — diagnostic-attribute name.
- alpha — coefficient.
- offset — pre-scale offset.
- divisor — pre-scale divisor.
- cap_at — optional post-scale upper bound.
ConfidenceTermContribution — one term’s contribution trace. Fields: feature, raw, normalized, alpha, contribution.
ConfidenceBreakdown — full evaluation trace. Fields: confidence, sigmoid_arg, alpha0, terms, hard_zero, hard_cap_applied.
evaluate_sigmoid_combination() — the evaluator. Returns the calibrated confidence, or a (confidence, ConfidenceBreakdown) pair when return_breakdown=True.

Call path traced through evaluate_sigmoid_combination():

Walk the spec’s hard_zero_if. For each entry, fetch the named attribute off the diagnostics object (raising ValueError when missing) and compare against the demanded boolean. If any condition holds, short-circuit with a 0.0 confidence (and a hard-zero-tagged ConfidenceBreakdown when the caller asked for one).
Initialise the sigmoid argument with alpha0.
For each term in terms, fetch the named attribute, apply the offset / divisor / cap normalisation, multiply by the alpha, and accumulate the contribution. Record the per-term contribution in a ConfidenceTermContribution when a breakdown was requested.
Pass the accumulated argument through the numerically-stable logistic sigmoid.
Apply hard_cap when set; record whether the cap fired in the breakdown.
Return the calibrated confidence (and the breakdown when requested).

The orchestrator-side helpers are:

validate_registered_confidence_specs() — invoked at config-load time. Walks every registered NavTechnique whose confidence_spec was loaded and verifies that every term’s feature and every hard_zero_if key appears in the technique’s confidence_attributes allow-list. Raises ValueError on the first unknown name.
log_confidence_breakdown() — emits the breakdown at DEBUG always, and also at INFO when the calibrated confidence falls at or below a low_threshold (default 0.1). This is what surfaces “alpha=-1.5 dt_fit_rms_px=8.7 contribution=-13.05 drove confidence to zero” in the per-image log.

Examples

Sigmoid-of-linear illustration. With alpha0 = -1.0 and a single term whose alpha is 3.0, no offset, no divisor, no cap, on a diagnostic whose raw value is 0.7: the sigmoid argument is \(-1.0 + 3.0 \cdot 0.7 = 1.1\), the sigmoid evaluates to approximately 0.7503, and the calibrated confidence is approximately 0.75. Holding the formula fixed and raising the diagnostic to 0.9 pushes the argument to 1.7 and the confidence to approximately 0.846; lowering to 0.3 drops the argument to -0.1 and the confidence to approximately 0.475.

Hard-zero override. The BodyLimbNav spec declares hard_zero_if: {at_edge: true, spurious: true}. When a fit converges with at_edge true (the offset hit the search-window boundary), the linear combination is irrelevant — the calibrated confidence is 0.0 regardless of how the dt-fit RMS or visible-arc terms scored. The breakdown returned in this case carries a hard_zero='at_edge' annotation so the operator log line explains the zero.

Post-sigmoid hard cap. The BodyBlobNav spec declares hard_cap: 0.4. Even when every term saturates (large blob, high SNR, multi-blob composite), the calibrated confidence cannot exceed 0.4 — a brightness-weighted centroid is structurally weaker evidence than a limb fit and the cap encodes that fact independently of the per-term coefficients.

Validation at startup. If the YAML spec for a technique declares feature: dt_fit_rms_px for a star technique whose StarRefineDiagnostics does not carry dt_fit_rms_px, the validate_registered_confidence_specs() walk fails with a ValueError naming the technique class and the unknown attribute, before any image is processed. The same check fires for unknown hard_zero_if keys.

Worked breakdown. A converged BodyLimbNav fit with visible_limb_arc_fraction 0.85, dt_fit_rms_px 0.4 px, and visible_arc_px 120 px feeds the spec alpha0 = -1.0, alpha(visible_limb_arc_fraction) = 3.0, alpha(dt_fit_rms_px) = -1.5, alpha(visible_arc_px / 100, capped at 1) = 0.4. The sigmoid argument is \(-1.0 + 3.0 \cdot 0.85 - 1.5 \cdot 0.4 + 0.4 \cdot 1.0 = 1.35\), the sigmoid evaluates to approximately 0.794, and the technique reports a calibrated confidence of ~0.79. When log_confidence_breakdown() fires, every term’s raw / normalised / contribution numbers appear in the per-image log so an operator can trace which diagnostic carried the score.