Image Library
The image library is the operator-curated regression suite at
tests/integration/image_library/. It serves three roles at once:
Regression cohort — every shipping commit re-navigates every library image and asserts the orchestrator’s reported offset stays within the per-image tolerance budget. Drift is caught at commit time.
Calibration cohort — the per-technique confidence formulas in
config_510_techniques.yamlare tuned against the library so that the orchestrator’s reported confidence_tier matches the operator-assigned tier on every image. Without the library there is no objective signal to tune the confidence formulas against.Coverage map — the scene-class taxonomy (
body_full_fov,ring_only_curved,star_dominated,below_resolution_body,high_phase_terminator, etc.) makes coverage gaps visible at a glance: a regime with zero entries is a regime nobody has hand-calibrated, and the orchestrator’s behaviour there is unverified.
Each entry is a YAML sidecar that records:
the image’s mission / camera / filter combo,
an opaque
pds3://URL resolved throughPDS3_HOLDINGS_DIR,the operator-verified ground-truth offset and its 1sigma uncertainty,
the expected
status,confidence_tier, and primary technique,and the set of techniques that must (or must not) run on the scene.
Two test layers consume it: a fast structural-invariants test
(tests/integration/test_image_library.py) and a slow per-image
regression test (tests/integration/test_autonomous_nav.py),
gated by the integration pytest marker.
Layout
The directory tree IS the registry — there is no manifest.yaml:
tests/integration/image_library/
images/
body_mostly_offscreen/
W1521598221_1_CALIB.yaml
ring_only_curved/
N1601122100_1.yaml
...
Adding a sidecar to the right scene-class directory enrolls it in CI
automatically; removing a sidecar stops its tests. The set of
scene-class subdirectories is checked against the master list in
tests.integration.sidecar.DECLARED_SCENE_CLASSES, so
typos like body_overflow vs body_partial_overflow fail loudly
at collection time.
Sidecar schema (schema_version 1)
schema_version: 1
mission: COISS # COISS | VGISS | GOSSI | NHLORRI
camera: WAC # NAC | WAC | SSI | NA | WA | LORRI
image_id: W1521598221_1_CALIB
image_datetime_utc: '2006-04-26T08:32:14.123Z'
# UTC ISO 8601; from et_to_utc(obs.midtime)
exposure_time_sec: 0.46 # seconds; from obs.texp
filter_combo: 'CL+VIO' # canonicalized: filters sorted, '+'-joined
image_url: 'pds3://volumes/COISS_2xxx/COISS_2021/.../W1521598221_1_CALIB.IMG'
scene_tags: [body_mostly_offscreen, rhea]
# First tag is the primary class;
# must match the directory name.
ground_truth:
offset_dv_px: 12.5
offset_du_px: -3.25
offset_uncertainty_px: 1.0 # 1sigma; the test's tolerance budget
source: operator_verified
operator: rfrench
verified_date: 2026-04-28
ui_version: 'rms-nav 0.1.dev0'
notes: |
Hand-verified limb fit, no rings in the FOV.
expected:
status: success # success | failed | conflicted
confidence_tier: high # high | medium | low | failed
primary_technique: BodyLimbNav
techniques_must_run: [BodyLimbNav]
techniques_must_skip: [StarFieldFromCatalogNav]
The full validator lives in tests.integration.sidecar; malformed
fields raise SidecarValidationError
at collection time.
The calibration process
The library is the project’s calibration substrate. This section describes the end-to-end calibration loop that the library participates in.
Why the library is the right substrate
The autonomous navigation pipeline produces three pieces of output per
image: an offset, a covariance, and a calibrated confidence rank
(high / medium / low / failed). The first two are
testable against ground truth; the third is defined by ground truth
— a “high”-confidence verdict means “in the regime where the operator
hand-verified the offset to ~1 px and the technique reported
self-consistent diagnostics, the answer is right ≥ 95 % of the time.”
That definition has no meaning without a cohort of images on which the
operator actually hand-verified the offset. The library is that cohort.
The same cohort drives:
Per-technique confidence-formula coefficients. Each
NavTechniquecarries a small set of calibrated coefficients intechniques.<TechniqueName>.confidenceinconfig_510_techniques.yaml. The coefficients map per-technique diagnostics (correlation peak height, NCC margin, fit residuals, …) into a [0, 1] confidence score. The coefficients are tuned offline so that, across the library, the score correlates monotonically with per-image correctness.Per-instrument photometric thresholds. Per-instrument
mag_offset,noise.sigma_floor,image_quality_thresholds, andsource_image_filterdefaults are sanity-checked against library images for each instrument; values that fail to reproduce the hand-verified offset on a representative cohort are revisited. See Config and Static Data for the citation discipline.Per-technique runtime tunables. Spurious-detection thresholds, at-edge tolerances, minimum arc lengths, ring-edge detectability cutoffs — every numeric knob in
config_510_techniques.yamlwas picked because it was the value that made the library pass while staying conservative on regimes outside the library’s coverage.
The calibration loop
A change to the navigation pipeline that affects per-image output goes through this loop:
Land the change behind the regression suite. The change is developed on a branch. The author runs
pytest -m integration -k <relevant_image_ids>against$PDS3_HOLDINGS_DIRto see which library entries it shifts. A pure refactor should shift nothing; a tuning change shifts a known cohort.Inspect the diff per image. For each shifted image, the author re-runs
nav_offset --manual <image>and visually verifies that the recomputed offset still overlays the limb / star field / ring edge. When the recomputed offset is better than the operator-stored ground truth — for example a fix produces a sub-pixel offset where the prior implementation reported 1.5 px — the operator-stored ground truth is updated in the same PR (a freshSave as Library Entry...from the manual nav dialog rewrites the sidecar’sground_truthblock;operator,verified_date, andui_versionget refreshed automatically).Update the regression baselines. Once the per-image ground-truth review is complete, the author runs
python -m tests.integration.update_baselines --image-id <ids>to refresh the byte-stable baseline JSONs for the images that shifted. The diff that update emits goes into the same PR as the code change.Re-tune calibrated confidence if needed. If the change shifts the per-image confidence score (not just the offset), the author re-runs the offline confidence-tuning script. (Automated tooling under
tests/integration/calibrate_confidence.pyis reserved as a slot pending a wider library cohort; the workflow uses ad-hoc per-technique tuning.) The per-techniqueconfidencecoefficient block inconfig_510_techniques.yamlupdates accordingly.Reviewer sign-off. PRs that touch any of (a) library sidecars, (b) baseline JSONs, or (c) per-technique coefficients require a reviewer to manually open at least one shifted image’s summary PNG and verify the overlay still tracks the data. This is the “operator-in-the-loop” gate the calibration substrate cannot replace.
Coverage taxonomy
The directory names under tests/integration/image_library/images/
constitute the scene-class taxonomy. Each name encodes one
information-bearing regime the orchestrator must handle. The shipping
classes (mirrored in
tests.integration.sidecar.DECLARED_SCENE_CLASSES) include:
Body geometry —
body_full_fov,body_partial_overflow,body_mostly_offscreen,body_overlapping,below_resolution_body,multi_body,high_phase_terminator.Ring geometry —
ring_only_curved,ring_only_straight,ring_with_body,ring_below_resolution(annulus regime).Star regimes —
one_bright_star_no_body,star_dominated,star_field_with_body,star_field_with_ring.Failure regimes —
empty_fov,saturated,mostly_dropouts,cosmic_ray_dense.
A regime with zero library entries is a regime where the orchestrator’s behaviour is unverified. When a code path is added — say, a new technique — the contributor adds at least one library image per regime that exercises the path. The Extending the System chapter’s checklist enumerates which regimes each subsystem needs.
Confidence-tier semantics
The four tiers are calibrated to operator expectations:
high— operator-verified offset to ≤ 1 px on a sharp-feature image; orchestrator reports a sharp NCC peak and tight covariance; multiple techniques agree. Bound: the per-image error is below the operator-suppliedoffset_uncertainty_pxessentially every time.medium— operator-verified offset to ≤ 2 px on a soft-feature / star-poor image, or an image where one technique dominates without cross-technique corroboration.low— operator-verified offset to ≤ 4 px on a degenerate-geometry scene (rank-1 ring fits, partial limb arcs without curvature). The pipeline reports the offset and the operator decides whether to trust it; the bundle annotates the data label accordingly.failed— no usable techniques fired, or every technique reported spurious / at-edge. No offset is reported.
The library’s expected.confidence_tier field is the calibration
target; a tier mismatch on regression always fails (no slack — tier
is the calibration target). The per-axis offset budget is
offset_uncertainty_px + 0.5 px slack.
Per-instrument calibration anchors
Each per-instrument config (config_4N0_inst_*.yaml) includes a
small set of calibration-anchor library images that exercise the
instrument’s per-image quirks. Examples:
COISS NAC: at least one calibrated-IF image, one CALIB-stage image, one heavily-saturated image (to exercise the saturation mask), one cosmic-ray-dense image.
VGISS NA / WA: at least one image with the known per-camera geometric distortion residual (the per-instrument
geometric_distortioncorrection in the per-instrument config was tuned against this image).GOSSI / NHLORRI: at least one calibrated image and one raw image to cross-check the
calibration=Falsedefault in the LORRI loader.
These anchors are the minimum coverage; the broader per-regime cohort above is what catches drift. When a per-instrument calibration value changes, every per-instrument anchor must still pass.
Adding a new entry
The recommended path is the manual-navigation dialog’s Save as Library Entry… button:
Run the manual-nav dialog on the candidate image with the
nav_offset [args] --manualCLI flag, where[args]are the selection / dataset / config flags that pin the run down to a single image (e.g. dataset id, an image-list file,--configfor a non-default bundle).Pick the offset by hand (or accept the Auto result).
Click Save as Library Entry…. A file-save dialog suggests
<image_id>.yamlas the filename — point it at the right scene-class directory undertests/integration/image_library/images/<class>/. The dialog also drops a companion<image_id>.pngnext to the YAML showing the red-image / green-model overlay at the chosen(dv, du); it’s an orientation aid for future reviewers and is not consumed by any test.Open the saved YAML and replace every
TODO_REPLACE_*placeholder (scene_tags, primary_technique, notes, etc.).Re-run
pytest tests/integration/test_image_library.pyto check the schema; iterate until it passes.With
PDS3_HOLDINGS_DIRset, runpytest tests/integration/test_autonomous_nav.py -k <image_id>to check the offset assertion against the live orchestrator.
Tolerance regimes
Source |
Typical uncertainty (px) |
Use when |
|---|---|---|
|
1.0 |
majority of cases |
|
2.0 |
high-phase / soft-edge / faint-star |
The CI test tolerance is offset_uncertainty_px + 0.5 px slack on
each axis. confidence_tier mismatches always fail (no slack — tier
is part of the calibration target).
Regression baselines
In addition to the per-sidecar tolerance test, a separate baseline
layer records the exact rounded (offset_dv_px, offset_du_px,
confidence) triple in
tests/integration/baselines/<image_id>.json. Comparison is
exact-equal on rounded values (offset to 4 decimals, confidence
to 3); the baseline schema deliberately omits
pipeline_run_iso8601 because that is the only provenance field that
is not byte-identical between identical runs.
What checks the baselines
Two tests under tests/integration/test_baselines.py:
test_every_baseline_cites_a_sidecar— runs in the fast suite (no holdings needed). Asserts that everybaselines/<image_id>.jsonhas a matching sidecar atimage_library/images/*/<image_id>.yaml, and that the file’s stem matches the baseline’simage_idfield. Catches the common drift where a sidecar is renamed or deleted but its baseline lingers.test_regression_baseline_exact_match— gated by theintegrationpytest marker and skipped whenPDS3_HOLDINGS_DIRis unset. Parametrized one case per(baseline, sidecar)pair; runs the orchestrator against the real holdings, callstests.integration.baseline.Baseline.from_run()to round the fresh outputs, and assertsactual == expected(exact equality on all four keys). The failure message tells the operator to update the JSON in the same PR if the diff is intended.
Plus a handful of round-trip / serialisation unit tests on
tests.integration.baseline.Baseline.from_run() and
tests.integration.baseline.Baseline.to_json() that pin the rounding
rule and confirm byte-stable JSON (sorted keys, trailing newline).
How a baseline is created or updated
Use the developer tool at
tests/integration/update_baselines.py. It is intentionally
not packaged as a user-facing CLI — invoke it from a project checkout
as a Python module so the test stack imports resolve naturally. It
refuses to run without PDS3_HOLDINGS_DIR set.
python -m tests.integration.update_baselines --image-id <image_id> # one image
python -m tests.integration.update_baselines --image-id A --image-id B # hand-picked batch
python -m tests.integration.update_baselines --all # every sidecar
python -m tests.integration.update_baselines --all --dry-run # preview only
For each image the tool runs
navigate_image_files() against the live
holdings, rounds the result via
tests.integration.baseline.Baseline.from_run(), compares against
any existing baseline, and reports one of:
CREATE— no on-disk baseline; new file written.UPDATE— baseline drifted; old → new diff printed and the file overwritten.UNCHANGED— bytes match; file untouched (mtime preserved).FAILED— orchestrator returned no offset, or the requested--image-idmatched no sidecar.
The exit code is 0 when every selected image succeeded (regardless
of write/update/unchanged), 1 when at least one FAILED line
was emitted, 2 on argument-parsing errors or when
PDS3_HOLDINGS_DIR is unset.
Sidecars must land first — the
test_every_baseline_cites_a_sidecar invariant refuses orphan
baselines. Baseline updates always require explicit operator review
on the PR; the CLI is the mechanical step, but the human review of
the resulting diff (does the new offset still overlay the limb? does
the new confidence still match expected.confidence_tier?) is what
keeps the regression layer trustworthy.