CLI¶
Install¶
uv tool install ordeal # global, `ordeal` on PATH
uvx ordeal explore # ephemeral, no install
uv run ordeal explore # inside project venv
Commands¶
ordeal mine-pair¶
Discover relational properties between two functions — roundtrip (g(f(x)) == x), reverse roundtrip, and commutative composition:
mine_pair(encode, decode): 200 examples
ALWAYS roundtrip g(f(x)) == x (200/200)
ALWAYS roundtrip f(g(x)) == x (200/200)
| Flag | Default | Description |
|---|---|---|
f |
required | First function (dotted path) |
g |
required | Second function (dotted path) |
--max-examples, -n |
200 |
Examples to sample |
ordeal audit¶
Measure your existing tests vs what ordeal auto-scan achieves — verified numbers, not estimates:
ordeal audit myapp.scoring --test-dir tests/
ordeal audit myapp.scoring myapp.pipeline -t tests/ --max-examples 50
Output:
ordeal audit
myapp.scoring
current: 33 tests | 343 lines | 98% coverage [verified]
migrated: 12 tests | 130 lines | 96% coverage [verified]
saving: 64% fewer tests | 62% less code | same coverage
mined: deterministic(compute, normalize), output in [0, 1](compute)
mutation: 14/18 (78%)
suggest:
- L42 in compute(): test when x < 0
- L67 in normalize(): test that ValueError is raised
Every number is [verified] (measured via coverage.py JSON, cross-checked for consistency) or FAILED: reason. Mined properties are grouped by kind. The mutation score shows how many code mutations the mined properties catch — if it's below 100%, the surviving mutants reveal property gaps.
The "migrated" column shows what a real ordeal test file looks like: fuzz() for crash safety plus explicitly mined properties (bounds, determinism, type checks). It generates the test file a developer would write after adopting ordeal.
Use --show-generated to inspect the generated test, or --save-generated to save it and use it directly:
ordeal audit myapp.scoring --show-generated # print generated test
ordeal audit myapp.scoring --save-generated test_migrated.py # save to file
| Flag | Default | Description |
|---|---|---|
modules |
required | Module paths to audit (positional, one or more) |
--test-dir, -t |
tests |
Directory containing existing tests |
--max-examples |
20 |
Hypothesis examples per function |
--show-generated |
off | Print the generated test file |
--save-generated |
— | Save generated test to this path |
ordeal mine¶
Discover properties of a function or all public functions in a module. Prints what mine() finds — type invariants, algebraic laws, bounds, monotonicity, length relationships — with confidence levels.
ordeal mine myapp.scoring.compute # single function
ordeal mine myapp.scoring # all public functions
ordeal mine myapp.scoring.compute -n 1000 # more examples = tighter confidence
Output:
mine(compute): 500 examples
ALWAYS output type is float (500/500)
ALWAYS deterministic (50/50)
ALWAYS output in [0, 1] (500/500)
ALWAYS observed range [0.0, 0.9987] (500/500)
ALWAYS monotonically non-decreasing (499/499)
n/a: commutative, associative
Use this to understand a function before writing tests. The ALWAYS properties are candidates for assertions; the n/a list shows what doesn't apply. result.not_checked (visible in the Python API) lists what mine() structurally cannot verify — those are the tests you write manually.
| Flag | Default | Description |
|---|---|---|
target |
required | Dotted path: mymod.func or mymod (positional) |
--max-examples, -n |
500 |
Examples to sample |
ordeal mine-pair¶
Discover relational properties between two functions: roundtrip (g(f(x)) == x), reverse roundtrip (f(g(x)) == x), and commutative composition (f(g(x)) == g(f(x))).
ordeal mine-pair myapp.encode myapp.decode # roundtrip?
ordeal mine-pair myapp.serialize myapp.parse -n 500 # more examples
Output:
mine(encode <-> decode): 200 examples
ALWAYS roundtrip decode(encode(x)) == x (48/48)
ALWAYS roundtrip encode(decode(x)) == x (45/45)
52% commutative composition (26/50)
Use this when you have function pairs that should be inverses (encode/decode, serialize/parse, compress/decompress) or that should commute.
| Flag | Default | Description |
|---|---|---|
f |
required | First function (positional) |
g |
required | Second function (positional) |
--max-examples, -n |
200 |
Examples to sample |
ordeal benchmark¶
Measure how parallel exploration scales on your machine and test class. Runs the Explorer at N=1, 2, 4, 8... workers, measures throughput, and fits the Universal Scaling Law (USL):
ordeal benchmark # uses ordeal.toml, first [[tests]] entry
ordeal benchmark -c ci.toml # custom config
ordeal benchmark --max-workers 16 # test up to 16 workers
ordeal benchmark --time 30 # 30s per trial (default: 10s)
ordeal benchmark --metric edges # fit on edges/sec instead of runs/sec
Scaling Analysis (Universal Scaling Law)
sigma (contention): 0.080755
kappa (coherence): 0.005578
Regime: usl
Optimal workers: 13.4
Peak throughput: 7.64x
Diagnosis:
Contention (sigma): 8.1% serialized fraction.
Coherence (kappa): 0.005578 cross-worker sync cost.
| Flag | Default | Description |
|---|---|---|
--config, -c |
ordeal.toml |
Config file |
--max-workers |
CPU count | Maximum workers to test |
--time |
10 |
Seconds per trial |
--metric |
runs |
"runs" (runs/sec) or "edges" (edges/sec) |
ordeal explore¶
Your main command for deep exploration. Reads ordeal.toml, loads each ChaosTest class, and runs coverage-guided exploration with fault injection, energy scheduling, and swarm mode.
Use for: pre-commit validation, pre-release exploration runs, CI pipelines, and finding deep bugs that unit tests miss.
ordeal explore # reads ordeal.toml
ordeal explore -c ci.toml # custom config
ordeal explore -v # live progress
ordeal explore --max-time 300 # override time
ordeal explore --seed 99 # override seed
ordeal explore --no-shrink # skip failure minimization
ordeal explore -w 4 # 4 parallel workers
The --workers / -w flag runs exploration across multiple processes. Each worker gets a unique seed for independent state-space exploration. Results are aggregated: runs/steps are summed, edges are unioned for true unique count. Use --workers $(nproc) for full CPU utilization.
ordeal replay¶
Reproduce a failure from a saved trace. The trace file contains the exact sequence of rules and fault toggles that triggered the failure, so replaying it re-executes the same steps.
Use for: triaging a CI failure, sharing a reproducible bug with a colleague, verifying that a fix actually resolves the issue.
ordeal replay .ordeal/traces/fail-run-42.json # reproduce
ordeal replay --shrink trace.json # minimize
ordeal replay --shrink trace.json -o minimal.json # save minimized
The --shrink flag runs delta-debugging to remove unnecessary steps from the trace. Use it when: the trace is too long to understand, or you want the minimal sequence of operations that reproduces the failure. The shrunk trace is often 5-10x shorter than the original.
Workflows¶
Local development¶
Quick exploration with live progress. Run this before committing to catch obvious issues:
The -v flag prints a progress line showing runs, steps, edges discovered, and failures found. Thirty seconds is enough to catch most shallow bugs.
CI pipeline¶
Longer exploration with a dedicated config, JSON report, and a nonzero exit code on failure:
Where ci.toml might set max_time = 120, report.format = "json", and report.output = "ordeal-report.json". The exit code is 1 if any failure is found, so your CI script can gate on it directly.
Bug triage¶
When a CI run or colleague reports a failure trace:
ordeal replay trace.json # confirm it reproduces
ordeal replay --shrink trace.json -o minimal.json # minimize it
The shrunk trace gives you the shortest sequence of operations that triggers the bug. Read through the steps: which rules ran, which faults were active, and where the exception occurred.
Reproducibility¶
Fix the seed for deterministic exploration. The same seed produces the same sequence of rule interleavings and fault schedules:
Useful for: bisecting changes (did this commit introduce the failure?), comparing exploration runs across branches, and ensuring consistent CI behavior.
pytest integration¶
ordeal also works as a pytest plugin (auto-registered when ordeal is installed). No configuration needed -- pytest picks it up automatically via the pytest11 entry point.
How --chaos works¶
pytest --chaos # enable chaos mode
pytest --chaos --chaos-seed 42 # reproducible seed
pytest --chaos --buggify-prob 0.2 # higher fault probability
When you pass --chaos, three things happen:
- PropertyTracker activates: all
always(),sometimes(),reachable(), andunreachable()calls start recording hits and results instead of being no-ops. - buggify() activates: every
buggify()call in your code has a chance of returning True (default 10%, controlled by--buggify-prob). - Chaos-only tests run: tests marked with
@pytest.mark.chaosare collected instead of skipped.
Without --chaos, your test suite runs normally. buggify() returns False, assertions are no-ops, and chaos-marked tests are skipped.
@pytest.mark.chaos¶
Mark tests that should only run under chaos mode. These are skipped without the --chaos flag, so your normal CI runs are not affected:
This is useful for tests that are slow (because they explore fault interleavings), flaky by design (because faults cause nondeterminism), or only meaningful under fault injection.
The property report¶
When --chaos is active, ordeal prints a property report at the end of the test run. It shows every tracked property, its type, hit count, and pass/fail status:
--- Ordeal Property Results ---
PASS cache hit (sometimes: 47 hits)
PASS no data loss (always: 312 hits)
FAIL stale read (sometimes: never true in 200 hits)
1/3 properties FAILED
always properties pass if they held every time they were evaluated. sometimes properties pass if they held at least once. reachable properties pass if the code path was reached. unreachable properties pass if it was never reached.
chaos_enabled fixture¶
For tests that need chaos in a specific scope without requiring the global --chaos flag:
def test_something(chaos_enabled):
# buggify() is active, PropertyTracker is recording
result = my_function()
assert result is not None
The fixture activates buggify and the PropertyTracker for the duration of the test, then restores the previous state.
Pytest patterns¶
Pattern 1: Separate chaos tests from unit tests. Keep chaos tests in their own directory so you can run them independently:
tests/
├── unit/ # fast, deterministic — always run
│ └── test_scoring.py
├── chaos/ # slower, exploratory — run with --chaos
│ └── test_scoring_chaos.py
└── conftest.py
Pattern 2: Use chaos_enabled for targeted chaos in unit tests. You don't need --chaos for everything. Use the fixture when a specific test needs fault injection:
def test_retry_logic(chaos_enabled):
"""This test specifically checks retry behavior under buggify."""
from ordeal.buggify import buggify
# buggify() is now active — it will sometimes return True
result = service_with_retries.call()
assert result is not None # should succeed despite faults
Pattern 3: Combine @pytest.mark.chaos with ChaosTest.TestCase. ChaosTest classes work with or without --chaos, but marking them ensures they're skipped in fast CI runs:
import pytest
from ordeal import ChaosTest, rule, always
@pytest.mark.chaos
class ScoreServiceChaos(ChaosTest):
faults = [...]
@rule()
def score(self): ...
TestScoreServiceChaos = ScoreServiceChaos.TestCase
Pattern 4: Auto-scan via ordeal.toml. When you add [[scan]] entries to ordeal.toml, pytest auto-discovers and runs them. No test files needed:
Each public function in the module becomes a test item. Functions without type hints are skipped unless fixtures are provided in the TOML.
Pattern 5: Different buggify probabilities for different environments.
pytest --chaos --buggify-prob 0.05 # gentle: 5% fault rate (local dev)
pytest --chaos --buggify-prob 0.1 # moderate: 10% (default, CI)
pytest --chaos --buggify-prob 0.3 # aggressive: 30% (pre-release stress)
Higher probability = more faults per run = finds more bugs but also more noise. Start gentle, increase as your error handling matures.
Exit codes¶
ordeal explore returns 0 on success (no failures found) and 1 if any failure is found or if there is a configuration error. Use this directly in CI scripts:
ordeal replay returns 0 if the failure did not reproduce (which can happen if the code has changed) and 1 if the failure reproduced.