Skip to content

API Reference

In plain English

This is your lookup table -- when you know what you want to do, find the exact function here. Each section maps to a concept you can learn more about in the guides. Whether you're adding ordeal to an existing test suite or starting fresh, the signatures and examples below give you everything you need to wire things up.

Complete public API with signatures, parameters, and usage.

Discovery

Find everything ordeal offers — programmatically

catalog() returns every fault, invariant, assertion, strategy, and integration ordeal has, with names, signatures, and neutral discovery metadata derived from the live runtime surface. AI assistants and scripts can use it to learn what exists, what it applies to, what inputs it expects, what it produces, and where to look next without reading files. When new features are added to ordeal, they appear in the catalog automatically.

from ordeal import catalog

c = catalog()
c["faults"]        # failure injectors
c["invariants"]    # composable checks
c["assertions"]    # property assertions
c["strategies"]    # adversarial data generation helpers
c["integrations"]  # API and fuzzing adapters

# Each entry has: name, qualname, signature, doc, capability,
# applies_to, inputs, outputs, examples, learn_more
for fault in c["faults"]:
    print(f"{fault['name']}{fault['signature']}")
    print(f"  {fault['capability']}")
    print(f"  applies_to: {fault['applies_to']}")
    print(f"  outputs: {fault['outputs']}")

Core

Stateful chaos testing

ChaosTest is the foundation of ordeal. You define rules (things your system does), faults (things that go wrong), and invariants (things that must stay true). Ordeal then explores thousands of interleavings automatically, finding the exact sequence of operations and failures that breaks your system.

ChaosTest

from ordeal import ChaosTest

Base class for stateful chaos tests. Extends Hypothesis's RuleBasedStateMachine.

Class attributes:

Attribute Type Default Description
faults list[Fault] [] Faults to inject during testing
swarm bool False Random fault subsets per run

Methods:

Method Returns Description
active_faults list[Fault] Property: currently active faults
teardown() None Deactivate all faults, clean up
class MyServiceChaos(ChaosTest):
    faults = [timing.timeout("myapp.api.call")]
    swarm = True

    @rule()
    def do_something(self):
        ...

TestMyServiceChaos = MyServiceChaos.TestCase

Hypothesis re-exports

These are re-exported from hypothesis.stateful for convenience:

from ordeal import rule, invariant, initialize, precondition, Bundle
Import Description
rule(**kwargs) Declare a test rule (decorator)
invariant() Declare an invariant check (decorator)
initialize(**kwargs) Declare an initialization rule (decorator)
precondition(condition) Gate a rule on current state (decorator)
Bundle(name) Named collection for data flow between rules

auto_configure

auto_configure(
    buggify_probability: float = 0.1,
    seed: int | None = None,
) -> None

Enable chaos testing programmatically. Alternative to --chaos flag.

from ordeal import auto_configure
auto_configure(buggify_probability=0.2, seed=42)

Assertions

The key insight

Assertions are how you tell ordeal what "correct" means. always and unreachable catch violations the instant they happen. sometimes and reachable are checked at the end of the session -- they verify that something good happened at least once across all your test runs. All four live in ordeal/assertions.py.

from ordeal import always, sometimes, reachable, unreachable

Thread safety: The PropertyTracker is fully lock-guarded — safe for free-threaded Python 3.13+/3.14. All access to active and _properties is synchronized.

always

always(
    condition: bool,
    name: str,
    *,
    mute: bool = False,
    **details: Any,
) -> None

Assert condition is True every time. Raises AssertionError immediately on violation — whether or not --chaos is active. Violations are never silent by default.

Pass mute=True to record the violation without raising. The violation still shows in the property report — tracked, not hidden. Use when a known issue is too loud and you need to focus on something else.

always(result >= 0, "result is non-negative")
always(not math.isnan(score), "score is never NaN", value=score)
always(response.ok, "API healthy", mute=True)  # known flaky, tracked not fatal

sometimes

sometimes(
    condition: bool | Callable[[], bool],
    name: str,
    *,
    attempts: int | None = None,
    **details: Any,
) -> None

Assert condition is True at least once across the session. Deferred — checked at session end via PropertyTracker.

If condition is callable and attempts is set, polls the callable up to attempts times for standalone use.

sometimes(cache_hit, "cache is exercised")
sometimes(lambda: service.ready(), "service starts", attempts=10)

reachable

reachable(
    name: str,
    **details: Any,
) -> None

Record that a code path executed. Deferred — must be hit at least once by session end.

except TimeoutError:
    reachable("timeout-handling-path")
    handle_timeout()

unreachable

unreachable(
    name: str,
    *,
    mute: bool = False,
    **details: Any,
) -> None

Assert code path never executes. Raises AssertionError immediately — whether or not --chaos is active. Violations are never silent by default. Pass mute=True to record without raising.

if data is None and not error_occurred:
    unreachable("data-lost-silently")

PropertyTracker

from ordeal.assertions import tracker

Global singleton. Accumulates property results across runs.

Method Returns Description
reset() None Clear all tracked properties
record(name, prop_type, condition, details) None Record a property result
record_hit(name, prop_type) None Record a hit without condition
results list[Property] All tracked properties
failures list[Property] Only failed properties

Property

from ordeal.assertions import Property
Attribute Type Description
name str Property name
type str "always", "sometimes", "reachable", "unreachable"
hits int Times evaluated
passes int Times condition was True
failures int Times condition was False
first_failure_details dict | None Details from first failure
passed bool Whether property passed (per type semantics)
summary str One-line "PASS ..." or "FAIL ..."

Buggify

Inline faults for production code

Buggify lets you embed fault injection points directly in your application code. In production, buggify() is a no-op with negligible overhead. During chaos testing, it fires with configurable probability, letting you simulate failures exactly where they'd happen in real life -- inside your own functions, not just at external boundaries.

from ordeal.buggify import buggify, buggify_value, activate, deactivate, set_seed, is_active

buggify

buggify(probability: float | None = None) -> bool

Returns True during chaos testing with configurable probability. No-op when inactive (negligible overhead).

if buggify():
    raise ConnectionError("simulated failure")

if buggify(0.5):  # 50% chance when active
    time.sleep(random.random())

buggify_value

buggify_value(normal: _T, faulty: _T, probability: float | None = None) -> _T

Returns faulty during chaos testing, normal otherwise.

return buggify_value(computed_result, float('nan'))
return buggify_value(response, TimeoutError("simulated"), 0.3)

activate / deactivate / set_seed / is_active

activate(probability: float = 0.1) -> None     # enable for current thread
deactivate() -> None                             # disable for current thread
set_seed(seed: int) -> None                      # seed RNG for reproducibility
is_active() -> bool                              # check if enabled

Faults

Think of it this way

Faults are how you simulate real-world failures -- timeouts, disk errors, network issues, corrupted data. You pick a target function by its dotted path (like "myapp.db.query"), and ordeal replaces it with a faulty version when the fault is active. When deactivated, the original function comes back. The base classes live in ordeal/faults/__init__.py, with specialized faults in io.py, numerical.py, timing.py, network.py, and concurrency.py.

Base classes

from ordeal.faults import Fault, PatchFault, LambdaFault

Thread safety: The active flag and activate/deactivate transitions are lock-guarded. intermittent_crash and jitter call counters are also lock-protected. Deep-copying faults creates fresh locks (for checkpoint serialization). Safe for free-threaded Python 3.13+.

Fault (ABC):

Method Description
activate() Enable fault injection
deactivate() Disable fault injection
reset() Deactivate and clear state
name: str Human-readable name
active: bool Whether currently active
with fault: Context manager — activates on enter, deactivates on exit

PatchFault:

PatchFault(
    target: str,                                    # dotted path: "myapp.api.call"
    wrapper_fn: Callable[[Callable], Callable],     # receives original, returns replacement
    name: str | None = None,
)

Resolves target to a function, replaces it with wrapper_fn(original) when active, restores on deactivation. Lazy resolution (resolved on first activation).

LambdaFault:

LambdaFault(
    name: str,
    on_activate: Callable[[], None],
    on_deactivate: Callable[[], None],
)

I/O faults

from ordeal.faults import io
Function Signature Description
error_on_call (target: str, error: type = IOError, message: str = "Simulated I/O error") -> PatchFault Target raises error on every call
return_empty (target: str) -> PatchFault Target returns None
corrupt_output (target: str) -> PatchFault Target returns random bytes (same length)
truncate_output (target: str, fraction: float = 0.5) -> PatchFault Target output truncated to fraction
disk_full () -> Fault Global: writes fail with OSError(ENOSPC)
permission_denied () -> Fault Global: opens fail with PermissionError
subprocess_timeout (target: str) -> PatchFault subprocess.run raises TimeoutExpired when command matches target
corrupt_stdout (target: str) -> PatchFault subprocess.run returns garbled stdout when command matches target
subprocess_delay (target: str, *, delay: float = 1.0) -> PatchFault Adds delay to subprocess.run when command matches target
# In ChaosTest — nemesis toggles automatically
faults = [
    io.error_on_call("myapp.storage.save", IOError, "disk unreachable"),
    io.corrupt_output("myapp.cache.read"),
    io.subprocess_timeout("cargo run"),
    io.disk_full(),
]

# As context manager — scoped activation in regular tests
with io.subprocess_timeout("cargo run"):
    result = run_kernel()

Numerical faults

from ordeal.faults import numerical
Function Signature Description
nan_injection (target: str) -> PatchFault Numeric output becomes NaN
inf_injection (target: str) -> PatchFault Numeric output becomes Inf
wrong_shape (target: str, expected: tuple, actual: tuple) -> PatchFault Returns array with wrong shape
dtype_drift (target: str, kind: str = "str") -> PatchFault Coerces numeric output into string/int/bool/object leaves
partial_batch (target: str, fraction: float = 0.5, min_items: int = 1) -> PatchFault Truncates batch-like output on the first axis
feature_order_drift (target: str, shift: int = 1) -> PatchFault Rotates feature order without changing outer shape
missing_feature (target: str, key: str \| None = None, *, fill: object = ...) -> PatchFault Drops one feature key or replaces it with a fill value
corrupted_floats (corrupt_type: str = "nan") -> Fault Standalone corrupt float source; use fault.value()
faults = [
    numerical.nan_injection("myapp.model.predict"),
    numerical.partial_batch("myapp.model.predict", fraction=0.5),
    numerical.missing_feature("myapp.features.fetch", "country"),
    numerical.wrong_shape("myapp.embed", (1, 512), (1, 256)),
]

Timing faults

from ordeal.faults import timing
Function Signature Description
timeout (target: str, delay: float = 30.0, error: type = TimeoutError) -> PatchFault Target raises instantly (no real sleep)
slow (target: str, delay: float = 1.0, mode: str = "simulate") -> PatchFault Add delay; "simulate" = instant, "real" = actual sleep
intermittent_crash (target: str, every_n: int = 3, error: type = RuntimeError) -> Fault Crash every Nth call; resets on reset()
jitter (target: str, magnitude: float = 0.01) -> Fault Add deterministic numeric jitter to return value
faults = [
    timing.timeout("myapp.api.call"),
    timing.intermittent_crash("myapp.worker.process", every_n=5),
    timing.jitter("myapp.sensor.read", magnitude=0.001),
]

Network faults

from ordeal.faults import network

For any code making HTTP/API calls. Simulates real-world network failures without requiring network access.

Function Signature Description
http_error (target: str, status_code: int = 500, message: str = "Internal Server Error") -> PatchFault Raise HTTPFaultError with status code and fake response
connection_reset (target: str) -> PatchFault Raise ConnectionError
rate_limited (target: str, retry_after: float = 30.0) -> PatchFault Raise HTTP 429 with Retry-After header
auth_failure (target: str, status_code: int = 401) -> PatchFault Raise HTTP 401/403
dns_failure (target: str) -> PatchFault Raise OSError (simulated DNS resolution failure)
partial_response (target: str, fraction: float = 0.5) -> PatchFault Truncate response to fraction of content
intermittent_http_error (target: str, every_n: int = 3, status_code: int = 503, message: str = "Service Unavailable") -> Fault HTTP error every Nth call; resets on reset()
faults = [
    network.http_error("myapp.client.post", status_code=503),
    network.rate_limited("myapp.client.get", retry_after=60),
    network.connection_reset("myapp.client.post"),
    network.dns_failure("myapp.client.resolve"),
]

HTTPFaultError carries .status_code and a duck-typed .response object compatible with requests/httpx patterns.

Concurrency faults

from ordeal.faults import concurrency

For testing thread-safety, resource contention, and concurrent access patterns.

Function Signature Description
contended_call (target: str, contention: float = 0.05, mode: str = "simulate") -> PatchFault Wrap target with a shared lock; simulates resource contention
delayed_release (target: str, delay: float = 0.5, mode: str = "simulate") -> PatchFault Add delay after target returns (simulates slow cleanup)
thread_boundary (target: str, timeout: float = 5.0) -> Fault Execute target on a background thread (finds thread-local state bugs)
stale_state (obj: Any, attr: str, stale_value: Any) -> Fault When active, set obj.attr = stale_value; restore on deactivation
faults = [
    concurrency.contended_call("myapp.pool.acquire", contention=0.1),
    concurrency.thread_boundary("myapp.cache.get"),
    concurrency.stale_state(my_service, "config", old_config),
]

Explorer

Coverage-guided exploration

The Explorer is ordeal's autopilot. Point it at a ChaosTest, and it runs thousands of rule/fault combinations, tracking which code paths each run reaches. Runs that discover new edges get higher energy, so the explorer automatically focuses on the most productive directions. Use it when manual test cases can't cover the combinatorial space of faults and operations.

from ordeal.explore import Explorer, ExplorationResult, Failure, ProgressSnapshot, CoverageCollector, Checkpoint

Explorer

Explorer(
    test_class: type,                           # ChaosTest subclass
    *,
    target_modules: list[str] | None = None,    # modules to track for coverage
    seed: int = 42,
    max_checkpoints: int = 256,
    checkpoint_prob: float = 0.4,               # probability of starting from checkpoint
    checkpoint_strategy: str = "energy",        # "energy", "uniform", "recent"
    fault_toggle_prob: float = 0.3,
    record_traces: bool = False,
    workers: int = 1,                           # 0 = auto (os.cpu_count())
    share_edges: bool = True,                   # shared-memory edge bitmap for workers
    share_checkpoints: bool = True,             # shared checkpoint ring for workers
    mutation_targets: list[str] | None = None,
    seed_mutation_prob: float | None = None,
    seed_mutation_respect_strategies: bool = False,
    ngram: int = 2,
    corpus_dir: str | Path | None = ".ordeal/seeds",
    rule_swarm: bool = False,
)
explorer.run(
    *,
    max_time: float = 60.0,
    max_runs: int | None = None,
    steps_per_run: int = 50,
    shrink: bool = True,
    max_shrink_time: float = 30.0,
    progress: Callable[[ProgressSnapshot], None] | None = None,
    resume_from: str | Path | None = None,    # resume from saved state
    save_state_to: str | Path | None = None,  # save state on completion
) -> ExplorationResult
Method Returns Description
save_state(path) None Save checkpoint corpus, edges, and RNG state to a pickle file for later resumption
load_state(path) dict Restore saved state; returns counters (total_edges, checkpoints)
explorer = Explorer(
    MyServiceChaos,
    target_modules=["myapp"],
    checkpoint_strategy="energy",
)
result = explorer.run(max_time=120, steps_per_run=100)
print(result.summary())

# Resume a previous run:
result = explorer.run(
    max_time=120,
    resume_from=".ordeal/state.pkl",
    save_state_to=".ordeal/state.pkl",
)

ExplorationResult

Attribute Type Description
total_runs int Runs completed
total_steps int Total steps across all runs
unique_edges int Unique control-flow edges discovered
checkpoints_saved int Checkpoints in corpus
failures list[Failure] Failures found
duration_seconds float Wall-clock time
edge_log list[tuple[int, int]] (run_id, cumulative_edges)
traces list[Trace] Recorded traces (if record_traces=True)
summary() str Human-readable report

Failure

Attribute Type Description
error Exception The exception raised
step int Step number when failure occurred
run_id int Run that found this failure
active_faults list[str] Faults active at failure time
rule_log list[str] Sequence of rules/faults leading to failure
trace Trace | None Full trace for replay

ProgressSnapshot

Attribute Type Description
elapsed float Seconds since start
total_runs int Runs completed
total_steps int Steps completed
unique_edges int Edges discovered
checkpoints int Checkpoints saved
failures int Failures found
runs_per_second float Throughput

CoverageCollector

CoverageCollector(target_paths: list[str])
Method Returns Description
start() None Begin collecting edge coverage via sys.settrace
stop() frozenset[int] Stop and return observed edges
snapshot() frozenset[int] Current edges without stopping

Trace

from ordeal.trace import Trace, TraceStep, TraceFailure, replay, shrink

Trace

Attribute Type Description
run_id int Run identifier
seed int RNG seed
test_class str "module.path:ClassName"
from_checkpoint int | None Checkpoint run_id, or None if fresh
steps list[TraceStep] Ordered steps
failure TraceFailure | None Failure info if applicable
edges_discovered int New edges found
duration float Run duration
Method Returns Description
to_dict() dict JSON-serializable dict
save(path) None Write to JSON file (use .json.gz extension for gzip compression)
Trace.from_dict(data) Trace Reconstruct from dict
Trace.load(path) Trace Load from JSON file (auto-detects .gz compression)

TraceStep

Attribute Type Description
kind str "rule" or "fault_toggle"
name str Rule name or "+fault" / "-fault"
params dict Parameters drawn for this step
active_faults list[str] Faults active after this step (populated on fault_toggle steps; empty on rule steps — derive from toggle sequence)
edge_count int Cumulative edges at this step
timestamp_offset float Time since run start

replay

replay(
    trace: Trace,
    test_class: type | None = None,     # auto-resolved from trace.test_class if None
) -> Exception | None

Replay a trace step-by-step. Returns the exception if it reproduces, None otherwise.

shrink

shrink(
    trace: Trace,
    test_class: type | None = None,
    *,
    max_time: float = 30.0,
) -> Trace

Shrink a failing trace to the minimal reproducing sequence. Three phases: delta debugging, step elimination, fault simplification.

generate_tests

generate_tests(
    traces: list[Trace],
    *,
    class_path: str | None = None,
) -> str

Convert exploration traces into standalone pytest test functions. Each generated test replays the exact rule/fault sequence — failures become regression tests, deep paths become coverage tests.

from ordeal.trace import generate_tests

result = explorer.run(max_time=60, record_traces=True)
test_source = generate_tests(result.traces)
Path("tests/test_generated.py").write_text(test_source)

Or from the CLI: ordeal explore --generate-tests tests/test_generated.py


QuickCheck

Boundary-biased property testing

QuickCheck gives you property-based testing with a twist: instead of purely random inputs, it biases toward boundary values -- zeros, empty strings, max-size lists, powers of two. These are the values most likely to trigger off-by-one errors and edge-case bugs. Just add type hints to your test function and @quickcheck handles the rest.

from ordeal.quickcheck import quickcheck, strategy_for_type, biased

quickcheck

@quickcheck
def test_fn(x: int, y: str) -> None:
    ...

@quickcheck(max_examples=500)
def test_fn(x: float) -> None:
    ...

@quickcheck(x=st.integers(min_value=0))  # override specific parameter
def test_fn(x: int, y: str) -> None:
    ...

Decorator. Infers strategies from type hints, runs as property test with max_examples=100 (default).

strategy_for_type

strategy_for_type(tp: type, *, _depth: int = 0) -> st.SearchStrategy

Derive a boundary-biased strategy from a type hint. Results are cached by (tp, _depth). Handles: int, float, str, bool, bytes, None, list[T], dict[K, V], tuple, set, Union, Optional, dataclass, and Pydantic BaseModel (v2+ — derives strategies from model_fields with constraint support: ge/le/gt/lt, min_length/max_length). Recursion depth limited to 5.

biased

Namespace of boundary-biased strategies:

biased.integers(min_value=None, max_value=None) -> SearchStrategy[int]
biased.floats(min_value=None, max_value=None, *, allow_nan=False, allow_infinity=False) -> SearchStrategy[float]
biased.strings(min_size=0, max_size=100) -> SearchStrategy[str]
biased.bytes_(min_size=0, max_size=100) -> SearchStrategy[bytes]
biased.lists(elements, min_size=0, max_size=50) -> SearchStrategy[list]

Biased toward boundary values: 0, -1, +1, empty, max-length, powers of 2, range endpoints.


Invariants

Composable correctness checks

Invariants are reusable validation rules you can compose with &. Instead of writing ad-hoc assertions in every test, define what "valid output" means once -- finite & bounded(0, 1) -- and apply it everywhere. Reach for these when you have numeric outputs that must satisfy mathematical properties like boundedness, monotonicity, or normalization.

from ordeal.invariants import (
    Invariant, no_nan, no_inf, finite, bounded, monotonic,
    unique, non_empty, unit_normalized, orthonormal, symmetric,
    positive_semi_definite, rank_bounded, mean_bounded, variance_bounded,
)

Invariant

Invariant(name: str, check_fn: Callable[..., None])
Method Description
__call__(value, *, name=None) Run check, raise AssertionError on violation
__and__(other) Compose: (a & b)(x) checks both

Built-in invariants

Invariant Signature Description
no_nan singleton Reject NaN in scalars, sequences, numpy arrays
no_inf singleton Reject Inf/-Inf
finite singleton no_nan & no_inf
bounded (lo: float, hi: float) All values in [lo, hi]
monotonic (*, strict: bool = False) Non-decreasing (or strictly increasing)
unique (*, key: Callable | None = None) No duplicates (optionally by key)
non_empty () Not empty/falsy
unit_normalized (*, tol: float = 1e-6) Row vectors have L2 norm ~1.0
orthonormal (*, tol: float = 1e-6) Rows form orthonormal set
symmetric (*, tol: float = 1e-6) Matrix equals its transpose
positive_semi_definite (*, tol: float = 1e-6) All eigenvalues >= -tol
rank_bounded (min_rank=0, max_rank=None) Matrix rank in range
mean_bounded (lo: float, hi: float) Mean in [lo, hi]
variance_bounded (lo: float, hi: float) Variance in [lo, hi]
valid_score = finite & bounded(0, 1)
valid_score(model_output)

valid_embedding = unit_normalized() & bounded(-1, 1)
valid_embedding(embedding_matrix)

Simulate

Deterministic time and filesystem

Clock and FileSystem replace real time and real disk with in-memory, deterministic versions. Tests that use Clock run instantly regardless of how many hours of simulated time pass. Tests that use FileSystem can inject corruption, permission errors, and disk-full conditions without touching actual files. Use these when your code depends on time or I/O and you need tests that are fast and reproducible.

from ordeal.simulate import Clock, FileSystem

Clock

Clock(start: float = 0.0)
Method Signature Description
time() -> float Current simulated time
sleep(seconds) -> None Advance by seconds (instant)
advance(seconds) -> None Advance, firing timers whose deadline passed
set_timer(delay, callback) -> int Schedule callback; returns timer ID
pending_timers -> int Property: unfired timer count
patch() context manager Patch time.time() and time.sleep()
clock = Clock()
clock.set_timer(10.0, lambda: print("fired"))
clock.advance(15.0)  # timer fires at t=10

with clock.patch():
    import time
    time.sleep(3600)  # instant

FileSystem

FileSystem()
Method Signature Description
write(path, data) (str, str | bytes) -> None Write data, respecting faults
read(path) (str) -> bytes Read raw bytes, respecting faults
read_text(path, encoding="utf-8") (str, str) -> str Read decoded string
exists(path) (str) -> bool True if path exists (no "missing" fault)
delete(path) (str) -> None Remove path
list_dir(prefix="/") (str) -> list[str] Paths starting with prefix
inject_fault(path, fault) (str, str) -> None Inject: "corrupt", "missing", "readonly", "full"
clear_fault(path) (str) -> None Remove fault on path
clear_all_faults() -> None Remove all faults
reset() -> None Remove all files and faults

Mutations

Test quality validation

Mutation testing answers a hard question: are your tests actually checking behavior, or just checking that the code runs? It makes small changes to your source code (swapping + to -, replacing returns with None) and checks whether your tests notice. A high kill score means your tests are specific. Surviving mutants point you to exactly where your assertions are too weak.

from ordeal.mutations import mutate_function_and_test, mutate_and_test, validate_mined_properties, generate_mutants, MutationResult, Mutant

validate_mined_properties

validate_mined_properties(
    target: str,                                # dotted path: "myapp.scoring.compute"
    max_examples: int = 100,                    # examples for mine()
    operators: list[str] | None = None,         # None = all operators
    *,
    preset: Literal["essential", "standard", "thorough"] | None = None,
    mine_result: MineResult | None = None,
    validation_mode: Literal["fast", "deep"] = "fast",
) -> MutationResult

Mine properties of target, then mutate it and check the properties catch the mutations. Bridges mine() and mutation testing. Surviving mutants reveal properties too weak to detect real bugs. validation_mode="fast" replays mined inputs against each mutant; validation_mode="deep" keeps that replay check and then re-runs mine() on each mutant. Used automatically by ordeal audit.

mutate_function_and_test

mutate_function_and_test(
    target: str,                                # dotted path: "myapp.scoring.compute"
    test_fn: Callable[[], None],                # test to run against each mutant
    operators: list[str] | None = None,         # None = all operators
    *,
    workers: int = 1,                           # parallel workers (1 = sequential)
) -> MutationResult

Mutate a single function via PatchFault. Safer than module-level. Recommended. Set workers > 1 for parallel mutant testing — each mutant is independent, giving near-linear speedup.

mutate_and_test

mutate_and_test(
    target: str,                                # module path: "myapp.scoring"
    test_fn: Callable[[], None],
    operators: list[str] | None = None,
    *,
    workers: int = 1,                           # parallel workers (1 = sequential)
) -> MutationResult

Mutate entire module, swap in sys.modules. Only works if tests import the module, not individual functions.

generate_mutants

generate_mutants(
    source: str,                                # source code string
    operators: list[str] | None = None,
) -> list[tuple[Mutant, ast.Module]]

Generate all possible mutants from source. Returns list of (Mutant, modified_ast).

MutationResult

Attribute Type Description
target str What was mutated
mutants list[Mutant] All generated mutants
total int Total mutants
killed int Mutants caught by tests
survived list[Mutant] Mutants tests missed
score float Kill ratio (1.0 = all caught)
summary() str Human-readable report

Mutant

Attribute Type Description
operator str "arithmetic", "comparison", "negate", "return_none", "boundary", "constant", "delete"
description str What changed: "+ -> -"
line int Source line
col int Source column
killed bool Whether test caught it
error str | None Compilation error if mutant was invalid
location str "L42:8"

Available operators: arithmetic, comparison, negate, return_none, boundary, constant, delete

mutation_faults

mutation_faults(
    target: str,                    # dotted path: "myapp.scoring.compute"
    operators: list[str] | None = None,
) -> list[tuple[Mutant, PatchFault]]

Generate PatchFault objects for each mutant. When activated, each fault replaces the target function with a mutated version. Use with ChaosTest to let the nemesis toggle mutations during exploration.

from ordeal.mutations import mutation_faults
faults = [mf for _, mf in mutation_faults("myapp.scoring.compute")]

Auto

from ordeal.auto import scan_module, fuzz, chaos_for, register_fixture

scan_module

scan_module(
    module: str | ModuleType,
    *,
    max_examples: int = 50,
    check_return_type: bool = True,
    fixtures: dict[str, SearchStrategy] | None = None,
    security_focus: bool = False,
    shell_injection_check: bool = False,
) -> ScanResult

Smoke-test every public function. Generates random inputs from type hints, checks: no crash, return type matches.

result = scan_module("myapp.scoring")
assert result.passed
print(result.summary())

security_focus=True keeps the same API but biases scan toward trust-boundary sinks such as import loading, deserialization, filesystem writes, and checkpoint/IPC handling. shell_injection_check=True adds a static metacharacter-to-shell-sink oracle that runs before the target is executed.

fuzz

fuzz(
    fn: Any,
    *,
    max_examples: int = 1000,
    check_return_type: bool = False,
    **fixtures: SearchStrategy | Any,
) -> FuzzResult

Deep-fuzz a single function.

result = fuzz(myapp.scoring.compute, model=model_strategy)
assert result.passed

chaos_for

chaos_for(
    module: str | ModuleType,
    *,
    fixtures: dict[str, SearchStrategy] | None = None,
    invariants: list[Invariant] | None = None,
    faults: list[Fault] | None = None,
    max_examples: int = 50,
    stateful_step_count: int = 30,
) -> type

Auto-generate a ChaosTest from a module's public API. Each function becomes a @rule.

TestScoring = chaos_for(
    "myapp.scoring",
    invariants=[finite, bounded(0, 1)],
    faults=[timing.timeout("myapp.scoring.predict")],
)

register_fixture

register_fixture(name: str, strategy: SearchStrategy) -> None

Register a named fixture for auto-scan. Highest priority after explicit fixtures.

ScanResult

Attribute Type Description
module str Module tested
functions list[FunctionResult] Per-function results
skipped list[tuple[str, str]] (name, reason) for skipped functions
passed bool All functions passed
total int Functions tested
failed int Failures
summary() str Human-readable report

FuzzResult

Attribute Type Description
function str Function tested
examples int Examples run
failures list[Exception] Exceptions found
passed bool No failures
summary() str Human-readable report

Strategies

from ordeal.strategies import corrupted_bytes, adversarial_strings, nan_floats, edge_integers, mixed_types
Strategy Signature Description
corrupted_bytes (min_size=0, max_size=1024) Edge-case bytes: empty, all-zero, all-0xFF
adversarial_strings (min_size=0, max_size=256) SQL injection, XSS, path traversal, null bytes
nan_floats () NaN, Inf, -Inf, subnormals, boundaries
edge_integers (bits=64) 0, +/-1, min/max for N bits
mixed_types () None, bool, int, float, str, bytes, lists, dicts
from hypothesis import given
from ordeal.strategies import adversarial_strings

@given(s=adversarial_strings())
def test_parser_doesnt_crash(s):
    parse(s)  # should never raise unhandled exception

Audit

from ordeal.audit import audit, audit_report, ModuleAudit

audit

audit(
    module: str,                    # dotted path: "myapp.scoring"
    *,
    test_dir: str = "tests",       # directory containing existing tests
    max_examples: int = 20,        # Hypothesis examples per function
    workers: int = 1,              # parallel mutation-validation workers
    validation_mode: Literal["fast", "deep"] = "fast",
) -> ModuleAudit

Audit a single module: measure existing test coverage vs ordeal-migrated tests. Every number in the result is either [verified] or FAILED: reason — the audit never silently returns 0%. validation_mode="fast" replays mined inputs against mutants. validation_mode="deep" keeps that replay check and then re-mines each mutant.

Coverage is measured via coverage.py JSON reports (stable schema), not terminal parsing. Results are cross-checked for consistency. Generated test files are saved to .ordeal/test_<module>_migrated.py.

audit_report

audit_report(
    modules: list[str],
    *,
    test_dir: str = "tests",
    max_examples: int = 20,
    workers: int = 1,
    validation_mode: Literal["fast", "deep"] = "fast",
) -> str

Audit multiple modules and produce a formatted summary report. Every number labeled [verified] or FAILED.

ModuleAudit

Attribute Type Description
module str Module path
current_test_count int Existing test count
current_test_lines int Lines of existing test code
current_coverage CoverageMeasurement Coverage from existing tests (with status)
migrated_test_count int Tests in generated migrated file
migrated_lines int Lines in generated migrated file
migrated_coverage CoverageMeasurement Coverage from migrated tests (with status)
mined_properties list[str] Properties with Wilson CI bounds
gap_functions list[str] Functions needing fixtures
suggestions list[str] Actionable suggestions for uncovered lines
mutation_score str e.g. "8/10 (80%)" — how many mutations mined properties catch
validation_mode Literal["fast", "deep"] Whether audit used replay or deep re-mining for mutation validation
not_checked list[str] Known unknowns — what ordeal structurally cannot verify
warnings list[str] Every problem visible here
generated_test str Full generated test file content
coverage_preserved bool True if migrated >= current - 2% (False if either failed)
summary() str Human-readable report with [verified]/FAILED labels

CoverageMeasurement

Every coverage number carries its epistemic status.

from ordeal.audit import CoverageMeasurement, Status
Attribute Type Description
status Status VERIFIED or FAILED
result CoverageResult | None Structured data if verified
error str | None Explanation if failed
percent float Coverage %, or 0.0 if failed
missing_lines frozenset[int] Uncovered lines, or empty if failed

CoverageResult

from ordeal.audit import CoverageResult
Attribute Type Description
percent float Coverage percentage
total_statements int Total source statements
missing_count int Number of uncovered statements
missing_lines frozenset[int] Uncovered line numbers
source str How measured (e.g. "coverage.py JSON")

wilson_lower

wilson_lower(successes: int, total: int, z: float = 1.96) -> float

Lower bound of the Wilson score confidence interval. For mined properties: 500/500 at 95% CI gives lower bound ~0.994, meaning "holds with >=99.4% probability" — not "always holds."


Diff

from ordeal.diff import diff, DiffResult, Mismatch

Differential testing — compare two implementations on the same random inputs.

diff

diff(
    fn_a: Callable,                             # reference function
    fn_b: Callable,                             # function to compare
    *,
    max_examples: int = 100,
    rtol: float | None = None,                  # relative tolerance
    atol: float | None = None,                  # absolute tolerance
    compare: Callable[[Any, Any], bool] | None = None,  # custom comparator
    **fixtures: SearchStrategy | Any,
) -> DiffResult

Compare two functions for equivalence. Infers strategies from fn_a's type hints. Both functions must accept the same parameters.

# Exact comparison
result = diff(score_v1, score_v2)
assert result.equivalent

# Floating-point tolerance
result = diff(compute_old, compute_new, rtol=1e-6)

# Custom comparator
result = diff(fn_a, fn_b, compare=lambda a, b: a.status == b.status)

DiffResult

Attribute Type Description
function_a str Name of reference function
function_b str Name of compared function
total int Examples tested
mismatches list[Mismatch] Inputs where outputs differed
equivalent bool True if no mismatches
summary() str Human-readable report

Mismatch

Attribute Type Description
args dict Input arguments that caused divergence
output_a Any Output from fn_a
output_b Any Output from fn_b

Scaling

from ordeal.scaling import usl, amdahl, optimal_n, peak_throughput, fit_usl, analyze, benchmark

Universal Scaling Law (USL) and Amdahl's Law for predicting parallel exploration performance.

usl

usl(n: float, sigma: float, kappa: float) -> float

C(N) = N / [1 + sigma*(N-1) + kappa*N*(N-1)]. Returns relative throughput (C(1) = 1).

  • sigma: contention coefficient — fraction of serialized work
  • kappa: coherence coefficient — cross-worker sync cost (grows quadratically)

amdahl / optimal_n / peak_throughput

amdahl(n: float, sigma: float) -> float          # USL with kappa=0
optimal_n(sigma: float, kappa: float) -> float    # worker count at peak throughput
peak_throughput(sigma: float, kappa: float) -> float

fit_usl

fit_usl(measurements: list[tuple[int | float, float]]) -> tuple[float, float]

Fit sigma and kappa from (N, throughput) pairs via least squares. Requires >= 3 data points.

analyze

analyze(measurements: list[tuple[int | float, float]]) -> ScalingAnalysis

Fit USL and return full analysis with diagnosis.

benchmark

benchmark(
    test_class: type | None = None,
    *,
    target_modules: list[str] | None = None,
    max_workers: int | None = None,       # default: CPU count
    time_per_trial: float = 10.0,
    seed: int = 42,
    steps_per_run: int = 50,
    metric: str = "runs",                 # "runs" or "edges"
    mutate_targets: list[str] | None = None,
    repeats: int = 5,
    workers: int = 1,
    preset: str | None = "standard",
    filter_equivalent: bool = True,
    test_filter: str | None = None,
) -> ScalingAnalysis | MutationBenchmarkSuite

Benchmark exploration at N=1, 2, 4, ... workers, measure throughput, fit USL parameters automatically. When mutate_targets=[...] is provided, benchmark mutation latency in fresh subprocesses instead and report median wall time plus per-phase timings.

from ordeal.scaling import benchmark
analysis = benchmark(MyServiceChaos, target_modules=["myapp"])
print(analysis.summary())
from ordeal.scaling import benchmark
suite = benchmark(
    mutate_targets=["tests._mutation_bench_target.tiny_add"],
    repeats=5,
    preset="standard",
)
print(suite.summary())

benchmark_perf_contract

from ordeal.scaling import benchmark_perf_contract

suite = benchmark_perf_contract("ordeal.perf.toml")
print(suite.summary())

Run a checked-in perf/quality contract. Supports import latency, audit latency, mutation latency, and audit_compare cases that fail when one audit validation mode falls too far behind another on mutation score.

When used from the CLI, --output-json PATH writes a stable artifact with passed, cases, failures, and per-case timing/score details so agents can consume the result without parsing text.

scales_linearly

from ordeal.scaling import scales_linearly

@scales_linearly(n_range=(1, 8), max_kappa=0.01, max_sigma=0.3)
def process_batch(items):
    ...

Decorator: assert that a function scales linearly with concurrency. Runs the function with increasing worker counts, fits the USL model, and fails if contention or coherence exceed thresholds.

Parameter Type Default Description
n_range tuple[int, int] (1, 8) (min_workers, max_workers) to test
max_kappa float 0.01 Fail if coherence exceeds this (quadratic overhead)
max_sigma float 0.3 Fail if contention exceeds this (serial bottleneck)
samples int 3 Number of worker counts to test between min and max
time_per_sample float 2.0 Seconds to run at each worker count

Raises AssertionError with diagnostics when thresholds are exceeded. Works as a bare decorator (@scales_linearly) or with parameters (@scales_linearly(max_kappa=0.005)).

ScalingAnalysis

Attribute Type Description
sigma float Contention coefficient
kappa float Coherence coefficient
n_optimal float Worker count at peak throughput
peak float Maximum achievable throughput multiplier
regime str "linear", "amdahl", or "usl"
efficiency(n) float Parallel efficiency C(N)/N at N workers
throughput(n) float Predicted relative throughput at N workers
summary() str Human-readable report with diagnosis

Mine

from ordeal.mine import mine, mine_pair, MineResult, MinedProperty

mine

mine(
    fn: Callable,
    *,
    max_examples: int = 500,
    **fixtures: SearchStrategy | Any,
) -> MineResult

Discover likely properties of a function by running it many times with random inputs and observing patterns in outputs.

Properties checked: type consistency, never None, no NaN, non-negative, bounded [0,1], never empty, deterministic, idempotent, involution (f(f(x)) == x), commutative (f(a,b) == f(b,a)), associative (f(f(a,b),c) == f(a,f(b,c))), observed range, monotonicity (per numeric input parameter), and length relationships (len(output) == len(input)). Float comparisons use math.isclose (rel_tol=1e-9, abs_tol=1e-12) so rounding noise doesn't cause false negatives.

result = mine(myapp.scoring.compute, max_examples=500)
for p in result.universal:
    print(p)
# ALWAYS  output type is float (500/500)
# ALWAYS  deterministic (50/50)
# ALWAYS  output in [0, 1] (500/500)

mine_pair

mine_pair(
    f: Callable,
    g: Callable,
    *,
    max_examples: int = 200,
    **fixtures: SearchStrategy | Any,
) -> MineResult

Discover relational properties between two functions. Checks roundtrip (g(f(x)) == x), reverse roundtrip (f(g(x)) == x), and commutative composition (f(g(x)) == g(f(x))). Strategies are inferred from f's signature.

result = mine_pair(encode, decode)
# roundtrip decode(encode(x)) == x: ALWAYS

MineResult

Results are separated into three categories: checked and applicable, checked but not relevant, and structurally impossible to check.

Attribute Type Description
function str Function name
examples int Examples run
properties list[MinedProperty] Checked and applicable (total > 0)
not_applicable list[str] Checked but not relevant (e.g. "bounded [0,1]" for string output)
not_checked list[str] Structural limitations — things mine() cannot verify
universal list[MinedProperty] Properties that held on every example
likely list[MinedProperty] Properties with >= 95% confidence
summary() str Human-readable report

STRUCTURAL_LIMITATIONS

from ordeal.mine import STRUCTURAL_LIMITATIONS

Things mine() fundamentally cannot discover from random sampling — these require domain knowledge:

  • Output value correctness (fuzz checks crash safety, not behavior)
  • Cross-function consistency (e.g., batch == map of individual)
  • Domain-specific invariants (e.g., weighted sum, refusal detection)
  • Error handling for intentionally invalid inputs
  • Performance and resource usage
  • Concurrency and thread safety
  • State mutation and side effects

MinedProperty

Attribute Type Description
name str Property description
holds int Times property held
total int Times property was checked
counterexample dict | None First counterexample if not universal
confidence float holds / total
universal bool True if held on every example

validate_mined_properties

from ordeal.mutations import validate_mined_properties

validate_mined_properties(
    target: str,                    # dotted path: "myapp.scoring.compute"
    max_examples: int = 100,
    operators: list[str] | None = None,
    *,
    preset: Literal["essential", "standard", "thorough"] | None = None,
    mine_result: MineResult | None = None,
    validation_mode: Literal["fast", "deep"] = "fast",
) -> MutationResult

Mine properties of target, then mutate the code and check whether the mined properties catch the mutations. Surviving mutants reveal properties that are too weak. validation_mode="fast" replays mined inputs against mutants. validation_mode="deep" keeps that replay check and then re-runs mine() for each mutant. Used by ordeal audit to report mutation scores.


Metamorphic

from ordeal.metamorphic import Relation, RelationSet, metamorphic

Metamorphic testing checks relationships between outputs rather than exact values. Define a relation that transforms input and checks how outputs relate, then apply it as a decorator.

Relation

Relation(
    name: str,                                              # human-readable label
    transform: Callable[[tuple], tuple],                    # transform input args
    check: Callable[[Any, Any], bool],                      # (original_out, transformed_out) -> bool
)

Compose with +: (relation_a + relation_b) checks both.

metamorphic

@metamorphic(*relations: Relation | RelationSet, max_examples: int = 100)
def test_fn(x: int, y: int):
    return x + y

Decorator. For each Hypothesis-generated input, runs the function on original and transformed inputs, then asserts the relation's check holds. Strategies inferred from type hints.

commutative = Relation(
    "commutative",
    transform=lambda args: (args[1], args[0]),
    check=lambda a, b: a == b,
)

negate_involution = Relation(
    "negate is involution",
    transform=lambda args: (-args[0],),
    check=lambda a, b: abs(a + b) < 1e-6,
)

@metamorphic(commutative)
def test_add(x: int, y: int):
    return x + y

@metamorphic(negate_involution)
def test_negate(x: float):
    return -x

Config

from ordeal.config import load_config, OrdealConfig, ExplorerConfig, TestConfig, ReportConfig, ScanConfig

load_config

load_config(path: str | Path = "ordeal.toml") -> OrdealConfig

Load and validate an ordeal.toml. Raises FileNotFoundError if missing, ConfigError on invalid keys/types.

OrdealConfig

Attribute Type Default
explorer ExplorerConfig see below
tests list[TestConfig] []
scan list[ScanConfig] []
report ReportConfig see below

ExplorerConfig

Attribute Type Default
target_modules list[str] []
max_time float 60.0
max_runs int | None None
seed int 42
max_checkpoints int 256
checkpoint_prob float 0.4
checkpoint_strategy str "energy"
steps_per_run int 50
fault_toggle_prob float 0.3
workers int 1
seed_mutation_respect_strategies bool False

TestConfig

Attribute Type Required
class_path str Yes
steps_per_run int | None No
swarm bool | None No

resolve() -> type — import and return the ChaosTest class.

ReportConfig

Attribute Type Default
format str "text"
output str "ordeal-report.json"
traces bool False
traces_dir str ".ordeal/traces"
verbose bool False

ScanConfig

Attribute Type Default
module str required
max_examples int 50
fixtures dict[str, str] {}

Exploration State

Unified view of what ordeal knows about your code

Every tool (mine, mutate, scan, chaos) explores one dimension of the state space. ExplorationState accumulates their results into a single, persistent, queryable picture. AI assistants read this to understand what's been explored, what's missing, and how confident the results are.

from ordeal.state import explore, ExplorationState
from ordeal.state import explore_mine, explore_scan, explore_mutate, explore_chaos

explore

explore(
    module: str,
    *,
    state: ExplorationState | None = None,  # resume from previous
    time_limit: float | None = None,
    workers: int = 1,                       # parallel mutation testing
    max_examples: int = 50,                 # input space sampling depth
    seed: int = 42,
    patch_io: bool = False,                 # deterministic file/network/subprocess I/O
) -> ExplorationState

Runs all exploration strategies in sequence: mine → scan → mutate → chaos. Each step enriches the shared ExplorationState. Scales with workers (mutation parallelism) and max_examples (input sampling depth). Resume from a previous state to accumulate confidence across sessions. Set patch_io=True to run the pipeline inside the deterministic supervisor's file/network/subprocess substrate.

Individual steps (explore_mine, explore_scan, explore_mutate, explore_chaos) are available for finer control.

ExplorationState

Attribute Type Description
module str Module being explored
functions dict[str, FunctionState] Per-function exploration state
skipped list[tuple[str, str]] Functions skipped during mining with reasons
refreshed list[str] Functions invalidated because source changed
confidence float Aggregate confidence [0, 1] across all functions
frontier dict[str, list[str]] Per-function gaps — what's unexplored
findings list[str] Bugs and anomalies found
finding_details list[dict] Structured findings for reports and agent handoff
exploration_time float Wall-clock time accumulated across runs
supervisor_info dict[str, Any] Reproduction info: seed, transitions, states, scheduler/subprocess data
summary() str Human-readable exploration report
to_dict() dict JSON-friendly state payload for persistence and agents
to_json() str Serialize for persistence across sessions
from_json(data) ExplorationState Deserialize
refresh() list[str] Invalidate stale function results after source changes

FunctionState

Attribute Type Description
mined bool Whether mine() has been run
properties list[dict] Discovered properties with confidence
property_violations list[str] Suspicious discovered properties summarized as findings
property_violation_details list[dict] Structured property-finding details
mutation_score float | None Kill ratio from mutation testing
survived_mutants int Mutants that survived the current test suite
killed_mutants int Mutants killed by the current test suite
hardened bool Whether extra tests have been verified against survivors
hardened_kills int Additional survivors closed by hardening
crash_free bool | None Whether random inputs crashed
scan_error str | None Crash/error text from scan_module()
failing_args dict[str, Any] | None Shrunk failing arguments from scan/fuzz
chaos_tested bool Whether chaos testing has been run
faults_tested list[str] Fault names exercised during chaos testing
edges_discovered int Unique code paths reached
saturated bool True when more mining won't find new paths
confidence float Per-function confidence [0, 1]
frontier list[str] What's unexplored for this function

Agent Schema

from ordeal.agent_schema import (
    AgentArtifact,
    AgentEnvelope,
    AgentFinding,
    build_agent_envelope,
)

Stable JSON envelope used by CLI --json output and other machine consumers.

AgentFinding

Attribute Type Description
kind str Finding class such as crash, mutation, property, or blocked
summary str One-line human-readable statement
confidence float | None Optional confidence score
target str | None Dotted path or module the finding applies to
location str | None Optional file/line or symbolic location
details dict[str, Any] Machine-readable structured payload
to_dict() dict JSON-friendly representation

AgentArtifact

Attribute Type Description
kind str Artifact type such as report, regression, trace, or index
uri str Path or URI to the artifact
description str | None Short human-readable explanation
metadata dict[str, Any] Extra machine-readable metadata
to_dict() dict JSON-friendly representation

AgentEnvelope

Attribute Type Description
schema_version str Stable envelope schema version
tool str Producing command or subsystem (scan, mine, mutate, ...)
target str Primary module/function/trace target
status str Overall status such as ok, issue_found, or blocked
summary str High-signal one-line summary
recommended_action str Best next action for the consumer
suggested_commands list[str] Follow-up shell commands
suggested_test_file str | None Suggested regression test path
confidence float | None Optional confidence score
confidence_basis list[str] Short reasons behind the confidence value
blocking_reason str | None Why execution was blocked, if applicable
findings list[AgentFinding] Structured findings
artifacts list[AgentArtifact] Produced or referenced artifacts
raw_details dict[str, Any] Tool-specific payload not normalized into top-level fields
to_dict() dict Stable machine-readable dict
to_json() str Deterministically sorted JSON

build_agent_envelope

build_agent_envelope(
    *,
    tool: str,
    target: str,
    status: str,
    summary: str,
    recommended_action: str = "",
    suggested_commands: Sequence[str] = (),
    suggested_test_file: str | None = None,
    confidence: float | None = None,
    confidence_basis: Sequence[str] = (),
    blocking_reason: str | None = None,
    findings: Sequence[AgentFinding | Mapping[str, Any]] = (),
    artifacts: Sequence[AgentArtifact | Mapping[str, Any]] = (),
    raw_details: Mapping[str, Any] | None = None,
    schema_version: str = "1.0",
) -> AgentEnvelope

Normalize mixed finding/artifact inputs into a stable AgentEnvelope.


Deterministic Supervisor

Control non-determinism for reproducible exploration

Execution is non-deterministic: RNG state, time, subprocess timing, and interleavings all vary between runs. The same code can produce different behavior. DeterministicSupervisor fixes this by seeding every entropy source, replacing time with a deterministic clock, and optionally running subprocesses and cooperative tasks against a seed-driven scheduler. Same seed = same execution. Different seeds = different exploration trajectories.

from ordeal.supervisor import DeterministicSupervisor, StateTree, StateNode

DeterministicSupervisor

import subprocess

with DeterministicSupervisor(seed=42) as sup:
    # random, buggify, numpy all seeded
    # time.time() and time.sleep() are deterministic
    result = my_function()
    sup.log_transition("called my_function", state_hash=hash(result))

with DeterministicSupervisor(seed=42, patch_io=True) as sup:
    sup.register_subprocess(["worker", "--once"], stdout="ok\n", delay=2.0)
    out = subprocess.check_output(["worker", "--once"], text=True)
    assert out == "ok\n"

with DeterministicSupervisor(seed=42) as sup:
    def worker(name):
        yield sup.yield_now()
        yield sup.sleep(1.0)
        return name

    sup.spawn("a", worker, "a")
    sup.spawn("b", worker, "b")
    results = sup.run_until_idle()
Method Description
log_transition(action, state_hash=) Record a state transition
spawn(name, task, *args, **kwargs) Register a cooperative task with the deterministic scheduler
yield_now() Yield control back to the scheduler
sleep(seconds) Suspend the running task for simulated time
run_until_idle(max_steps=None) Run cooperative tasks until completion or a step limit
register_subprocess(command, stdout=, stderr=, returncode=, delay=, match=) Register deterministic subprocess.run / check_output / Popen behavior
clear_subprocesses() Remove registered deterministic subprocesses
fork(new_seed=) Create a new supervisor from current state with different seed
state Current state hash
trajectory List of Transition objects
visited_states All states visited
task_results Completed cooperative task results keyed by name
pending_tasks Cooperative tasks that are still blocked or runnable
reproduction_info() Dict with seed, patch_io, subprocess count, scheduler steps, hash seed, steps — everything needed to replay
summary() Human-readable trajectory report

StateTree

Navigable exploration tree with checkpoint and rollback. Each node is a checkpointed state; edges are actions taken. The AI can checkpoint, explore a branch, roll back, and try a different branch.

tree = StateTree()
tree.checkpoint(state_id=0, snapshot=my_state)
tree.checkpoint(state_id=1, parent=0, action="deposit(50)", snapshot=new_state)

old = tree.rollback(0)  # returns deepcopy of checkpointed state
tree.checkpoint(state_id=2, parent=0, action="withdraw(50)", snapshot=other_state)
Method Description
checkpoint(state_id, snapshot=, parent=, action=, edges=, seed=) Save a state as a tree node
rollback(state_id) Return deepcopy of a previous checkpoint
frontier() Nodes that can be explored further
leaves() Deepest explored states
path_to(state_id) Sequence of actions from root to a node
summary() Visual tree structure
to_json() Serialize tree (without snapshots)

CMPLOG

Crack guarded branches that random testing can't reach

When code has if x == 42 and mode == "admin", random testing will almost never generate those exact values. CMPLOG parses the function's AST, extracts literal values from comparisons, and injects them into Hypothesis strategies. This is the Python equivalent of AFL++'s CMPLOG/RedQueen technique.

from ordeal.cmplog import extract_comparison_values, enhance_strategies

extract_comparison_values

extract_comparison_values(fn: Callable) -> dict[str, list[Any]]

Returns {"param_name": [literal_values]} extracted from ==, !=, in, >=, etc. in the function source.

enhance_strategies

enhance_strategies(
    strategies: dict[str, SearchStrategy],
    fn: Callable,
) -> dict[str, SearchStrategy]

Merges extracted magic values into Hypothesis strategies. The enhanced strategy generates branch-cracking values alongside random exploration.

Automatically wired into mine() — no manual usage needed. Available for custom fuzzing loops.


Mutagen

AFL's bit-flip loop for Python values

Real fuzzers don't generate random inputs from scratch — they mutate known-good inputs. mutagen applies type-aware perturbation to Python values: bit-flips for ints, mantissa perturbation for floats, character swaps for strings. Combined with coverage feedback, mutations that reach new code paths become seeds for further mutation.

from ordeal.mutagen import mutate_value, mutate_inputs

mutate_value

mutate_value(value: Any, rng: random.Random, intensity: float = 0.3) -> Any

Mutate a single value. Type-aware: ints get bit-flips and arithmetic perturbation, floats get mantissa perturbation and special values (NaN, Inf), strings get character swaps and boundary strings, lists/dicts get element mutation.

mutate_inputs

mutate_inputs(
    inputs: dict[str, Any],
    rng: random.Random,
    intensity: float = 0.3,
) -> dict[str, Any]

Mutate a full kwargs dict (like those in MineResult.collected_inputs). Returns a new dict with mutated values. Keys are preserved.

Automatically wired into mine() Phase 2 — after Hypothesis sampling, productive inputs are mutated to explore nearby state space. Available for custom fuzzing loops.


Cross-Function Mining

Discover relationships between functions automatically

Single-function mining finds properties like "output >= 0". Cross-function mining finds relationships like "decode(encode(x)) == x" — roundtrips, composition commutativity, output equivalence. Tests all compatible function pairs automatically.

from ordeal.mine import mine_module, MineModuleResult, CrossFunctionProperty

mine_module

mine_module(
    module: str | ModuleType,
    *,
    max_examples: int = 30,
    mine_per_function: bool = True,
) -> MineModuleResult

Discovers per-function properties (via mine()) and cross-function relationships for all compatible pairs.

CrossFunctionProperty

Attribute Type Description
function_a str First function
function_b str Second function
relation str "roundtrip", "commutative_composition", or "equivalent"
confidence float Fraction of inputs where the relation held
holds int Number of inputs where it held
total int Number of inputs tested
counterexample dict | None One failing input if relation doesn't hold universally

Grammar Strategies

Syntax-valid inputs reach deeper code

Random bytes and strings get rejected at the parser level — they never reach the business logic that actually has bugs. Grammar-aware strategies generate syntactically valid inputs (JSON, SQL, URLs, etc.) that pass parsing and exercise the code paths that matter. This is the Python equivalent of libFuzzer's structure-aware custom mutators.

from ordeal.grammar import json_strategy, sql_strategy, url_strategy
from ordeal.grammar import email_strategy, csv_strategy, xml_strategy
from ordeal.grammar import path_strategy, regex_strategy, structured_strategy

Each returns a hypothesis.strategies.SearchStrategy. Use with @given, @quickcheck, mine(), or any Hypothesis-based tool.

Strategy What it generates Key parameters
json_strategy(schema=, max_depth=3) Valid JSON values (objects, arrays, primitives) schema dict constrains structure
sql_strategy(dialect=, tables=) Valid SELECT/INSERT/UPDATE/DELETE tables dict of {name: [columns]}
url_strategy(schemes=) Valid URLs with paths, query params, fragments schemes list (default: http, https, ftp)
email_strategy() Valid email addresses
path_strategy() Valid Unix and Windows file paths
csv_strategy(columns=, rows=) Valid CSV with headers columns list of names
xml_strategy(tag=, max_depth=2) Well-formed XML with elements and attributes tag root element name
regex_strategy(pattern) Strings matching a regex pattern regex string
structured_strategy(example) Values structurally similar to the example Any Python value
# Generate valid JSON for API testing
from ordeal.grammar import json_strategy
@given(payload=json_strategy({"type": "object"}))
def test_api(payload):
    response = my_api.post(payload)
    assert response.status_code < 500

# Generate valid SQL for query testing
from ordeal.grammar import sql_strategy
@given(query=sql_strategy(tables={"users": ["id", "name", "email"]}))
def test_query_parser(query):
    parsed = parse_sql(query)
    assert parsed is not None

# Infer strategy from an example
from ordeal.grammar import structured_strategy
example = {"name": "Alice", "scores": [95, 87, 92], "active": True}
@given(data=structured_strategy(example))
def test_process(data):
    result = process_record(data)
    assert result is not None

Equivalence Detection

Not all surviving mutants are test gaps

Equivalent mutants are code changes that don't change behavior — they always survive mutation testing, inflating the "test gap" count and wasting developer time. Detecting them is one of the hardest problems in mutation testing. ordeal provides three complementary approaches: structural (fast), statistical (medium), and formal (slow, definitive).

from ordeal.equivalence import (
    structural_equivalence,
    statistical_equivalence,
    prove_equivalent,
    classify_mutant,
    filter_equivalent_mutants,
    EquivalenceResult,
)

Three approaches, layered fast → slow

Structural — AST comparison after normalization. Catches trivially equivalent mutants (e.g., reordering commutative operations). Fast but conservative.

Statistical — Run both versions on random inputs, compare outputs. Uses Wilson score confidence interval to bound equivalence probability. Medium speed, probabilistic.

Formal — Z3 SMT solver encodes both functions and checks semantic identity. Definitive proof but slow. Optional: pip install z3-solver.

classify_mutant

classify_mutant(
    original_fn: Callable,
    mutant_fn: Callable,
    original_source: str,
    mutant_source: str,
    *,
    max_seconds: float = 5,
) -> EquivalenceResult

Runs all three methods in order (structural → statistical → formal). Returns the first definitive result.

EquivalenceResult

Attribute Type Description
equivalent bool | None True = equivalent, False = different, None = inconclusive
confidence float 1.0 for proven, 0.0-1.0 for statistical
method str "structural", "statistical", "formal", or "inconclusive"
counterexample dict | None Input where outputs differ (if not equivalent)
time_seconds float Time taken for the analysis

filter_equivalent_mutants

filter_equivalent_mutants(
    target: str,
    mutant_pairs: list[MutantPair],
    *,
    methods: tuple[str, ...] = ("structural", "statistical"),
) -> list[MutantPair]

Drop-in replacement for the existing equivalence filter in mutation testing. Uses the layered approach: structural first (fast), then statistical, optionally formal.


N-gram Coverage

Path context finds deeper bugs

Single-edge coverage (the default AFL model) tracks individual transitions: A→B. But the same edge reached via different paths can expose different bugs. N-gram coverage tracks sequences of N edges as a single hash: at ngram=2, the path X→A→B is different from Y→A→B. This captures deeper patterns in control flow without the full overhead of path-sensitive analysis.

The Explorer's CoverageCollector supports configurable N-gram depth:

from ordeal.explore import CoverageCollector

# Default: single-edge (backward compatible)
collector = CoverageCollector(["myapp"], ngram=1)

# 2-gram: captures one level of path context
collector = CoverageCollector(["myapp"], ngram=2)

Configure via ordeal.toml:

[explorer]
ngram = 2  # path-context depth (default: 1)
N-gram What it captures Overhead Best for
1 Single edge transitions Lowest Quick exploration
2 Edge + one predecessor Low Most codebases (recommended)
3+ Deeper path context Medium Complex state machines