Configuration¶

All exploration settings live in ordeal.toml. Copy from ordeal.toml.example and edit.

Why TOML¶

ordeal uses a single ordeal.toml file because configuration should be data, not code.

TOML is human-readable, machine-parseable, and version-controllable. You can review changes in a diff, generate the file from a script, or have an AI agent produce it after scanning your codebase. There is no Python import machinery involved, no subclassing, no registration -- just a flat file that describes what to explore, how long to run, and where to report.

One file, checked into your repo, that anyone (or anything) can read and modify.

Schema¶

`[explorer]`¶

Key	Type	Default	Description
`target_modules`	`list[str]`	`[]`	Modules to track for edge coverage
`max_time`	`float`	`60`	Wall-clock time limit (seconds)
`max_runs`	`int?`	`null`	Run count limit (null = time-only)
`seed`	`int`	`42`	RNG seed
`max_checkpoints`	`int`	`256`	Checkpoint corpus size
`checkpoint_prob`	`float`	`0.4`	Probability of starting from checkpoint
`checkpoint_strategy`	`str`	`"energy"`	`"energy"`, `"uniform"`, or `"recent"`
`steps_per_run`	`int`	`50`	Max rule steps per run
`fault_toggle_prob`	`float`	`0.3`	Nemesis action probability per step
`workers`	`int`	`1`	Parallel workers (0 = auto: `os.cpu_count()`)

`[[tests]]`¶

Key	Type	Required	Description
`class`	`str`	Yes	`"module.path:ClassName"`
`steps_per_run`	`int?`	No	Override per test
`swarm`	`bool?`	No	Override swarm mode

`[report]`¶

Key	Type	Default	Description
`format`	`str`	`"text"`	`"text"`, `"json"`, or `"both"`
`output`	`str`	`"ordeal-report.json"`	JSON report path
`traces`	`bool`	`false`	Save full traces for replay
`traces_dir`	`str`	`".ordeal/traces"`	Trace output directory
`verbose`	`bool`	`false`	Live progress to stderr

`[[scan]]`¶

Declare modules for auto-scan testing. The pytest plugin auto-collects these and runs scan_module() on each.

Key	Type	Default	Description
`module`	`str`	required	Dotted module path to scan
`max_examples`	`int`	`50`	Hypothesis examples per function
`fixtures`	`dict`	`{}`	Strategy overrides for untyped parameters

[[scan]]
module = "myapp.scoring"
max_examples = 100

[[scan]]
module = "myapp.pipeline"
fixtures = { model = "sampled_from(['gpt-4', 'claude'])" }

When you run pytest --chaos, ordeal auto-discovers these entries and smoke-tests every public function in each module. Functions without type hints are skipped unless fixtures are provided.

Tuning guide¶

The defaults are reasonable for a first run. Once you have something working, the parameters below are the ones worth adjusting.

`target_modules`¶

Controls which Python modules the explorer instruments for edge coverage. The explorer uses AFL-style edge hashing via sys.settrace -- it only tracks control-flow transitions in the modules you list here.

Start with your main application module (e.g., ["myapp"]). Add more as you want broader coverage. Submodules are included automatically: "myapp" covers myapp.api, myapp.db, and so on.

Too many modules means tracing overhead slows down each run, reducing the number of runs per second. Too few modules means the explorer is blind to coverage in code you care about, so it cannot checkpoint effectively. If you are unsure, start narrow and widen after looking at the edge count in the report.

`max_time`¶

Wall-clock time limit for the entire exploration run. The explorer loops until this limit is reached (or max_runs is hit, if set).

60s -- good for local development and quick feedback.
300s -- good for CI. Catches most shallow and medium-depth bugs.
3600s+ -- pre-release or nightly runs. Longer runs find deeper bugs because the explorer has more time to branch from rare checkpoints.

The relationship is roughly logarithmic: doubling the time does not double the bugs found, but it does explore states that shorter runs never reach. Start short, increase as your confidence requirements grow.

`checkpoint_prob`¶

The probability that a new exploration run starts from a saved checkpoint rather than from a fresh machine state. This controls the balance between depth and diversity.

0.4 (default) -- a good balance. 40% of runs branch from interesting prior states, 60% start fresh.
0.6 - 0.8 -- deep-state exploration. Use this for systems with deep state machines where bugs hide behind many prerequisite steps.
0.1 - 0.2 -- high diversity. Use this early on, or for systems where bugs tend to appear in the first few steps regardless of prior state.

If you see the edge count plateauing quickly, try increasing this value to let the explorer dig deeper from known interesting states.

`checkpoint_strategy`¶

How the explorer picks which checkpoint to branch from when it does use a checkpoint.

"energy" (default) -- checkpoints that led to new coverage discoveries get higher energy and are selected more often. Energy decays over time (decay factor 0.95, minimum 0.01), so stale checkpoints gradually lose priority. This works well for most systems.
"uniform" -- pick a checkpoint at random with equal probability. Try this if energy scheduling seems stuck on a small cluster of checkpoints.
"recent" -- favor recently created checkpoints with linearly increasing weights. Good for systems where newer states matter more than older ones, such as systems with monotonically growing state.

`steps_per_run`¶

The maximum number of rule steps (including fault toggles) in a single exploration run. Each run picks a random number of steps between 1 and this value.

50 (default) -- good for most services and typical ChaosTest classes.
100 - 200 -- for systems with deep state machines where the interesting behavior requires many sequential operations (e.g., a database that needs a series of writes before a compaction triggers).
20 - 30 -- for fast iteration. Shorter runs complete faster, so the explorer gets more runs per second and more chances to try different checkpoint branches.

Higher values mean each individual run takes longer. Lower values mean more runs, but each run explores less deeply. If your ChaosTest rules are expensive (e.g., they call external services), lean toward fewer steps.

`fault_toggle_prob`¶

The probability that any given step is a fault toggle (nemesis action) rather than a regular rule execution. When a fault toggle fires, the explorer randomly activates or deactivates one of the registered faults.

0.3 (default) -- roughly 30% of steps are fault toggles. This gives a good mix of normal operation and fault injection.
0.5 - 0.7 -- highly chaotic. Faults flip on and off frequently, testing rapid recovery and cascading failures.
0.05 - 0.15 -- long fault-free windows with occasional disruptions. Better for testing sustained degraded operation rather than rapid fault cycling.

`seed`¶

The RNG seed for the entire exploration. Same seed + same code = same exploration path.

Set this explicitly in CI for reproducibility. When a failure is found, the seed (along with the trace) lets you replay the exact same sequence. Use different seeds across parallel CI jobs to explore different paths.

`max_checkpoints`¶

The maximum number of checkpoints kept in the corpus. When the limit is reached, the lowest-energy checkpoint is evicted (under the "energy" strategy) or a random one is removed.

256 is generous for most use cases. Increase it if you have long runs where many distinct interesting states accumulate. Decrease it if checkpoint deepcopy is expensive for your machine state.

`workers`¶

Number of parallel worker processes. Default 1 (sequential). Each worker gets a unique seed (base + i*7919) and explores independently.

Set to the number of available CPU cores for maximum throughput. In CI, match your runner's core count. Locally, leave headroom for other work (e.g., workers = 6 on an 8-core machine).

Workers don't share checkpoints or coverage — they explore independently and results are aggregated at the end (runs summed, edges unioned). This means some edge discovery may overlap. Use ordeal.scaling.benchmark() to measure actual efficiency for your test.

[explorer]
workers = 4    # 4 parallel processes

Examples¶

Local development (quick iteration)¶

Fast feedback during development. Short runs, text output, no traces.

[explorer]
target_modules = ["myapp"]
max_time = 30
steps_per_run = 30

[[tests]]
class = "tests.test_chaos:MyServiceChaos"

CI¶

Longer runs, fixed seed for reproducibility, JSON report for tooling.

[explorer]
target_modules = ["myapp"]
max_time = 300
seed = 0

[[tests]]
class = "tests.chaos.test_api:APIChaos"

[[tests]]
class = "tests.chaos.test_scoring:ScoringChaos"

[report]
format = "json"
output = "ordeal-report.json"

Pre-release validation¶

Thorough exploration. Long time budget, deep state exploration, full traces saved for replay.

[explorer]
target_modules = ["myapp", "myapp.db", "myapp.cache"]
max_time = 3600
seed = 0
checkpoint_prob = 0.7
steps_per_run = 150
max_checkpoints = 512

[[tests]]
class = "tests.chaos.test_api:APIChaos"
steps_per_run = 200

[[tests]]
class = "tests.chaos.test_scoring:ScoringChaos"

[[tests]]
class = "tests.chaos.test_persistence:PersistenceChaos"
steps_per_run = 200

[report]
format = "both"
output = "ordeal-report.json"
traces = true
traces_dir = ".ordeal/traces"
verbose = true

Multi-service with per-test overrides¶

Multiple ChaosTest classes with different tuning per test. The API test gets more steps because it has a deeper state machine. The cache test uses swarm mode to randomize which faults are active.

[explorer]
target_modules = ["ordering", "inventory", "payments"]
max_time = 600
seed = 7

[[tests]]
class = "tests.chaos.test_ordering:OrderingChaos"
steps_per_run = 100

[[tests]]
class = "tests.chaos.test_inventory:InventoryChaos"
steps_per_run = 50
swarm = true

[[tests]]
class = "tests.chaos.test_payments:PaymentsChaos"
steps_per_run = 80

[report]
format = "both"
output = "ordeal-report.json"
verbose = true

Loading from Python¶

from ordeal.config import load_config

cfg = load_config()               # reads ./ordeal.toml
cfg = load_config("ci.toml")     # custom path
cfg.explorer.max_time             # 60.0
cfg.tests[0].resolve()           # imports the class

The load_config function validates the TOML against the schema and raises ConfigError with a clear message if any key is unknown or any value is out of range.

For AI agents¶

ordeal.toml is designed to be generated programmatically. The format is intentionally flat and predictable -- no inheritance, no imports, no conditional logic.

A typical workflow for an AI agent:

Scan the codebase for ChaosTest subclasses.
Identify their module paths and class names.
Determine which application modules should be traced for coverage.
Generate an ordeal.toml with reasonable defaults.

Here is what a generated config might look like after scanning a project:

# Auto-generated by agent. Target: myapp (3 ChaosTest classes found).

[explorer]
target_modules = ["myapp"]
max_time = 300
seed = 0

[[tests]]
class = "tests.chaos.test_api:APIChaos"

[[tests]]
class = "tests.chaos.test_db:DatabaseChaos"

[[tests]]
class = "tests.chaos.test_cache:CacheChaos"

[report]
format = "json"
output = "ordeal-report.json"

The class path format is always "module.path:ClassName" -- the same format used by Python entry points. An agent can derive this from any ChaosTest subclass it finds in the test tree.

No Python code is needed to produce or consume this file. A shell script, a CI step, or an LLM can generate it from a template.

Configuration¶

Why TOML¶

Schema¶

[explorer]¶

[[tests]]¶

[report]¶

[[scan]]¶

Tuning guide¶

target_modules¶

max_time¶

checkpoint_prob¶

checkpoint_strategy¶

steps_per_run¶

fault_toggle_prob¶

seed¶

max_checkpoints¶

workers¶