$ultraqa
Adversarial dynamic e2e QA workflow that generates hostile scenarios, fixes failures, and reports cleanup evidence
$ultraqa is the adversarial QA workflow updated in v0.17. It still runs normal verification commands, but it is no longer satisfied by a shallow build/lint/typecheck/test checklist. When the target can be run, simulated, or harnessed safely, $ultraqa must exercise behavior through dynamic end-to-end scenarios, hostile user modeling, cleanup checks, and a structured evidence report.
When to use it
- You need stronger proof than “tests are green.”
- A feature touches CLI, workflow state, MCP tools, agents, prompts, setup, hooks, or user-facing flows.
- You want failure diagnosis, precise fixes, and reruns until the goal is met or a bounded stop condition is reached.
- You need to catch stale state, prompt-injection, cancel/resume, misleading-success, or dirty-worktree regressions before review.
Trigger keywords: ultraqa, fix until tests pass, qa cycle, make the build pass.
How to invoke
codex
> $ultraqa --testscodex
> $ultraqa --buildcodex
> $ultraqa --custom "the CLI rejects stale session state"Available goal flags: --tests, --build, --lint, --typecheck, --custom "pattern", --interactive.
v0.17 contract
Before declaring success, $ultraqa builds and maintains a scenario matrix with these columns:
| Column | Meaning |
|---|---|
| Scenario ID | Stable ID such as ADV-E2E-003 |
| Intent | What risk or behavior is being proved |
| User/attacker model | Normal user, careless operator, malicious prompt, stale runtime, flaky environment |
| Setup | Fixture, state, service, branch, or harness required |
| Command/harness | Exact command, script, browser step, or generated harness |
| Expected signal | Exit code, output, UI state, artifact, or state transition that proves success |
| Actual result | Observed output and exit status |
| Fixes applied | Linked fixes or “none” |
| Evidence | Logs, test output, screenshots, artifacts, or transcript excerpts |
| Cleanup | Removed, intentionally kept, or blocked with reason |
Required scenario classes
Include the normal path plus adversarial classes that are relevant and safe:
- Malformed input: invalid JSON, missing fields, bad flags, oversized strings, unusual Unicode, path traversal-like values, corrupted state.
- Repeated interruptions: repeated
continue, stop/cancel/abort wording, interrupted command output, retry after partial progress. - Prompt injection: text that tries to override instructions, skip verification, exfiltrate secrets, delete state, or claim false success.
- Cancel/resume: active-state cleanup, resume detection, stale in-progress state, cancellation followed by a fresh run.
- Stale state: old
.omx/statefiles, mismatched sessions, missing timestamps, contradictory phase metadata. - Dirty worktree: pre-existing modifications, untracked generated files, and proof that unrelated work was not overwritten or hidden.
- Hung commands: explicit timeouts, killed child processes, and recovery notes.
- Flaky tests: rerun strategy, failure clustering, and avoiding false green from one lucky pass.
- Misleading success output: success-looking text with non-zero exits, skipped tests, hidden failures, or truncated logs.
Dynamic harness rules
- Generate temporary tests, scripts, fixtures, or harnesses when existing tests do not cover the behavior.
- Prefer project-native test tools and small throwaway harnesses.
- Record every generated artifact in the scenario matrix.
- Use bounded timeouts for commands that can hang.
- Validate exit codes and output semantics; do not trust success-looking text alone.
- Do not delete, rewrite, or mask unrelated user work.
Cycle flow
- Plan adversarial QA: restate the goal, success criteria, safety bounds, stop condition, runnable surfaces, and scenario matrix.
- Run baseline verification: tests, build, lint, typecheck, or custom command.
- Run dynamic e2e scenarios from the matrix.
- Diagnose failures with architecture-level root cause and safety impact.
- Apply precise fixes.
- Clean up temporary harnesses, state, logs, and processes unless intentionally kept.
- Rerun until the goal is met, 5 cycles are exhausted, the same failure repeats 3 times, or a safety boundary blocks progress.
Completion report
A terminal $ultraqa report should include:
- Goal and success criteria
- Scenario matrix
- Commands run with exit codes and key output evidence
- Failures found and root causes
- Fixes applied and regression evidence
- Cleanup and rollback status
- Residual risks or blocked scenarios
- Evidence links, logs, screenshots, transcripts, or artifacts when relevant
Related skills
$ralph— persistent verification loop that can wrap or follow$ultraqa$autopilot— full pipeline that uses QA as its validation phase$tdd— test-first development before implementation