$ultraqa

적대적 시나리오를 만들고, 실패를 고치고, cleanup evidence까지 보고하는 dynamic e2e QA 워크플로우

$ultraqa는 v0.17에서 적대적 QA 워크플로우로 강화되었습니다. 기존처럼 일반 검증 명령도 실행하지만, 이제 build/lint/typecheck/test가 초록색이라는 이유만으로 완료하지 않습니다. 대상 동작을 안전하게 실행, 시뮬레이션, harness 처리할 수 있다면 dynamic end-to-end scenario, hostile user modeling, cleanup check, structured evidence report까지 수행해야 합니다.

언제 쓰나

“테스트가 통과했다”보다 강한 증거가 필요할 때
기능이 CLI, workflow state, MCP tool, agent, prompt, setup, hook, user-facing flow를 건드릴 때
실패를 진단하고, 정확히 고치고, 목표 달성이나 bounded stop condition까지 rerun하고 싶을 때
stale state, prompt injection, cancel/resume, misleading success, dirty worktree regression을 review 전에 잡고 싶을 때

트리거 키워드: ultraqa, fix until tests pass, qa cycle, make the build pass.

호출

codex
> $ultraqa --tests

codex
> $ultraqa --build

codex
> $ultraqa --custom "the CLI rejects stale session state"

사용 가능한 goal flag: --tests, --build, --lint, --typecheck, --custom "pattern", --interactive.

v0.17 contract

성공을 선언하기 전에 $ultraqa는 아래 column을 가진 scenario matrix를 만들고 유지합니다.

Column	의미
Scenario ID	`ADV-E2E-003` 같은 안정적인 ID
Intent	어떤 risk 또는 behavior를 증명하는지
User/attacker model	normal user, careless operator, malicious prompt, stale runtime, flaky environment
Setup	필요한 fixture, state, service, branch, harness
Command/harness	정확한 command, script, browser step, generated harness
Expected signal	성공을 증명하는 exit code, output, UI state, artifact, state transition
Actual result	관찰한 output과 exit status
Fixes applied	연결된 수정 또는 `none`
Evidence	log, test output, screenshot, artifact, transcript excerpt
Cleanup	제거, 의도적으로 보존, 또는 사유가 있는 blocked 상태

필수 scenario class

정상 경로와 함께 관련 있고 안전한 적대적 class를 포함합니다.

Malformed input: invalid JSON, missing field, bad flag, oversized string, unusual Unicode, path traversal-like value, corrupted state.
Repeated interruption: 반복 continue, stop/cancel/abort 표현, 끊긴 command output, partial progress 이후 retry.
Prompt injection: instruction override, verification skip, secret exfiltration, state deletion, false success claim 시도.
Cancel/resume: active-state cleanup, resume detection, stale in-progress state, cancellation 이후 fresh run.
Stale state: 오래된 .omx/state file, mismatched session, missing timestamp, contradictory phase metadata.
Dirty worktree: 기존 수정, untracked generated file, unrelated work를 덮거나 숨기지 않았다는 증거.
Hung command: explicit timeout, killed child process, recovery note.
Flaky test: rerun strategy, failure clustering, 한 번 운 좋게 통과한 false green 방지.
Misleading success output: non-zero exit, skipped test, hidden failure, truncated log인데 성공처럼 보이는 출력.

Dynamic harness 규칙

기존 테스트가 동작을 덮지 못하면 temporary test, script, fixture, harness를 생성합니다.
project-native test tool과 작은 throwaway harness를 우선합니다.
생성한 artifact는 모두 scenario matrix에 기록합니다.
hang 가능성이 있는 command에는 bounded timeout을 둡니다.
exit code와 output semantics를 함께 검증하며, 성공처럼 보이는 text만 믿지 않습니다.
unrelated user work를 삭제, 재작성, 은폐하지 않습니다.

Cycle flow

Adversarial QA 계획: goal, success criteria, safety bounds, stop condition, runnable surface, scenario matrix를 정리합니다.
Baseline verification 실행: tests, build, lint, typecheck, custom command.
Matrix의 dynamic e2e scenario를 실행합니다.
실패를 architecture-level root cause와 safety impact로 진단합니다.
정확한 fix를 적용합니다.
의도적으로 남길 것이 아니라면 temporary harness, state, log, process를 정리합니다.
목표 달성, 5 cycle 소진, 동일 실패 3회 반복, safety boundary 차단 중 하나가 될 때까지 rerun합니다.

Completion report

최종 $ultraqa report에는 다음이 포함되어야 합니다.

Goal and success criteria
Scenario matrix
Exit code와 핵심 evidence가 있는 command 목록
발견한 failure와 root cause
적용한 fix와 regression evidence
Cleanup and rollback status
Residual risk 또는 blocked scenario
필요 시 evidence link, log, screenshot, transcript, artifact