Real-World Evaluation¶

Prince of Space is validated against two large, well-known Java codebases to catch formatting and idempotency regressions that synthetic tests miss.

Target codebases¶

Project	Size	Why
Guava (`google/guava`)	~3,200 `.java` files	Heavy generics, functional patterns, complex enums. Native line limit is 100 chars (google-java-format), so our wrapping engine is stressed heavily.
Spring Framework (`spring-projects/spring-framework`)	~9,200 `.java` files	Dense annotation stacks (`@Component`, `@Bean`, `@Conditional`), extensive Javadoc, varied interface hierarchies.

Latest results¶

Evaluated on 2026-04-15 with lineLength=120, wrapStyle=BALANCED (w120-balanced).

Project	Files	Parse errors	Idempotency failures	Over-long lines	Time
Guava @ `ce39d2b`	3,221	0	0	59	415s
Spring Framework @ `1787d3e`	9,198	0	0	235	161s

Zero parse errors and zero idempotency failures across 12,419 files.

Over-long lines are informational warnings — they occur in constructs that have no safe wrap point (very long string literals, generated data files, deeply nested generic signatures). See the full report eval-results/2026-04-17.md for per-file details.

eval-results/2026-04-17.md aggregates runs across line lengths and wrap styles, confirming zero parse errors and zero idempotency failures for each configuration shown there.

Running the eval harness¶

1. Clone targets (one-time, outside the repo)¶

git clone --depth=1 https://github.com/google/guava /tmp/eval/guava
git clone --depth=1 https://github.com/spring-projects/spring-framework /tmp/eval/spring-framework

2. Run¶

Set one line length and wrap style per invocation (CI sets these from the workflow matrix):

export PRINCE_EVAL_ROOTS=/tmp/eval/guava,/tmp/eval/spring-framework
export PRINCE_EVAL_LINE_LENGTH=120
export PRINCE_EVAL_WRAP_STYLE=BALANCED
export PRINCE_EVAL_REPORT_DIR=$(pwd)/docs/eval-results
./gradlew :core:evalTest

PRINCE_EVAL_WRAP_STYLE is case-insensitive (BALANCED, balanced, etc.).

The test is skipped when PRINCE_EVAL_ROOTS is unset. It scans .java files while skipping common build and generated paths (build/, .gradle/, .git/, generated/, generated-sources/).

Optional PRINCE_EVAL_REPORT_SLUG (ASCII alphanumerics, -, _, max 64 chars) writes docs/eval-results/<date>-<slug>.md instead of <date>.md, so parallel corpus runs keep separate files (this is what CI uses).

3. Review¶

cat docs/eval-results/$(date +%F).md
# or, when using PRINCE_EVAL_REPORT_SLUG=guava:
# cat docs/eval-results/$(date +%F)-guava.md

Without a slug, reports are overwritten on re-run for the same day; with a slug, only same-day re-runs for that slug overwrite.

Checked-in corpus (always on)¶

ExamplesCorpusFormatTest walks examples/outputs/** and examples/inputs/** on every ./gradlew :core:test run and asserts every .java file satisfies format(format(x)) == format(x).

Failure policy¶

Condition	Test outcome
Any parse error	Fails — all paths and messages printed
Any idempotency failure	Fails — all paths printed
Over-long non-comment lines	Warning only — printed to stdout; test passes

Release gating¶

The eval is mandatory for releases. The release workflow runs external-eval as a matrix (Spring Framework and Guava × three lineLength values × three wrap styles); the publish job declares needs: external-eval, so a failed leg blocks publishing even on dry runs. See RELEASING — External eval gate on GitHub for recovery steps.

A lighter external-eval-smoke matrix (lineLength=120, wrapStyle=BALANCED only) also runs on every push and pull request for fast feedback; the full matrix runs weekly and on-demand via workflow_dispatch. See .github/workflows/external-eval.yml.

Config permutations¶

The full eval runs one job per combination of lineLength ∈ {80, 100, 120} and wrapStyle ∈ {WIDE, BALANCED, NARROW}. Reports label each run as w<lineLength>-<wrapStyle> (e.g. w120-balanced).