Real-World Evaluation¶
Prince of Space is validated against two large, well-known Java codebases to catch formatting and idempotency regressions that synthetic tests miss.
Target codebases¶
| Project | Size | Why |
|---|---|---|
Guava (google/guava) |
~3,200 .java files |
Heavy generics, functional patterns, complex enums. Native line limit is 100 chars (google-java-format), so our wrapping engine is stressed heavily. |
Spring Framework (spring-projects/spring-framework) |
~9,200 .java files |
Dense annotation stacks (@Component, @Bean, @Conditional), extensive Javadoc, varied interface hierarchies. |
Latest results¶
Evaluated on 2026-04-15 with lineLength=120, wrapStyle=BALANCED (w120-balanced).
| Project | Files | Parse errors | Idempotency failures | Over-long lines | Time |
|---|---|---|---|---|---|
Guava @ ce39d2b |
3,221 | 0 | 0 | 59 | 415s |
Spring Framework @ 1787d3e |
9,198 | 0 | 0 | 235 | 161s |
Zero parse errors and zero idempotency failures across 12,419 files.
Over-long lines are informational warnings — they occur in constructs that have no safe wrap point (very long string literals, generated data files, deeply nested generic signatures). See the full report eval-results/2026-04-17.md for per-file details.
eval-results/2026-04-17.md aggregates runs across line lengths and wrap styles, confirming zero parse errors and zero idempotency failures for each configuration shown there.
Running the eval harness¶
1. Clone targets (one-time, outside the repo)¶
git clone --depth=1 https://github.com/google/guava /tmp/eval/guava
git clone --depth=1 https://github.com/spring-projects/spring-framework /tmp/eval/spring-framework
2. Run¶
Set one line length and wrap style per invocation (CI sets these from the workflow matrix):
export PRINCE_EVAL_ROOTS=/tmp/eval/guava,/tmp/eval/spring-framework
export PRINCE_EVAL_LINE_LENGTH=120
export PRINCE_EVAL_WRAP_STYLE=BALANCED
export PRINCE_EVAL_REPORT_DIR=$(pwd)/docs/eval-results
./gradlew :core:evalTest
PRINCE_EVAL_WRAP_STYLE is case-insensitive (BALANCED, balanced, etc.).
The test is skipped when PRINCE_EVAL_ROOTS is unset. It scans .java files while skipping common build and generated paths (build/, .gradle/, .git/, generated/, generated-sources/).
Optional PRINCE_EVAL_REPORT_SLUG (ASCII alphanumerics, -, _, max 64 chars) writes
docs/eval-results/<date>-<slug>.md instead of <date>.md, so parallel corpus runs keep
separate files (this is what CI uses).
3. Review¶
cat docs/eval-results/$(date +%F).md
# or, when using PRINCE_EVAL_REPORT_SLUG=guava:
# cat docs/eval-results/$(date +%F)-guava.md
Without a slug, reports are overwritten on re-run for the same day; with a slug, only same-day re-runs for that slug overwrite.
Checked-in corpus (always on)¶
ExamplesCorpusFormatTest walks examples/outputs/** and examples/inputs/** on every ./gradlew :core:test run and asserts every .java file satisfies format(format(x)) == format(x).
Failure policy¶
| Condition | Test outcome |
|---|---|
| Any parse error | Fails — all paths and messages printed |
| Any idempotency failure | Fails — all paths printed |
| Over-long non-comment lines | Warning only — printed to stdout; test passes |
Release gating¶
The eval is mandatory for releases. The release workflow runs external-eval
as a matrix (Spring Framework and Guava × three lineLength values × three wrap styles);
the publish job declares needs: external-eval, so a
failed leg blocks publishing even on dry runs. See
RELEASING — External eval gate
on GitHub for recovery steps.
A lighter external-eval-smoke matrix (lineLength=120, wrapStyle=BALANCED only) also runs on every
push and pull request for fast feedback; the full matrix runs weekly and
on-demand via workflow_dispatch. See .github/workflows/external-eval.yml.
Config permutations¶
The full eval runs one job per combination of lineLength ∈ {80, 100, 120} and
wrapStyle ∈ {WIDE, BALANCED, NARROW}. Reports label each run as w<lineLength>-<wrapStyle> (e.g. w120-balanced).