SDBS

SSCCS Large-scale Document Build System

Author

Affiliation

SSCCS Foundation

Published

July 13, 2026

Abstract

A technical account of the document build infrastructure underlying the SSCCS documentation suite. The system manages a growing corpus of technical documents through a multi-layered pipeline combining static-site generation with custom build orchestration, automated structural validation, and intelligent content management. This report describes the architecture and the dual-parallel build model that together render build time independent of document count at any scale.

Code

Github

Other Formats

LLMs

The Scale Problem

Technical documentation exhibits a unique growth characteristic: the corpus expands monotonically with each research direction, specification, and design record. Over a multi-year horizon, a project of this scope accumulates thousands of documents across multiple natural languages, output formats, and audience levels.

Conventional static-site management breaks down at this scale through several well-understood failure modes:

Link rot: moved or renamed files leave dangling references
Include drift: shared templates change but individual pages are not updated
Format inconsistency: new pages omit required metadata or cross-format navigation controls
Build-time growth: sequential rebuilds scale linearly with document count
Cross-reference decay: references between documents become stale without centralized validation

The build system described here was designed to make these failure modes structurally impossible rather than depending on human discipline or manual review processes.

Architectural Principles

The build system rests on three architectural principles that together enable management at scale. The diagram below previews the two-level parallelism that underlies the build architecture; the sections that follow detail each principle in turn.

Figure 1: Two-level hierarchy: logical parallelism within each unit, physical parallelism across units, unified by recursive merge

Principle 1: Idempotent Pre-Build Validation

All structural checks (link validity, include presence, path correctness) are performed before the build step, in a dedicated validation pass. Each check is idempotent: running it repeatedly on an unchanged corpus produces the same result. A check mode reports issues without modifying files, enabling CI/CD gating with exit code 1.

Principle 2: Target-Isolated Parallel Rendering

The document corpus is partitioned into independent render targets. Each target renders into a physically isolated copy of the entire source tree, eliminating cross-target contention for shared resources. Outputs are merged into a unified deployment directory after all targets complete.

The same partitioning extends naturally to physical distribution across runner nodes, with zero additional coordination. The merge function is commutative: merging in any order produces the same result, eliminating the need for distributed locks, consensus, or shared state. Partial outputs move through any transport layer – shared filesystem, object storage, or CI artifact system – requiring only that remote output be accessible as a local directory.

The commutative merge makes the architecture resilient by construction:

Failure	Effect	Mitigation
Runner fails mid-build	Its targets missing	Rerun on another runner; merge is idempotent
Single target fails	Only that target absent	Rebuild only that target; merge accepts partial results
Network partition	Cannot upload output	Other runners continue; gap detected at merge time

The same build command and merge function supports four deployment topologies by changing only the target assignment:

Topology	Description
Single runner	All targets render with logical parallelism; merge is a trivial no-op
CI runner pool	Each runner picks a target subset from a build queue, uploads artifacts; final job merges
Dedicated cluster	Fixed target groups per node; shared filesystem eliminates transport overhead
Hybrid cloud	Baseline on fixed infrastructure; burst spills to ephemeral cloud instances

Figure 2: Target-isolated parallel rendering: each target copies the source tree, renders independently, then merges

Principle 3: Automatic Structural Healing

Common structural defects (broken relative paths, missing includes, incorrect file extensions in links) are detected and corrected automatically by a chain of resolvers. Each resolver addresses one class of defect independently and can be enabled or disabled without affecting the others.

Figure 3: Pre-build, render, and post-render pipeline stages

Each stage is idempotent: running it repeatedly on an unchanged corpus produces identical output. All fixes are validated before any writes occur, and a dedicated check mode returns exit code 1 on any unresolved issue.

The build system is renderer-agnostic. Quarto is the current engine, but the pipeline – target isolation, resolver chain, commutative merge – applies to any static-site generator that produces a directory tree. Replacing the renderer requires no changes outside the single build command invocation.

Resolver Chain Architecture

The structural validation layer is organised as a sequential chain of independent resolvers, each handling one class of defect:

Figure 4: Sequential resolver chain, each stage handling one defect class

Each resolver operates independently, with its own scope configuration inherited from a three-tier model:

Global exclusions: build artifacts, version-control directories
Project-level exclusions: files excluded from the build pipeline
Local exclusions: per-resolver patterns for special cases

This ensures that the project configuration remains the single source of truth for exclusions while allowing individual resolvers to tighten scope without modifying global settings.

Scaling Properties

The resolver chain operates by scanning file content rather than parsing. Total scan cost grows linearly with aggregate file size with small constants, independent of the number of render targets or the document graph complexity. A full pass over the current corpus completes in under four seconds.

Build time is determined by the number of render targets, not the number of documents:

Figure 5: Build time is a function of target count, invariant to document count

Since each render target operates in a physically isolated copy of the source tree, targets can be distributed across any number of runners without architectural changes. A single merge step combines the independently generated output directories using content-level merging for shared assets (search indexes, site maps) and union semantics for everything else.

Figure 6: Target isolation enables distribution without architectural change

Dynamic Content Generation

Figure 7: SDBS artifects: from source documents to published outputs

Beyond structural validation, the platform generates navigational and cross-format content at render time based on each document’s metadata and the project configuration:

Cross-format links (PDF, machine-readable text) are generated automatically, with the link target resolved from project settings
Format variant grouping consolidates multiple alternatives under a single heading, avoiding redundant labels
Recent-document listings are generated from version-control history, ensuring the front page always reflects current activity without manual updates

Figure 8: Format link resolution follows a deterministic chain from project config to per-document override

LLMs Standard Compliance

The build system generates machine-readable document summaries in the llms.txt format, producing both a global index and per-document .llms.md files. This pipeline runs as a post-render step, extracting content from the merged output and organising it into a hierarchical structure consumable by LLM tooling.

Figure 9: LLMs files are extracted from the build output and indexed for downstream consumption

The resulting _llms/ directory serves two purposes:

Agent knowledge base: a separate artifact uploaded for LLM agent consumption, providing a structured snapshot of all published documentation
Website extension: merged back into the deployment directory so each HTML page can include a link to its .llms.md counterpart

This integration means the build system is the data-ingestion layer for any downstream KG or agentic system — the same pipeline that validates and renders documents also produces the structured corpus that feeds semantic applications.

CI/CD Integration

The full pipeline is containerised and executed in a continuous-deployment workflow:

Figure 10: CI/CD pipeline: validation, build, post-process, deploy

Containerisation ensures environment parity between local development and CI. All pipeline stages are declared in a single configuration file, making the workflow self-documenting and reproducible.

Comparison with Conventional Approaches

Concern	Conventional approach	This build system
Link validation	Manual or external CI checker	Pre-build automated scan, exits 1 on any defect
Include management	Manual copy-paste	Automatic insertion of missing includes
Format navigation	Hand-written links per document	Auto-generated from configuration
Build scaling	Sequential per document	Target-parallel, O(targets) not O(documents)
Cross-reference correctness	Manual upkeep	Automatic extension correction
Structural robustness	Relies on author discipline	Automated healing of common defects
Agent/LLM feed	Separate manual pipeline	Integrated extraction at build time
CI quality gate	Sparse or absent	Comprehensive check, exit code 1

Conclusion

The build system demonstrates that static-site management at scale is achievable through a layered automation architecture combining structural validation with dual-parallel build execution. The key insight is not any individual tool or technique but the composition of idempotent passes that collectively guarantee integrity regardless of corpus size.

The architecture is renderer-agnostic, containerised, and configuration-driven. These properties make it a candidate for extraction into a general-purpose open-source tool — applicable to any static-site corpus that has outgrown manual management. A standalone release would package the resolver chain, the dual-parallel build model, and the LLMs pipeline into a single command that can be dropped into any documentation repository.

The system exhibits:

Document-count independence in build time
Target-count linearity in resource requirements
Physical partitionability for distributed execution
Self-healing for common structural defects
Configuration-driven behaviour with a single source of truth
CI-native integration with pre-build validation and post-render processing

Whitepaper: PDF / HTML DOI: 10.5281/zenodo.18759106 via CERN/Zenodo, indexed by OpenAIRE. Licensed under CC BY-NC-ND 4.0.
Official repository: GitHub. Authenticated via GPG: BCCB196BADF50C99. Licensed under Apache 2.0.
Governed by the Foundational Charter and Statute of the SSCCS Foundation (in formation).
Provenance: Human-in-Command, AI-assisted. Aligns with ISO/IEC JTC 1/SC 42 and C2PA-certified. Full intellectual responsibility with author(s).