SDBS

SSCCS Large-scale Document Build System

Author
Affiliation

SSCCS Foundation

Published

May 30, 2026

Abstract

A technical account of the document build infrastructure underlying the SSCCS documentation suite. The system manages a growing corpus of technical documents through a multi-layered pipeline combining static-site generation with custom build orchestration, automated structural validation, and intelligent content management. This report describes the architecture and the dual-parallel build model that together render build time independent of document count at any scale.

Repository
Other Formats

The Scale Problem

Technical documentation exhibits a unique growth characteristic: the corpus expands monotonically with each research direction, specification, and design record. Over a multi-year horizon, a project of this scope accumulates thousands of documents across multiple natural languages, output formats, and audience levels.

Conventional static-site management breaks down at this scale through several well-understood failure modes:

  • Link rot: moved or renamed files leave dangling references
  • Include drift: shared templates change but individual pages are not updated
  • Format inconsistency: new pages omit required metadata or cross-format navigation controls
  • Build-time growth: sequential rebuilds scale linearly with document count
  • Cross-reference decay: references between documents become stale without centralized validation

The build system described here was designed to make these failure modes structurally impossible rather than depending on human discipline or manual review processes.

Architectural Principles

The build system rests on three architectural principles that together enable management at scale. The diagram below previews the two-level parallelism that underlies the build architecture; the sections that follow detail each principle in turn.

Figure 1: Two-level hierarchy: logical parallelism within each unit, physical parallelism across units, unified by recursive merge

Principle 1: Idempotent Pre-Build Validation

All structural checks (link validity, include presence, path correctness) are performed before the build step, in a dedicated validation pass. Each check is idempotent: running it repeatedly on an unchanged corpus produces the same result. A check mode reports issues without modifying files, enabling CI/CD gating with exit code 1.

Principle 2: Target-Isolated Parallel Rendering

The document corpus is partitioned into independent render targets. Each target renders into a physically isolated copy of the entire source tree, eliminating cross-target contention for shared resources. Outputs are merged into a unified deployment directory after all targets complete.

The same partitioning extends naturally to physical distribution across runner nodes, with zero additional coordination. The merge function is commutative: merging in any order produces the same result, eliminating the need for distributed locks, consensus, or shared state. Partial outputs move through any transport layer – shared filesystem, object storage, or CI artifact system – requiring only that remote output be accessible as a local directory.

The commutative merge makes the architecture resilient by construction:

Failure Effect Mitigation
Runner fails mid-build Its targets missing Rerun on another runner; merge is idempotent
Single target fails Only that target absent Rebuild only that target; merge accepts partial results
Network partition Cannot upload output Other runners continue; gap detected at merge time

The same build command and merge function supports four deployment topologies by changing only the target assignment:

Topology Description
Single runner All targets render with logical parallelism; merge is a trivial no-op
CI runner pool Each runner picks a target subset from a build queue, uploads artifacts; final job merges
Dedicated cluster Fixed target groups per node; shared filesystem eliminates transport overhead
Hybrid cloud Baseline on fixed infrastructure; burst spills to ephemeral cloud instances
Figure 2: Target-isolated parallel rendering: each target copies the source tree, renders independently, then merges

Principle 3: Automatic Structural Healing

Common structural defects (broken relative paths, missing includes, incorrect file extensions in links) are detected and corrected automatically by a chain of resolvers. Each resolver addresses one class of defect independently and can be enabled or disabled without affecting the others.

Figure 3: Pre-build, render, and post-render pipeline stages

Each stage is idempotent: running it repeatedly on an unchanged corpus produces identical output. All fixes are validated before any writes occur, and a dedicated check mode returns exit code 1 on any unresolved issue.

The build system is renderer-agnostic. Quarto is the current engine, but the pipeline – target isolation, resolver chain, commutative merge – applies to any static-site generator that produces a directory tree. Replacing the renderer requires no changes outside the single build command invocation.

Resolver Chain Architecture

The structural validation layer is organised as a sequential chain of independent resolvers, each handling one class of defect:

Figure 4: Sequential resolver chain, each stage handling one defect class

Each resolver operates independently, with its own scope configuration inherited from a three-tier model:

  • Global exclusions: build artifacts, version-control directories
  • Project-level exclusions: files excluded from the build pipeline
  • Local exclusions: per-resolver patterns for special cases

This ensures that the project configuration remains the single source of truth for exclusions while allowing individual resolvers to tighten scope without modifying global settings.

Scaling Properties

The resolver chain operates by scanning file content rather than parsing. Total scan cost grows linearly with aggregate file size with small constants, independent of the number of render targets or the document graph complexity. A full pass over the current corpus completes in under four seconds.

Build time is determined by the number of render targets, not the number of documents:

Figure 5: Build time is a function of target count, invariant to document count

Since each render target operates in a physically isolated copy of the source tree, targets can be distributed across any number of runners without architectural changes. A single merge step combines the independently generated output directories using content-level merging for shared assets (search indexes, site maps) and union semantics for everything else.

Figure 6: Target isolation enables distribution without architectural change

Dynamic Content Generation

Figure 7: SDBS artifects: from source documents to published outputs

Beyond structural validation, the platform generates navigational and cross-format content at render time based on each document’s metadata and the project configuration:

  • Cross-format links (PDF, machine-readable text) are generated automatically, with the link target resolved from project settings
  • Format variant grouping consolidates multiple alternatives under a single heading, avoiding redundant labels
  • Recent-document listings are generated from version-control history, ensuring the front page always reflects current activity without manual updates
Figure 8: Format link resolution follows a deterministic chain from project config to per-document override

LLMs Standard Compliance

The build system generates machine-readable document summaries in the llms.txt format, producing both a global index and per-document .llms.md files. This pipeline runs as a post-render step, extracting content from the merged output and organising it into a hierarchical structure consumable by LLM tooling.

Figure 9: LLMs files are extracted from the build output and indexed for downstream consumption

The resulting _llms/ directory serves two purposes:

  • Agent knowledge base: a separate artifact uploaded for LLM agent consumption, providing a structured snapshot of all published documentation
  • Website extension: merged back into the deployment directory so each HTML page can include a link to its .llms.md counterpart

This integration means the build system is the data-ingestion layer for any downstream KG or agentic system — the same pipeline that validates and renders documents also produces the structured corpus that feeds semantic applications.

CI/CD Integration

The full pipeline is containerised and executed in a continuous-deployment workflow:

Figure 10: CI/CD pipeline: validation, build, post-process, deploy

Containerisation ensures environment parity between local development and CI. All pipeline stages are declared in a single configuration file, making the workflow self-documenting and reproducible.

Comparison with Conventional Approaches

Concern Conventional approach This build system
Link validation Manual or external CI checker Pre-build automated scan, exits 1 on any defect
Include management Manual copy-paste Automatic insertion of missing includes
Format navigation Hand-written links per document Auto-generated from configuration
Build scaling Sequential per document Target-parallel, O(targets) not O(documents)
Cross-reference correctness Manual upkeep Automatic extension correction
Structural robustness Relies on author discipline Automated healing of common defects
Agent/LLM feed Separate manual pipeline Integrated extraction at build time
CI quality gate Sparse or absent Comprehensive check, exit code 1

Conclusion

The build system demonstrates that static-site management at scale is achievable through a layered automation architecture combining structural validation with dual-parallel build execution. The key insight is not any individual tool or technique but the composition of idempotent passes that collectively guarantee integrity regardless of corpus size.

The architecture is renderer-agnostic, containerised, and configuration-driven. These properties make it a candidate for extraction into a general-purpose open-source tool — applicable to any static-site corpus that has outgrown manual management. A standalone release would package the resolver chain, the dual-parallel build model, and the LLMs pipeline into a single command that can be dropped into any documentation repository.

The system exhibits:

  • Document-count independence in build time
  • Target-count linearity in resource requirements
  • Physical partitionability for distributed execution
  • Self-healing for common structural defects
  • Configuration-driven behaviour with a single source of truth
  • CI-native integration with pre-build validation and post-render processing