Evolving the Harness: How a Creative-Writing Product Tunes Itself with Genetic Algorithms
In February, LangChain pushed their coding agent from outside the Top 30 to rank 5 on Terminal Bench 2.0. Same model: GPT-5.2-Codex, held fixed throughout. The harness around the model changed. The score moved 13.7 points. Vivek Trivedy describes this as harness engineering: the discipline of building systems around a language model rather than changing the model itself. The result is well-established now for coding agents: Codex, Claude Code, deepagents. What it looks like in a creative-writing product is less explored. This is what we built at Novelyst.
What harness engineering actually is
Agent = Model + Harness. Mitchell Hashimoto's framing puts it simply: anytime an agent makes a mistake, you engineer a solution so it never makes that mistake again. The harness is everything that isn't the model: system prompts, tools, middleware, observability, the surrounding code. Anthropic's context engineering essay widens this to context as a finite resource with an attention budget. ThoughtWorks' Birgitta Böckeler splits it further into guides that anticipate behavior before it happens and sensors that observe and self-correct after.
Models are spiky; harnesses make them useful. Tuning the harness is cheap, fast, and compounds. That is exactly the leverage point a small team can move.

The Novelyst harness, briefly
Novelyst is an AI-assisted novel-writing platform. The autofill that helps a writer anchor a story is wrapped in a harness recognisable from the literature: middleware, retrieval-augmented context, genre-aware validation, structured traces, an evaluator in shadow mode against a literary rubric. None of this is novel on its own. It is the LangChain and ThoughtWorks playbook, ported to creative writing. The novel part is what sits on top.
The novel move: a genetic algorithm over the harness
The harness has many knobs that interact. Hand-tuning works for one configuration. A/B testing scales to a few. Neither covers the space that opens up once those knobs are treated as a structured configuration evolving together.
A genetic algorithm can.
The critical move is evolving the config, not the prompt text. Earlier work in this space evolves prompt strings; the search space is unbounded and the resulting prompts are hard to audit. Evolving a structured configuration keeps the search discoverable. Every change maps to a config-file edit a person could review. The closest research analog is Meta-Harness, which searches over harness code at the research benchmark layer. Putting the same idea into a production creative-writing product, with selection pressure driven by real user behavior rather than a synthetic benchmark, is what makes this useful.

What changes with this approach
The harness improves on its own. The product team does not need to predict which shape works best for which genre; the data answers that question over time. As models change, as new failure modes surface, as the writer base shifts, the harness evolves toward what works without requiring a manual tuning pass each time.
The interesting part is what this looks like at low traffic. Most production GA systems assume large user populations and long data-accumulation windows. Novelyst is a small product at early launch. The engineering that makes evolutionary search produce useful signal under those constraints is where the contribution worth sharing lives.
The portable insight for AI labs
The pattern generalizes beyond creative writing. Any production AI system with a structured harness, a usable evaluation signal, and a way to observe user response can apply it. The substrate is generic.
There is also a collaboration shape here that AI labs are well positioned to do. Better evaluators come from labs. Production behavior data comes from products. Evolutionary search over harness configuration sits between them and can close a loop neither side closes alone.