How We Scaled Git Context’s Analysis Pipeline with Batching, Caching, and Dependency Fixes
The Pipeline That Almost Broke Under Its Own Weight
A few months ago, Git Context’s analysis pipeline was a ticking time bomb. It worked fine on small repos, but throw a 50k-commit monorepo at it and everything slowed to a crawl—or worse, timed out mid-analysis. The root cause? A fragile chain of synchronous operations, race conditions in dependency resolution, and zero caching. We were reprocessing the same files over and over, fetching full diffs when we only needed a few paths, and letting UI state dictate when and how analysis ran. It wasn’t sustainable.
The breaking point came when we tried to run detector logic across long-lived branches. The pipeline would stall, lose context between commits, and sometimes return partial or inconsistent results. We knew we had to decouple data flow from the UI, enforce strict execution order, and introduce smart optimizations—or we’d never scale.
Batching, Caching, and the End of N+1 Hell
Our first move was to refactor the entire execution model around batched processing. Instead of analyzing commits one-by-one with individual Git calls, we grouped them into chunks based on branch topology and file path relevance. This reduced round trips to Git by over 80% in large histories.
But batching alone wasn’t enough. We noticed the same files—especially shared config or utility modules—were being parsed repeatedly across commits. So we introduced an in-memory caching layer keyed on file path + commit hash. If a file hadn’t changed since the last analyzed commit, we reused the cached AST and metadata. This was a game-changer for monorepos with slow-moving core packages.
We also moved away from a flat, linear processing queue to a timeline chain model. Now, each analysis job maintains a rolling context window, carrying forward known state (like detected frameworks or project structure) unless a relevant file changes. This cut redundant work and made the output far more consistent across runs.
Here’s a simplified version of how we structured the batch resolver:
async function processCommitBatch(batch: Commit[], cache: FileCache) {
const results = [];
for (const commit of batch) {
const changedFiles = await git.diff(commit.parent, commit.hash);
const relevantFiles = applyPathFilter(changedFiles); // Only analyze what matters
for (const file of relevantFiles) {
if (cache.has(file.path, file.commit)) {
results.push(cache.get(file.path, file.commit));
} else {
const ast = parse(file.content);
const detection = runDetectors(ast);
cache.set(file.path, file.commit, detection);
results.push(detection);
}
}
}
return results;
}
Smarter Analysis Through Path Filtering and Dependency Chaining
One of the quietest but most impactful changes was introducing path-based filtering at the pipeline entry point. Instead of analyzing every changed file in every commit, we now let detectors declare which globs they care about (e.g., **/package.json, *.ts). The pipeline pre-filters diffs before dispatching work. This means a CSS change won’t trigger full TypeScript semantic analysis—and that adds up fast in large histories.
We also fixed long-standing dependency issues in the analysis chain. Previously, detectors ran in an unpredictable order, and some depended on side effects from others without explicit linkage. We rewrote the orchestrator to enforce a DAG-based execution model, where each detector declares its inputs and outputs. This eliminated race conditions and made the pipeline’s behavior predictable and testable.
The final piece was decoupling data management from the UI. We introduced a client-side database (think: a lightweight, queryable store for commit metadata, file states, and detection results). Now, the UI reacts to data updates instead of driving them. This not only improved responsiveness but also enabled background analysis and resume-on-disconnect behavior.
The result? A pipeline that’s 5–8x faster on large repos, uses 60% less memory, and produces consistent, reliable output. More importantly, it’s now extensible—we can add new detectors and analysis passes without fear of breaking the whole chain.
This refactor wasn’t just about speed. It was about building a foundation that can handle the complexity of real-world Git histories—branch merges, rebases, sparse checkouts—without falling over. And now, with batching, caching, and smarter dependency management in place, we’re finally ready to scale.