Back to Blog
4 min read

Building a Resilient Task Requeue Mechanism in GhostGraph: Recovering Orphaned Pipeline Jobs

The Problem: Ghost Tasks in the Machine

We’ve all seen it — a job goes into a queue, the logs go quiet, and suddenly you’ve got a silent failure with no retries, no errors, just… nothing. In GhostGraph, our pipeline engine for recipe-driven data workflows, we started noticing tasks stuck in a 'pending' state with no active worker tracking them. No crash logs. No timeouts. Just orphaned.

As we shifted from Redis lists to Redis Streams for better scalability and observability, this edge case became more pronounced. A worker would pull a job, die mid-process (maybe from a node restart or OOM kill), and leave behind a 'pending' entry that no one was monitoring. Redis Streams’ claim-check pattern is great for durability, but it doesn’t auto-heal stalled jobs. Without intervention, these orphaned tasks would stay frozen — breaking data consistency and delaying downstream pipelines.

We needed a way to detect these ghosts and safely return them to the stream for reprocessing — without duplicates, without race conditions, and without disrupting active workers.

Designing the Requeue Endpoint: Querying Stale, Not Dead

The solution started with visibility. Redis Streams provide the XPENDING command, which lets you inspect pending entries — their delivery count, last delivery time, and owning consumer. That’s gold for recovery logic.

We built a new endpoint: POST /recipe-pipeline/requeue. It doesn’t blindly requeue everything. Instead, it applies smart filtering:

  • Only consider jobs idle for more than a threshold (configurable, default 5 minutes)
  • Filter by consumer group and stream name (we use recipe_pipeline:group)
  • Respect max delivery attempts to avoid infinite loops

Here’s the core logic:

pending_tasks = await redis.xpending(
    "recipe_pipeline", 
    "recipe_pipeline:group", 
    min_idle_time=300_000,  # 5 minutes in ms
    count=100
)

For each pending task, we extract the stream ID and fetch the full message using XREAD. Then, instead of XADDing it back blindly, we use XADD with a MAXLEN strategy to preserve stream health and append it to the main stream for redelivery.

But here’s the kicker: we don’t just requeue the raw message. We enrich it with metadata:

  • requeued_at: timestamp for observability
  • requeue_count: increment to track recovery attempts
  • next_offset: critical for order preservation (more on this in a sec)

We also made sure the endpoint uses request.app.state.redis — a lesson learned the hard way when an early version failed silently due to incorrect Redis instance binding. That fix was small but vital.

Security-wise, the endpoint is admin-only, protected by JWT and rate-limited. You don’t want just anyone reviving pipeline jobs.

Preserving Order with next_offset: No Skipped Steps

One of GhostGraph’s core guarantees is ordered processing within a recipe pipeline. Tasks are chained: step 2 must run after step 1. But if you requeue a job without context, you risk breaking that sequence.

Our pipelines use a logical offset system — each task has an offset field indicating its position in the workflow. When a job is requeued, we preserve its original offset, but we also set next_offset in the requeued message. This tells the worker: "when you finish, the next expected task should be this one."

Why? Because sometimes, a stalled job causes downstream tasks to wait indefinitely. By injecting next_offset, we allow the pipeline supervisor to detect gaps and trigger gap-filling logic — like fast-forwarding or alerting.

This pattern turned out to be essential for maintaining idempotency and continuity. We’re not just recovering work — we’re recovering context.

The result? A pipeline that’s not just fast, but resilient. Operators can now run /requeue manually during incident response, or automate it via health checks. We’ve reduced stuck-job incidents by over 90% since deploying this.

Building reliable distributed systems isn’t about preventing failures — it’s about designing for them. In GhostGraph, we’re treating failure as a feature, not a bug. And now, when tasks go quiet, we don’t panic. We requeue.

Newer post

Building a Reusable Vertex Animation Pipeline in Unreal Engine 5.5

Older post

How We Built a Real-Time Fleet Dashboard for Distributed Scraping Workers in GhostGraph