Back to Blog
4 min read

Building a Resilient Task Requeue Mechanism in GhostGraph: Recovering Orphaned Pipeline Jobs

The Problem: Ghost Tasks in the Machine

I’ve all seen it — a job goes into a queue, the logs go quiet, and suddenly you’ve got a silent failure with no retries, no errors, just… nothing. In GhostGraph, my pipeline engine for recipe-driven data workflows, I started noticing tasks stuck in a 'pending' state with no active worker tracking them. No crash logs. No timeouts. Just orphaned.

As I shifted from Redis lists to Redis Streams for better scalability and observability, this edge case became more pronounced. A worker would pull a job, die mid-process (maybe from a node restart or OOM kill), and leave behind a 'pending' entry that no one was monitoring. Redis Streams’ claim-check pattern is great for durability, but it doesn’t auto-heal stalled jobs. Without intervention, these orphaned tasks would stay frozen — breaking data consistency and delaying downstream pipelines.

I needed a way to detect these ghosts and safely return them to the stream for reprocessing — without duplicates, without race conditions, and without disrupting active workers.

Designing the Requeue Endpoint: Querying Stale, Not Dead

The solution started with visibility. Redis Streams provide the XPENDING command, which lets you inspect pending entries — their delivery count, last delivery time, and owning consumer. That’s gold for recovery logic.

I built a new endpoint: POST /recipe-pipeline/requeue. It doesn’t blindly requeue everything. Instead, it applies smart filtering:

  • Only consider jobs idle for more than a threshold (configurable, default 5 minutes)
  • Filter by consumer group and stream name (I use recipe_pipeline:group)
  • Respect max delivery attempts to avoid infinite loops

Here’s the core logic:

pending_tasks = await redis.xpending(
    "recipe_pipeline", 
    "recipe_pipeline:group", 
    min_idle_time=300_000,  # 5 minutes in ms
    count=100
)

For each pending task, I extract the stream ID and fetch the full message using XREAD. Then, instead of XADDing it back blindly, I use XADD with a MAXLEN strategy to preserve stream health and append it to the main stream for redelivery.

But here’s the kicker: I don’t just requeue the raw message. I enrich it with metadata:

  • requeued_at: timestamp for observability
  • requeue_count: increment to track recovery attempts
  • next_offset: critical for order preservation (more on this in a sec)

I also made sure the endpoint uses request.app.state.redis — a lesson learned the hard way when an early version failed silently due to incorrect Redis instance binding. That fix was small but vital.

Security-wise, the endpoint is admin-only, protected by JWT and rate-limited. You don’t want just anyone reviving pipeline jobs.

Preserving Order with next_offset: No Skipped Steps

One of GhostGraph’s core guarantees is ordered processing within a recipe pipeline. Tasks are chained: step 2 must run after step 1. But if you requeue a job without context, you risk breaking that sequence.

My pipelines use a logical offset system — each task has an offset field indicating its position in the workflow. When a job is requeued, I preserve its original offset, but I also set next_offset in the requeued message. This tells the worker: "when you finish, the next expected task should be this one."

Why? Because sometimes, a stalled job causes downstream tasks to wait indefinitely. By injecting next_offset, I allow the pipeline supervisor to detect gaps and trigger gap-filling logic — like fast-forwarding or alerting.

This pattern turned out to be essential for maintaining idempotency and continuity. I’m not just recovering work — I’m recovering context.

The result? A pipeline that’s not just fast, but resilient. Operators can now run /requeue manually during incident response, or automate it via health checks. I’ve reduced stuck-job incidents by over 90% since deploying this.

Building reliable distributed systems isn’t about preventing failures — it’s about designing for them. In GhostGraph, I’m treating failure as a feature, not a bug. And now, when tasks go quiet, I don’t panic. I requeue.

Newer post

Building a Reusable Vertex Animation Pipeline in Unreal Engine 5.5

Older post

How I Built a Real-Time Fleet Dashboard for Distributed Scraping Workers in GhostGraph