Migrating Job State Management from Redis to Postgres: Why I Centralized Crawler Jobs in a Single Source of Truth
The Problem with Redis: Speed at the Cost of Visibility
When I first built the Vultr Crawler, Redis was the obvious choice for job claiming. It’s fast, supports atomic operations like LPOP and SREM, and handles high-throughput workloads with ease. Each worker would claim a job item by popping it from a Redis set or list, and I’d track basic state in side keys. For a while, it worked great.
But as my crawl volume grew and compliance requirements tightened, cracks started to show. The biggest issue? I couldn’t reliably audit or recover job state. If a worker crashed mid-processing, the job was gone from Redis—but was it actually started? Was it retried? Without durable logs or foreign key relationships, reconstructing the chain of events during incidents became a forensic nightmare.
Worse, I had no built-in way to enforce referential integrity between job items and their parent crawl jobs. Orphaned items, duplicate processing, and race conditions during retries became more frequent. I was trading long-term operability for short-term performance.
Migrating to Postgres: A Single Source of Truth
This month, I completed the migration of job item claiming and management from Redis to Postgres—a shift that redefined how I think about state in my crawler.
The core of the new system is a job_items table with the following schema:
CREATE TABLE job_items (
id UUID PRIMARY KEY,
job_id UUID REFERENCES crawl_jobs(id),
payload JSONB,
state TEXT CHECK (state IN ('pending', 'claimed', 'processing', 'completed', 'failed')),
claimed_by TEXT, -- worker identifier
claimed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_job_items_job_id_state ON job_items(job_id, state);
CREATE INDEX idx_job_items_claimed_by ON job_items(claimed_by) WHERE state = 'claimed';
Workers now claim jobs using a transactional SELECT FOR UPDATE SKIP LOCKED pattern:
BEGIN;
SELECT id, payload FROM job_items
WHERE job_id = $1 AND state = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- If row returned:
UPDATE job_items
SET state = 'claimed', claimed_by = $2, claimed_at = NOW(), updated_at = NOW()
WHERE id = $3;
COMMIT;
This ensures that only one worker can claim a job item at a time, even under high concurrency. The entire operation is atomic, and because it’s in Postgres, every state change is durable and queryable.
I also added triggers to log state transitions to a separate job_item_events table, giving me full auditability. Need to know when a job failed and who was running it? It’s all there.
Trade-offs and Performance: Was It Worth It?
I’ll be real: Postgres is not Redis. Raw throughput is lower, and SELECT FOR UPDATE has more overhead than a simple LPOP. I expected a performance hit—but the reality was more nuanced.
My benchmarks showed a 15–20% increase in job claim latency under peak load. But in practice, this didn’t bottleneck the overall crawl pipeline. Why? Because job claiming was never my critical path. Network I/O, HTML parsing, and rate limiting were the real bottlenecks.
More importantly, the trade-off paid off in reliability and debuggability. I’ve eliminated entire classes of race conditions and now have a single system of record for job state. Operational incidents that used to take hours to debug now take minutes.
I also gained unexpected benefits: easier backups, integration with my existing observability stack (via logical replication to my analytics DB), and the ability to run complex ad-hoc queries across job metadata.
Looking ahead, I’m exploring a hybrid model—using Redis as a read-through cache for pending job counts and worker load balancing, while keeping Postgres as the authoritative source. This gives me the best of both: durability where it matters, speed where it doesn’t.
Migrating job state out of Redis wasn’t just a technical refactor—it was a shift in philosophy. I’m no longer optimizing for raw speed at the expense of clarity. Sometimes, the right database is the one you already trust with your data.