From Direct Queries to Clean Repositories: Refactoring a Python Scraper’s Database Layer

The Messy Reality of Direct Database Access

If you’ve ever written a scraper, you’ve probably been here: a script that pulls data from a page, massages it, and jams it straight into a database using raw SQL queries. That’s exactly where the Vultr Scraper started. Early versions used asyncpg directly in scraper logic—hand-rolled INSERT statements, ad-hoc queries, and zero abstraction. It worked. Until it didn’t.

The first red flags showed up during maintenance. A schema change meant hunting down every .execute() call. Testing became a nightmare—mocking raw queries led to brittle tests that passed but masked real integration issues. And onboarding new logic? Forget about reusing data access patterns. Every new feature duplicated the same error-prone patterns.

Worse, my domain logic was tangled with database details. Did the scraper really need to know about ON CONFLICT clauses or connection pooling? No. It needed to save pricing data—not manage SQL syntax. I was violating one of the oldest rules in software: separate concerns.

Building a Repository Layer That Scales

The fix? Introduce a proper repository pattern—domain-driven, type-safe, and testable. The goal wasn’t just abstraction for abstraction’s sake. It was about making the codebase predictable, extendable, and safe to change.

I started by defining clear interfaces using Python protocols:

class PricingRepository(Protocol):
    async def upsert_pricing_data(self, data: List[PricingEntry]) -> None: ...
    async def get_latest_scrape(self) -> Optional[ScrapeMetadata]: ...

These contracts forced me to think in terms of what I needed, not how. Then came concrete implementations—PostgresPricingRepository—that handled the SQL details behind the scenes. Type hints ensured callers couldn’t pass junk data. Dependency injection (via simple constructor args) made it trivial to swap implementations in tests.

The real win? Composition. Instead of scattering queries across modules, I grouped them by domain: PricingRepository, ScrapeLogRepository, ProviderMetadataRepository. Each had a single responsibility and lived in vultr_scraper/repositories/. The scraper itself became lean—just coordination and parsing. The database layer became reusable, documented, and boring in the best way.

Testing the Abstraction, Not Just the Code

A clean design is useless without confidence. So I doubled down on testing—specifically integration tests that validate the real behavior of these repositories against a live Postgres instance.

Using Docker and a Makefile, I spun up a local test database with the exact schema used in production. Then, I built data factories with factory_boy to generate realistic PricingEntry and ScrapeMetadata instances:

class PricingEntryFactory(AsyncSQLAlchemyModelFactory):
    class Meta:
        model = PricingEntry

    region = "sgp"
    plan_type = "cloud"
    price_usd = 5.0
    scrape_id = 1

These factories powered integration tests that verified upserts, conflict resolution, and data retrieval—no mocks, no fakes. Just real roundtrips. The test suite now runs in CI, catching breaking changes before they deploy.

I also added a local development environment with a docker-compose.yml setup, so new contributors could run the full stack—scraper, database, Redis—without touching their host machine. The Makefile tied it all together: make test, make up, make shell. Developer experience improved overnight.

This wasn’t just a refactor. It was a shift in mindset. I stopped treating the database as a dumping ground and started treating it as a collaborator in a well-structured system. The 18 commits leading up to this weren’t just lines of code—they were guardrails for the future.

Now, when I add a new provider or change a schema, I do it once—in the repository—and trust that the rest of the system won’t break. That’s the kind of code I want to maintain. And ship.