A cloud-based platform was built to automate the collection, processing, and maintenance of cruise data across multiple providers.
Cruise Data Automation
Custom scouts collect sailing schedules, pricing information, cabin details, and deck plans from cruise providers. Depending on the target website, the system uses lightweight HTTP requests, standard browser automation (Playwright/Selenium), or anti-detect browser frameworks (nodriver). To bypass headless-detection scripts and anti-bot systems, the platform executes automation inside Docker containers using Xvfb (X Virtual Framebuffer) to run browsers within a virtual display.
PDF Parsing & OCR Cabin Extraction
For lines like Windstar Cruises, cabin structures are extracted directly from PDF deck plans. The pipeline utilizes PyMuPDF to extract page images, passes them to OCR engines (easyocr / pytesseract) to read cabin labels, and validates the output against database manifests to account for skips or non-cabin layouts.
Dual-Backend Image Pipeline
Collected images (itineraries, covers, and daily galleries) are processed using a specialized media pipeline. To minimize storage fees, images are deduplicated in-memory by URL before downloading. Once processed, the custom dual-backend storage uploader (GCSUploader) uploads assets to Cloudflare R2 via an S3-compatible API. Native Google Cloud Storage (GCS) is retained as a fallback and specifically for Virgin Voyages (vvy), where upstream assets are managed on GCS.
Two-Tier Deduplication & PostgreSQL Upserts
Data consistency is managed at two levels:
In-Memory Asset Deduplication: Grouping itinerary and cover maps before downloading to prevent uploading duplicate images.
Database Upserting (ON CONFLICT): Writing records to the PostgreSQL database (Boylston database) using an ON CONFLICT (cruise_line, sailing_id). DO UPDATE pattern that updates existing sailings only when attributes are distinct, preventing duplicate rows and avoiding redundant database writes.
Durable Pricing Scout Runtime
While Google Cloud Tasks dispatches catalog-scraping workflows, pricing updates are orchestrated by a persistent, database-backed Scout Runtime. This framework manages pricing-only updates in queued batches with cadences customized per provider (e.g., 24-hour cycles for high-volume lines like Royal Caribbean vs. 168-hour cycles for niche lines), ensuring up-to-date pricing with minimal compute overhead.
Secure DB Connections
Workloads deployed to Google Cloud Run connect securely to the database using the Google Cloud SQL Python Connector with IAM database authentication enabled, removing the need to manage static credentials or rotate passwords manually.
Key Deliverables
- Built automated data collection workflows for 60+ cruise providers.
- Implemented HTTP scraping, Playwright/Selenium, and nodriver browser automation (using Python).
- Implemented OCR-based cabin and deck plan extraction from PDF documents.
- Developed a dual-backend image pipeline utilizing Cloudflare R2 and Google Cloud Storage.
- Created in-memory image deduplication and PostgreSQL-level sailing upserts.
- Implemented a durable scheduled pricing queue and database writer.
- Deployed the serverless platform using Google Cloud Run, Google Cloud Tasks, and Docker.