Automating Cruise Data Collection Across 60+ Cruise Providers

End-to-end automated cruise data platform with real-time collection, processing, and updates.Infographic showing an automated cruise data collection platform that gathers sailing schedules, pricing, cabin details, deck plans, and images from 50+ cruise providers using Python, Playwright, Selenium, Cloud Run, PostgreSQL, and Cloudflare R2 with a fully automated 24/7 processing workflow.

Overview

PlayBook Travels

Travel & Cruise Technology

Challenge

Managing cruise schedules, pricing, and cabin data across 60+ providers while maintaining accuracy and consistency.

Solution

Built an automated cloud platform that collects, processes, and updates cruise data without manual effort.

PythonFlaskGoogle Cloud RunCloudflareGoogle Cloud StorageDocker

Cruise Providers

60+

Automated sailing and cruise data collection across more than sixty cruise brands.

Automated Workflow

100%

Eliminated manual collection and processing through a fully automated cloud pipeline.

Cloud Processing

24/7

Enable continuous data collection, processing, and database updates.

Client

Playbook Travel operates Cruise Scouts, a platform that aggregates sailing schedules, pricing information, cabin details, and deck plans from cruise operators around the world.

The business relies on accurate and timely cruise data to power customer-facing experiences and internal operations. As the number of supported cruise providers grew, maintaining that data manually became increasingly difficult and time-consuming.

Challenge

Collecting cruise data sounds straightforward until dozens of cruise providers are involved. Each website presents information differently. Some expose structured data while others require browser automation to access sailing information.

Key Issues

Supporting more than 60 cruise providers through a single platform.
Handling both simple scraping and complex browser-driven websites with anti-scraping defenses (e.g. Cloudflare).
Processing large volumes of images and sailing itineraries.
Preventing duplicate records from entering the database.
Extracting cabin labels and deck orientation from complex PDF deck plan documents.
Keeping cruise inventory updated without constant manual oversight.

The client needed a solution that could reliably collect, process, and maintain cruise data at scale while remaining easy to manage as the platform continued to grow.

Solution

A cloud-based platform was built to automate the collection, processing, and maintenance of cruise data across multiple providers.

Cruise Data Automation

Custom scouts collect sailing schedules, pricing information, cabin details, and deck plans from cruise providers. Depending on the target website, the system uses lightweight HTTP requests, standard browser automation (Playwright/Selenium), or anti-detect browser frameworks (nodriver). To bypass headless-detection scripts and anti-bot systems, the platform executes automation inside Docker containers using Xvfb (X Virtual Framebuffer) to run browsers within a virtual display.

PDF Parsing & OCR Cabin Extraction

For lines like Windstar Cruises, cabin structures are extracted directly from PDF deck plans. The pipeline utilizes PyMuPDF to extract page images, passes them to OCR engines (easyocr / pytesseract) to read cabin labels, and validates the output against database manifests to account for skips or non-cabin layouts.

Dual-Backend Image Pipeline

Collected images (itineraries, covers, and daily galleries) are processed using a specialized media pipeline. To minimize storage fees, images are deduplicated in-memory by URL before downloading. Once processed, the custom dual-backend storage uploader (GCSUploader) uploads assets to Cloudflare R2 via an S3-compatible API. Native Google Cloud Storage (GCS) is retained as a fallback and specifically for Virgin Voyages (vvy), where upstream assets are managed on GCS.

Two-Tier Deduplication & PostgreSQL Upserts

Data consistency is managed at two levels:

In-Memory Asset Deduplication: Grouping itinerary and cover maps before downloading to prevent uploading duplicate images.

Database Upserting (ON CONFLICT): Writing records to the PostgreSQL database (Boylston database) using an ON CONFLICT (cruise_line, sailing_id). DO UPDATE pattern that updates existing sailings only when attributes are distinct, preventing duplicate rows and avoiding redundant database writes.

Durable Pricing Scout Runtime

While Google Cloud Tasks dispatches catalog-scraping workflows, pricing updates are orchestrated by a persistent, database-backed Scout Runtime. This framework manages pricing-only updates in queued batches with cadences customized per provider (e.g., 24-hour cycles for high-volume lines like Royal Caribbean vs. 168-hour cycles for niche lines), ensuring up-to-date pricing with minimal compute overhead.

Secure DB Connections

Workloads deployed to Google Cloud Run connect securely to the database using the Google Cloud SQL Python Connector with IAM database authentication enabled, removing the need to manage static credentials or rotate passwords manually.

Key Deliverables

Built automated data collection workflows for 60+ cruise providers.
Implemented HTTP scraping, Playwright/Selenium, and nodriver browser automation (using Python).
Implemented OCR-based cabin and deck plan extraction from PDF documents.
Developed a dual-backend image pipeline utilizing Cloudflare R2 and Google Cloud Storage.
Created in-memory image deduplication and PostgreSQL-level sailing upserts.
Implemented a durable scheduled pricing queue and database writer.
Deployed the serverless platform using Google Cloud Run, Google Cloud Tasks, and Docker.

Tools Used

Python
Flask
Playwright
Google Cloud Run
Google Cloud Tasks
Google Cloud Storage
Cloudflare
PostgreSQL
Docker

Results

Expanded Cruise Coverage

Automated sailing schedules, pricing information, cabin details, and deck plans across more than 50 cruise providers, creating a centralized source of cruise inventory data.

End-to-End Automation

Built a fully automated pipeline that handles data collection, image processing, deduplication, and database updates without manual intervention.

Reliable Cloud Infrastructure

Enabled continuous operation through Google Cloud Run and Google Cloud Tasks, allowing the platform to process and update cruise data at scale.

Duplicate-Free Data Management

Implemented URL-based deduplication and validation workflows to maintain accurate sailing records and improve data quality.

Centralized Asset Management

Stored and managed cruise images through Cloudflare R2, creating a consistent workflow for asset delivery across the platform.

Impact

The project transformed a complex and fragmented data collection process into a centralized platform capable of operating at scale.

Business Impact

Reduced the operational effort required to maintain cruise inventory.
Improved consistency across sailing, pricing, and cabin information.
Eliminated duplicate records through automated validation workflows.
Increased reliability through automated processing and cloud infrastructure.
Simplified onboarding of additional cruise providers.

Long-Term Value

The platform provides a scalable foundation for continued growth. By automating data collection, image management, and downstream processing, Playbook Travel can expand cruise coverage without significantly increasing operational overhead. The architecture also supports future integrations, additional providers, and higher data volumes while maintaining consistency and reliability.

Next project

Transforming Scout's Data Platform for Scale and Speed

Real Estate Intelligence

Book Consultation