Overview

Automated daily scraping system for Singapore job listings (JobStreet) and real estate rentals (99.co). Collects thousands of listings per day, enriches them with company data, and stores structured data for economic research.

Technical Stack

  • Scraping: Python 3.12, Camoufox (anti-detection browser), BrowserForge, BeautifulSoup4
  • Infrastructure: AWS Lambda, S3, SNS, ECR, Docker
  • Data: Pandas, NDJSON format

Architecture

A 4-stage pipeline that separates concerns:

  1. Card Scraper - Scrapes job listing cards from search results, extracts 14 metadata fields
  2. Cleaner/Deduplicator - Fixes datetime edge cases, deduplicates by job ID
  3. Detail Enricher - Visits each job URL to extract full descriptions, company UEN, employee count, industry data
  4. Summary Reporter - Aggregates daily statistics and sends email notifications via SNS

Engineering Challenges

  • Anti-bot detection: Used Camoufox with realistic fingerprints and humanized interactions
  • Lambda constraints: Worked around missing /dev/shm, read-only filesystem, and 15-minute timeout via chunked processing
  • Resilience: Auto-restart on timeouts, graceful 404 handling, deduplication across overlapping scrape windows

Output

Daily volume of 1,000-3,000 job listings including job metadata, salary, location, company registration numbers (UEN), and full job descriptions.