JobStreet & 99.co Web Scraping Pipeline

Overview

Automated daily scraping system for Singapore job listings (JobStreet) and real estate rentals (99.co). Collects thousands of listings per day, enriches them with company data, and stores structured data for economic research.

Technical Stack

Scraping: Python 3.12, Camoufox (anti-detection browser), BrowserForge, BeautifulSoup4
Infrastructure: AWS Lambda, S3, SNS, ECR, Docker
Data: Pandas, NDJSON format

Architecture

A 4-stage pipeline that separates concerns:

Card Scraper - Scrapes job listing cards from search results, extracts 14 metadata fields
Cleaner/Deduplicator - Fixes datetime edge cases, deduplicates by job ID
Detail Enricher - Visits each job URL to extract full descriptions, company UEN, employee count, industry data
Summary Reporter - Aggregates daily statistics and sends email notifications via SNS

Engineering Challenges

Anti-bot detection: Used Camoufox with realistic fingerprints and humanized interactions
Lambda constraints: Worked around missing /dev/shm, read-only filesystem, and 15-minute timeout via chunked processing
Resilience: Auto-restart on timeouts, graceful 404 handling, deduplication across overlapping scrape windows

Output

Daily volume of 1,000-3,000 job listings including job metadata, salary, location, company registration numbers (UEN), and full job descriptions.