At BeautyFeeds.io, we track a growing number of beauty product SKUs across a wide range of ecommerce websites, including global retailers like Sephora, Ulta, Mecca, and Shopee. Each site has its own structure, behavior, and data challenges.
Managing the complexity of a distributed scraper system manually or with standalone scripts quickly becomes unscalable. That’s why we leverage Apache Airflow to help orchestrate our data collection workflows. Airflow gives us the control and visibility needed for consistent, reliable scraping — without exposing our core scraping infrastructure.
In this post, we’ll walk through how we use Airflow to support recurring, site-specific scraping logic while maintaining modularity and reliability.
Why We Use Apache Airflow
- Declarative DAGs: Each retailer has its own logical DAG (Directed Acyclic Graph), with specific scraping intervals.
- Built-in Scheduling: Scrapers are scheduled based on change frequency — from weekly to daily.
- Retries + Alerts: Failed tasks retry automatically with backoff logic. Alerts notify our team in real time.
- UI Monitoring: Airflow’s dashboard helps us monitor and troubleshoot runs easily.
Learn more about Airflow’s scheduling capabilities here.
How We Define Scraper Flows
Each DAG corresponds to a target site. A typical flow includes three core steps:
- Fetch Product URLs
- Scrape Product Pages
- Parse + Save to Storage
The Airflow model helps isolate flows and ensures task-level failure handling.
Scheduling by Category or Product Type
We use flexible rules for crawl frequency:
- High-demand or seasonal categories are refreshed more often
- Lower-priority products are scheduled monthly
- Dynamic scheduling is supported using Airflow Variables
This keeps our crawl efficient without overloading any site.
Where the Data Goes
Once structured, the data is pushed to:
- Cloud storage for historical access
- Internal databases for search & analysis
- XLSX/API output for clients via BeautyFeeds.io
Key Takeaways
- Isolated DAGs = stable pipelines per site
- Automatic retries + alerting reduce manual oversight
- Smart scheduling gives us control over crawl frequency
Wrapping Up
Apache Airflow has helped us move from manually managed scripts to a flexible, observable, and automated scraping pipeline.
If you’re working on large-scale product data extraction or ecommerce intelligence, structured orchestration is essential.
Want to see what kind of data we deliver? Explore BeautyFeeds.io to view our structured beauty product datasets and use cases.
Leave a Reply