Member-only story
Apache Airflow on Docker for Complete Beginners
Airflow — it’s not just a word Data Scientists use when they fart. It’s a powerful open source tool originally created by Airbnb to design, schedule, and monitor ETL jobs. But what exactly does that mean, and why is the community so excited about it?
Background: OLTP vs. OLAP, Analytics Needs, and Warehouses
Every company starts out with some group of tables and databases that are operation critical. These might be an orders table, a users table, and an items table if you’re an e-commerce company: your production application uses those tables as a backend for your day-to-day operations. This is what we call OLTP, or Online Transaction Processing. A new user signs up and a row gets added to the users table — mostly insert, update, or delete operations.
As companies mature (this point is getting earlier and earlier these days), they’ll want to start running analytics. How many users do we have? How have our order counts been growing over time? What are our most popular items? These are more complex questions and will tend to require aggregation (sum, average, maximum) as well as a few joins to other tables. We call this OLAP, or Online Analytical Processing.
The most important differences between OLTP and OLAP operations is what their priorities are: