What do ETL pipelines in FANG look like ?

I'm a SQL guy, so ETL (Extract Transform Load) for me a bunch of SSIS packages with FTP batch scripts, T-SQL SPs, and overnight Agent jobs. But the 'modern' pipeline is probably different. Any pointers or links to engineering blogs on what ETL looks like in your company ? #interview #facebook #amazon #google

Google google> Aug 6, 2020

Less raw sql more python

Snapchat bigTCenrgy Aug 6, 2020

Apache Airflow

New
chibofh Aug 6, 2020

Non FANG here our DW team runs Tableau + Oracle and MSSQL. My team runs Kafka + KSQL for newer stuff. Initially I implemented a solution w/ Python for transformation we are now using streaming for most newer processes.

Fannie Mae ShoMeDMony OP Aug 6, 2020

when you say streaming, does it mean data constantly goes from OLTP to OLAP with just some ETL in the mix, without batching it up nightly ?

New
chibofh Aug 6, 2020

Yes that is correct. The DW team does nightly batching on their stuff.

Facebook checkcall Aug 6, 2020

Facebook has its own airflow built with integration to hive/spark and presto. Highly optimized for our datasets that have tens of billions of rows per day. Many of our tables have 1T rows per day. Some have orders more. work on a team very related to ^

Fannie Mae ShoMeDMony OP Aug 6, 2020

Hi checkcall, Thanks for the response. Are you a Data Engineer at FB ? Can I DM you for couple follow up questions, if you have time ? I'm looking for a bit of help on study material and mock-interviews (and I can pay for your time if you're interested).

Facebook checkcall Aug 7, 2020

I’m not a DE, sure you can DM

Stripe scheme Aug 6, 2020

Airflow + spark with Scala/sparksql

Capital One uGob61 Aug 6, 2020

More live streaming data. Kafka and very large datawarehouses with separate store and compute, both independently elastic of each other.

Fannie Mae ShoMeDMony OP Aug 6, 2020

What is your process for updating data in the DW, without locking up reads ? How often do you do it - with or without downtime ?

PLAYSTUDIOS tomcat14 Aug 6, 2020

New workplace: Pyspark + Airflow Current workplace: AWS Kinesis, S3, python and snowflake. some old pipelines still run on Talend. FB has it’s own custom ETL tool built out. DE at FB needs knowledge of Presto+Python at minimum. Apple has spark. Some teams at Amazon use spark, redshift, glu, Athena. Search for ETL pipelines built using Python/Spark on YouTube/Google. You’ll find great videos/medium articles.