Airflow - any help is appreciated

Happy Monday! Hello, I am an analytics engineer mostly working with semiconductor test data, financial data for BU, automating few manual task and creating pbi dashboards. I use Python and SQL mostly I report to a director and everyone else surrounding me are managers who are experts in electrical engineering but have no knowledge in IT Example project: we get test data daily from offshore teams in Korea & china they dump into share point So I created a python ETL to Extract data then Transform it and Load into SQL server and then the table is connected to pbi like almost 12 ETL in total for this project I am using windows task scheduler to run them on daily basis , and now I am working on different small projects mostly with flat files data and sql server data but the number of ETL reached to almost 25 , sometime these are failing because of issues with incoming data and other…. Skyworks has ADF infra so I created a POC of 1 ETL and demonstrated to my Director he is happy about it but the IT has some policy that they are not allowing to electrical engineering teams So I was exploring all open source batch process ETL orchestration tool and started looking into airflow I completed Marc lamberti udemy course, watched many YouTube tutorials to understand it, and also following harenslak data pipeline book but mostly they are using Linux based os and everything if working perfectly for them I am using Windows OS So I installed Docker, then airflow image and trying to run DAGs I am able to create my first dag but the main issues I am struck at installing the external python packages on docker images , getting sql server table data into dag env, … I tried with many YouTube videos, medium article and stack overflow questions but everywhere they are explaining the basic things like simple methods with out any parameters and nothing in transformations I will be very thankful if someone guide me on how to: 1. Install additional python packages on airflow image in docker windows 2. How to install drivers to connect databases Thank you very much for reading TC: 115k (pls don’t bang me on this TC, that’s the highest Skyworks can pay for my level and also I am very happy with the people I work with) #Airflow #data engineer #dataanalytics

Komodo Health abcpq Mar 27, 2023

Installing Airflow on Windows machine is a pain. Is it not possible for you to get a linux machine. Seems like you are using Azure, you can easily start a new vm with linux instance. Also, now managed airflow version is available in azure which is ok to test as a poc.

Skyworks Solutions ikigai_357 OP Mar 27, 2023

That’s the problem I don’t belong to IT team , I am in an electrical engineering team so basically IT will not allow me to use ADF Also I am not using any azure services even for poc I did in my personal account to demonstrate that we are capable of developing scripts but we got straight no

Sana Biotechnology wYSx00 Mar 27, 2023

So escalate. Who gives a shit what IT thinks. They will do as they are told.

NVIDIA rtxavomni Mar 27, 2023

I’ve done this in Linux and for python packages check your python home location or where it’s picking python packages from. Install it there. Or if you’re using docker-compose for everything then it probably is better

Skyworks Solutions ikigai_357 OP Mar 27, 2023

Yes I am using docker compose

Skyworks Solutions ikigai_357 OP Mar 27, 2023

Something like this

NVIDIA rtxavomni Mar 27, 2023

Put all your bindings in requirement file and rebuild it, I’m guessing you have access to outside internet from your vpn or box you’re using

Skyworks Solutions ikigai_357 OP Mar 27, 2023

Sure will try