when they say building data pipelines that means like taking the meta data from websites like how many clicked on the site or how many people stayed on the website for x number of minutes and storing that data in a database? and then spark is used to manage the large amounts of data thats being gathered and then airflow is used to control the timing of when that data is being scraped and cloud services like aws/azure/google cloud is where you store it? how do you get that meta data? isnt that stuff already being stored when the website is being created ? like in a database already? because when the create the website they have to store the person info somewhere to have accounts right?
Try googling "data pipelines" and reading about it from any company that has a tech blog. Or Youtubing it if you prefer video.
Even within the field of data engineering there are many roles but you did get some of them correct. Being a data engineer means you will be doing anything from: - Real time data collection and processing to insert into DBs and creating data lakes/warehouses - Setting up data streams to connect producers and consumers - Writing transformations and other computations on the data - Setting up ETL jobs, staging and production tables, managing clusters for your data lake - Collecting and vending metrics and monitoring those metrics As for the metadata stuff, it is not usually stored when the website is created. User clickstreams are not automatically inserted into a database because this would not really be useful without adequate processing first. A data engineer or other backend eng has to usually define DB schemas for the data, do data clean ups and sanitizing, maybe transformations, etc. such that it will actually have significant value to whoever is querying and using the data. They have to set up low latency methods (data pipelines) of getting data to other people around the company. They have to set up monitoring on their data pipelines to make sure data volume and quality is as expected. They usually have to make sure the people using their data are able to use it well and are not encountering problems with data quality or processing, etc.
This. Also if you’re dealing with big data, building the pipeline is also more challenging as you have to tune various settings (e.g. memory) if you use distributed computing platforms.
It’s not just metadata. Usually all of the company’s data is pipelined to a data warehouse or data lake. These stores/environments are optimized for analytics, unlike the transactional databases that services typically use. Service is using that transactional database to “run the website” as you were saying in post. These analytics stores are used to generate dashboards and to do statistical analysis for business, e.g. how well is my A/B test variant performing. Maybe for other too things like machine learning.
Your post sounds like someone threw a buzzword at you and now you have to explain it to someone else. Here is what you need to read up on: 3 tier web app OAuth 2.0 workflow and identity providers Ad pixels That reading will give you an idea of how website data is handled.
Not a question for blind but can't blame you, you're from kpmg
Confirmed boomer
How condescending you both are!!