Hi community, Context: In my current company, we have a data-pipeline (one of the biggest pipelines), which in short works like this: 1.We get raw json events from Kafka dumped in s3. 2.We run a batch job (Airflow) daily, this job picks up the raw jsons in s3, enforces a data parser logic. 3.Data parser logic is simply, a service written in python where we explicitly define what attributes we want from raw json, these attributes are accordingly picked up from the json. There could be nested attributes as well which is taken care of. 4.Post this parsing, the final filtered json is loaded in a dataframe where later this parsed data is converted to CSV/parquet formats and dumped in s3 in another folder. 5.Later this processed CSV is loaded into tables, which is used for analytics, ml models, etc. Problem: 1.Today for every new event we generate and get from Kafka-S3 (raw event), we have to write a parser logic from scratch, if the event structure is different. In case of small changes in event we can update attributes we want to parse in code itself. 2.Post that we have to deploy the service changes (regardless of changes in a json event parser is small or big) which again takes time. 3.If there is someone from a different team publishing the event into Kafka and writes parser logic, he has to learn through our codebase, understand how to write a parser which is can be a time-taking learning curve. Solution (I can think of): 1.Can we have a UI interface, where we abstract all this code (thereby making it language agnostic), and the engineer simply has to select the attributes from json (that could include nested attributes), and post selecting it, the batch job gets that information, applies the parser logic selected from UI and then runs rest of the pipeline as it is. 2.This could help us in avoiding tons of code for parsers, avoid or minimize deployments in for smaller changes, and as our data size grows I think it would be better to abstract things make it language agnostic and streamline stuff. There could be other benefits as well I suppose. Other thoughts: 1.Is there a smart way of doing it? 2.Do we have any open source alternatives here? 3.Or any good engineering blogs which has covered such/similar scenario? 4.How do big companies handle the humongous volume of data and parse relevant stuff into their tables or datalake, etc?
Why are you not using Spark to do this? Amazon has lots of internal tooling. Look at Cradle, you can use spark-sql to do this with minimal effort.
OP might have missed this step. I’m pretty sure they gave some spark job for aggregation
Nope. That's the issue, the entire distributed processing is written in dask (I understand it's old, but legacy code). We are in process of migrating stuff, but that would take time.
We have somewhat of a similar process but with a lot more sources and different parsing logic. What I would suggest is having lookup tables or schema files. You should have one main parser per source or a couple if it's completely different. The parser would read the schema/fields from the lookup table. All you have to do is pass the name of the parameter for each variation and the script would look up the fields for that variation....
Dm if you need more info...
We have a similar use case reading data from elasticsearch -> s3 -> redshift. When we need to load a new ES index or add fields/nested fields, all we have to do is add the relevant info in the lookup tables...
Tech Industry
Yesterday
3136
What happens when most of your team is Indian?
Tech Industry
Yesterday
1778
Women, help me understand why this is inspirational
AMA
Yesterday
790
PM Manager, early 40s, married and ENM (Ethical Non Monogamous) AMA
India
3h
297
'Hindutva': The Radical Hindu Ideology That Seeks to 'Push Christianity Out of India’
Health & Wellness
Yesterday
682
Lasik cost
Ain't reading all that
Ok. But would really appreciate your inputs if you have solved the above problem in your experience. Coz, I'm stuck & figuring out to find a way.
fr tho XD