Doubt: How does data parsing happen in your company (Big Data)?

Hi community, Context: In my current company, we have a data-pipeline (one of the biggest pipelines), which in short works like this: 1.We get raw json events from Kafka dumped in s3. 2.We run a batch job (Airflow) daily, this job picks up the raw jsons in s3, enforces a data parser logic. 3.Data parser logic is simply, a service written in python where we explicitly define what attributes we want from raw json, these attributes are accordingly picked up from the json. There could be nested attributes as well which is taken care of. 4.Post this parsing, the final filtered json is loaded in a dataframe where later this parsed data is converted to CSV/parquet formats and dumped in s3 in another folder. 5.Later this processed CSV is loaded into tables, which is used for analytics, ml models, etc. Problem: 1.Today for every new event we generate and get from Kafka-S3 (raw event), we have to write a parser logic from scratch, if the event structure is different. In case of small changes in event we can update attributes we want to parse in code itself. 2.Post that we have to deploy the service changes (regardless of changes in a json event parser is small or big) which again takes time. 3.If there is someone from a different team publishing the event into Kafka and writes parser logic, he has to learn through our codebase, understand how to write a parser which is can be a time-taking learning curve. Solution (I can think of): 1.Can we have a UI interface, where we abstract all this code (thereby making it language agnostic), and the engineer simply has to select the attributes from json (that could include nested attributes), and post selecting it, the batch job gets that information, applies the parser logic selected from UI and then runs rest of the pipeline as it is. 2.This could help us in avoiding tons of code for parsers, avoid or minimize deployments in for smaller changes, and as our data size grows I think it would be better to abstract things make it language agnostic and streamline stuff. There could be other benefits as well I suppose. Other thoughts: 1.Is there a smart way of doing it? 2.Do we have any open source alternatives here? 3.Or any good engineering blogs which has covered such/similar scenario? 4.How do big companies handle the humongous volume of data and parse relevant stuff into their tables or datalake, etc?

1.is - Veflausnir

1.Today

Sort by :

ex-Oracle mad🥜🙃 Jan 5

Ain't reading all that

Amazon Jarusv7 OP Jan 5

Ok. But would really appreciate your inputs if you have solved the above problem in your experience. Coz, I'm stuck & figuring out to find a way.

SAS dream0 Jan 8

fr tho XD

New

tyahah Jan 5

Why are you not using Spark to do this? Amazon has lots of internal tooling. Look at Cradle, you can use spark-sql to do this with minimal effort.

ex-Striim BxaB61 Jan 5

OP might have missed this step. I’m pretty sure they gave some spark job for aggregation

Amazon Jarusv7 OP Jan 5

Nope. That's the issue, the entire distributed processing is written in dask (I understand it's old, but legacy code). We are in process of migrating stuff, but that would take time.

Cisco luigii Feb 28

We have somewhat of a similar process but with a lot more sources and different parsing logic. What I would suggest is having lookup tables or schema files. You should have one main parser per source or a couple if it's completely different. The parser would read the schema/fields from the lookup table. All you have to do is pass the name of the parameter for each variation and the script would look up the fields for that variation....

Cisco luigii Feb 28

Dm if you need more info...

Cisco luigii Feb 28

We have a similar use case reading data from elasticsearch -> s3 -> redshift. When we need to load a new ES index or add fields/nested fields, all we have to do is add the relevant info in the lookup tables...

Sort by :

ex-Oracle mad🥜🙃 Jan 5

Ain't reading all that

Amazon Jarusv7 OP Jan 5

Ok. But would really appreciate your inputs if you have solved the above problem in your experience. Coz, I'm stuck & figuring out to find a way.

SAS dream0 Jan 8

fr tho XD

New

tyahah Jan 5

Why are you not using Spark to do this? Amazon has lots of internal tooling. Look at Cradle, you can use spark-sql to do this with minimal effort.

ex-Striim BxaB61 Jan 5

OP might have missed this step. I’m pretty sure they gave some spark job for aggregation

Amazon Jarusv7 OP Jan 5

Nope. That's the issue, the entire distributed processing is written in dask (I understand it's old, but legacy code). We are in process of migrating stuff, but that would take time.

Cisco luigii Feb 28

Dm if you need more info...

Cisco luigii Feb 28

Industries

Job Groups

General Topics

Sponsored

Most Read

Doubt: How does data parsing happen in your company (Big Data)?

Most Read