Hi ML/DS/DE/MLOps friends. I am building my team for doing ML and some supporting DE work, and also navigating the best platform for ML tools, orchestration, monitoring etc. Between the databricks and dataiku, which one you suggest? Please comment your reason. #databricks #Dataiku #Sagemaker
Did you consider AzureML as well? I'd lean more towards databricks but their orchestration is data centric. I like AzureML's jobs and experimentation tools better
Our company uses aws. And I have had positive experiences with azure ml studio before.
So what’s what’s bad about dataiku? I know spark is kinda embedded in Databricks but if we are not dealing with data lakes, don’t we benefit from the AI-centric full cycle tool in dataiku?
Doing good AI requires good data and processes. Databricks will enable you to build your data stack for the long-term
You mean dataiku doesn’t provide that?
If you want a full code-centric Spark-based platform, Databricks is a great choice. They are a strong technology with a very loyal fanbase comprised of data scientists and data engineers. If upskilling, reuse, collaboration amongst your team and interpretability with your business stakeholders are important to you - maybe look more towards Dataiku. But hey - totally biased here 😄
I am a Dataiku employee so take what I have to say with a grain of salt, I drank the Dataiku Kool-Aid well long ago. Comparing DB to Dataiku in mind is comparing apples and oranges. Why? Dataiku started its life and continues to be one platform for the entire lifecycle with the option to go both code and click regardless of your storage layer or compute engine. While most people are trying to build their data pipeline, Dataiku users already connected to their data, prepped it and are now doing feature engineering and more interesting work. At the same time, you can still code in the platform through built-in notebooks or through external IDEs (PyCharm, Sublime, VSCode-my fav, and RStudio). Once you have built your model, you can take it somewhere else or let Dataiku help get it into the hands of your users through Dataiku Apps, Webapps, RMarkdown, REST APIs, or just schedule it to do something for someone. What do you give up? Well you are on a platform; to make use of that platform you will have some coupling. "But I want to be able to move away from any platform at anytime." You can avoid most coupling inside of Dataiku, but you will lose out on some of the functionality. That said, the speed at which you will be able to get stuff done should more than make up for it. Also, with Dataiku you bring your own storage and compute to the platform. We don't dictate those. You wanna push SQL and JARs to Snowflake. Sure. You wanna connect and make use of EKS for your work, Dataiku will help with that. Maybe you have some weird SQL Server or Oracle still hanging around-gotcha covered. Funny story, we started life on the edge node of Hadoop clusters (yes you can use Scala, PySpark, SparkR, Python, Hive, Impala, R, Julia with Dataiku). Well a funny thing happened, both K8s and Snowflake came along and everyone decided to ditch their Hadoop clusters. Dataiku users just changed the storage and compute they were attached too. No code refactoring necessary as Dataiku abstracts the compute and storage away from the projects themselves. That said, the experience you get with DB will be simpler in terms of managing the compute layer because DB owns the compute layer. Not even the clouds can make cluster management as easy as DB. Dataiku tries to abstract away the storage and compute complexity from the end-user, but there is work that needs to be done upfront. DB makes it dirt simple to get a cluster going and running quickly. From my perspective, DB pitches itself and continues to be a storage and compute layer for DS/AI/ML. That's not the same as a platform for AI/ML. DB compares themselves to Snowflake, but nobody would mistake Snowflake for a AI/ML platform. DB emphasizes the use of their clusters and Delta format for storage. Those orgs that like DB will be very code-heavy and more build versus buy. DB is a fantastic engine, but you will need to build everything around it. You will have every choice, but you will be responsible for everything. As you can see by this poll, most people would choose DB. Personally, I think that has to do with a lot more name recognition; that's 6000 Microsoft reps around the world spamming DB into their accounts (sorry, I doubt most people even know Dataiku here). Easiest thing to do, we have a 2 week trial and you can deploy an AMI into your tenant. Try it and see how it works for you now; and 18 months from now when hopefully you have a 3 dozen or so projects in production. Do the same for DB. But that's just my opinion; I could be wrong.
Thank you for sharing your pov. I actually have decided to go with dataiku based on the fact that we care more about the AI/ML side rather than the DE. I’ll probably form my own opinion 6 months from now and probably write about it somewhere as I agree that DB is more popular because many don’t even know dataiku.
Make sure to use https://academy.dataiku.com that's where we have all our free online learning materials-from the simple getting started to the most advanced topics. Also, while its small and still growing the Community on Dataiku has some really good folks on it and of course a large part of our DS and Dev teams watch it and answer questions. Good luck, love to hear your responses on what we do well and how we can improve.
Not really 6 months, but we are getting there. So OP, how is it going with Dataiku?
We haven’t started yet!
Did you see any advantages for Dataiku over databricks? Anything you learnt while scoping them out in the last few months?
Anyone who has used dataiku that suggests Databricks over it?
Yes use Databricks. Many are familiar with Spark in Scala or PySpark. Managing a large Databricks instance with many users is difficult. Make sure you setup all the extra features like SCIM.
Not sure if understood your point