Databricks in San Francisco offers the Databricks Lakehouse Platform (formerly the Unified Analytics Platform), a data science platform and Apache Spark cluster manager. The Databricks Unified Data Service aims to provide a reliable and scalable platform for data pipelines, data lakes, and data platforms. Users can manage full data journey, to ingest, process, store, and expose data throughout an organization. Its Data Science Workspace is a collaborative environment for practitioners to run…
$0.07
Per DBU
dbt
Score 9.5 out of 10
N/A
dbt is an SQL development environment, developed by Fishtown Analytics, now known as dbt Labs. The vendor states that with dbt, analysts take ownership of the entire analytics engineering workflow, from writing data transformation code to deployment and documentation. dbt Core is distributed under the Apache 2.0 license, and paid Teams and Enterprise editions are available.
If you need a managed big data megastore, which has native integration with highly optimized Apache Spark Engine and native integration with MLflow, go for Databricks Lakehouse Platform. The Databricks Lakehouse Platform is a breeze to use and analytics capabilities are supported out of the box. You will find it a bit difficult to manage code in notebooks but you will get used to it soon.
If you can load your data first into your warehouse, dbt is excellent. It does the T(ransformation) part of ELT brilliantly but does not do the E(xtract) or L(oad) part. If you know SQL or your development team knows SQL, it's a framework and extension around that. So, it's easy to learn and easy to hire people with that technical skill (as opposed to specific Informatica, SnapLogic, etc. experience). dbt uses plain text files and integrates with GitHub. You can easily see the changes made between versions. In GUI-based UIs it was always hard to tell what someone had changed. Each "model" is essentially a "SELECT" statement. You never need to do a "CREATE TABLE" or "CREATE VIEW" - it's all done for you, leaving you to work on the business logic. Instead of saying "FROM specific_db.schema.table" you indicate "FROM ref('my_other_model')". It creates an internal dependency diagram you can view in a DAG. When you deploy, the dependencies work like magic in your various environments. They also have great documentation, an active slack community, training, and support. I like the enhancements they have been making and I believe they are headed in a good direction.
Connect my local code in Visual code to my Databricks Lakehouse Platform cluster so I can run the code on the cluster. The old databricks-connect approach has many bugs and is hard to set up. The new Databricks Lakehouse Platform extension on Visual Code, doesn't allow the developers to debug their code line by line (only we can run the code).
Maybe have a specific Databricks Lakehouse Platform IDE that can be used by Databricks Lakehouse Platform users to develop locally.
Visualization in MLFLOW experiment can be enhanced
Slow load times of the dbt cloud environment (they're working on it via a new UI though)
More out-of-the-box solutions for managing procedures, functions, etc would be nice to have, but honestly, it's pretty easy to figure out how to adapt dbt macros
Because it is an amazing platform for designing experiments and delivering a deep dive analysis that requires execution of highly complex queries, as well as it allows to share the information and insights across the company with their shared workspaces, while keeping it secured.
in terms of graph generation and interaction it could improve their UI and UX
One of the best customer and technology support that I have ever experienced in my career. You pay for what you get and you get the Rolls Royce. It reminds me of the customer support of SAS in the 2000s when the tools were reaching some limits and their engineer wanted to know more about what we were doing, long before "data science" was even a name. Databricks truly embraces the partnership with their customer and help them on any given challenge.
Databricks has a much better edge than Synapse in hundred different ways. Databricks has Photon engine, faster available release in cloud and databricks does not run on Open source spark version so better optimization, better performance and better agility and all kind of performance boost can be achieved in Databricks rather Open source synapse spark
Most ETL pipeline products have a T layer, but dbt just does it better. The transformation is on steroids compared to the others. Also, just allows much more Adhoc solutions for very specific projects. Those ETL tools are probably better on the T part if you don't need too many transforms - also dbt is pretty much free dependent on how you work it, also extremely scalable.
The problem with this tool and all other ones that are at the top of the industry, it's so expensive that soon as another one will be on the market and deliver the same or different value, it will be catastrophic for them. So you get the fact that they are cashing every dime right now like SAS or Hadoop once did. Now, look at them
Again, another level of professional services, this is not their biggest strength but this is the cherry on top. I couldn't think about any other professional services like this one. Now I'm talking about meaningful services that really help out our project and delivery.