Databricks for modern day ETL
January 31, 2019
Databricks for modern day ETL
Score 9 out of 10
Vetted Review
Verified User
Overall Satisfaction with Databricks Unified Analytics Platform
Data from APIs is streamed into our One Lake environment. This one lake is S3 on AWS.
Once this raw data is on S3, we use Databricks to write Spark SQL queries and pySpark to process this data into relational tables and views.
Then those views are used by our data scientists and modelers to generate business value and use in lot of places like creating new models, creating new audit files, exports etc.
Once this raw data is on S3, we use Databricks to write Spark SQL queries and pySpark to process this data into relational tables and views.
Then those views are used by our data scientists and modelers to generate business value and use in lot of places like creating new models, creating new audit files, exports etc.
- Process raw data in One Lake (S3) env to relational tables and views
- Share notebooks with our business analysts so that they can use the queries and generate value out of the data
- Try out PySpark and Spark SQL queries on raw data before using them in our Spark jobs
- Modern day ETL operations made easy using Databricks. Provide access mechanism for different set of customers
- Databricks should come with a fine grained access control mechanism. If I have tables or views created then access mechanism should be able to restrict access to certain tables or columns based on the logged in user
- There should be improved graphing and dash boarding provided from within Databricks
- Better integration with AWS could help me code jobs in Databricks and run them in AWS EMR more easily using better devops pipelines
- ROI for us has been tremendous. Time to market by processing raw data in our big data infrastructure has been pretty fast.
- Non engineers can easily use Databricks, hence helping business customers.
- Thousands of different data combinations can easily be joined and used by our data teams.
Databricks was picked among other competitors. Closest competition in our organization was H2O.ai and Databricks came out to be more useful for ROI and time to market in our internal research.
We could have used AWS products, however Databricks notebooks and ability to launch clusters directly from notebooks was seen as a very helpful tool for non tech users.
We could have used AWS products, however Databricks notebooks and ability to launch clusters directly from notebooks was seen as a very helpful tool for non tech users.