Apache Spark - your go to technology for distributed data processing
May 03, 2021

Apache Spark - your go to technology for distributed data processing

Surendranatha Reddy Chappidi | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User

Overall Satisfaction with Apache Spark

  • We are using Apache Spark in Digital - Data teams to build data products and help business teams to take data-driven decisions.
  • We use Apache Spark to source that from different source systems, process it, and store it in the data lake.
  • Once the data is in data lake, we use spark for data cleansing and data transformation as per business requirements
  • Once the data is transformed, then we will insert it into the final target layer in the data warehouse.
  • Spark is very fast compered to other frameworks because it works in cluster mode and use distributed processing and computation frameworks internally
  • Robust and fault tolerant
  • Open source
  • Can source data from multiple data sources
  • No Dataset API support in python version of spark
  • Apache Spark job run UI can have more meaningful information
  • Spark errors can provide more meaningful information when a job is failed
  • Distributed processing and computing
  • Processing different data source formats
  • Fault tolerant and robust
  • Business leaders are able to take data driven decisions
  • Business users are able access to data in near real time now . Before using spark, they had to wait for at least 24 hours for data to be available
  • Business is able come up with new product ideas
  • Apache Spark works in distributed mode using cluster
  • Informatica and Datastage cannot scale horizontally
  • We can write custom code in spark, whereas in Datastage and Informatica we can only choose the different features proivided already.
  • Apache Spark is open-sourced and free, whereas we need to buy license for Datastage and Informatica

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

Yes

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

Azure Data Factory, Databricks Lakehouse Platform (Unified Analytics Platform), Cloudera Distribution Hadoop (CDH)
Specific scenarios where Apache Spark is well suited:
1. real-time processing of streaming data
2. processing unstructured data, semi-structured data, and structured data from multiple sources
3. avoid vendor lock-in and cloud platform lock-in while developing products