Apache Spark is still a valid DE tool
December 28, 2024

Apache Spark is still a valid DE tool

Anonymous | TrustRadius Reviewer
Score 9 out of 10
Vetted Review
Verified User

Overall Satisfaction with Apache Spark

We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.

Pros

  • Parallel processing
  • Configurability
  • Usage with other tools

Cons

  • More ready-to-use solutions for tweaking the Apache Spark configs
  • Reduce the creation of UDFs for Pyspark by implementing transformations directly
  • Increased data literacy and adherence to best data engineering practices across the organization
  • Increased ability for the data analysts to quickly and reliably have access to their data, better supporting data driven decisions
  • Decreased costs due to better parallelization of resources
If the team looking to use Apache Spark is not used to debug and tweak settings for jobs to ensure maximum optimizations, it can be frustrating. However, the documentation and the support of the community on the internet can help resolve most issues. Moreover, it is highly configurable and it integrates with different tools (eg: it can be used by dbt core), which increase the scenarios where it can be used

Do you think Apache Spark delivers good value for the price?

Yes

Are you happy with Apache Spark's feature set?

Yes

Did Apache Spark live up to sales and marketing promises?

I wasn't involved with the selection/purchase process

Did implementation of Apache Spark go as expected?

Yes

Would you buy Apache Spark again?

Yes

dbt, Amazon S3 (Simple Storage Service), Amazon EMR (Elastic MapReduce)
Based on my personal experience, Apache Spark is great when you have the need for highly parallelized jobs and have the time and resources to adapt the configurations for your jobs: for this reason I would not recommend it for companies that do not have a strong group of data engineers that can support other data roles to process data in their company.

Comments

More Reviews of Apache Spark