Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing
October 26, 2017

Apache Spark - Simple Syntax, Huge Data Handling, Best Optimization, Parallel processing

Kamesh Emani | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User

Overall Satisfaction with Apache Spark

We previously used the database and Pentaho ETL tool to perform data transformation as per project requirements but as the time passed our data is building day by day and we suffered a lot of optimization problems working this way. Then we thought of implementing Hadoop cluster with 8 nodes in our company. We deployed an 8 node cluster with Cloudera distribution. Then we started using Apache Spark to create applications for Student Course Enrollment data and run them parallelly on multiprocessors.

It is used by a department but the data consists of information about students and professors of the whole organization.

It addresses the problem of assigning classrooms for a specific time in a week based on student course enrollment and professors teaching the course schedules.
This is just one aspect of the application. There are various other data transformation requirement scenarios for different departments across the organization
  • Spark uses Scala which is a functional programming language and easy to use language. Syntax is simpler and human readable.
  • It can be used to run transformations on huge data on different cluster parallelly. It automatically optimizes the process to get output efficiently in less time.
  • It also provides machine learning API for data science applications and also Spark SQL to query fast for data analysis.
  • I also use Zeppelin online tool which is used to fast query and very helpful for BI guys to visualize query outputs.
  • Data visualization.
  • Waiting for Web Development for small apps to be started with Spark as backbone middleware and HDFS as data retrieval file system.
  • Transformations and actions available are limited so must modify API to work for more features.
  • Optimization at its best (Super Fast).
  • Handles huge data with simple syntax whereas other programming language takes hell a lot of coding.
  • Best for parallel computing applications.
Apache Pig and Apache Hive provide most of the things spark provide but apache spark has more features like actions and transformations which are easy to code. Spark uses optimization technique as we can select driver program and manipulate DAG (Directed Acyclic Graph)
Python can be used even for data transformations but it requires lot of coding compared to Spark and it is even so slow.
For large data
For best optimization
For parallel processing
For machine learning on huge data because presently available machine learning software like RapidMiner, are are limited to data size whereas Spark is not