Hadoop for processing big data
Hung Vu | TrustRadius Reviewer
June 07, 2019

Hadoop for processing big data

Score 9 out of 10
Vetted Review
Verified User
Review Source

Modules Used

  • Hadoop Distributed File System
  • Hadoop MapReduce

Overall Satisfaction with Hadoop

We use Hadoop as the main framework to store and process our data, which including very large log files from multiple applications, user behavior data and movie metadata. We deploy our Hadoop cluster in AWS and Hadoop allows us to effective store this data in multiple machines with a consistent pattern and we use Hadoop to process business function such as analytics user behavior, checking application performance and stability, and user taste base on the specified movie type and user age, and time.
  • Allow storing of very big data files in multiple machines with high availability.
  • Effective process large data files with high speed and correct result.
  • Easy to install and configure a Hadoop cluster.
  • Map Reduce framework is simple to understand.
  • It is not suitable for real-time processing.
  • Data store in Hadoop should be in the same pattern in order to process by Map Reduce.
  • Community and support are quite limited.
  • Provides a reliable distributed storage to store and retrieve data.
  • Reduce cost by utilizing spot and low-cost VM in AWS.
  • Shorten development time.
We considered using Relationship database with Oracle Database and Java applications to process our data but ended up with Hadoop despite it being almost new. However, it proved to be the correct solution, we just need a little time to get started with Hadoop and it allows it to save cost on license and EC2 cost as we configure DataNode to be on-demand or spot instance, it also provides high performance and easy to implement as Map-Reduce function is quite simple.
Amazon Elastic Compute Cloud (EC2), MySQL, AWS Lambda
Hadoop is best suitable for analytics and summary scenarios when input data is very big but in the same pattern such as analytics of a very big log file or parallel processing very big files to retrieve information. We would not use Hadoop for real-time processing or when input data is small, or in the case that we need complex data relationships.