Hadoop - You Can Tame the Elephant
Updated August 19, 2015

Hadoop - You Can Tame the Elephant

Michael Reynolds | TrustRadius Reviewer
Score 10 out of 10
Vetted Review
Verified User

Software Version

CDH4

Modules Used

  • Hadoop Distributed File System
  • Hadoop MapReduce

Overall Satisfaction with Hadoop

Hadoop is slowly taking the place of the company-wide MySQL data warehouse. Sqoop is being used to import the data from MySQL. Impala is gradually being used as the new data source for all queries. Eventually, MySQL will be phased out, and all data will go directly into Hadoop. Tests have shown that the queries run from Impala are much faster than those from MySQL
  • The built-in data block redundancy helps ensure that the data is safe. Hadoop also distributes the storage, processing, and memory, to work with large amounts of data in a shorter period of time, compared to a typical database system.
  • There are numerous ways to get at the data. The basic way is via the Java-based API, by submitting MapReduce jobs in Java. Hive works well for quick queries, using SQL, which are automatically submitted as MapReduce Jobs.
  • The web-based interface is great for monitoring and administering the cluster, because it can potentially be done from anywhere.
  • Impala is a very fast alternative to Hive. Unlike Hive, which submits queries as MapReduce jobs, Impala provides immediate access to the data.
  • If you are not familiar with Java and the operating system Hadoop rides on, such as Linux, and have trouble with submitted MapReduce jobs, the error messages can seem cryptic, and it can be challenging to track down the source of the problem.
Hadoop is still young and evolving. There is a lot of potential and undiscovered uses for it.
  • Because Hadoop is open source, the cost is basically limited to the hardware. However, organizations with large clusters might want to invest in support services from companies like Cloudera or Hortonworks.
Hadoop is designed for huge data sets, which can save a lot of time with reading and processing data. However, the NameNode, which allocates the data blocks, is a single point of failure. Without a proper backup, or another NameNode ready to kick in, the file system can be become instantly useless. There are typically two ways to ensure the integrity of the NameNode.

One way is to have a Secondary NameNode, which periodically creates a copy of the file system image file. The process is called a "checkpoint". In the event of a failure of the Primary NameNode, the Secondary NameNode can be manually configured as the Primary NameNode. The need for manual intervention can cause delays and potentially other problems.

The second method is with a Standby NameNode. In this scenario, the same checkpoints are performed, however, in the event of a Primary NameNode failure, the Standby NameNode will immediately take the place of the Primary, preventing a disruption in service. This method requires additional services to be installed for it to operate.

Hadoop Training