Hadoop - You Can Tame the Elephant
Overall Satisfaction with Hadoop
Hadoop is slowly taking the place of the company-wide MySQL data warehouse. Sqoop is being used to import the data from MySQL. Impala is gradually being used as the new data source for all queries. Eventually, MySQL will be phased out, and all data will go directly into Hadoop. Tests have shown that the queries run from Impala are much faster than those from MySQL
Pros
- The built-in data block redundancy helps ensure that the data is safe. Hadoop also distributes the storage, processing, and memory, to work with large amounts of data in a shorter period of time, compared to a typical database system.
- There are numerous ways to get at the data. The basic way is via the Java-based API, by submitting MapReduce jobs in Java. Hive works well for quick queries, using SQL, which are automatically submitted as MapReduce Jobs.
- The web-based interface is great for monitoring and administering the cluster, because it can potentially be done from anywhere.
- Impala is a very fast alternative to Hive. Unlike Hive, which submits queries as MapReduce jobs, Impala provides immediate access to the data.
Cons
- If you are not familiar with Java and the operating system Hadoop rides on, such as Linux, and have trouble with submitted MapReduce jobs, the error messages can seem cryptic, and it can be challenging to track down the source of the problem.
- Because Hadoop is open source, the cost is basically limited to the hardware. However, organizations with large clusters might want to invest in support services from companies like Cloudera or Hortonworks.
Comments
Please log in to join the conversation