TrustRadius: an HG Insights company

Apache Hadoop

Score7.5 out of 10

270 Reviews and Ratings

What is Apache Hadoop?

Hadoop is an open source software from Apache, supporting distributed processing and data storage. Hadoop is popular for its scalability, reliability, and functionality available across commoditized hardware.

Categories & Use Cases

Open source Hadoop: smart choice smart price

Use Cases and Deployment Scope

We are using the Apache Hadoop to handle the data which is continuously coming from different devices in real time from different geographical location across the globe and then run spark jobs and notebook to ingest the data and process it and then load it other external systems for further processing.

Pros

  • It’s ability to handle magnitude of data is what makes Hadoop a go to open source product
  • It’s open source nature makes if quite configurable
  • Its community support is superb.

Cons

  • It’s set up is quite complex which requires good knowledge of it
  • It’s fine tuning in terms of configuration requires in depth knowledge of the product
  • It’s logging can be improved

Return on Investment

  • As it was open source makes it popular choice for handling large chuck of datasets
  • It was free earlier but now it’s licensed but still enterprise is a fine tuned version which makes it easier for new users and administrators to use it
  • Our investment is worth every single penny.
  • Initial cost is more as you might need to hire administrators to setup the cluster and make them in scalable. But once done it’s pretty easy

Usability

Alternatives Considered

Amazon EMR (Elastic MapReduce)

Other Software Used

Amazon EMR (Elastic MapReduce), Amazon RDS on VMware, Amazon EventBridge, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Apache Kafka, Google Compute Engine, Apache Airflow, Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Hadoop: A Robust Big Data Platform

Pros

  • Capability to collaborate with R Studio. Most of the statistical algorithms can be deployed.
  • Handling Big Data issues like storage, information retrieval, data manipulation, etc.
  • Redundant tasks like data wrangling, data processing, and cleaning are more efficient in Hadoop as the processing times are faster.

Cons

  • Hadoop requires intensive computational platforms like a minimum of 8GB memory and i5 processor. Sometimes the hardware does become a hindrance.
  • If we can connect Hadoop to Salesforce, it would be a tremendous functionality as most CRM data comes from that channel.
  • It will be good to have some Geo Coding features if someone wants to opt for spatial data analysis using latitudes and longitudes.

Return on Investment

  • Positive: it is powerful, and it allows you to manage your data on a very big scale.
  • Negative: since its computationally expensive, the laptops were upgraded and that was pretty heavy on financials.
  • Positive: it also has given us the power to make data-driven decisions anytime and anywhere.

Alternatives Considered

Apache Spark

Other Software Used

Tableau Desktop, RStudio, Apache Spark

Good tool for unstructured data

Pros

  • Apache Hadoop has made managing large amounts of data quite easy.
  • The system contains a file system known as HDFS (Hadoop Distributed File System) which processes components and programs.
  • The parallel processing tool of this software is also a good aspect of Apache Hadoop.
  • It keeps interesting and reliable features and functions.
  • Apache Hadoop also has a store of very big data files in machines with high levels of availability.

Cons

  • I personally feel that Apache Hadoop is slower as compared to other interactive querying platforms. Queries can take up to hours sometimes which can be frustrating and discouraging sometimes.
  • Also, there are so many modules of Apache Hadoop so it takes so much more time to learn all of them. Other than that, optimization is somewhat a challenge in Apache Hadoop.

Most Important Features

  • Data sourcing is excellent.
  • Efficient customer support.
  • Reliable customization of functionalities.
  • Spark integration.
  • Workload processing.

Return on Investment

  • Apache Hadoop can handle even large amounts of data as well for business-level purposes.
  • HDFS also keeps data files across the machines by distinguishing them into larger blocks and then distributing them across nodes.
  • It is keeping a great role in the growth of our organization.

Alternatives Considered

Azure Data Lake Storage

Other Software Used

Red Hat Ansible Automation Platform, Oracle Java Cloud, QuickBooks Desktop Pro

Great enterprise tool for handling large data

Pros

  • The various modules sometimes are pretty challenging to learn but at the same time, it has made Hadoop easy to implement and perform.
  • Hadoop comprises a thoughtful file system which is called as Hadoop Distributed File System that beautifully processes all components and programs.
  • Hadoop is also very easy to install so this is also a great aspect of Hadoop as sometimes the installation process is so tricky that the user loses interest.
  • Customer support is quick.

Cons

  • As much as I really appreciate Hadoop there are certain cons attached to it as well. I personally think that Hadoop should work attentively towards their interactive querying platforms which in my opinion is quite slow as compared to other players available in the market.
  • Apart from that, a con that I have noticed is that there are many modules that exist in Hadoop so due to the higher number of modules it becomes difficult and time-consuming to learn and ace all of them.

Most Important Features

  • Data distribution.
  • Machine scaling.
  • Cloud processing.
  • Data management.

Return on Investment

  • There are many advantages of Hadoop as first it has made the management and processing of extremely colossal data very easy and has simplified the lives of so many people including me.
  • Hadoop is quite interesting due to its new and improved features plus innovative functions.

Other Software Used

Delphix, Fortinet FortiGate, OneLogin, McAfee Endpoint Security

Apache Hadoop Can Save on the Headaches

Pros

  • Handles large amounts of unstructured data well, for business level purposes
  • Is a good catchall because of this design, i.e. what does not fit into our vertical tables fits here.
  • Decent for large ETL pipelines and logging free-for-alls because of this, also.

Cons

  • Many, many modules and because of Apache open source, takes time to learn
  • Integration is not always seamless between the disparate pieces nor are all the pieces required.
  • Optimization can be challenging (see PSTL design)

Return on Investment

  • Positive as we have saved money on hardware (and software costs) as data scaling as increased in the last several years.
  • Positive as I said earlier as the design of Hadoop allows for a natural split of the dataflows and less data to be "shoved into" the vertical data stack. This saves money and is naturally more efficient.
  • Negative, where we need expertise to manage the Hadoop datastacks due to the learning curves.

Other Software Used

Adobe Spark, Apache Kafka, Oracle Java SE, Red Hat Ansible Automation Platform, Google Cloud Dataflow