Open source Hadoop: smart choice smart price
Use Cases and Deployment Scope
We are using the Apache Hadoop to handle the data which is continuously coming from different devices in real time from different geographical location across the globe and then run spark jobs and notebook to ingest the data and process it and then load it other external systems for further processing.
Pros
- It’s ability to handle magnitude of data is what makes Hadoop a go to open source product
- It’s open source nature makes if quite configurable
- Its community support is superb.
Cons
- It’s set up is quite complex which requires good knowledge of it
- It’s fine tuning in terms of configuration requires in depth knowledge of the product
- It’s logging can be improved
Return on Investment
- As it was open source makes it popular choice for handling large chuck of datasets
- It was free earlier but now it’s licensed but still enterprise is a fine tuned version which makes it easier for new users and administrators to use it
- Our investment is worth every single penny.
- Initial cost is more as you might need to hire administrators to setup the cluster and make them in scalable. But once done it’s pretty easy
Usability
Alternatives Considered
Amazon EMR (Elastic MapReduce)
Other Software Used
Amazon EMR (Elastic MapReduce), Amazon RDS on VMware, Amazon EventBridge, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Apache Kafka, Google Compute Engine, Apache Airflow, Amazon Managed Streaming for Apache Kafka (Amazon MSK)



