
Filter Ratings and Reviews
Filter 245 vetted Hadoop reviews and ratings
Reviews (1-25 of 35)
Companies can't remove reviews or game the system. Here's why.
January 16, 2021
[Apache Hadoop] is being handled as it is (mostly) intended. For large, unstructured data management from our data flows to include logging and reports extract, transform and load. We are using it at a medium scale in an on-prem server delivery with Cloudera as the management platform. While I firmly believe cloudera makes it a bit easier to manage, it obfuscates issues at times.
- Handles large amounts of unstructured data well, for business level purposes
- Is a good catchall because of this design, i.e. what does not fit into our vertical tables fits here.
- Decent for large ETL pipelines and logging free-for-alls because of this, also.
- Many, many modules and because of Apache open source, takes time to learn
- Integration is not always seamless between the disparate pieces nor are all the pieces required.
- Optimization can be challenging (see PSTL design)
September 21, 2020
It's used organization-wide for older data that's not used as frequently. We use Teradata to warehouse our more recent data, but for data we don't access as often, it's migrated to Hadoop. It addresses the problem of securely storing data without paying the fortune that most warehouses charge for premium cloud storage.
- Accessible
- Inexpensive
- User friendly
- Much slower than more premium platforms
- Doesn't connect with other data warehouses
- Not mainstream -- somewhat more, "hacky" of a solution
September 19, 2020
We are using it within my department to process large sets of data that can't be processed in a timely fashion on a single computer or node. The various modules provided with Hadoop make it easy for us to implement map-reduce and perform parallel processing on large sets of data. We have approximately 40TB of data that we run various algorithms against as we try to use the data to solve business problems and prevent fraudulent transactions.
- Map-reduce
- Parallel processing
- Handles node failures
- HDFS: distributed file system
- More connectors
- Query optimization
- Job scheduling
December 07, 2019
We leverage Hadoop for several of our Tier 1 applications. We use Hadoop for our enterprise data lake where all of the data that our company takes in from our members is stored and then a lot of our applications use that as their master datasource. We also leverage Hadoop clusters for Paxata and Data Scientist analytics workloads. Basically, anything that requires a scale-out approach, we put on Hadoop.
- Scale.
- Stability.
- Reliability.
- There are a lot of Hadoop-specific services and applications under the hood that you have to learn how to administer across the Hadoop cluster.
- Enterprise-class support doesn't live up to other third-party vendors.
We use Hadoop as the main framework to store and process our data, which including very large log files from multiple applications, user behavior data and movie metadata. We deploy our Hadoop cluster in AWS and Hadoop allows us to effective store this data in multiple machines with a consistent pattern and we use Hadoop to process business function such as analytics user behavior, checking application performance and stability, and user taste base on the specified movie type and user age, and time.
- Allow storing of very big data files in multiple machines with high availability.
- Effective process large data files with high speed and correct result.
- Easy to install and configure a Hadoop cluster.
- Map Reduce framework is simple to understand.
- It is not suitable for real-time processing.
- Data store in Hadoop should be in the same pattern in order to process by Map Reduce.
- Community and support are quite limited.
It is being used at our Fortune 500 clients. It is great for storage, but it is not well understood by the business. The challenge is that it requires very sophisticated data scientists to use properly and in parallel, but the data scientists turn the data on its head, causing IT execution issues. This has forced IT to restructure data in a denormalized form so the business users can actually be productive. This is a big trend in organizations.
- Great for inexpensive storage, when originally introduced.
- Distributed processing
- Industry standard
- Network fabric needs to be more sophisticated.
- Need centralized storage.
- The three copy of data should have been in the original design, not years later.
- Consider deploying Spectrum Scale in these environments.
December 22, 2018
Hadoop is being used to solve big data modeling problems in our firm. The corporate analytics team uses Hadoop to perform functions like data manipulation, information retrieval, data mapping, and statistical modeling. The business problem which it solves is the limitation of CSV/Excel files to handle more than a million rows. Hadoop allows you to process big data and also has connectivity with platforms like R Studio where you can deploy mathematical models.
- Capability to collaborate with R Studio. Most of the statistical algorithms can be deployed.
- Handling Big Data issues like storage, information retrieval, data manipulation, etc.
- Redundant tasks like data wrangling, data processing, and cleaning are more efficient in Hadoop as the processing times are faster.
- Hadoop requires intensive computational platforms like a minimum of 8GB memory and i5 processor. Sometimes the hardware does become a hindrance.
- If we can connect Hadoop to Salesforce, it would be a tremendous functionality as most CRM data comes from that channel.
- It will be good to have some Geo Coding features if someone wants to opt for spatial data analysis using latitudes and longitudes.
March 28, 2018
- Used for Massive data collection, storage, and analytics
- Used for MapReduce processes, Hive tables, Spark job input, and for backing up data
- Storing Retail Catalog & Session data to enable omnichannel experience for customers, and a 360-degree customer insight
- Having a consistent data store that can be integrated across other platforms, and have one single source of truth.
- HDFS is reliable and solid, and in my experience with it, there are very few problems using it
- Enterprise support from different vendors makes it easier to 'sell' inside an enterprise
- It provides High Scalability and Redundancy
- Horizontal scaling and distributed architecture
- Less organizational support system. Bugs need to be fixed and outside help take a long time to push updates
- Not for small data sets
- Data security needs to be ramped up
- Failure in NameNode has no replication which takes a lot of time to recover
It is massively being used in our organization for data storage, data backup, and machine learning analytics. Managing vast amounts of data has become quite easy since the arrival of the Hadoop environment. Our department is on verge of moving towards Spark instead of MapReduce, but for now, Hadoop is being used extensively for MapReduce purposes.
- Hadoop Distributed Systems is reliable.
- High scalability
- Open Sources, Low Cost, Large Communities
- Compatibility with Windows Systems
- Security needs more focus
- Hadoop lack in real time processing
December 13, 2017
Currently, there are two directorates using Hadoop for processing a vast amount of data from various data sources in my organization. Hadoop helps us tackle our problem of maintaining and processing a huge amount of data efficiently. High availability, scalability and cost efficiency are the main considerations for implementing Hadoop as one of the core solutions in our big-data infrastructure.
- Scalability is one of the main reasons we decided to use Hadoop. Storage and processing power can be seamlessly increased by simply adding more nodes.
- Replication on Hadoop's distributed file system (HDFS) ensures robustness of data being stored which ensures high-availability of data.
- Using commodity hardware as a node in a Hadoop cluster can reduce cost and eliminates dependency on particular proprietary technology.
- User and access management are still challenging to implement in Hadoop, deploying a kerberized secured cluster is quite a challenge itself.
- Multiple application versioning on a single cluster would be a nice to have feature.
- Processing a large number of small files also becomes a problem on a very large cluster with hundreds of nodes.
[It was used] As a proof of concept to analyze a huge amount of data. We were building a product to analyze huge data and eventually sell that product to a utility.
- Highly Scalable Architecture
- Low cost
- Can be used in a Cloud Environment
- Can be run on commodity Hardware
- Open Source
- Its open source but there are companies like hortonworks, Cloudera etc., which give enterprise support
- Lots of scripting still needed
- Some tools in the hadoop eco system overlap
Hadoop is used to build a data lake where all enterprise data for my entire company can be stored. With data centralization and standardization we use it to build analytical solutions for our company. There are many other uses for the data - for example monitoring performance via KPIs, etc.
- Massive data processing
- Fault tolerance
- Speed to market
- Data visualization
- Data history
- Random access
We needed a robust/redundant system to run multiple simultaneous jobs for our ETL pipeline, this needed distributed storage space, integration with Windows AD user accounts and the ability to expand when needed with little to no downtime.
We are using Cloudera 5.6 to orchestrate the install (along with puppet) and manage the hadoop cluster.
We are using Cloudera 5.6 to orchestrate the install (along with puppet) and manage the hadoop cluster.
- The distributed replicated HDFS filesystem allows for fault tolerance and the ability to use low cost JBOD arrays for data storage.
- Yarn with MapReduce2 gives us a job slot scheduler to fully utilize available compute resources while providing HA and resource management.
- The hadoop ecosystem allows for the use of many different technologies all using the same compute resources so that your spark, samza, camus, pig and oozie jobs can happily co-exist on the same infrastructure.
- Without Cloudera as a management interface the hadoop components are much harder to manage to ensure consistency across a cluster.
- The calculations of hardware resources to job slots/resource management can be quite an exercise in finding that "sweet spot" with your applications, a more transparent way of figuring this out would be welcome.
- A lot of the roles and management pieces are written in java, which from an administration perspective can have there own issues with garbage collection and memory management.
May 26, 2016
Hadoop is not used as a norm in my organization. I just use it personally to complete my job faster. It is implemented in the research computing cluster to be used by faculty and students. It completes jobs faster by parallelizing the tasks using MapReduce framework. This gives me considerable speed in the tasks I perform.
- Provides a reliable distributed storage to store and retrieve data. I am able to store data without having to worry that a node failing might cause the loss of data.
- Parallelizes the task with MapReduce and helps complete the task faster. The ease of use of MapReduce makes it possible to write code in a simple way to make it run on different slaves in the cluster.
- With the massive user base, it is not hard to find documentation or help relating to any problem in the area. Therefore, I rarely had any instances where I had to look for a solution for a really long time.
- I would have hoped for a simpler interface if possible, so that the initial effort that had to be spent would have been much less. I often see others who are starting to use hadoop are finding it hard to learn.
- I'm not sure if it is a problem with the organization and the modules they provide, but sometimes I wish there were more modules available to be used.
May 25, 2016
The company I worked at used Hadoop clusters for processing huge datasets. They had several nodes for both production and per-production nodes. It allowed distributed processing of data across several clusters with an easy to use software model. It is used by the Systems and IT department at my company.
- HDFS provides a very robust and fast data storage system.
- Hadoop works well with generic "commodity" hardware negating the need for expensive enterprise grade hardware.
- It is mostly unaffected by system and hardware failures of nodes and is self-sustained.
- While its open source nature provides a lot of benefits, there are multiple stability issues that arise due to it.
- Limited support for interactive analytics.
December 01, 2015
I have used Hadoop for building business feeds for a telecom client. The major purpose for using Hadoop was to tackle the problem of gaining insights into the ever growing number of business data. We leveraged the map reduce programming model to churn more than 30 gigabytes of data per day into actionable and aggregated data which was further leveraged by campaign teams to design and shape marketing and by product teams to envision new customer experiences.
- Hadoop is an excellent framework for building distributed, fault tolerant data processing systems which leverage HDFS which is optimized for low latency storage and high throughput performance.
- Hadoop Map reduce is a powerful programming model and can be leveraged directly either via use of Java programming language or by data flow languages like Apache Pig.
- Hadoop has a reach eco system of companion tools which enable easy integration for ingesting large amounts of data efficiently from various sources. For example Apache Flume can act as data bus which can use HDFS as a sink and integrates effectively with disparate data sources.
- Hadoop can also be leveraged to build complex data processing and machine learning workflows, due to availability of Apache Mahout, which uses the map reduce model of Hadoop to run complex algorithms.
- Hadoop is a batch oriented processing framework, it lacks real time or stream processing.
- Hadoop's HDFS file system is not a POSIX compliant file system and does not work well with small files, especially smaller than the default block size.
- Hadoop cannot be used for running interactive jobs or analytics.
February 16, 2016
My present company uses Hadoop and associated technology to create a data pipeline using open source tools. Apart from that we also consult for projects which could potentially use Hadoop. Apart from that, I also work as a consultant for HDP. We actively help in installation and setup of hadoop clusters.
- Hadoop is open source and with a wide community already present, the usage is much easy for individuals, startups and MNCs alike.
- Hadoop works well for commodity hardware and that makes it easier to avoid pricey clusters.
- Hadoop takes parallel programming to next level and helps processing of multi terabytes (even petabytes) of data easier.
- While Hadoop MR parallelizes jobs involving Big Data, it is slow for smaller data sets
- OLAP (analytics)is easier, however, OLTP (transactions) is a problem in most cases.
- People using Hadoop have to keep in mind that small proof of concepts may not scale as expected.
February 13, 2016
I have been working with Hadoop since last year. It is very user friendly. Hadoop was used by the data center management team. It allows distributed processing of huge amount of data sets across clusters of computers using simple programming models.
- It is robust in the sense that any big data applications will continue to run even when individual servers fail.
- Enormous data can be easily sorted.
- It can be improved in terms of security.
- Since it is open source, stability issues must be improved.
December 04, 2015
We utilize Hadoop primarily as a large data staging area for disparate corporate data. Select data is aggregated and moved downstream to a more formal data warehouse. Some data analytics is also performed directly against the Hadoop stored data. The direct analytics is done primarily with Apache Spark utilizing Scala and Python.
- No requirement for schema on write.
- Ability to scale to massive amounts of data.
- Open platform provides multiple options and customizations to fit your exact needs.
- The platform is still maturing and can be confusing to research and use. Basic tasks can still be manual and are not always user friendly.
December 01, 2015
Hadoop is used by data center management team. Hadoop processes the metric data pushed by virtual machines. Hadoop's output is served to the analytics engine and respective actions are taken to maintain even load on machines.
- Processing huge data sets.
- Concurrent processing.
- Performance increases with distribution of data across multiple machines.
- Better handling of unstructured data.
- Data nodes and processing nodes
- Make Haadop lighweight.
- Installation is very difficult. Make it more user friendly.
- Introduce a feature that works with continuous integration.
December 01, 2015
I have been using Hadoop for 2 years and I really find it very useful, especially working with bigger datasets. I have used Hadoop and Mahout for my project to analyze and learn different patterns from Yelp Dataset. It was really very easy and user friendly to use.
- Scalability. Hadoop is really useful when you are dealing with a bigger system and you want to make your system scalable.
- Reliable. Very reliable.
- Fast, Fast Fast!!! Hadoop really works very fast, even with bigger datasets.
- Development tools are not that easy to use.
- Learning curve can be reduced. As of now, some skill is a must to use Hadoop.
- Security. In today's world, security is of prime importance. Hadoop could be made more secure to use.
November 17, 2015
I have being using Hadoop for the last 12 months and really find it effective while dealing with large amounts of data. I have used Hadoop jointly with Apache Mahout for building a recommendation system and got amazing results. It was fast, reliable and easy to manage.
- Fast. Prior to working with Hadoop I had many performance based issues where our system was very slow and took time. But after using Hadoop the performance was significantly increased.
- Fault tolerant. The HDFS (Hadoop distributed file system) is good platform for working with large data sets and makes the system fault tolerant.
- Scalable. As Hadoop can deal with structured and unstructured data it makes the system scalable.
- Security. As it has to deal with a large data set it can be vulnerable to malicious data.
- Less performance with smaller data. Doesn't provide effective results if the data is very small.
- Requires a skilled person to handle the system.
We are using it for Retail data ETL processing. This is going to be used in whole organization. It allows terabytes of data to be processed in faster manner with scalability.
- Processes big volume of data using parallelism in faster manner.
- No schema required. Hadoop can process any type of data.
- Hadoop is horizontally scalable.
- Hadoop is free.
- Development tools are not that friendly.
- Hard to find hadoop resources.
April 29, 2015
Hadoop is used for storing and analyzing log data (logs from warehouse loads or other data processing) as well as storing and retrieving financial data from JD Edwards. It's also planned to be used for archival. Hadoop is used by several departments within our organization. Currently, we are paying a lot of money for hosting historical data and we plan to move that to Hadoop; reducing our storage costs. Also, we got a much better performance out of our Hadoop cluster for processing a large amount of financial data. So, in that senese, Hadoop addressed multiple business problems for us.
- Hadoop stores and processes unstructured data such as web access logs or logs of data processing very well
- Hadoop can be effectively used for archiving; providing a very economic, fast, flexible, scalable and reliable way to store data
- Hadoop can be used to store and process a very large amount of data very fast
- Security is a piece that's missing from Hadoop - you have to supplement security using Kerberos etc.
- Hadoop is not easy to learn - there are various modules with little or no documentation
- Hadoop being open-source, testing, quality control and version control are very difficult
August 19, 2015
Hadoop is slowly taking the place of the company-wide MySQL data warehouse. Sqoop is being used to import the data from MySQL. Impala is gradually being used as the new data source for all queries. Eventually, MySQL will be phased out, and all data will go directly into Hadoop. Tests have shown that the queries run from Impala are much faster than those from MySQL
- The built-in data block redundancy helps ensure that the data is safe. Hadoop also distributes the storage, processing, and memory, to work with large amounts of data in a shorter period of time, compared to a typical database system.
- There are numerous ways to get at the data. The basic way is via the Java-based API, by submitting MapReduce jobs in Java. Hive works well for quick queries, using SQL, which are automatically submitted as MapReduce Jobs.
- The web-based interface is great for monitoring and administering the cluster, because it can potentially be done from anywhere.
- Impala is a very fast alternative to Hive. Unlike Hive, which submits queries as MapReduce jobs, Impala provides immediate access to the data.
- If you are not familiar with Java and the operating system Hadoop rides on, such as Linux, and have trouble with submitted MapReduce jobs, the error messages can seem cryptic, and it can be challenging to track down the source of the problem.
Hadoop Scorecard Summary
What is Hadoop?
Hadoop is an open source software from Apache, supporting distributed processing and data storage. Hadoop is popular for its scalability, reliability, and functionality available across commoditized hardware.
Categories: Hadoop-Related
Hadoop Video
What is Hadoop?
Hadoop Integrations
Sematext Infrastructure Monitoring (formerly Sematext SPM)
Hadoop Pricing
- Does not have featureFree Trial Available?No
- Has featureFree or Freemium Version Available?Yes
- Does not have featurePremium Consulting/Integration Services Available?No
- Entry-level set up fee?No
Hadoop Technical Details
Operating Systems: | Unspecified |
---|---|
Mobile Application: | No |