Likelihood to Recommend Sqoop is great for sending data between a JDBC compliant database and a
Hadoop environment. Sqoop is built for those who need a few simple CLI options to import a selection of database tables into
Hadoop , do large dataset analysis that could not commonly be done with that database system due to resource constraints, then export the results back into that database (or another). Sqoop falls short when there needs to be some extra, customized processing between database extract, and
Hadoop loading, in which case
Apache Spark 's JDBC utilities might be preferred
Read full review Azure Data Lake is an absolutely essential piece of a modern data and analytics platform. Over the past 2 years, our usage of Azure Data Lake as a reporting source has continued to grow and far exceeds more traditional sources like MS SQL, Oracle, etc.
Read full review Pros Provides generalized JDBC extensions to migrate data between most database systems Generates Java classes upon reading database records for use in other code utilizing Hadoop's client libraries Allows for both import and export features Read full review Setting up Azure Data Lake Storage account, container is quite easy Access from anywhere and easy maintenance Integration with Azure Data Factory service for end to end pipeline is pretty easy Can store Any form of data (Structured, Unstructured, Semi) in faster manner Read full review Cons Sqoop2 development seems to have stalled. I have set it up outside of a Cloudera CDH installation, and I actually prefer it's "Sqoop Server" model better than just the CLI client version that is Sqoop1. This works especially well in a microservices environment, where there would be only one place to maintain the JDBC drivers to use for Sqoop. Read full review study for the certifications also to have them as a reference for work when you have any questions about applying a configuration to the equipment. The Internet interface is simple and easy to use. Capacity is good and it's good that HP continues to innovate with this technology Read full review Alternatives Considered Sqoop comes preinstalled on the major Hadoop vendor distributions as the recommended product to import data from relational databases. The ability to extend it with additional JDBC drivers makes it very flexible for the environment it is installed within. Spark also has a useful JDBC reader, and can manipulate data in more ways than Sqoop, and also upload to many other systems than just Hadoop . Kafka Connect JDBC is more for streaming database updates using tools such as Oracle GoldenGate or Debezium. Streamsets and Apache NiFi both provide a more "flow based programming" approach to graphically laying out connectors between various systems, including JDBC and Hadoop . Read full review Azure Data Lake Storage from a functionality perspective is a much easier solution to work with. It's implementation from
Amazon EMR went smooth, and continued usage is definitely better. However,
Amazon EMR was significantly cheaper overall between the high transaction fees and cost of storage due to growth. The two both have their advantages and disadvantages, but the functionality of Azure Data Lake Storage outweighed it's cost
Read full review Return on Investment When combined with Cloudera's HUE, it can enable non-technical users to easily import relational data into Hadoop. Being able to manipulate large datasets in Hadoop, and them load them into a type of "materialized view" in an external database system has yielded great insights into the Hadoop datalake without continuously running large batch jobs. Sqoop isn't very user-friendly for those uncomfortable with a CLI. Read full review Instead of having separate pools of storage for data we are now operating on a single layer platform which has cut down on time spent on maintaining those separate pools. We have had more of an ROI with the scalability as we are able to control costs of storage when need be. We are able to operate in a more streamlined approach as we are able to stay within the Azure suite of products and integrate seamlessly with the rest of the applications in our cloud-based infrastructure Read full review ScreenShots