Likelihood to Recommend
Apache Spark Streaming is a tool that we are using for almost a year and is excellent in managing batch processing. It is user-friendly. Using it, we can even process our massive data in fractions of seconds. Its pricing is its other plus point. Only its In-memory processing is its demerit as it occupies a large memory.
Read full review
Majorly for all Batch and Streaming Scenarios we are designing StreamSets pipelines, few best suited and tried out use cases below : 1. JDBC to ADLS data transfer based on source refresh frequency. 2. Kafka to GCS. 3. Kafka to Azure Event. 4. Hub HDFS to ADLS data transfer. 5. Schema generation to generate Avro. The easy to design Canvas, Scheduling Jobs, Fragment creation and utilization, an inbuilt wide range of Stage availability makes it an even more favorable tool for me to design data engineering pipelines.
Read full review Pros It is amazing in solving complicated transformative logic. It is straightforward to program. It is a very quick tool. It processes large data within a fraction of seconds. Read full review A easy to use canvas to create Data Engineering Pipeline. A wide range of available Stages ie. Sources, Processors, Executors, and Destinations. Supports both Batch and Streaming Pipelines. Scheduling is way easier than cron. Integration with Key-Vaults for Secrets Fetching. Read full review Cons There must be more documentation. It is a profoundly complex tool. Its in-memory processing consumes massive memory. Read full review Monitoring/Visualization can be improvised and enhanced a lot (e.g. to monitor a Job to see what happened 7 days back with data transfer). The logging mechanism can be simplified (Logs can be filtered with "ERROR", "DEBUG", "ALL" etc but still takes some time to get familiar for understanding). Auto Scalability for heavy load transfer (Taking much time for >5 million record transfer from JDBC to ADLS destination in Avro file transfer). There should be a concept of creating Global variables which is missing. Read full review Alternatives Considered
Apache Spark Streaming stands above all the huge data transformative tools because of its speed of processing which was quite slow in
as it takes a lot of our time in the data processing. Spark, comfortably provides integration with Jupyter like notebook environment. and Spark's combination with Jupyter and Python results in enhancing the speed .
Read full review
StreamSets is a one-stop solution to design Data engineering Pipelines and doesn't require deep Programming knowledge, It's so user-friendly that anyone in Team can contribute to the Idea of pipeline design. In
One has to be programming proficient to use its various components like Hive, HDFS, Kafka, etc but in StreamSets all these stages are built-in and ready to use with minor configuration.
Read full review Return on Investment Cost and time-effective tool for our business. We can integrate with Jupyter with many conveniences. Its high-speed data processing has proved beneficial for us. Read full review Simplified Improvised Overall data ingestion and Integration Process. Support to various Hetrogenous Source systems like RDBMS< Kafka, Salesforce, Key Vault. Secure, easy to launch Integration tool. Read full review ScreenShots