Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The Kafka event streaming platform is used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
N/A
GoCD
Score 8.0 out of 10
N/A
GoCD, from ThoughtWorks in Chicago, is an application lifecycle management and development tool.
Apache Kafka is well-suited for most data-streaming use cases. Amazon Kinesis and Azure EventHubs, unless you have a specific use case where using those cloud PaAS for your data lakes, once set up well, Apache Kafka will take care of everything else in the background. Azure EventHubs, is good for cross-cloud use cases, and Amazon Kinesis - I have no real-world experience. But I believe it is the same.
Previously, our team used Jenkins. However, since it's a shared deployment resource we don't have admin access. We tried GoCD as it's open source and we really like. We set up our deployment pipeline to run whenever codes are merged to master, run the unit test and revert back if it doesn't pass. Once it's deployed to the staging environment, we can simply do 1-click to deploy the appropriate version to production. We use this to deploy to an on-prem server and also AWS. Some deployment pipelines use custom Powershell script for.Net application, some others use Bash script to execute the docker push and cloud formation template to build elastic beanstalk.
Really easy to configure. I've used other message brokers such as RabbitMQ and compared to them, Kafka's configurations are very easy to understand and tweak.
Very scalable: easily configured to run on multiple nodes allowing for ease of parallelism (assuming your queues/topics don't have to be consumed in the exact same order the messages were delivered)
Not exactly a feature, but I trust Kafka will be around for at least another decade because active development has continued to be strong and there's a lot of financial backing from Confluent and LinkedIn, and probably many other companies who are using it (which, anecdotally, is many).
Pipeline-as-Code works really well. All our pipelines are defined in yml files, which are checked into SCM.
The ability to link multiple pipelines together is really cool. Later pipelines can declare a dependency to pick up the build artifacts of earlier ones.
Agents definition is really great. We can define multiple different kinds of environments to best suit our diverse build systems.
Sometimes it becomes difficult to monitor our Kafka deployments. We've been able to overcome it largely using AWS MSK, a managed service for Apache Kafka, but a separate monitoring dashboard would have been great.
Simplify the process for local deployment of Kafka and provide a user interface to get visibility into the different topics and the messages being processed.
Learning curve around creation of broker and topics could be simplified
Apache Kafka is highly recommended to develop loosely coupled, real-time processing applications. Also, Apache Kafka provides property based configuration. Producer, Consumer and broker contain their own separate property file
Support for Apache Kafka (if willing to pay) is available from Confluent that includes the same time that created Kafka at Linkedin so they know this software in and out. Moreover, Apache Kafka is well known and best practices documents and deployment scenarios are easily available for download. For example, from eBay, Linkedin, Uber, and NYTimes.
I used other messaging/queue solutions that are a lot more basic than Confluent Kafka, as well as another solution that is no longer in the market called Xively, which was bought and "buried" by Google. In comparison, these solutions offer way fewer functionalities and respond to other needs.
GoCD is easier to setup, but harder to customize at runtime. There's no way to trigger a pipeline with custom parameters.
Jenkins is more flexible at runtime. You can define multiple user-provided parameters so when user needs to trigger a build, there's a form for him/her to input the parameters.
Positive: Get a quick and reliable pub/sub model implemented - data across components flows easily.
Positive: it's scalable so we can develop small and scale for real-world scenarios
Negative: it's easy to get into a confusing situation if you are not experienced yet or something strange has happened (rare, but it does). Troubleshooting such situations can take time and effort.
Settings.xml need to be backed up periodically. It contains all the settings for your pipelines! We accidentally deleted before and we have to restore and re-create several missing pipelines
More straight forward use of API and allows filtering e.g., pull all pipelines triggered after this date