NiFi vs Streamsets: A Comparative Analysis

The main thing to understand is that NiFi was created to do one really important thing very well, namely "Data Flow Management." Its design is based on the concept of flow-based programming, which you can read about and use for your project at https://en.wikipedia.org/wiki/Flow-based_programming.

There are already many systems that stream data, such as sensors, IoT, etc. There are many systems focused on data processing, like Apache Storm, Spark, Flink, and others. And finally, there are many systems that store data, such as HDFS, relational databases, etc. NiFi focuses exclusively on connecting these systems and providing the user experience and core features necessary for this.

Some of the key features and architectural choices made to make this efficient include:

Interactive Command and Control

The work of a person trying to connect systems is to be able to interact quickly and effectively with the continuous streams of data they see. The NiFi user interface allows exactly this; as data comes in, you can add functions to work with it, create data copies to try new approaches, adjust current parameters, review recent and historical statistics, useful built-in documentation, etc. In contrast, almost all other systems have a design and deployment-oriented model, i.e., you make a series of changes and then deploy them. This model is great and can be intuitively understandable, but for data flow management work, it means you don't get interactive changes with feedback on the changes, which is so important for quickly creating new streams or safely and effectively correcting or improving processing of existing data streams.

Data Provenance

A very unique feature of NiFi is its ability to generate detailed and powerful tracking of where your data comes from, what is done with it, where it is sent, and when this is done in the flow. This is important for effective data flow management for several reasons, but for those in the early stages of research and working on a project, the most important thing this provides is tremendous debugging flexibility. You can set up your streams and let everything work, and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected, you can correct the flow and replay the object, and then repeat. This is very useful.

Specially Designed Data Repositories

The ready experience of NiFi offers very powerful performance even on very modest hardware or virtual environments. This is due to the flow file architecture and content repository, which gives us high performance, but the transactional semantics we want when data passes through the flow. The flowfile repository is a simple write-ahead log implementation, and the content repository provides immutable content storage with versions. This, in turn, means that we can "copy" data just by adding a new pointer (actually not copying bytes), or we can transform data, just by reading from the original and writing a new version. Again, this is very efficient. Combine this with the provenance elements I mentioned, and it just provides a really powerful platform. Another very important thing to understand here is that you can't always dictate things like the size of the data involved. The NiFi API was created with this fact in mind, so our API allows processors to receive, transform, and send data without having to load all objects into memory. These repositories also mean that in most streams, most processors don't touch the content at all. However, you can easily see in the NiFi user interface how many bytes are actually read or written, so you again get really useful information for setting up and monitoring your streams. This design also means that NiFi can naturally support back pressure and load shedding, and these are really critical features for a data flow management system.

Previously, people from Streamsets mentioned that NiFi is file-oriented. I'm not quite sure what the difference is between a file, a record, a tuple, an object, or a message in general terms, but actually, when data is in the stream, it's a "thing that needs to be managed and delivered." That's what NiFi does. Whether you have a lot of really high-speed small things, or you have large things, and whether they come from a live audio stream from the Internet or a file on your hard drive, it doesn't matter. Once it's in the stream, it's time to manage it and deliver it. That's what NiFi does. Streamsets also noted that NiFi is schema-less. It's entirely true that NiFi doesn't forcibly convert data from what it was originally into some special NiFi format, and we don't need to reconvert them back into some format for further delivery. It would be very unfortunate if we did that, because it means that even the most trivial cases would have problematic implications for performance.

Mykhailo Makhno

The Difference between NiFi and Streamsets

More recent stories

Making Real Time Bidding Solution for video ads

Click Analysis – OpenSource Architecture

Apache Druid — A Brief Overview