Skip to content

NiFi watchers review

Damiano Giampaoli edited this page Apr 20, 2017 · 4 revisions

This section investigates on the possibilities we have with NiFi watcher processors to fit with the GIS raster data requirements.

The tipical NiFi processors associated with the watcher role are those named with the prefix Get.

GetAzureEventHub, GetCouchbaseKey, GetDynamoDB, GetFile, GetFTP, GetHBase, GetHDFS, GetHDFSEvents, GetHDFSSequenceFile, GetHTMLElement, GetHTTP, GetIgniteCache, GetJMSQueue, GetJMSTopic, GetKafka, GetMongo, GetSFTP, GetSNMP, GetSolr, GetSplunk, GetSQS, GetTwitter

Let's read the GetFile processor description:

This processor obtains FlowFiles from a local directory. NiFi will need at least read permissions on the files it will pull otherwise it will ignore them.

The main issue here is the fact that NiFi is focused on its content repository (see the link section below), a flow file must be stored there, this is not the best design to deal with Raster data since it involves useless copies of huge files.

The flowfile should only propagate the absolute path on the filesystem of the file to process this means that the flowfile payload is not very useful for our GIS use cases since we can rely mostly on flowfile attributes.

Some other watchers implementation although they are:

  • focused on flowfile generation
  • a flowfile can only be located inside the content repository by design

can be usable anyway... let's take the GetKafka processor description:

This Processors polls Apache Kafka for data. When a message is received from Kafka, this Processor emits a FlowFile where the content of the FlowFile is the value of the Kafka message. If the message has a key associated with it, an attribute named kafka.key will be added to the FlowFile, with the value being the UTF-8 Encoded value of the Message’s Key.

So let's say that our kafka messages provide information about a new file available to be processed stored somewhere on the local filesystem or in a S3 bucket. A key in the kafka message can indicate if the file is stored in the FS or S3 and another key can hold the path/URL and all of this information can be propagated trough NiFi flowfiles attributes, the real imagery (which is hard to be managed as a pure flowfile store in the content repository) will be located out of NiFi.

Conclusions

The NiFi components which act as watcher are the processor called with the prefix Get.

Some of them, like the GetKafka one, playing a bit with the processor configuration can be used in a flow avoiding the issue of raster data double copy. The drawback is that the flow designer must have to add to the flow additional steps to remove the temporary data produced during the flow execution.

Other processors, like the GetUrl which could be useful to monitor the FileSystem, must be improved because NiFi will import the raster file in the content repository using it as flowfile. A workaround here can be to add an additional step (other than the cleaning step described for kafka) which uses txt files with the absolute path of the raster file to process, in that case the payload propagated will be only the path not the image itself.

Content repository references

What is Content Repository Archiving?

How to Know in which directory data stores after the process is completed in Nifi

Does Nifi always create a Content Repository on Disk?

Clone this wiki locally