Skip to content

Latest commit

 

History

History
364 lines (205 loc) · 20.7 KB

01_kafka_filter_kafka.md

File metadata and controls

364 lines (205 loc) · 20.7 KB

Use Case 1 - Reading and filtering a stream of syslog data

You have been tasked with filtering a noisy stream of syslog events which are available in a Kafka topic. The goal is to identify critical events and write them to another Kafka topic. Visit the Cloudera YouTube channel for a video walkthrough of this use case.

use-case-1_overview.png

1.1 Open ReadyFlow & start Test Session

  1. On the CDP Public Cloud Home Page, navigate to DataFlow
  2. Navigate to the ReadyFlow Gallery
  3. Explore the ReadyFlow Gallery

Info:

The ReadyFlow Gallery is where you can find out-of-box templates for common data movement use cases. You can directly create deployments from a ReadyFlow or create new drafts and modify the processing logic according to your needs before deploying.


  1. Search for the “Kafka filter to Kafka” ReadyFlow.
  2. Click on “Create New Draft” to open the ReadyFlow in the Designer
  3. Select the only available workspace and give your draft a name
  4. Click Create. You will be forwarded to the Designer
  5. Start a Test Session by either clicking on the start a test session link in the banner or going to Flow Options and selecting Start in the Test Session section.
  6. In the Test Session creation wizard, keep the defaults for NiFi Runtime Version (latest), Inbound Connections (unchecked) and Custom NAR Configuration (unchecked). Click Start Test Session. Notice how the status at the top now says “Initializing Test Session”.

Note: Test Session initialization should take about 5-10 minutes.


Info:

Test Sessions provision infrastructure on the fly and allow you to start and stop individual processors and send data through your flow. By running data through processors step-by-step and using the data viewer as needed, you’re able to validate your processing logic during development in an iterative way without having to treat your entire data flow as one deployable unit.


1.2 Modifying the flow to read syslog data

The flow consists of three processors and looks very promising for our use case. The first processor reads data from a Kafka topic, the second processor allows us to filter the events before the third processor writes the filtered events to another Kafka topic. All we have to do now to reach our goal is to customize its configuration to our use case.

1. Provide values for predefined parameters

  • a. Navigate to Flow OptionsParameters

  • b. Select all parameters that show No value set and provide the following values.

    Note: The parameter values that need to be copied from the Trial Manager homepage are found by selecting Manage Trial in the upper right corner and then selecting Configurations. See screenshots below.

Name Value
CDP Workload User srv_nifi-kafka-ingest
CDP Workload User Password <Copy the value for 'nifi-kafka-ingest-password' from Trial Manager homepage>
Data Input Format AVRO
Data Output Format JSON
Kafka Broker Endpoint <Comma-separated list of Kafka Broker addresses. Copy the value for 'kafka_broker' from Trial Manager homepage>
Kafka Consumer Group ID cdf
Kafka Destination Topic syslog_critical
Kafka Producer ID cdf
Kafka Source Topic syslog_avro
Schema Name syslog
Schema Registry Hostname <Hostname of Schema Registry service. Copy the value for 'schema_registry_host_name' from Trial Manager homepage>

manage-trial.png

trial-configurations.png

  • c. Click Apply Changes to save the parameter values

2. Start Controller Services

  • a. Once your test session has started, navigate to Flow OptionsServices

  • b. Select CDP_Schema_Registry service and click Enable Service and Referencing Components action

    service-and-referencing-components.png

    Note: If clicking Enable Service and Referencing Components does not cause CDP_Schema_Registry and other services to enable, perform the following steps:

    1. Click Enable.

      enable-service.png

    2. The CDP_Schema_Registry service should be enabled. Click Disable.

      disable-service.png

    3. Click Enable Service and Referencing Components.

  • c. Make sure all services have been enabled

    enable-services-kafka-filter-to-kafka.png

Navigate back to the Flow Designer canvas by clicking on the Back To Flow Designer link at the top of the screen.

3. Start the ConsumeFromKafka processor using the right click action menu or the Start button in the configuration drawer if it hasn’t been started automatically.

consume-from-kafka-processor.png

4. Stop the Filter Events processor if it is running using the right click action menu or the Stop button in the configuration drawer.

filter-events-processor-running.png

5. Stop the WriteToKafka processor if it is running

write-to-kafka-processor.png

After starting the ConsumeFromKafka processor and stopping the remaining processors, you should see events starting to queue up in the success_ConsumeFromKafka-FilterEvents connection.

6. Verify data being consumed from Kafka

  • a. Right-click on the success_ConsumeFromKafka-FilterEvents connection and select List Queue

    success_ConsumeFromKafka-connection.png


Info:

The List Queue interface shows you all flow files that are being queued in this connection. Click on a flow file to see its metadata in the form of attributes. In our use case, the attributes tell us a lot about the Kafka source from which we are consuming the data. Attributes change depending on the source you’re working with and can also be used to store additional metadata that you generate in your flow.


list-queue-kafka-filter-to-kafka.png

  • b. Select any flow file in the queue and click the book icon to open it in the Data Viewer

    book-icon-data-viewer.png


Info:

The Data Viewer displays the content of the selected flow file and shows you the events that we have received from Kafka. It automatically detects the data format - in this case JSON - and presents it in human readable format.


  • c. Scroll through the content and note how we are receiving syslog events with varying severity.

    data-viewer-varying-severity.png

7. Define filter rule to filter out low severity events

  • a. Return to the Flow Designer by closing the Data Viewer tab and clicking Back To Flow Designer in the List Queue screen.

  • b. Select the Filter Events processor on the canvas. We will be using a QueryRecord processor to filter out low severity events. The QueryRecord processor is very flexible and can run several filtering or routing rules at once.

    filter-events-processor-stopped.png

  • c. In the configuration drawer, scroll down until you see the filtered_events property. We are going to use this property to filter out the events. Click on the menu at the end of the row and select Go To Parameter.

    go-to-parameter.png

  • d. Once you are on the Parameters view, you can see that this parameter already has a SQL query assigned to it. However it currently does not do any filtering - rather it passes through all events.

  • e. Modify the parameter value to:

    SELECT * FROM FLOWFILE WHERE severity <= 2
    

    filter-rule-high-severity.png

  • f. Click Apply Changes to update the parameter value. Return back to the Flow Designer.

  • g. Select the Filter Events processor on the canvas. Start the Filter Events processor using the right-click menu or the Start icon in the configuration drawer.

8. Verify that the filter rule works

  • a. After starting the Filter Events processor, flow files will start queueing up in the filtered_events-FilterEvents-WriteToKafka connection

    filter-events-processor-queue.png

  • b. Right click the filtered_events-FilterEvents-WriteToKafka connection and select List Queue.

  • c. Select a few random flow files and open them in the Data Viewer to verify that only events with severity <=2 are present.

    data-viewer-high-severity.png

  • d. Navigate back to the Flow Designer canvas.

9. Write the filtered events to the Kafka alerts topic

Now all that is left is to start the WriteToKafka processor to write our filtered high severity events to syslog_critical Kafka topic.

  • a. Select the WriteToKafka processor and explore its properties in the configuration drawer.
  • b. Note how we are plugging in many of our parameters to configure this processor. Values for properties like Kafka Brokers, Topic Name, Username, Password and the Record Writer have all been parameterized and use the values that we provided in the very beginning.
  • c. Start the WriteToKafka processor using the right-click menu or the Start icon in the configuration drawer.

Congratulations! You have successfully customized this ReadyFlow and achieved your goal of sending critical alerts to a dedicated topic! Now that you are done with developing your flow, it is time to deploy it in production!

1.3 Publishing your flow to the catalog

1. Stop the Test Session

  • a. Click the toggle next to Active Test Session to stop your Test Session

    active-test-session.png

  • b. Click “End” in the dialog to confirm. The Test Session is now stopping and allocated resources are being released

    ending-test-session.png

2. Publish your modified flow to the Catalog

  • a. Open the “Flow Options” menu at the top

  • b. Click “Publish” to make your modified flow available in the Catalog

  • c. Append your username to the Flow Name and provide a Flow Description. Click Publish.

    publish-a-new-flow.png

  • d. You are now redirected to your published flow definition in the Catalog.

    published-flow-definition.png


Info:

The Catalog is the central repository for all your deployable flow definitions. From here you can create auto-scaling deployments from any version or create new drafts and update your flow processing logic to create new versions of your flow.


1.4 Creating an auto-scaling flow deployment

1. Start the Deployment Wizard

Click Deploy, select the available environment from the drop down menu and click Continue to start the Deployment Wizard.

new-deployment-kafka.png

2. Complete the Deployment Wizard

The Deployment Wizard guides you through a six step process to create a flow deployment. Throughout the six steps you will choose the NiFi configuration of your flow, provide parameters and define KPIs. At the end of the process, you are able to generate a CLI command to automate future deployments.

  • a. Provide a name such as Critical Syslogs - Prod to indicate the use case and that you are deploying a production flow. Click Next.

    deployment-overview.png

  • b. The NiFi Configuration screen allows you to customize the runtime that will execute your flow. You have the opportunity to pick from various released NiFi versions.

    Change the NiFi version to a Previous Version and make sure Automatically start flow upon successful deployment is checked.

    Click Next.

  • c. The Parameters step is where you provide values for all the parameters that you defined in your flow. In this example, you should recognize many of the prefilled values from the previous exercise - including the Filter Rule and our Kafka Source and Kafka Destination Topics.

    To advance, you have to provide values for all parameters. Select the No Value option to only display parameters without default values.

    no-value-checkbox.png

    You should now only see one parameter - the CDP Workload User Password parameter which is sensitive. Sensitive parameter values are removed when you publish a flow to the catalog to make sure passwords don’t leak.

    Provide the CDP Workload User Password (the value for 'nifi-kafka-ingest password' from the Trial Manager homepage) and click Next to continue.

    cdp-workload-user-password-value.png

  • d. The Sizing & Scaling step lets you choose the resources that you want to allocate for this deployment. You can choose from several node configurations and turn on Auto-Scaling.

    Let’s choose the Extra Small Node Size and turn on Auto-Scaling from 1-3 nodes.

    Check the Flow Metrics Scaling option that appears after Auto-Scaling has been enabled. In addition to CPU based scaling, Flow Metrics Scaling uses back pressure prediction based on data in your flow to make intelligent scaling choices.

    sizing-and-scaling.png

    Click Next.

  • e. The Key Performance Indicators (KPI) step allows you to monitor flow performance. You can create KPIs for overall flow performance metrics or in-depth processor or connection metrics.

    Add the following KPI

    • KPI Scope: Entire Flow

    • Metric to Track: Data Out

    • Alerts:

      • Trigger alert when metric is less than: 1 MB/sec
      • Alert will be triggered when metrics is outside the boundary(s) for: 1 Minute

      kpi-entire-flow.png

    Add the following KPI

    • KPI Scope: Processor

    • Processor Name: ConsumeFromKafka

    • Metric to Track: Bytes Received

    • Alerts:

      • Trigger alert when metric is less than: 512 KBytes
      • Alert will be triggered when metrics is outside the boundary(s) for: 30 seconds

      kpi-processor.png

    Review the KPIs and click Next.

    kpi-list.png

  • f. In the Review page, review your deployment details.

  • g. Click Deploy to initiate the flow deployment!

  • h. You are redirected to the Deployment Dashboard where you can monitor the progress of your deployment.

    deploying.png

    Note: Creating the deployment may take up to 15 minutes. Only the latest NiFi version is cached on the nodes. Any other version (like the previous version selected earlier for this deployment) has to be downloaded.

  • i. Congratulations! Your flow deployment has been created and is already processing Syslog events!

    deployed.png

1.5 Monitoring your flow deployment

  1. Notice how the dashboard shows you the data rates at which a deployment currently receives and sends data. The data is also visualized in a graph that shows the two metrics over time.

  2. Change the Metrics Window setting at the top right. You can visualize as much as 24 hours.

  3. Click on the Critical Syslogs - Prod deployment. The side panel opens and shows more detail about the deployment. On the KPIs tab it will show information about the KPIs that you created when deploying the flow.

    Using the two KPIs Bytes Received and Data Out we can observe that our flow is filtering out data as expected since it reads more than it sends out.

    kpi-metrics.png

  4. Switch to the System Metrics tab where you can observe the current CPU utilization rate for the deployment. Our flow is not doing a lot of heavy transformation, so CPU utilization shouldn’t be too high.

  5. Close the side panel by clicking anywhere on the Dashboard.

  6. Notice how your Critical Syslogs - Prod deployment shows Concerning Health status. Hover over the warning icon and click View Details.

    concerning-health.png

  7. You will be redirected to the Alerts tab of the deployment. Here you get an overview of active and past alerts and events. Expand the Active Alert to learn more about its cause.

    active-alert.png

    After expanding the alert, it is clear that it is caused by a KPI threshold breach for sending less than 1MB/s to external systems as defined earlier when you created the deployment.

1.6 Managing your flow deployment

  1. Click on the Critical Syslogs - Prod deployment in the Dashboard. In the side panel, click Manage Deployment at the top right.

    manage-deployment.png

  2. You are now being redirected to the Deployment Manager. The Deployment Manager allows you to reconfigure the deployment and modify KPIs, modify the number of NiFi nodes or turn auto-scaling on/off or update parameter values.

    deployment-manager.png

  3. Explore NiFi UI for deployment. Click the Actions menu and click on View in NiFi.

    view-in-nifi.png

  4. You are being redirected to the NiFi cluster running the flow deployment. You can use this view for in-depth troubleshooting. Users can have read-only or read/write permissions to the flow deployment. Double click on the NiFi process group to see the deployment flow.

    nifi-kafka.png

  5. Go back to the Deployment Manager expand the Actions menu and click on Change NiFi Runtime Version

    change-nifi-version.png

  6. Looks like we’re running an older version of NiFi. Let’s upgrade to the latest! Select the latest version from the dropdown and click “Update” (if you chose the latest NiFi version when creating the deployment, you can use the Update action to downgrade to the previous NiFi version too).

    You are redirected to the Dashboard and can monitor the upgrade process. Upgrading NiFi versions is as simple as that! One click!

    update-nifi-version2.png

    Note: The NiFi version update may take up to 5-10 minutes.

Congratulations, you have completed the first use case! While the NiFi version of your deployment is being upgraded, feel free to proceed to the next use case!


Info:

The completed flow definition for this use case can be found in the Catalog under the name "Use Case 1 - Cloudera - Filter critical syslog events". If you run into issues during your flow development, it may be helpful to select this flow in the Catalog and select "Create New Draft" to open it in the Flow Designer to compare to your own.