Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events don't show on Middleware Provider Timeline #15756

Closed
gbaufake opened this issue Aug 8, 2017 · 23 comments
Closed

Events don't show on Middleware Provider Timeline #15756

gbaufake opened this issue Aug 8, 2017 · 23 comments

Comments

@gbaufake
Copy link

gbaufake commented Aug 8, 2017

Description

Hello,

I've been trying to see Middleware Providers Events on Timeline and I couldn't see any. Although everything seems to be fine on Hawkular Services side. Group triggers, Group member, Events are being generated fine and connection between the MIQ and Provider seems to be fine as well.

Environment

  • ManageIQ Docker instance using docker:latest
  • Hawkular Services instance usign hawkular-services:latest

Samples of Logs

  • Event Sample :

[ { "eventType": "EVENT", "tenantId": "hawkular", "id": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1-1502156325254-4c581377-1c55-4474-8f72-4d1f31f56d1f", "ctime": 1502156325254, "dataSource": "_none_", "dataId": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1", "category": "TRIGGER", "text": "Test-Alert1-Instance1", "context": { "dataId.hm.prefix": "hm_g_", "dataId.hm.type": "gauge", "miq.alert_profiles": "22", "resource_path": "/t;hawkular/f;7402c000-6df6-46ae-9e79-9b4f71aa0ce4/r;EAP7-Standalone~~" }, "tags": { "miq.event_type": "hawkular_alert", "miq.resource_type": "MiddlewareServer" }, "trigger": { "tenantId": "hawkular", "id": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1", "name": "Test-Alert1-Instance1 for EAP7-Standalone", "description": "Test-Alert1-Instance1", "type": "MEMBER", "eventType": "EVENT", "eventCategory": null, "eventText": null, "severity": "MEDIUM", "context": { "dataId.hm.prefix": "hm_g_", "dataId.hm.type": "gauge", "miq.alert_profiles": "22", "resource_path": "/t;hawkular/f;7402c000-6df6-46ae-9e79-9b4f71aa0ce4/r;EAP7-Standalone~~" }, "tags": { "miq.event_type": "hawkular_alert", "miq.resource_type": "MiddlewareServer" }, "autoDisable": false, "autoEnable": false, "autoResolve": false, "autoResolveAlerts": true, "autoResolveMatch": "ALL", "dataIdMap": { "WildFly Memory Metrics~Heap Max": "hm_g_MI~R~[7402c000-6df6-46ae-9e79-9b4f71aa0ce4/EAP7-Standalone~~]~MT~WildFly Memory Metrics~Heap Max", "WildFly Memory Metrics~Heap Used": "hm_g_MI~R~[7402c000-6df6-46ae-9e79-9b4f71aa0ce4/EAP7-Standalone~~]~MT~WildFly Memory Metrics~Heap Used" }, "memberOf": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28", "enabled": true, "firingMatch": "ANY", "source": "_none_" }, "dampening": { "tenantId": "hawkular", "triggerId": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1", "triggerMode": "FIRING", "type": "STRICT", "evalTrueSetting": 1, "evalTotalSetting": 1, "evalTimeSetting": 0, "dampeningId": "hawkular-MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1-FIRING" }, "evalSets": [ [ { "evalTimestamp": 1502156325254, "dataTimestamp": 1502156342001, "type": "COMPARE", "condition": { "tenantId": "hawkular", "triggerId": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1", "triggerMode": "FIRING", "type": "COMPARE", "conditionSetSize": 2, "conditionSetIndex": 1, "conditionId": "hawkular-MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1-FIRING-2-1", "dataId": "hm_g_MI~R~[7402c000-6df6-46ae-9e79-9b4f71aa0ce4/EAP7-Standalone~~]~MT~WildFly Memory Metrics~Heap Used", "operator": "GT", "data2Id": "hm_g_MI~R~[7402c000-6df6-46ae-9e79-9b4f71aa0ce4/EAP7-Standalone~~]~MT~WildFly Memory Metrics~Heap Max", "data2Multiplier": 0.2 }, "value1": 365691568, "value2": 1366294528 }, { "evalTimestamp": 1502156325254, "dataTimestamp": 1502156342001, "type": "COMPARE", "condition": { "tenantId": "hawkular", "triggerId": "MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1", "triggerMode": "FIRING", "type": "COMPARE", "conditionSetSize": 2, "conditionSetIndex": 2, "conditionId": "hawkular-MiQ-region-2ecdd959-4b31-45c1-bc5a-afc9ca8e9fcf-ems-1d05c128-af7f-409d-8855-4ec373930bc6-alert-28-1-FIRING-2-2", "dataId": "hm_g_MI~R~[7402c000-6df6-46ae-9e79-9b4f71aa0ce4/EAP7-Standalone~~]~MT~WildFly Memory Metrics~Heap Used", "operator": "LT", "data2Id": "hm_g_MI~R~[7402c000-6df6-46ae-9e79-9b4f71aa0ce4/EAP7-Standalone~~]~MT~WildFly Memory Metrics~Heap Max", "data2Multiplier": 0.15 }, "value1": 365691568, "value2": 1366294528 } ] ] } ]

MIQ Server Provider with Data:

image

MIQ Provider Timeline with no events:
image

@gbaufake gbaufake changed the title Events is not being shown on Middleware Provider Timeline Events don't show on Middleware Provider Timeline Aug 8, 2017
@gbaufake
Copy link
Author

gbaufake commented Aug 8, 2017

@israel-hdez, @jshaughn

The comment above is a "summary" of what we saw on the test run.
Best Regards,
Guilherme Baufaker Rêgo

@abonas
Copy link
Member

abonas commented Aug 9, 2017

@miq-bot add_label providers/hawkular

@abonas
Copy link
Member

abonas commented Aug 9, 2017

@miq-bot add_label events

@abonas
Copy link
Member

abonas commented Aug 10, 2017

@cfcosta please investigate
if needed, the issue can be moved to the hawkular provider repo
@miq-bot assign @cfcosta

@cfcosta
Copy link

cfcosta commented Aug 10, 2017

@abonas on it.

@gbaufake
Copy link
Author

@cfcosta
I've found this on the middleware log:
[2017-08-10T22:32:54.200273 #906:2ae4f7cc0388] WARN -- : EMS_1(Hawkull ar::EventCatcher::Stream) Error capturing events com.datastax.driver.core.exceptt ions.NoHostAvailableException: All host(s) tried for query failed (tried: myCasss andra/172.17.0.2:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [mm yCassandra/172.17.0.2] Pool is busy (no available connection and the queue has rr eached its max size 9182))) [----] W, [2017-08-10T22:35:04.296563 #906:2ae4f7cc0388] WARN -- : EMS_1(Hawkull ar::EventCatcher::Stream) Error capturing events Timed out reading data from serr ver

@jshaughn
Copy link
Contributor

I would be surprised if there was an actual issue in the event catcher, it has not changed in a while. I also doubt a problem in the timeline, which also has not changed to my knowledge. If you have confirmed that the events are generated in hAlerts, and tagged properly (they looked fine when we looked at this earlier this week) then the problem lies in the physical fetch from hAlerts. Either the catcher is not running or there is a problem like the one you posted above, which does not look good. That issue is due to many concurrent requests to Cassandra, which seems odd, because I don't see how our query would cause that problem. It seems almost like there is something else hitting cassandra and causing load. Can you examine the Cassandra load in some way?

By peeking into the postgres db you could look at the ems_events table and see whether the events are there, but it seems more like the fetch is failing. If you have a way to look into the DB you could eliminate the timeline as the issue.

@israel-hdez
Copy link
Member

israel-hdez commented Aug 23, 2017

@cfcosta @jshaughn @gbaufake
I went back to this because I was on something else that, to some extent, depends on this. I could find several problems:

  • First, look at this line of code in event catcher. It uses ::Hawkular::Alerts::Alert to partition events, but it should be ::Hawkular::Alerts::Event. This is my fault. Sorry 😞 Once this is fixed, events will be stored in miq database but they still won't appear in timeline.
  • Events won't appear in the timeline because they are created in Hawkular with tag miq.event_type == hawkular_alert (see source here). However, we don't have it listed in our settings.yml file and that explains why it's ignored when querying the timeline. I found this comment on event catcher and it makes me think that hidding those events/alerts was on purpose. But I don't know exactly the background on that.
  • At least for servers in domain mode, member triggers are being created to metrics like hm_g_MI~R~[master.Unnamed+Domain/Local~/host=master/server=server-one]~MT~WildFly Memory Metrics~Heap Max. Notice the + character in Unnamed+Domain. This means that member triggers are being associated to metrics that doesn't exist and events will never raise in these cases. This is also a bug, but I didn't track the code to find where it is.
  • And, finally, event catcher is using EmsEvent to store events. I read the code of that class a little and what I understand is that EmsEvent is for logging events in timeline. This means that event catcher is just recording hidden timeline events and it's completly ignoring the configuration in the alert. Probably, event catcher should be doing something else to process the events; but I don't know.

So, right now I don't know if this is an issue. I think we need more background on what should be doing the event catcher with the raised events. But right now, it would be accumulating them in miq the database completly hidden to the user and doing nothing else with those events 👎

@cfcosta
Copy link

cfcosta commented Aug 23, 2017

@israel-hdez oh, perfect. Of the points you found, I found the first one only, so it's really good that you did manage to find other problems. I agree on what you said, after taking a look at the code.

@abonas what should we do after this then? It seems like it is broken, but it also seems like the breakage was on purpose. No sure how to proceed on this.

@abonas
Copy link
Member

abonas commented Aug 24, 2017

@jshaughn @lucasponce could you perhaps shed some light into the above ? perhaps you have a bit more background on @israel-hdez findings?

@lucasponce
Copy link
Contributor

@jshaughn can confirm when he is back on PTO but I remember that this was on purpose for several reasons:

  • We use Hawkular Events internally and not all events should be shown, for example, in the timeline only a MiQAlert event should be shown, not the Hawkular Events that internally triggered.
  • Also, there were other Hawkular events that potentially want to show in the timeline, for this logic, events are tagged and filtered in a configurable way.

But I hope Jay can confirm this.

Also, related the background exception related BusyPoolException there is a jira https://issues.jboss.org/browse/HWKALERTS-275 where is being fixed.
Not sure if it is related to this issue or just a side comment as part of the investigation.

Another idea, to really confirm if this feature is broken, I would define a MiQ Alert and indicate that we want to "raise a MiQ Event" and "show in timeline", at least that was the proper way to show MiQ Alerts on timeline which I think it's referring the original description.

@lucasponce
Copy link
Contributor

Also, another side thought. From MiQ the Hawkular Alerting definitions are defined in a simple way.
I remember Triggers have autoDisable == off, this means that a trigger will generated always alerts for the same issue.
I.e. if a machine is down, everytime that you get a down data, you will receive an alert.
With autoDisable == on for a trigger, if a machine is down and an alert is generated, no more alerts will be generated until the trigger is enabled again, or until some resolve rules are defined.

This comment is not really related with the issue but linked with the concern about number of events pulled and stored in miq from hawkular, this can be a way to configure that from the backend.

@gbaufake
Copy link
Author

gbaufake commented Aug 24, 2017

Although https://issues.jboss.org/browse/HWKALERTS-275 and this issue were found
using similar test cases, they exist independently.

Considering https://issues.jboss.org/browse/HWKALERTS-275 was corrected. It should not be a problem for MIQ.

The test case used to find the present issue:

  1. Install MIQ latest, EAP server, Hawkular Services

  2. Login on ManageIQ instance

  3. Add Hawkular Provider on ManageIQ Instance
    image

  4. Define an Alert on ManageIQ instance with conditions
    image

  5. Create an Alert Profile

image

  1. Assign Alert to Alert Profile

image

  1. Assign Server to Alert Profile
    image

  2. Wait to Hawkular Agent collect metrics and send to Hawkular Metrics

  3. Query Events on Events Endpoints (/hawkular/alerts/events?tags=miq.event_type|*)

    • It should have some events related to Alert defined

image

  1. Check Miq Timeline for events
    image

Considering the @israel-hdez findings, I think it needs more investigation on MIQ side to check if the events are stored on MIQ database or if there is some kind of filter which are preventing to show on timeline.

@lucasponce
Copy link
Contributor

Did you check in the MiQ form the options to "raise" or "shown events" in timeline ?
For what you describe it sounds like a regression in miq code (I am not sure).

@lucasponce
Copy link
Contributor

Also, take into consideration that step 9 and 10 are not direct and MiQ needs to process and filter Hawkular Events according to MiQ Alert definitions.

@gbaufake
Copy link
Author

@lucasponce I updated some pictures on my last comment.
The alert has the option "Show in timeline" marked.

@lucasponce
Copy link
Contributor

@gbaufake I think you need to check also the "Send a Management Event" to see them in the "Management Event" filter.
Also, AFAIK the Application filter was the default place for these events AFAIK.

Can you check in the miq tables if there is some miq_alert generated ?

@israel-hdez
Copy link
Member

@lucasponce I tried checking the boxes "Show on Timeline" and "Send a Management Event". The last one asks for a name. But the only thing it does is to log events in event_streams table using hawkular_alert instead of the name I typed. It doesn't do anything with the configs and still is not appearing in timeline. But I still haven't fixed the bug about the wrong metric ID (domain mode case) and I had to feed the metric with artificial data. So, I'm not sure if the bug about the metric ID is preventing it from working correctly.

@jshaughn
Copy link
Contributor

I've talked with @israel-hdez about this and, aside from the issue introduced recently due to the use of ::Hawkular::Alerts::Alert, the event catcher should be fine. It is important to understand that events tagged with miq.event_type: 'hawkular_alert' should not be shown on the timeline but are part of the internal implementation of "live alerting" in our provider. They are fetched by the event catcher but are then used to create actual MIQ Alerts. It is the MIQ alert that should be shown on the timeline. If I recall correctly, these are not shown on the 'Application' event group but I could be wrong. I believe the show up under a different event group filter. Edgar is now looking to see if something broke in the code that converts hawkular events into MIQ alerts.

@israel-hdez
Copy link
Member

israel-hdez commented Aug 28, 2017

It's broken here:

if target.class.name == 'MiddlewareServer'

because target is the server object and, because of STI, target.class.name will have the full class name ManageIQ::Providers::Hawkular::MiddlewareManager::MiddlewareServer. It should be better to use is_a? method.

It's also broken here:

if event_id.start_with?("MiQ-#{alert_id}") && event.middleware_server_id == id

because alert id's can now be different.

Aaahh! 😱 It's broken in a lot of places.

@israel-hdez
Copy link
Member

@miq-bot assign israel-hdez

@israel-hdez
Copy link
Member

@gbaufake now that everything is merged, you may want to test when the next build is available.

@gbaufake
Copy link
Author

gbaufake commented Oct 9, 2017

After a session with @israel-hdez, we verified the corrections and it is working under d32b663!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants