What is the best way to be informed of errors that may have occurred during parsing? #361

jlittle-ptc · 2024-06-10T11:54:39Z

jlittle-ptc
Jun 10, 2024

While GCToolkit has generally handled any file I've thrown at it, there have been a couple of times where I've needed to do a bit of troubleshooting to figure out what was going on.

The first step I've generally used is to use the -Dmicrosoft.debug flag to capture "not implemented" messages in the log, which usually indicates either a possible problem with the regex (#352 ) or that the parser doesn't know the pattern in question.

However, I have not figured out how I might be able to capture this information such that my application can react programmatically. For example, having the missed lines/events logged into a data structure that can be queried to determine if there were any issues parsing the file.

Is there a way to capture this sort of information during the Aggregation process, or is it something that would need to be added to the parser?

Some of the use cases I can envision:

Event couldn't be parsed as it is malformed or unknown to the parser
No events parsed from a normal text file, indicating bad input (NullPointerException when processing file without GC Events #359)
Event timestamps out of sequence (I think I've seen "Time Traveling" messages in the test suites)

kcpeppe · 2024-06-10T16:37:33Z

kcpeppe
Jun 10, 2024
Maintainer

Short story, currently log lines don't make it past the parsers. The reasoning is that historically there were a couple of cases that needed to be handled. First, in preunified logging there were approximately 60 different flags that would affect the format of the file. Each flag or combination of flags has a different affect on the format of the logs file. Accommodating all flags let alone all combination of flags is simply a nightmare. This is especially true when arbitrary changes were frequently introduced with each release. Additionally, GC logs might be collected from stdout and consequently be mixed with output from other sources. With Unified logging the logs can contain all kinds of data that isn't GC/memory related and if the output is collected on stdout, it can be mixed with output from other sources. Thus, the select what we recognize and ignore/log everything else was the best solution I could come up with at the time. That the data source (GC log) is separate from the parsing which is separate from Aggregation/views helped me cope with the changes in the logs that had no bearing on the data was to be viewed. This is why log lines don't make it to Aggregator.
I have added some code to enhance debugging support. I'm always happy is people are willing to contribute more. As for testing, GCToolKit an certainly use more.

0 replies

dsgrieve · 2024-06-10T17:17:42Z

dsgrieve
Jun 10, 2024
Maintainer

I think the issue, @kcpeppe, is that any log lines not parsed are just swallowed by the parser with some log message. There are also logged lines that are swallowed as "not interesting". I think @jlittle-ptc is say that there should be a way of passing those un-parsed lines on for futher handling.

1 reply

jlittle-ptc Jun 10, 2024
Author

This is correct. For my specific use cases, I have two concerns:

What log lines aren't being parsed, so I can look towards adding that functionality?
What errors occurred during parsing so that I can report that the displayed results may not be accurate?

(I'd also like to know which lines are skipped as not-interesting, more for my own knowledge than anything else.)

If the lines can't currently make it past the parser, the parser would likely need to populate this information into the JVM object so the application can use it. It could be as simple as "counts" of non-parseable categories for production use, with an option to turn on debugging to get full details written out.

The key thing is that my application is able to react to problems that the library encountered. Right now, there doesn't seem to be anything I can catch or any objects I can inspect after parsing a file to make sure that the file parsed correctly and that the data points I've generated are accurate.

kcpeppe · 2024-06-10T18:36:59Z

kcpeppe
Jun 10, 2024
Maintainer

The challenge is, if the parser doesn't recognize some input it's hard to say if that input has negatively impacted the analysis or not. All it knows is that it encountered something that it doesn't know how to deal with. If that "something" is meaningful is a question that would need to be answered by inspection.

My current thinking is that the a summary of the error information could end up in the Diary. The diary is injected into GCLogParser. The JavaVirtualMachine class has access to the diary as does the end user client. That would create a path to the information.

Another option would be to create a new event (ParsingErrorEvent) that would carry error information with it. One could aggregate that. Of the two I think I prefer the former option. The error complicates the event hierarchy which isn't a reason to not create the new event, it's just something that needs to be considered. Also, if an error is pub'ed, I'm not sure what you'd do with in an Aggregator so, suggestions are welcome.

As an FYI, notYetImplemented() was intended to be used only during development for debugging purposes.

0 replies

kcpeppe · 2024-06-10T18:45:53Z

kcpeppe
Jun 10, 2024
Maintainer

Here is an idea. With Unified logging one could inspect the tags if the tags are being included in the decorator set. If there are no tags/decorators in the line, then it's likely output from something else. If there are tags, they can be checked to see if they are in the supported set. If the tags are in the supported set then you'd make the calculations as suspect. The open question is what to do when the decorators don't include the tags.

3 replies

dsgrieve Jun 10, 2024
Maintainer

wrong discussion?

kcpeppe Jun 10, 2024
Maintainer

is my answer that disjoint? :-)

kcpeppe Jun 10, 2024
Maintainer

ahh, I see. What I'm looking for is a way to filter out the lines that should be missed as apposed to lines that should be captured

dsgrieve · 2024-06-10T18:52:14Z

dsgrieve
Jun 10, 2024
Maintainer

Another option might be to chain another Channel into a parser. If Channel#consume returned a boolean whether the message was "published", then the parser could move on to the next Channel in the chain. Something like that. By "publish", I mean that it created an event or whatever "success" means to the Channel.

3 replies

kcpeppe Jun 10, 2024
Maintainer

This is an interesting way of getting unparsed lines to somewhere other than a file. That said, it's not uncommon that GCToolKit encounters log lines that it can't parse and there are many reasons for this other than we missed adding this particular rule for this expected line. The difficulty is that the parser won't know if the line is relevant to some calculation or not.

As an aside, some how getting rid of unwanted lines prior to parsing would be a good thing. This is because, parsing (as it is currently implemented) is a run to failure process meaning a line it can't parse is the most expensive line to deal with. If opening up an alternative to logging helps people with problem, I'm all for it. I'm just a bit sceptical that one will be able to reliably tell is a calculation has been corrupted because of a log line that can't be parsed. But, maybe we add it and see what happens???

jlittle-ptc Jun 10, 2024
Author

I'm still new with vertx, so I don't know if the verticles are sequential or it's a free-for-all on the event-bus, but this sounds like it might be like a coin sorting bank to use an analogy. If the first verticle can't parse it (known gc formats), then it goes to the next (suspected GC format based on heuristics?) and the next (Likely garbage).

I could see something like that working out...

kcpeppe Jun 10, 2024
Maintainer

I would setup a 3rd channel and pub the unparseable line on that channel. The other option is to create an error event and publish that on the event channel.
Vertx has a couple of modes to manage consumption of messages. The mode we use is deliver to all listeners. The listeners in this case are asymmetric.

dsgrieve · 2024-06-11T15:47:39Z

dsgrieve
Jun 11, 2024
Maintainer

With respect to Vertx, we abstract that away with com.microsoft.gctoolkit.message.*

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to be informed of errors that may have occurred during parsing? #361

{{title}}

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the best way to be informed of errors that may have occurred during parsing? #361

jlittle-ptc Jun 10, 2024

Replies: 6 comments · 7 replies

kcpeppe Jun 10, 2024 Maintainer

dsgrieve Jun 10, 2024 Maintainer

jlittle-ptc Jun 10, 2024 Author

kcpeppe Jun 10, 2024 Maintainer

kcpeppe Jun 10, 2024 Maintainer

dsgrieve Jun 10, 2024 Maintainer

kcpeppe Jun 10, 2024 Maintainer

kcpeppe Jun 10, 2024 Maintainer

dsgrieve Jun 10, 2024 Maintainer

kcpeppe Jun 10, 2024 Maintainer

jlittle-ptc Jun 10, 2024 Author

kcpeppe Jun 10, 2024 Maintainer

dsgrieve Jun 11, 2024 Maintainer

jlittle-ptc
Jun 10, 2024

Replies: 6 comments 7 replies

kcpeppe
Jun 10, 2024
Maintainer

dsgrieve
Jun 10, 2024
Maintainer

jlittle-ptc Jun 10, 2024
Author

kcpeppe
Jun 10, 2024
Maintainer

kcpeppe
Jun 10, 2024
Maintainer

dsgrieve Jun 10, 2024
Maintainer

kcpeppe Jun 10, 2024
Maintainer

kcpeppe Jun 10, 2024
Maintainer

dsgrieve
Jun 10, 2024
Maintainer

kcpeppe Jun 10, 2024
Maintainer

jlittle-ptc Jun 10, 2024
Author

kcpeppe Jun 10, 2024
Maintainer

dsgrieve
Jun 11, 2024
Maintainer