Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

Closed
wants to merge 1 commit into from
Closed

Merge-on-Red: Implemented YAML log reader alongside the XML ones #11807

wants to merge 1 commit into from

Conversation

ivdiazsa
Copy link
Member

@ivdiazsa ivdiazsa commented Dec 1, 2022

Follow up to issue #11559. Now that the Helix machines have the PyYaml dependency installed, we can proceed with the next stage, which is processing the YAML test logs.

For context, our current test logs written in XML format. The problem with this is that if a there's a fatal hang/freeze or crash, then the log will end up incomplete, and therefore unreadable later on. This led to the motivation to add another log based in YAML format. Even if it's "incomplete", it's perfectly readable thanks to YAML's lack of closing indicators.

The full project can be found in the following links, for further information:

The goal of this PR is to update Helix's scripts, so that they can process the YAML logs, alongside to their existing XML counterparts.

@MattGal
Copy link
Member

MattGal commented Dec 1, 2022

Rats. There's a problem here, namely since you're doing this in the reporter it has to work inside docker containers (log of failure)

I tried to make this an explicit dependency of the helix client scripts (so it'd be forcibly installed inside docker scenarios) but it broke on some specific SLES 12 dependencies, so I settled for just installing it on the helix clients themselves. I'm not sure how to unblock you here, sorry for the inconvenience. I will discuss w/ my team and let you know if I have any useful ideas.

@MattGal
Copy link
Member

MattGal commented Dec 2, 2022

I apologize for any of your time I’ve wasted here, but at this point I don’t see any reasonable way forward for this feature to work in Helix. Specifically, the Helix client .WHL is authored in such a way that it is compatible, without modifications, with literally every mac, linux, and windows machine we run on. This is made more complex by the fact that .NET Core supports some old, long-supported OSes such as SLES, and that changing out the version of Python 3 / PIP on these machines often can result in breaking core operating system functionality and/or requires building python from source.

As such, the work I did end up doing was to make sure some version of pyYaml was on every helix image, but this doesn’t apply to the docker scenarios. In this case, the dependencies of the Helix WHL are installed at the beginning of any docker work item (where we try to pre-install them before running the work item), but expressing this dependency in our WHL prevents the WHL from working on SLES 12 and possibly SLES 15. These distros contain distutils versions of these packages which cannot be updated / downgraded via pip, and which break our installer if this dependency is expressed.

We walk a precarious line trying to keep a set of dependencies that can run the Helix client on all of our OSes and architectures, which is often frustrating for folks who want to use latest and greatest packages / python features. If and when we remove support for SLES, it would be possible to try to put this dependency back into the Helix client WHL and see if it works again. I’ll also be the first to admit I’m still a relative newcomer to python, and express my willingness to try out other approaches if you think there’s a reasonable way to rectify this problem I will do my best to try them out and see if we can make something work.

skip_reason = test.find("reason")
res = TestResult(name, u'yaml', type_name, method, time, result, exception_type,
failure_message, stack_trace, skip_reason, attachments)
yield res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed that we need some sort of terminal detection. To make sure that the tests actually finished, and didn't just abort in the middle. Presumably it can go at the end here, and just be something like

if not contents.get("completed"):
    yield TestResult("TEST CRASH"...)

I'm not sure what a good name for the fake test is, since you'd presumably want it to be different for different workitems (so that you didn't just get a bunch of "TEST CRASH" tests that you can't tell which actual test execution they are bound to).

@markwilkie
Copy link
Member

Thanks @MattGal - I was sorta afraid this might happen. @ivdiazsa - we're hoping to work on a common/shared approach to testing which this likely falls into.

@ivdiazsa
Copy link
Member Author

ivdiazsa commented Dec 2, 2022

@MattGal This is a very unfortunate turn of events indeed. I really appreciate the detailed explanation of all that's going on nonetheless, Matt. I can't think of a solution out off the top of my head as of now, so we will have to discuss this problem with the whole Merge-on-Red team. We'll be in touch in the nearby future.

@agocke
Copy link
Member

agocke commented Dec 5, 2022

I'm confused. Why does the yaml parsing logic need to run in the runners themselves? Don't we just need the parsing to run after the run has finished, in order to find the Build Analysis failures? Can't that be done in the orchestration layer?

@alexperovich
Copy link
Member

The test result parsing happens on the same machine the tests run on, after the test run. This means whatever python packages are needed for it to work must be installed. Changing where the parsing happens would be risky feature work, but doable.

@agocke
Copy link
Member

agocke commented Dec 5, 2022

Yeah, I guess I would have expected it to be done on a different machine for exactly this reason -- you don't want to be burdened with the machine requirements.

Moreover, if the runner is super slow for whatever reason I'd imagine you don't want to do compute-heavy work on it.

@ChadNedzlek
Copy link
Member

The main problem is the scale. Having to spin up another machine to send the results would dramatically increase the costs of helix. We already have a machine that can do the work with the file already available. Moving that parsing to another machine means that we need to have another machine and that the intermediate file needs to get transferred there somehow. And versioning becomes complicated as well. Right now, since it's all running on the same machine with the same scripts, there are no N/N-1 versioning problems.

@markwilkie
Copy link
Member

To re-iterate, it's my belief that that there's a solution here, but it likely involves another architectural layer, not a bolt on to Helix.

@agocke
Copy link
Member

agocke commented Dec 6, 2022

It sounds like a pretty big architectural change. We can think about it, but in the meantime I’m ok with standardizing on the xunit xml format. Our test watchdog can be responsible for writing or fixing up the file if necessary.

@ivdiazsa ivdiazsa closed this May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants