Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema v4.4 with issue transitions #78

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

spbnick
Copy link
Collaborator

@spbnick spbnick commented May 21, 2024

Add schema v4.4 with support for describing issue presence transitions.

The transitions point out the range of revisions along the change history, where an issue has appeared or disappeared from the code. Note that although there are no explicit revision objects in the I/O schema they exist as aggregated objects in the ORM and the dashboards.

Transitions are generally expected to be substantiated by issue incidents (which can be inferred from the revision and the issue linked from the transition). However, that's not really necessary, as long as the submitter makes it clear what happened in the transition and/or issue description (or perhaps in the linked report), and makes sure it really did.

Transitions can be "updated" - new versions of them submitted, when e.g. they're located more precisely (the revision range was narrowed), or for example they need to point to another issue.

@spbnick
Copy link
Collaborator Author

spbnick commented May 21, 2024

OK, here's what I came up with so far, @hardboprobot.

I decided to go with a much lighter coupling between regression/recoveries and actual test artifacts, making the presence of the latter optional, and instead pointing to the boundary revisions directly. This is an optimization after all, we can as well make it easier to query.

I don't have any more time for trying to express your stats with this, but I should be able to do that tomorrow.

Please post the cases you think are the most difficult and/or important here, if you can. And I'll try to work them out.

Thanks!

@r-c-n
Copy link

r-c-n commented May 22, 2024

According to the schema descriptions, issues, incidents and transitions are meant to be generated in the source CI systems and then submitted to KCIDB. Have you though about the potential pros and cons of this approach vs having the CI systems submitting only the results and having KCIDB generate issues, incidents and transitions from them?

Here are some ideas:

CI systems providing issues, incidents and transitions:
pros:

  • Simpler on the KCIDB side: no logic to implement here
  • These concepts are surely easier to track and detect individually in each CI system than on the aggregated data in KCIDB

cons:

  • May lead to inconsistent data across different CI systems
  • Possible object duplication if many sources detect the same issue, for instance. Data still segregated by origin.
  • More strict entry barrier for CI systems to become KCIDB data providers
  • How to manage, detect and trust that the provided data is correct?
    Note: some of these cons could be reduced if there was a common set of tools / libraries to perform the detection and generation of these objects that all KCIDB-compliant CI systems could use

CI systems providing only results and KCIDB detecting and creating issues, incidents and transitions:
pros:

  • Consistency: all objects created in a single place rather than by independent data sources
  • The produced objects may be platform-agnostic -> unification of results

cons:

  • Harder to deal with transition detection unless we apply certain restrictions during the detection (for instance, restricting the search to a single origin / branch)

I think the creation of issues depends only on the info provided by the test results, and thus it can be platform and CI-agnostic -> it can be done by an independent and standalone process (we can work on a tool for this), but identification of transitions is probably easier to do if we factor in the data origin, ie. I think it's easier to identify a transition looking at localized results in a per-CI basis.

What I mean is that maybe we can tackle these problems separately because they're different in nature, even if they will be related in the end. But first I think we need to clarify if these will be provided by the CI origins or created by KCIDB itself.

"The Git repository branch from which the commit "
"with the base source code was checked out."
},
"git_commit_generation": {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of these concepts that may break once we consider branches that are frequently rebased, if we're using this to compare two commits, that is. "later in history" is really hard to tell in those cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this won't support that.

@tales-aparecida
Copy link

tales-aparecida commented May 22, 2024

In reply to #78 (comment)

I think the creation of issues depends only on the info provided by the test results, and thus it can be platform and CI-agnostic -> it can be done by an independent and standalone process (we can work on a tool for this), but identification of transitions is probably easier to do if we factor in the data origin, ie. I think it's easier to identify a transition looking at localized results in a per-CI basis.

Based on my experience with CKI, where most Issues can be identified by a set of regexes over the logs generated by a single test + metadata about the build and tree, I can see how this might be true...

Just so I can understand this better, in that scenario you are proposing, we would only automatically create issues when processing test runs about trees, right? Patches that are still being reviewed have the potential to introduce issues, but they don't need to be tracked until they land IMO.

I'm really asking you this because, from my perspective, the CKI workflow cares more about tests runs on patches (MRs), so we're still relying on human judgement to categorize whether the patch caused the FAIL or not, e.g. comparing the context of the failure with the lines of code that changed; comparing the outcome of the same tests in builds from other architectures and configs, that might've ran in different hosts; manually checking ticket trackers for other trees; and even asking people if they had seen that before. Once we are sure the patch is not responsible for the FAIL, we create a new issue (which points to a ticket tracker with someone assigned to fix it ASAP), otherwise the developer should update their code and start a new test run.

EDIT: added "+ metadata about the build and tree"

@tales-aparecida
Copy link

Now that I see the solidification of the idea of transition, I'm struggling to see how we'd go about dealing with flaky bugs in it. I would like to think that they would just be ignored until someone is sure the bugs are gone, but I can also see them being "toggled" with a pair of on-and-off transitions whenever they show up and inevitably disappear, if we're not careful.

I would like to see some examples or mock ups of the data that would be stored using this if we have time to discuss this further before merging.

@spbnick
Copy link
Collaborator Author

spbnick commented May 22, 2024

According to the schema descriptions, issues, incidents and transitions are meant to be generated in the source CI systems and then submitted to KCIDB. Have you though about the potential pros and cons of this approach vs having the CI systems submitting only the results and having KCIDB generate issues, incidents and transitions from them?

My aim is to support both cases, as needed. Have KCIDB do its own triaging and reporting on submitted results, as much as it can. However, if a CI system is equipped with more advanced machinery for detecting particular issues or regressions, let them do it until KCIDB catches up.

For our particular target of having your report generation work in KCIDB, my aim is to have the detection of conditions and generation of the appropriate objects just hardcoded inside KCIDB for the start. That would make sure we have enough data reported in Maestro results for processing, and verify that we can express the outputs.

Next step would be adding condition fields to issues, and building a generic machinery for doing the same thing, but executing the matching expressed in the database, corresponding to what you need, and what we were hardcoding before.

@spbnick
Copy link
Collaborator Author

spbnick commented May 22, 2024

Now that I see the solidification of the idea of transition, I'm struggling to see how we'd go about dealing with flaky bugs in it. I would like to think that they would just be ignored until someone is sure the bugs are gone, but I can also see them being "toggled" with a pair of on-and-off transitions whenever they show up and inevitably disappear, if we're not careful.

Yes, this is a very valid concern. It is orthogonal to the concept of regressions/transitions, though. It's a question of how well we can trust a test result. We would have to have a layer which employs a measure of stability and statistic analysis to make this more robust. I don't know how to do this exactly yet. But small steps. Let's figure this out first.

I would like to see some examples or mock ups of the data that would be stored using this if we have time to discuss this further before merging.

Yes, that's what I would like to do. First with the reports that are sent by the bot to the #general channel already. And we're far from merging this, in addition to that too, as we'll have to run this by submitters, after we agree on it, and before merging.

@r-c-n
Copy link

r-c-n commented May 23, 2024

Just so I can understand this better, in that scenario you are proposing, we would only automatically create issues when processing test runs about trees, right? Patches that are still being reviewed have the potential to introduce issues, but they don't need to be tracked until they land IMO.

I'm really asking you this because, from my perspective, the CKI workflow cares more about tests runs on patches (MRs), so we're still relying on human judgement to categorize whether the patch caused the FAIL or not, e.g. comparing the context of the failure with the lines of code that changed; comparing the outcome of the same tests in builds from other architectures and configs, that might've ran in different hosts; manually checking ticket trackers for other trees; and even asking people if they had seen that before. Once we are sure the patch is not responsible for the FAIL, we create a new issue (which points to a ticket tracker with someone assigned to fix it ASAP), otherwise the developer should update their code and start a new test run.

I wasn't making any distinctions between tests on trees vs tests on patches. As long as there's a test output, the processing of the output and the parameters can generate information about the result, so the origin of the changes should be irrelevant for this process.

Having a human eye on it is still a necessary step because of all the reasons you listed, but I think we can make it easier for the people checking those details if a tool already pre-digests some of that automatically, as a sort of pre-classification. Many of those steps can be at least partially automated.

And, strictly speaking, if a patch was applied over a commit that passed for a particular test and that test now fails, to me the key point is not whether the cause of the failure was the patch or not, but that the patch uncovered a bug. Now, the bug may be in the patch, in a different code unrelated to the patch or in the test code itself. All those things are valuable outcomes in the sense that they lead to a bug investigation, regardless of whether it was the patch that was buggy or not. But these are just two different valid ways to look at the same situation.

@r-c-n
Copy link

r-c-n commented May 23, 2024

Some regression report examples:

note: each report contains a link to the KernelCI node where you can see the full node structure. In some cases, the node may have been updated after the report is created.

@r-c-n
Copy link

r-c-n commented May 23, 2024

Now that I see the solidification of the idea of transition, I'm struggling to see how we'd go about dealing with flaky bugs in it. I would like to think that they would just be ignored until someone is sure the bugs are gone, but I can also see them being "toggled" with a pair of on-and-off transitions whenever they show up and inevitably disappear, if we're not careful.

Back when we worked on the KernelCI regression tracker last year, we were basing our processing on the low-level data provided by KernelCI --whose concept of "regression" was analogous to the concept of "transition" we're talking about here--, and it was the higher-level processing done by the tool that checked if a test looked flaky based on the frequency of fail-pass-fail transitions (using a very rudimentary rule of thumb). This works as a process, and it leads to valuable added information that can be used to:

  • Create higher-level "regression" objects after applying a band-pass filter to weed out unwanted results (flaky tests, false positives, etc)
  • Report flaky tests to lab maintainers and test authors, depending on the cause of the flakiness

@spbnick
Copy link
Collaborator Author

spbnick commented May 23, 2024

  • Create higher-level "regression" objects after applying a band-pass filter to weed out unwanted results (flaky tests, false positives, etc)
  • Report flaky tests to lab maintainers and test authors, depending on the cause of the flakiness

Yeaaaah, that sounds good. I wonder if we should allow incidents to refer to transitions, so that we can create issues referring to transitions 🤔

@spbnick
Copy link
Collaborator Author

spbnick commented May 28, 2024

OK, I'll try answering on case by case basis.

Some regression report examples:

note: each report contains a link to the KernelCI node where you can see the full node structure. In some cases, the node may have been updated after the report is created.

In the proposed model this case is represented by a transition with appearance set to true. It references an issue using issue_id and issue_version fields. The issue describes the problem in a free-form comment (we don't have matching fields implemented yet). Something like "Test storage.WriteZeroPerf failing". Both of these would be created by the hard-coded post-processing code for the start. The issue would have to have a particular stable ID based on e.g. the test name and the status (the basic conditions we're watching for), so we can avoid creating different ones every time the test fails. We would need to think more about the transition ID, so we could find and update it, if more data (e.g. bisection results) come in. The transition and the issue generating code would also have to create the incidents connecting the issue to the particular tests that "created" the transition (both PASSing and FAILing), so we can find them. At least until we figure out something better (which could happen as we try to implement this).

Below is the description of the regression you link. I'm commenting on each piece of data separately, in the context of KCIDB displaying the issue in the context of the whole database, with its ID and version in hand. We will also likely need to have the same display in the context of a particular tree/repo/branch. I'll comment on each piece of data separately.

storage.WriteZeroPerf (tast-perf-long-duration-x86-intel)

This would be the issue's description.

  • KernelCI node: 66414c2bc3305e852749cce5

Not applicable.

  • Status: ACTIVE

In the context of the whole database, ACTIVE would mean that there's at least one tree/repo/branch which don't have a confirmed (tested) fix. That won't be very useful, probably, as it takes a long time to propagate (and test) the fix, and will never happen for removed/abandoned branches. The tree/repo/branch-context view would contain a more useful status. Still it needs to be here for completeness.

So, for the whole database we can do an SQL query listing all the branches that have the latest transition of this issue appearing. Something like this (obviously untested):

/* Filter branches where issue appeared (regression in progress) */
SELECT git_repository_url, git_repository_branch
FROM (
    /* Get the type of the latest (generation-wise) transition of the issue in question for each distinct branch */
    SELECT DISTINCT ON (checkouts.git_repository_url, checkouts.git_repository_branch) transitions.appearance
    FROM checkouts
    INNER JOIN transitions
    ON
        transitions.issue_id = 'ID of the issue of interest' AND
        transitions.issue_version = 'Version of the issue' AND
        transitions.revision_after.git_commit_hash = checkouts.git_commit_hash AND
        transitions.revision_after.patchset_hash = checkouts.patchset_hash
       /* (we gotta do better with revision IDs, and I have a plan, but it'll have to wait) */
    WHERE
        transitions._timestamp > 'our cut-off time' AND
        checkouts.git_commit_generation IS NOT NULL
    ORDER BY checkouts.git_commit_generation DESC
) AS branch_appearance
WHERE appearance

Actually, this same thing could be done just looking at incidents only instead, but we would have to follow the links from the tests they're referencing and up to checkouts, and then order them by the generation. So referencing revisions in transitions helps here.

For a tree, this would be very similar, as we would need to know if any of its branches have the issue appearing last. But we can just as well get the list of those branches again:

/* Filter branches where issue appeared (regression in progress) */
SELECT git_repository_url, git_repository_branch
FROM (
    /* Get the type of the latest (generation-wise) transition of the issue in question for each distinct branch */
    SELECT DISTINCT ON (checkouts.git_repository_url, checkouts.git_repository_branch) transitions.appearance
    FROM checkouts
    INNER JOIN transitions
    ON
        transitions.issue_id = 'ID of the issue of interest' AND
        transitions.issue_version = 'Version of the issue' AND
        transitions.revision_after.git_commit_hash = checkouts.git_commit_hash AND
        transitions.revision_after.patchset_hash = checkouts.patchset_hash
       /* (we gotta do better with revision IDs, and I have a plan, but it'll have to wait) */
    WHERE
        transitions._timestamp > 'our cut-off time' AND
        checkouts.git_commit_generation IS NOT NULL AND
        checkouts.tree = 'tree of interest'
    ORDER BY checkouts.git_commit_generation DESC
) AS branch_appearance
WHERE appearance

This would be similar for a repository.

And, finally, for a particular branch we would simply need to know if the latest issue transition was it disappearing:

/* Get the type of the latest (generation-wise) transition of the issue in question in the specified branch */
SELECT transitions.appearance
FROM checkouts
INNER JOIN transitions
ON
    transitions.issue_id = 'ID of the issue of interest' AND
    transitions.issue_version = 'Version of the issue' AND
    transitions.revision_after.git_commit_hash = checkouts.git_commit_hash AND
    transitions.revision_after.patchset_hash = checkouts.patchset_hash
   /* (we gotta do better with revision IDs, and I have a plan, but it'll have to wait) */
WHERE
    transitions._timestamp > 'our cut-off time' AND
    checkouts.git_commit_generation IS NOT NULL AND
    checkouts.git_repository_url = 'repository of interest' AND
    checkouts.git_repository_branch = 'branch of interest'
ORDER BY checkouts.git_commit_generation DESC
LIMIT 1
  • Introduced in: 2024-05-12T23:09:31.805000

For the whole database context this would simply be the minimum start_time across all tests referenced by incidents marking the issue as present. We don't need to reference transitions for that.

For e.g. a tree we can go issue->incident->test->build->checkout to get the minimal test start_time with this issue for this tree.

  • Previous successful run: 66413aa2c3305e852749a573

Again, for the whole database, it would simply be the maximum start_time across all tests referenced by incidents marking the issue as absent.

  • Tree: mainline

For the whole database this would be the list of branches (or trees, with the corresponding adjustment) from the query above.

  • Branch: master

Well, same thing, more-or-less.

  • Commit: a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 (v6.9)

I'm not sure which commit this is, but if it's a latest/earliest breaking/fixing commit, we can get that, by using the approach in the query above.

  • Arch : x86_64

This should be a list of architectures of the builds of the tests referenced by all incidents with the issue appearing.

  • Platform : dell-latitude-5300-8145U-arcada

Similar thing. Only we don't have the schema for the "platform" formalized and agreed upon in KCIDB yet. And we would only need to get it from the tests, not builds.

  • Config: cros://chromeos-6.6/x86_64/chromeos-intel-pineview.flavour.config+x86-board+CONFIG_MODULE_COMPRESS=n+CONFIG_MODULE_COMPRESS_NONE=y

Same thing as architecture.

  • Compiler: gcc-10

Same thing as architecture.

  • Logs:
    • lava_log

Um, which log would that be? Earliest? Latest?

This is mostly as above. We would get the INACTIVE status from getting no branches with the appearing issue, out of that first SQL query. We can get a single "fixing" commit for the whole database. Maybe we could have an expanded view, which shows status per tree/repo/branch. Still, for an issue in the context of the whole database we won't see these often.

Regardless, to get the fixing commits we would need to query all the latest transitions of the issue for each tree/repo/branch, pick the ones with appearance = false, and get the commit info from the linked revision after. This could also work by selecting latest incidents and filtering the ones with the issue missing.

This is similar to the above, except KCIDB needs to have the necessary data to detect that, and set the harness field in the issue's culprit object (the above-discussed issues would have the code culprit set instead).

To list all failed runs after the first detection we need to just list all incidents for the issue, and follow their references to tests. If we want to limit those by a tree/repo/branch, we would need to involve checkouts and their git_commit_generation field.


Overall this seems doable from the POV of the dashboard and SQL. The incident objects really make things simpler and allow the dashboard to be completely unaware of how issues are matched. I have conflicting feelings about transitions. In many ways they feel rudimentary, as a lot of stuff can be done with just incidents, but they do provide the shortcut to revisions. Let's get on with the implementation and things will get clearer.

There seems to be a great benefit of just always dealing with "incidents", instead of tests directly. I'm still thinking perhaps automagically created "virtual incidents" could help, but I don't want to attempt that now. For the start, let's write our transition-detection code, and make it create incidents explicitly for transition boundaries (so the dashboard can find the responsible test results).

Speaking of that code, I think it could go under kcidb/monitor and be triggered via a pub-sub message (either directly from kcidb_spool_notifications function in main.py, after it was renamed to something more suitable), or by a separate "Cloud Function" triggered from its own subscription to the "updates" message queue.

That code can just go straight to the SQL for the start (after extracting the relevant object IDs from the "patterns" we pass around), or we could try to build the necessary ORM mappings first. Depending on how hard the latter would be.

But first, we would have to implement preliminary support for the schema in the three database drivers, and try to build on that to see if it works, roughly, and then get the schema update past the submitting CI system maintainers.

What do you think, @hardboprobot, everyone else?

@spbnick
Copy link
Collaborator Author

spbnick commented May 28, 2024

Regarding the actual transition-detecting code, it would need access to the branch history.

First, it would need to see if the object is applicable: its checkout has git_commit_generation, but likely more checks will be needed, at least for the start.

Then it would need to get the history along the object's branch. It will have to have its own idea of the criteria for those objects, but overall the timeline can be derived by joining with checkouts with the same git_repository_url and git_repository_branch (note: not tree, as those can have multiple branches, AFAICT), and sorted by git_commit_generation. It would likely need the objects both before and after the present one.

The criteria can be as simple as all tests with the same path, or the builds for the same architecture. But note, that whatever the criteria are, it would always be possible to get conflicting results for the same revision. And we'll have to think on what to do with that. But one step at a time.

@r-c-n
Copy link

r-c-n commented May 29, 2024

Thanks for the detailed analysis, @spbnick

In general, I think these ideas try to approach some of the problems from the perspective that the same issue may happen in different trees/branches and different setups, so a high-level analysis of test results across the DB will provide more information about it.

There's this other lower-level approach (legacy KernelCI / Maestro) that holds that to study a test result over time you need to fix some of the search parameters (tree/branch, hardware, kernel config, etc) because any of these variables may influence the test result and that, by leaving the commit as the only moving part, you can tell if the failure was introduced by it or not. So, under this approach, two similar issues may be different if they happen in different hardware targets, for instance.

I think the reality is that both perspectives are right: there are test results failing across repos and targets that point to the same issue, and there are also tests that fail under very specific circumstances (hardware, kernel config) and it's useful to inspect which of those variables may have introduced the failure (narrowing down) rather than expanding the search space.

So the difficult part may be how to mix these two strategies. And I think we can apply them in different stages, as they are clearly on different abstraction levels and have different goals. We could first do the low-level analysis to remove noise from results and narrow down the search variables, and then we can apply the high-level approach to the end results of the low-level one. Basically, move the regression processing that KernelCI is doing into KCIDB and add the broader-scope analysis on top of it.

In order for the high-level approach to be practical, though, I think we first need a way to characterize and identify issues, which is what I'm working on now.

But let's take it bit by bit:

In the context of the whole database, ACTIVE would mean that there's at least one tree/repo/branch which don't have a confirmed (tested) fix. That won't be very useful, probably, as it takes a long time to propagate (and test) the fix, and will never happen for removed/abandoned branches. The tree/repo/branch-context view would contain a more useful status. Still it needs to be here for completeness.

Here we hit the problem I mentioned in KCIDB design and development ideas, that in order to do this across different trees/branches the issues need to be sufficiently specific. Something like what you proposed there: "Test storage.WriteZeroPerf failing" may be useful for a human to know what the issue is about but isn't specific enough to tell two issues apart. In this scenario, the same test may be failing in two different branches/repos for different reasons.

I'd say for now we can either:

  • assume we have a way to profile issues (I'm working on that) and move on.
  • simplify the detection of FAIL->PASS transitions by restricting the search by the tree/branch (what KernelCI currently does)

Introduced in: 2024-05-12T23:09:31.805000

For the whole database context this would simply be the minimum start_time across all tests referenced by incidents marking the issue as present. We don't need to reference transitions for that.

I'm not completely sure of that. Let's suppose an example: a test failed last month causing a transition (PASS->FAIL) identified by a specific issue, and an incident (present). Two weeks ago the test was detected to pass, so we get a new transition (FAIL->PASS) and another incident (absent). Btw, please, correct me if I'm wrong about the use of incidents, I think I see some overlapping of functionality now between concepts but I'm not sure.
Now, the same test fails again today, the culprit is a different commit or whatever other condition, but the issue seems to be the same. So we get another transition (PASS->FAIL) and another incident (present) pointing to the same issue.

If we search for the minimum start_time across tests referenced by incidents marking the issue as present we'll get the result from last month, but that's unrelated to the one that happened today.

This example happens all the time in unstable tests in KernelCI, due to poorly programmed tests, setup problems, faulty hardware, etc. Only the frequency is much higher (multiple PASS->FAIL->PASS transitions in the same week).


Commit: a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 (v6.9)

I'm not sure which commit this is, but if it's a latest/earliest breaking/fixing commit, we can get that, by using the approach in the query above.

This would be the commit that made the test fail, ie. the earliest breaking commit.

  • Logs:
    • lava_log

Um, which log would that be? Earliest? Latest?

This should be the log from the earliest failure.


For the rest of the attributes, you mention how to get the results applied to the whole database. I think before this we should start from the primitive data as in the examples, where we have a very zoomed-in view of a test failure. And then, when we have a way to characterize the failure we can zoom-out and search for the same issue across the whole DB.


Overall this seems doable from the POV of the dashboard and SQL. The incident objects really make things simpler and allow the dashboard to be completely unaware of how issues are matched. I have conflicting feelings about transitions. In many ways they feel rudimentary, as a lot of stuff can be done with just incidents, but they do provide the shortcut to revisions. Let's get on with the implementation and things will get clearer.

Maybe another way to interpret the roles of transitions and incidents is this separation between low-level and high-level processing that I'm proposing rather than trying to conflate them into the same solution space. That is, transitions may be meant to provide the solution to one problem (rudimentary detection of breaking points) and incidents provide a solution to a different problem (relationship between test failures and issues). Linking the information provided by these two may be a way to bring low-level information to the high-level problem.

The next three weeks I'll be available to work on anything related to this, so let me know if I can help or assist you in any way.

@spbnick
Copy link
Collaborator Author

spbnick commented May 30, 2024

Thank you for another detailed and thoughtful response, @hardboprobot!

In order for the high-level approach to be practical, though, I think we first need a way to characterize and identify issues, which is what I'm working on now.

I think we're looking at the problem from two somewhat different angles (which is great!). I'm coming from the environment where we have an army of QE maintaining specific tests an handling failures by manually creating issues which match the corresponding results as closely as possible. They know the test, know the environment and can do that.

These issues can generally be applied to the full swath of trees we're testing. In addition they enjoy the limitation of mostly testing the latest commits (although they can be to quite old branches). Whenever there's a tree-specific problem, it's possible to add a corresponding condition to the issue to limit detection. There's constant collaboration, and negotiation on and about the issue conditions.

If I understand you right, you would like to have automatic identification, and scoping of issues. This is great and I fully support these efforts. And I think that we would still need to have the same base schema to implement the low level, as you say. Plus, as we already agreed before, it would be great if a human can interfere and correct the automatically-generated issue description in any case. So, I think we can get on with implementing what we have so far (or at least attempting to). This will help us figure out the foundation for the automatic detection and having the former more defined will help us think about the latter.

But let's take it bit by bit:

In the context of the whole database, ACTIVE would mean that there's at least one tree/repo/branch which don't have a confirmed (tested) fix. That won't be very useful, probably, as it takes a long time to propagate (and test) the fix, and will never happen for removed/abandoned branches. The tree/repo/branch-context view would contain a more useful status. Still it needs to be here for completeness.

Here we hit the problem I mentioned in KCIDB design and development ideas, that in order to do this across different trees/branches the issues need to be sufficiently specific. Something like what you proposed there: "Test storage.WriteZeroPerf failing" may be useful for a human to know what the issue is about but isn't specific enough to tell two issues apart. In this scenario, the same test may be failing in two different branches/repos for different reasons.

Yes, that "Test storage.WriteZeroPerf failing" is absolutely the simplest, wildcard case. And it's coming from the requests from maintainers to only send them "new" failures. It's crude, but even just that could make KCIDB reports and dashboards more useful, and potentially attract new users, and to give us feedback. And a simple case is good to start the implementation with.

I'd say for now we can either:

  • assume we have a way to profile issues (I'm working on that) and move on.

I think this is a good idea.

  • simplify the detection of FAIL->PASS transitions by restricting the search by the tree/branch (what KernelCI currently does)

Let's also keep this in mind and see how things go with the implementation and with the actual data, and tackle things as they come.

Introduced in: 2024-05-12T23:09:31.805000

For the whole database context this would simply be the minimum start_time across all tests referenced by incidents marking the issue as present. We don't need to reference transitions for that.

I'm not completely sure of that. Let's suppose an example: a test failed last month causing a transition (PASS->FAIL) identified by a specific issue, and an incident (present). Two weeks ago the test was detected to pass, so we get a new transition (FAIL->PASS) and another incident (absent). Btw, please, correct me if I'm wrong about the use of incidents, I think I see some overlapping of functionality now between concepts but I'm not sure. Now, the same test fails again today, the culprit is a different commit or whatever other condition, but the issue seems to be the same. So we get another transition (PASS->FAIL) and another incident (present) pointing to the same issue.

If we search for the minimum start_time across tests referenced by incidents marking the issue as present we'll get the result from last month, but that's unrelated to the one that happened today.

This example happens all the time in unstable tests in KernelCI, due to poorly programmed tests, setup problems, faulty hardware, etc. Only the frequency is much higher (multiple PASS->FAIL->PASS transitions in the same week).

Ah, of course. Just the timestamp didn't give me enough information on what it is about. We can look for the earliest commit with the issue instead here.

  • Logs:

    • lava_log

Um, which log would that be? Earliest? Latest?

This should be the log from the earliest failure.

Right. Shouldn't be a problem.

For the rest of the attributes, you mention how to get the results applied to the whole database. I think before this we should start from the primitive data as in the examples, where we have a very zoomed-in view of a test failure. And then, when we have a way to characterize the failure we can zoom-out and search for the same issue across the whole DB.

Well, sure. Although having one more dashboard for the overall database doesn't hurt anyone and is easy to implement in Grafana. Notifications is another thing.

Overall this seems doable from the POV of the dashboard and SQL. The incident objects really make things simpler and allow the dashboard to be completely unaware of how issues are matched. I have conflicting feelings about transitions. In many ways they feel rudimentary, as a lot of stuff can be done with just incidents, but they do provide the shortcut to revisions. Let's get on with the implementation and things will get clearer.

Maybe another way to interpret the roles of transitions and incidents is this separation between low-level and high-level processing that I'm proposing rather than trying to conflate them into the same solution space. That is, transitions may be meant to provide the solution to one problem (rudimentary detection of breaking points) and incidents provide a solution to a different problem (relationship between test failures and issues). Linking the information provided by these two may be a way to bring low-level information to the high-level problem.

I agree.

The next three weeks I'll be available to work on anything related to this, so let me know if I can help or assist you in any way.

Great! I'll start updating the database drivers ASAP. Just need to finish jq.py rebasing. Then we would be able to start reimplementing the issue detection in KCIDB, in a separate deployment. That's where I will need your help. When we see the things are going the right way, we'll send the schema proposal to the submitters.

In parallel, if you feel we'll need to look at the logs for our issue detection soon, we'll need to improve artifact caching, to increase speed, and save on incoming traffic costs. At the moment the caching logic is rather trivial, and also it only downloads every 256th file (to save costs, again, for the testing period). We'll need to figure out storage policy so that data is retired to cheaper storage automatically, and make sure we cache/archive all the files we need. You could also help with that. Unless we find someone else.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 2, 2024

I finished rebasing our fork of jq.py, and will be able to start the driver work on Monday.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 4, 2024

One thing this schema doesn't specify is how to "remove" a transition. There needs to be a certain value of a field or combination of fields that says "this transition is invalid". Missing fields mean "unknown", so that is not available.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 4, 2024

Perhaps we could permit empty issue_id (distinct from missing issue_id)?
Speaking of which, I'm wondering how to make an issue itself invalid.

@r-c-n
Copy link

r-c-n commented Jun 4, 2024

Great! I'll start updating the database drivers ASAP. Just need to finish jq.py rebasing. Then we would be able to start reimplementing the issue detection in KCIDB, in a separate deployment. That's where I will need your help. When we see the things are going the right way, we'll send the schema proposal to the submitters.

That a good plan, ping me if you have a clear idea of where to start or have already some tasks that you want to offload. Or we can have a quick meeting or slack chat about it.

One thing this schema doesn't specify is how to "remove" a transition. There needs to be a certain value of a field or combination of fields that says "this transition is invalid".

Sorry, I may have forgotten part of the context, but I don't recall talking about this concept of a "removed" or "invalid" transition. Which case do you want to model with that?

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 4, 2024

That a good plan, ping me if you have a clear idea of where to start or have already some tasks that you want to offload. Or we can have a quick meeting or slack chat about it.

Great! Let me finish the basic database support and post it for your review. And then we take it from there.

Sorry, I may have forgotten part of the context, but I don't recall talking about this concept of a "removed" or "invalid" transition. Which case do you want to model with that?

We didn't discuss it, but it's very likely that a transition we "find" might turn out to be wrong and we might need to "remove" it. Same with issues. We cannot actually delete stuff from KCIDB, so we need a way to invalidate issues, incidents, and transitions.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 4, 2024

Also, do you think we'll need to look inside log files soon? If yes, then we need to improve the artifact cache.

@r-c-n
Copy link

r-c-n commented Jun 4, 2024

We didn't discuss it, but it's very likely that a transition we "find" might turn out to be wrong and we might need to "remove" it. Same with issues. We cannot actually delete stuff from KCIDB, so we need a way to invalidate issues, incidents, and transitions.

Ah right. I leave the implementation decision to you, then. If the "empty" is clear enough, that'll work. If not, would a single extra boolean per object impact the DB size too much?

Also, do you think we'll need to look inside log files soon? If yes, then we need to improve the artifact cache.

For now I'm using KernelCI's logs to add coverage to the tool. In this stage I don't need full access to every log, I'm just picking as many failures as I can find (currently kernel build failures). Once I have this in a more or less useful state we can see if it's worth it to integrate it into KCIDB, as a tool or a lib, and then we can process the logs of test failures as they arrive and keep the extracted data (hopefully as an issue) without necessarily changing the caching strategy.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 4, 2024

OK, I think an issue could be invalidated by clearing all "culprit" boolean fields. And then an incident or a transition could be invalidated by supplying empty issue IDs. I'll change the schema accordingly.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 4, 2024

Ah, incidents don't have versions, and they don't need to be invalidated, but issues do.

@spbnick spbnick force-pushed the add_schema_v4_4 branch from f56770c to 6f404d6 Compare June 4, 2024 10:53
spbnick added 3 commits June 5, 2024 16:30
Add support for describing issue presence transitions.

The transitions point out the range of revisions along the change
history, where an issue has appeared or disappeared from the code. Note
that although there are no explicit revision objects in the I/O schema
they exist as aggregated objects in the ORM and the dashboards.

Transitions are generally expected to be substantiated by issue incidents
(which can be inferred from the revision and the issue linked from the
transition). However, that's not really necessary, as long as the
submitter makes it clear what happened in the transition and/or issue
description (or perhaps in the linked report), and makes sure it really
did.

Transitions can be "updated" - new versions of them submitted, when e.g.
they're located more precisely (the revision range was narrowed), or
for example they need to point to another issue.
@spbnick spbnick force-pushed the add_schema_v4_4 branch from 6f404d6 to dfaa97e Compare June 5, 2024 13:30
@spbnick
Copy link
Collaborator Author

spbnick commented Jun 5, 2024

Getting there. DB support just passed all local tests. Tests against a deployment are next.

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 6, 2024

Alright, preliminary support for this schema is up in a PR: kernelci/kcidb#529

@hardboprobot, please review!

@r-c-n
Copy link

r-c-n commented Jun 21, 2024

Hi @spbnick , I've been doing some experiments and playing around on the schema proposal, and the cloudy ideas are starting to form some clearer shapes. I'll comment my impressions here (sorry, long comment ahead).

It looks to me like in this schema the central concept is the "Issue", but we also need to consider that some users may be primarily interested in something else rather than in issues as the main point, and that some queries might not revolve around issues, but rather start from the opposite direction (from build or test results rather than from issues and incidents).

current_scheme

So what I'm doing is to go through typical use cases one by one and evaluating if the concepts and relationships in the schema are expressive and flexible enough to satisfy them.

The first use case I'm studying is the processing of a test failure result upon arrival, which covers most of the concepts in the schema and will sprawl into other use cases:

Assuming that a new failing test result was just detected, the steps to follow could be:

  1. Identify issue: automatically or manually
    Does the issue exist already? NO -> Create it
  2. Create a new incident to link the test run to the identified issue
  3. Does the test run create a PASS -> FAIL transition?
    • Search for previous test runs on the same tree/branch, platform, kernel config, etc.
    • Get the previous PASS result
    • Create a transition with appearance = TRUE and link it to the test result and the appropriate issue

Problem: a transition is only related to an issue in the schema, but they should be also linked to the test runs that define it (at least the most recent one, the one that creates the transition)

The only information that a transition holds about the breaking point is the commit hash pair that contain the transition, but in which repo / branch?

                                 Table "public.transitions"
             Column              |           Type           | Collation | Nullable | Default 
---------------------------------+--------------------------+-----------+----------+---------
 _timestamp                      | timestamp with time zone |           |          | 
 id                              | text                     |           | not null | 
 version                         | integer                  |           | not null | 
 origin                          | text                     |           | not null | 
 issue_id                        | text                     |           | not null | 
 issue_version                   | integer                  |           | not null | 
 appearance                      | boolean                  |           |          | 
 revision_before_git_commit_hash | text                     |           |          | 
 revision_before_patchset_hash   | text                     |           |          | 
 revision_after_git_commit_hash  | text                     |           |          | 
 revision_after_patchset_hash    | text                     |           |          | 
 comment                         | text                     |           |          | 
 misc                            | jsonb                    |           |          |

As an example, consider the use case: "The user wants to know when an issue was introduced"

From a specific issue, it's trivial to find the associated transitions, but all the information we can get from them is a commit/patchset hash pair. The issue may have happened in more than one repo, hardware target, etc.

Additionally, from the issue we can get all the related incidents, and from them we can get to the test runs that show the issue, with its related checkout and environment info, but we can't relate this to the info we got from the transition.

Possible solution: link "transitions" to the related test runs / builds.
Substitute revision_before_* and revision_after_* with test_id_before / test_id_after and build_id_before / build_id_after. But this is cumbersome because a transition is either for a test or for a build, but not for both. Possible solution to this: split transitions into "Test transitions" and "Build transitions" and make the reference ids not null.

BUT!

We already have a relationship that links test and build objects with issues: incidents. So maybe we can encode the information of a transition in an incident and discard the "Transition" objects altogether.

                         Table "public.incidents"
    Column     |           Type           | Collation | Nullable | Default 
---------------+--------------------------+-----------+----------+---------
 _timestamp    | timestamp with time zone |           |          | 
 id            | text                     |           | not null | 
 origin        | text                     |           | not null | 
 issue_id      | text                     |           | not null | 
 issue_version | integer                  |           | not null | 
 build_id      | text                     |           |          | 
 test_id       | text                     |           |          | 
 present       | boolean                  |           |          | 
 comment       | text                     |           |          | 
 misc          | jsonb                    |           |          |
 transition    | boolean                  |           |          |  <---- new field
 prev_build_id | text                     |           |          |  <---- new field
 prev_test_id  | text                     |           |          |  <---- new field

With these new fields this should provide the same information as the transition entity, since a transition is now equivalent to an incident with transition == true.

Additionally, this removes a small redundancy that we had with the incidents.present and transition.appearance fields: they meant more or less the same thing, but applied to different objects. If we encode transitions as incidents, we can just use the present field to mean both things:

  • test run shows an issue when the previous test run didn't:
    new incident with issue_id = (issue), test_id = (this_test), transition = true, prev_test_id = (test_without_issue), present = true

  • test run doesn't show an issue that was present in the previous run:
    new incident with issue_id = (issue), test_id = (this_test), transition = true, prev_test_id = (test_with_issue), present = false

The rest of the combinations are still possible, for example:

  • test run shows an issue that was already present:
    new incident with issue_id = (issue), test_id = (this_test), transition = false, present = true

Queries we can run on this:

  • For a known issue:

    1. Find test runs / builds that show it: yes, joining tests / builds tables through the incidents table
    2. Find transitions: trivial, searching for incidents with transition == true
    3. Find all the commits that introduced it in all known repos: from 1), then get the checkout info
  • For a test run or build:

    1. Find related issues: trivial, through the incidents table
    2. Check if it introduced or fixed a regression: search incidents with transition = true and present = true / false, respectively

Additionally, since an incident will be either a test or a build incident, but not both, I'd split "incidents" into "test incidents" and "build incidents":

proposed_scheme

but this isn't a hard requirement, I think.

What do you think? Does it make sense to you? Are there any features that you think we could be missing with the change?

@spbnick
Copy link
Collaborator Author

spbnick commented Jun 27, 2024

OK, this comment is long, so I gave up trying to address specific points separately, and will just dump everything I have to say about it, so far.

I think it will be more productive to exchange shorter comments more often in the future, and perhaps have meetings to just talk over stuff.

The "issue extractor" (my name for the automatic issue creator you're working on for now) can only know if the issue in the result already exists by parsing the result first.

The question is, whether, once extracted, it can describe the issue in some other way (at all), and whether that other way would allow something else to identify it faster. E.g. regular expressions or shell globs. If not, or if the difference is not significant, then we don't have to do that, and can simply feed every result to the extractor, and let it maintain its own issue IDs and versions. The issues it creates wouldn't be used by a generic triager to find more incidents, but only the extractor itself will be submitting them. In this case it would be similar to just another CI system submitting issues and incidents.

If it can generate something else to match the issue with, e.g. a regex or (better) a glob, which would run faster, then the only thing the issue extractor should do is (re)submit the issues. The triager (the thing which matches generic issues to results) should submit the incidents, and take care of not submitting more than necessary.

Now, we can't possibly triage every result with every issue, we'll have to pick our fights, prioritize which issues are most important to triage, and have a cut off time before generating the first round of notifications about the results. We can continue (e.g. opportunistically) triaging afterwards, and uncover more correlation, but we gotta call "triaging results are good enough" at some point.

Problem: a transition is only related to an issue in the schema, but they should be also linked to the test runs that define it (at least the most recent one, the one that creates the transition)

The test runs can be found by querying tests linked to issue's incidents (and the transition's revision) and picking the earliest result. Also, there's no guarantee, that it will be the earliest test that created the transition. They can arrive in any order. And it's really not that important if we get exactly the earliest test for that, or exactly the test that triggered the creation of the transition, it's the issue and the revision identification that matters. We can always sort the incident tests by start_time for a particular revision.

Then, there's really no answer to the question "in which branch an issue first appeared", in general, given enough time has passed to disseminate the commit among the trees.

However, we can find where we checked out that commit from first (time-wise), which could be good enough. And the solution improves with increased sample rate and sample width (checking more branches more often).

In theory, we could refer to the "origin" test/build result which "produced" the issue from the issue itself. Like have a special field for that. However, if the issue extractor submits multiple issues with this field different, the outcome will be indeterministic and the whole point of having this data will be missed.

That could be used in manually-created issues, though.

Possible solution: link "transitions" to the related test runs / builds.
Substitute revision_before_* and revision_after_* with test_id_before / test_id_after and build_id_before / build_id_after. But this is cumbersome because a transition is either for a test or for a build, but not for both. Possible solution to this: split transitions into "Test transitions" and "Build transitions" and make the reference ids not null.

I started designing the schema with this idea. However, I then changed it to the present schema. First, this would need considerably more joins for the major use-cases of showing which issues a branch has active, and showing the issue history. Especially for tests. Second, sometimes we might have no test results for particular revisions, but nevertheless might know that the issue has (dis)appeared there, due to out-of-system testing, or just a human investigation. If we insist on requiring incidents we would need to either fake them, or wait for a test to reproduce them, which might take a while for difficult-to-reproduce issues, and all this time the transition would be either pointing to a wrong commit or to nothing at all. It also could be that there's no test reproducing a certain issue, but it could be good to still record where it's introduced/fixed, thus augmenting a bug tracker.

Also, we had some discussion on how to integrate KCIDB with regzbot data this week, and we can surface its information as issues and transitions without incidents, which would already help us display them, where necessary. The regzbot-collected issues have a good reputation in the community, and it would be good to have them in the dashboards.

In general I find coupling gaps like this useful, as it gives us more flexibility and the system more viability.

Regarding linking both builds and tests from incidents, in theory (a bit far-fetched, but still) it's possible for an issue to manifest both in a build and in a test. E.g. a build can warn us about something, and then a test can actually trip on it. I don't think we need to worry much about both references being present there, and it makes things simpler. I.e. by keeping a single type of transitions. Eventually we might want to unify build and test results, like you guys did in Maestro, and then the problem (such as it is) will go away.

We already have a relationship that links test and build objects with issues: incidents. So maybe we can encode the information of a transition in an incident and discard the "Transition" objects altogether.

With these new fields this should provide the same information as the transition entity, since a transition is now equivalent to an incident with transition == true.

Additionally, this removes a small redundancy that we had with the incidents.present and transition.appearance fields: they meant more or less the same thing, but applied to different objects.

I get where you're coming from. And yes, it's good to not have redundancy and unnecessary object types. We don't really have to have "transitions", and we can run SQL queries to figure out where an issue first or last had a positive incident, without them. However we'd like to have them as an optimization, and to express issue (dis)appearance without actual incidents. And optimization results often have redundancy, unfortunately.

I think putting previous build/test ID into incidents makes them accommodate two layers, which can be separate and implemented by different parties. Not many CI systems would be able to provide those fields, we cannot edit submitted CI data (we would have to use our own origin), so we cannot add those fields for them ourselves. OTOH, we can definitely build transition detection based on pure incidents they submit. As well as be able to cleanly separate issue triaging from regression detection in our own processing. Additionally the previous build/test ID in an incident could very quickly become outdated. E.g. once we do bisections and add more builds/tests between those the incident knows about, and we don't have a way to update incidents by themselves (and I don't think we should), only when issues are updated.

Anyway, I spent long enough trying to make this response coherent, while getting distracted by all matter of things 😁

Let's exchange our thoughts and ideas more often, so we don't have to struggle replying, and please correct me if I got anything wrong, or I don't make sense 😁

@spbnick
Copy link
Collaborator Author

spbnick commented Aug 2, 2024

I spent some time thinking about "git commit generation" which we included here, and I think that we don't really need it. I think we should just use the checkout's start_time. The commit generation that we wanted to introduce would be essentially that anyway. I investigated using git rev-list --count HEAD, but it would still not allow us to compare history for force-push branches. For that, timestamp works better, actually. So I think we can drop it from the design, and work with start_time until and if we get the graph database.

@spbnick
Copy link
Collaborator Author

spbnick commented Aug 2, 2024

But, we would need a field specifying that this was the tip of the branch at the checkout time. Because we can have checkouts for previous commits too.

@spbnick spbnick changed the title Add schema v4.4 Add schema v4.4 with issue transitions Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants