Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Score "Investigate" progress as part of the overall metric #49

Closed
jgraham opened this issue Jan 21, 2022 · 8 comments
Closed

Score "Investigate" progress as part of the overall metric #49

jgraham opened this issue Jan 21, 2022 · 8 comments
Labels
meta Process and/or repo issues

Comments

@jgraham
Copy link
Contributor

jgraham commented Jan 21, 2022

Mozilla are concerned that the score being based purely on the test pass rate for the accepted proposals doesn't provide any visible reward for progress on the areas that were marked as "Investigate". Although these are not at the point where we can define a set of tests for implementors to target, several of the "Investigate" areas are places where we see a lot of compatability problems in practice, and completing the pre-implementation work required to address those problems will pay dividends in terms of improving the experience of the web for end users and authors. Therefore that work should be recognised alongside implementation work in terms of improving the Interop-2022 score.

Given this we propose the following:

  • For each area marked as "Investigate" we define a set of outcomes that will represent mesurable progress toward addressing the compat problems in that area, for example, writing spec text for a feature, or creating a viable plan to align existing implementations.
  • We dedicate some percentage of the overall score to progress on these goals. Unlike the existing areas, these will be scored manually and the score will be shared across all implementations.

For example, we might decide that 85% of the total score comes from the test pass rate, and 15% of the score comes from progress on the "Investigate" areas. If we completed two thirds of the investigate work, but failed to make progress on the final third, that would mean the most that any one implementation could score on the overall metric would be 95%.

Concretely, I think the areas we marked as "Investigate" that should form part of this proposal are:

I've excluded AVIF because, as I understand it, the scope of "Investigate" there wasn't about the state of the spec or tests, and accent-color because the concern there seemed to be that there weren't actually any failing tests.

In terms of scoring I don't think that all these areas necessarily need to be given equal weight, or that each "Investigate" area has to be exactly equal to an implementation area.

@foolip
Copy link
Member

foolip commented Jan 24, 2022

I think this makes sense, and 15% is a reasonable portion of the score.

I do think spelling out the scoring of these areas ahead of time might be challenging, in particular for Pointer Events where the issue was that existing failures didn't seem representative of issues that web developers might face, but do we know in any detail what things do need to be tested?

For layerX and layerY, it's rather a lot smaller than the others, so if there's a way to write a tentative test that would fail in all browsers for this, and keep it in the webcompat bucket, I think that would be preferable.

@jensimmons what do you think of this proposal, in particular as regards #41?

@chrishtr
Copy link
Contributor

I think this makes sense, and 15% is a reasonable portion of the score.

I also think it's ok, and good to pay attention to these research tasks in addition to fixing bugs.

My suggestion is to start with a 0 score in this area for all three browser implementations and then increase it at the end of each quarter if progress has been made, via a consensus decision.

I think each research group can and should define its own agenda and ways of organizing; it's not necessary for the whole group to pre-agree on these plans. The group can then review the results at the end of the quarter and give a reasonable score.

@jensimmons
Copy link
Contributor

jensimmons commented Jan 27, 2022

I think this makes sense to me for Viewports, perhaps we make the (few) automated tests be 50%, and the manual testing that we plan & do be 50%. Perhaps? We won't be manually able to test over & over, however. So I'm open for debate on this.

@jensimmons
Copy link
Contributor

I have a serious problem with reopening and re-litigating decisions already made by changing the process after the fact.

@gsnedders
Copy link
Member

At this stage, we have a number of concerns about substantially changing the scoring approach, especially when we are so close to the proposed announcement date. This isn’t to say we don’t think there’s substantial value in documenting progress on the things marked as investigate, but I think we’d need a much more concrete proposal as to how the scoring would affect the overall scores.

It’s unclear how the Investigate items will be scored, but presuming that gets figured out, it’s still very unclear how those scores will impact the overall score for each browser. Is the proposal that the resulting score be applied to each browser’s total in an equal fashion? Or that browsers will be able to earn more points than others by participating in the standards and testing work needed?

If the proposal is to have the investigation scoring apply equally to all browsers, it’s not clear it provides much value aside from making it harder for everyone to reach 100%; it lacks the competitive pressure that the metric otherwise provides.

If the proposal is to have investigation scoring apply differently to each browser, based on some sort of measure of participation or contribution, then much more detail would be needed as to how to this would be measured.

In either case, we are dubious that consensus about how each investigation item is scored can be a reached in time for the announcement date, and we very strongly want any discussions about scoring to be concluded by that date.

Our preference, as far as 2022 is concerned—and we can develop further proposals over the coming year for how we want to score investigate items in future—is to list the investigation as a separate item on the dashboard, aside from the browsers and their scores. In that regard, we’re much less concerned about ongoing discussion about the scoring of the items, provided we believe we can reach consensus within the first quarter of the year.

@jgraham
Copy link
Contributor Author

jgraham commented Feb 3, 2022

As a point of clarity, the proposal is to have a uniform score that applies to all implementations.

Investigation work intrinsically requires collaboration and so it makes sense to score it in a way that rewards everyone for collective progress. That does come with the tradeoff that you can't use that score in a "competitive" sense against other implementations. But you can use it to demonstrate that you've made good on a commitment to improving the web ecosystem as a whole. Although competitive pressure is certainly one way that the web makes progress, it's not the only way, and scoring an interoperability metric entirely on the basis of what specific implementations acomplish in terms of landing new feature work is missing the bigger picture.

@foolip
Copy link
Member

foolip commented Feb 23, 2022

This is implemented in https://staging.wpt.fyi/interop-2022 now, in all places except one. In the summary graph, I've forgotten to scale score to 90%. Leaving this issue as a reminder of that...

@foolip
Copy link
Member

foolip commented Aug 18, 2022

I fixed the summary graph at some point, closing this as fixed.

@foolip foolip closed this as completed Aug 18, 2022
@gsnedders gsnedders added the meta Process and/or repo issues label Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Process and/or repo issues
Projects
None yet
Development

No branches or pull requests

5 participants