Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename variables to match the configuration and introduce elementScore concept #92

Open
0x4007 opened this issue Aug 20, 2024 · 11 comments · May be fixed by #121
Open

Rename variables to match the configuration and introduce elementScore concept #92

0x4007 opened this issue Aug 20, 2024 · 11 comments · May be fixed by #121

Comments

@0x4007
Copy link
Member

0x4007 commented Aug 20, 2024

I was reviewing this action run and its a lot more clear to me when looking at the clean JSON logs exactly whats going on here. I realized that the scoring is implemented incorrectly. The primary crediting strategy is supposed to be based on the amount of elements. The word counter is a separate scoring mechanism. This was added as an after thought in version one, but it seems that you prioritized word count over element count.

The problem is that the original emphasis on element count allows us to easily target and credit special and useful elements such as <a>, <img>, and <code>. Helpful comments generally have links, images and code samples.

  • With the following example, I would expect only one <p>
  • The comment word count is 7
  • We also need the ability to ignore words inside of specific tags. For example, <code>

Now you are keeping track of word count PER element which is more complex than it needs to be. Remember, the word count scoring strategy was added as an after thought for version one. Version one simply counts all the words in the comment (but also ignores words in specific elements, like code.) at the end of all the complex calculations.

Aside from the ignore capability, it doesn't care which element contains the words.

{
  "id": 2014495969,
  "content": "Wouldn't solve scenarios requiring keys or credentials",
  "url": "https://github.com/ubiquity/cloudflare-deploy-action/issues/6#issuecomment-2014495969",
  "type": "ISSUE_ASSIGNEE",
  "score": {
    "formatting": {
      "content": {
        "p": {
          "count": 7, // should be changed to `wordCount` and you should correctly count the amount of `p` tags to be `elementCount`
          "score": 1
        }
      },
      "wordValue": 0,
      "formattingMultiplier": 0
    },
    "reward": 0,
    "relevance": 0.3
  }
}
  • I would expect our p scoring to be 0 but our word scoring to be 0.1 for a grand total of 0.7.
  • Because this is the assignee writing on the issue (not the pull) we can provide credit; so the multiplier should not be 0.
  • Perhaps we could do a 4x multiplier, which yields 2.8 but then multiply by relevance to yield a sum of 0.84 for the comment.

Source

@0x4007
Copy link
Member Author

0x4007 commented Aug 20, 2024

Originally created at #83

@0x4007
Copy link
Member Author

0x4007 commented Aug 20, 2024

Hope that you can prioritize this so that I can start fine tuning the incentives!

@gentlementlegen
Copy link
Member

I think that Wouldn't solve scenarios requiring keys or credentials would be 8 not 7 because "Wouldn't = Would not" which is what the current regex would do but that's adjustable.

We also need the ability to ignore words inside of specific tags. For example, <code>

Is it safe to assume that a html element with a score of 0 would not have its content counting for the word total count? And if the html looks like

<code>
  <p>some content</p>
</code>

then the </p> wouldn't be counted for the final word count?

@0x4007
Copy link
Member Author

0x4007 commented Sep 16, 2024

Yes as I recall in the old implementation if it came across an "ignored" tag then it would count the total amount of words inside and then subtract from the net score.

@0x4007
Copy link
Member Author

0x4007 commented Sep 16, 2024

In normal front-end code I would express this logic as follows:

const codes = document.querySelectorAll(`code`);
let excludedWordCount = 0;
for (const code of codes){
  excludedWordCount += getWordCount(code);
}

// ...

const netWordScore = totalWordCount - excludedWordCount;
return netWordScore;

function getWordCount(element) {
  return element.textContent.split(" ").length;
}

But ultimately it appears that we will scrap this word counting logic all together pretty soon. The more robust way to credit for productive contributions is via semantic understanding of the corpus. This will be handled by embeddings most likely.

However I do want to retain the tag/element counter because there is definitely a high correlation with high quality comments and sample images, links, and code.

So definitely need to emphasize counting tags as per the original design. The word counting is an afterthought and will be removed when we figure out a good strategy with embeddings.

@gentlementlegen
Copy link
Member

gentlementlegen commented Sep 18, 2024

@0x4007 I think what is missing from your exemple is the handling of the list of regexes, so the result should actually look like

{
  "id": 2014495969,
  "content": "Wouldn't solve scenarios requiring keys or credentials",
  "url": "https://github.com/ubiquity/cloudflare-deploy-action/issues/6#issuecomment-2014495969",
  "type": "ISSUE_ASSIGNEE",
  "score": {
    "formatting": {
      "content": {
        "p": {
          "regex": {
            "\\b\\w+\\b": {
              "wordCount": 8,
              "wordValue": 0.1
            }
          },
          "score": 1,
          "elementCount": 1
        }
      },
      "multiplier": 1
    },
    "reward": 0.54,
    "relevance": 0.3
  }
}

and the result would be:
((wordCount * wordValue) + (score * elementCount) * multiplier) * relevance
((8 * 0.1) + (1 * 1) * 1) * 0.3 = 0.54
would that be correct? Or would you apply the relevance only to the wordCount * wordValue?

@0x4007
Copy link
Member Author

0x4007 commented Sep 18, 2024

Relevance should only apply to word count.

Tag count should remain unaffected. For example every image should guarantee $5

@gentlementlegen
Copy link
Member

@0x4007 Here is a first shot at it, let me know if that is what you expected: Meniole#14 (comment)

If that's ok, I'll fix the tests and put the PR ready for review.

@0x4007
Copy link
Member Author

0x4007 commented Sep 19, 2024

It's really hard to say from those results. It's always been difficult to understand the results based on the results table unfortunately

@gentlementlegen
Copy link
Member

The applied formula is

$$\sum_{i=0}^{n} \left( \sum_{j=0}^{n} \left(\text{wordCount}^{exponent} \times \text{wordValue} \times \text{relevance}\right) + \left(\text{score} \times \text{elementCount}\right) \right) \times multiplier + \text{task.reward} = \text{total}$$

So for the aforementioned example you described, the following happened:

$$((8^{0.85} \times 0.2 \times 0.3)+1*1)*1 = 1.351$$

So relevance is only applied to the words, and the elements have a fixed value not altered by the relevance.

@0x4007
Copy link
Member Author

0x4007 commented Sep 19, 2024

So relevance is only applied to the words, and the elements have a fixed value not altered by the relevance.

Sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants