Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Unstable search issue behavior #178

Merged
merged 3 commits into from
Aug 11, 2024

Conversation

snaka
Copy link
Contributor

@snaka snaka commented Aug 2, 2024

Issue

GitHub's Isseu search API may fail due to a query character count validation error.

GET https://api.github.com/search/issues?q=... (snip) ... : 422 Validation Failed [{Resource:Search Field:q Code:invalid Message:The search is longer than 256 characters.}]

How to reproduce problem

For example, with the current calculation method, API validation will generate an error if the following conditions are met

  • The repository name is extremely short
  • Many commits since the last release

Reproduced repository: https://github.com/snaka/a

How to fix it

I think there are two problems with the current logic

  • Wrong count targets.
    • The count of query characters includes qualifiers.
  • Inconsistency between the API delimiter counting method and the implementation.
    • The SHA delimiter is assumed to be a single space character, but the API counts delimiters as approximately three characters (note that there is no mention of this in the API documentation and it is a guess based on trial and error).

Wrong count targets

According to the GitHub API specification,
The structure of a query consists of one or more KEYWORDs and one or more QUALIFIERs, as follows.

KEYWORD_1 KEYWORD_2 QUALIFIER_1 QUALIFIER_2

The character limit of query does not apply to QUALIFIERs, so the part excluding the QUALIFIERs must be validated.

See also: https://docs.github.com/en/rest/search/search?apiVersion=2022-11-28#limitations-on-query-length

Inconsistency of delimiter counting method

The following scripts have been used to try switching the contents of the keywords variable,
I have speculated that the count as the number of characters per delimiter space may be counting as three characters instead of one.

#!/bin/bash

# GitHub API endpoint
url="https://api.github.com/search/issues"

# search query
qualifiers="repo:snaka/a is:pr is:closed"
keywords="a3dbce50efe8b2133b284157329cfae236ffc154ae2ce4d1f9571a330629635150d269e597c8765b591fbacc2876ecb195db6032a13ba8c8dfdce965c1e99b3dbccbab22911dc13da15916e3d8808eb384bccdaa6b1172badbb216965c1e99b3dbccbab22911dc13da15916e3d8808eb384bccdaa6b1172badbb21691234 1" # valid
# keywords="a3dbce50efe8b2133b284157329cfae236ffc154ae2ce4d1f9571a330629635150d269e597c8765b591fbacc2876ecb195db6032a13ba8c8dfdce965c1e99b3dbccbab22911dc13da15916e3d8808eb384bccdaa6b1172badbb216965c1e99b3dbccbab22911dc13da15916e3d8808eb384bccdaa6b1172badbb21691234 12" # invalid
# keywords="a3dbce50efe8b2133b284157329cfae236ffc154ae2ce4d1f9571a330629635150d269e597c8765b591fbacc2876ecb195db6032a13ba8c8dfdce965c1e99b3dbccbab22911dc13da15916e3d8808eb384bccdaa6b1172badbb216965c1e99b3dbccbab22911dc13da15916e3d8808eb384bccdaa6b1172badbb216912341234" # valid
encoded_kw=$(printf "$keywords" | jq -sRr @uri)

echo "keywords (${#keywords}): $keywords"
echo "encoded keywords (${#encoded_kw}): $encoded_kw"
echo "---------------------------------"

query="$qualifiers $keywords"

# URL encode
encoded_query=$(printf "$query" | jq -sRr @uri)

# run curl command
curl -H "Accept: application/vnd.github.v3+json" \
  -H "Authorization: token $GH_TOKEN" \
  "$url?q=$encoded_query"

Associated with the number 3 is the specification that space characters can be represented by the expression %20.

The API documentation is not clear on this, but as long as the number of characters is counted assuming the whitespace character is %20, the API validation does not seem to cause an error.

According to the GitHub API specification,
The structure of a query consists of one or more KEYWORDs and one
one more QUALIFIERs, as follows.

```
KEYWORD_1 KEYWORD_2 QUALIFIER_1 QUALIFIER_2
```

The character limit of query does not apply to QUALIFIERs, so the
part excluding the QUALIFIERs must be validated.

See-also: https://docs.github.com/en/rest/search/search?apiVersion=2022-11-28#limitations-on-query-length
@snaka snaka marked this pull request as ready for review August 2, 2024 15:26
tagpr.go Outdated
// Also, from the results of the experiment, it is possible that when counting
// the number of characters in the keyword part, one space character is counted
// as three characters (possibly '%20').
if len(strings.Join(keywords, "%20") + "%20" + sha) >= 256 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected something like this

len(keywords) * (3+len(sha)) >= 256 or something close to this.

While I understand the logic of your code, I'm not confident about how this code could be maintained.

Any extra special character added later might break the way to count and may lead to errors.

I would have kept the old code, and check the length of the URL encoded strings

So instead of checking len(query)+1+len(sha) >= 256

I would concat a tmpQuery := query + " " + sha

Then use if len(url.QueryEscape(tmpQuery)) >= 256

If the condition is true, the code is unchanged.

If condition is false, you set

query = tmpQuery

Copy link
Contributor Author

@snaka snaka Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback.

I was actually wondering if I should use the url package here.

I will explain my concerns later, but first, let me comment on your suggested fix.

url.QueryEscape() escapes spaces to + (not %20).
To achieve the expected result, we should use url.PathEscape() here.

https://go.dev/play/p/0Xk56v95-Ol?v=goprev

However, I wasn't confident that url.PathEscape() was the best choice here.
There are two reasons for this:

  1. The awkwardness of applying PathEscape to query escaping.
  2. The fact that the GitHub API specification does not disclose how the number of characters in this part is counted (this is determined by black-box testing, but there is no certainty).
    • In other words, I can't determine if the GitHub API's query character count specification matches the PathEscape specification

Considering these points, do you still think it's better to use the url package? I'd like to here your opinion on this.

To be honest, I think it does't matter if we use the url package here or not.

Also, I have limited experience with Go language, so I appreciate your feedback like this.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccoVeille
May I ask your opinion about the above?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't receive a notification 🤦‍♂️

I understand it's quite a mess to cope with this.

I thought it was something that was used in URL encoding.

The solution you published today is simpler and safer.

I hope you won't face more trouble.

The changes might come with higher number of request, so maybe the rate limiting will be experienced more often.

Maybe you could/should wait a bit more between calls

@snaka snaka changed the title fix: Unstable search behaviour fix: Unstable search issue behavior Aug 3, 2024
@Songmu
Copy link
Owner

Songmu commented Aug 7, 2024

Thanks. There is no need to aim for the last possible length (256), and it would be better to do a character count that considers the possibility of spaces being URL-escaped to three characters.

query := queryBase
func buildChunkSearchIssuesQuery(qualifiers string, shasStr string) (chunkQueries []string) {
// array of SHAs
keywords := make([]string, 0, 25)
Copy link
Contributor Author

@snaka snaka Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 To clarify that the character limit applies only to the part of the string that makes up the query, excluding the qualifier, the variable name has been changed.

@snaka snaka requested a review from ccoVeille August 7, 2024 12:09
@snaka
Copy link
Contributor Author

snaka commented Aug 7, 2024

@Songmu @ccoVeille

The following changes were made:

  • Removed escaping of spaces, which was assumed to be guessing.
  • Instead, the maximum number of characters was given more leeway.
    • 7 x (25+1) = 200

tagpr.go Outdated
Comment on lines 624 to 629
// However, although not explicitly stated in the documentation, the space separating
// keywords is counted as one or more characters, so it is possible to exceed 256
// characters if the text is filled to the very limit of 256 characters.
// For this reason, the maximum number of chars in the KEYWORD section is limited here to 200.
tempKeywords := append(keywords, sha)
if len(strings.Join(tempKeywords, " ")) >= 200 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a constant and put this explanation aside.

It would help to understand the comment is about the value and not the algorithm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it, how about this?

@snaka snaka requested a review from ccoVeille August 7, 2024 12:54
@Songmu
Copy link
Owner

Songmu commented Aug 11, 2024

OK; let's give it a try.

@Songmu Songmu merged commit 9138386 into Songmu:main Aug 11, 2024
3 checks passed
@Songmu Songmu added the minor label Aug 11, 2024
@snaka snaka deleted the fix-unstable-search-api-behavior branch August 13, 2024 23:31
@snaka
Copy link
Contributor Author

snaka commented Aug 15, 2024

I checked with the repository where the problem had previously appeared, and the problem has been resolved. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants