We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the current behavior?
The robots.txt is re-parsed for every request but those files can be big.
robots.txt
Today Google only reads the first 500 Kb and ignore the rest.
What is the expected behavior?
Maybe the crawler could keep the parsed robots.txt up to N instances. It should allow a strong cache hit without allowing it to growth forever.
What is the motivation / use case for changing the behavior?
Although I didn't manage to find the robots.txt again, I did already see ones that were doing easily > 1Mb.
The overall performance could take a serious hit if it were to be reparsed for every single request.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
What is the current behavior?
The
robots.txt
is re-parsed for every request but those files can be big.Today Google only reads the first 500 Kb and ignore the rest.
What is the expected behavior?
Maybe the crawler could keep the parsed robots.txt up to N instances. It should allow a strong cache hit without allowing it to growth forever.
What is the motivation / use case for changing the behavior?
Although I didn't manage to find the robots.txt again, I did already see ones that were doing easily > 1Mb.
The overall performance could take a serious hit if it were to be reparsed for every single request.
The text was updated successfully, but these errors were encountered: