Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting support for new CKAN User-Agent strings on kerbaltek.com #75

Closed
HebaruSan opened this issue Nov 29, 2021 · 26 comments
Closed

Comments

@HebaruSan
Copy link

HebaruSan commented Nov 29, 2021

Hi @Ezriilc!

We'd like to update the User-Agent strings that some CKAN bots and utilities use (see KSP-CKAN/CKAN#3490, KSP-CKAN/xKAN-meta_testing#84, and KSP-SpaceDock/SpaceDock#436), and we are aware this would break HyperEdit and Graphotron:

image

Could you please update your site to treat all three of these as CKAN? The old one is not being removed, just supplemented:

  • Mozilla/4.0 (compatible; CKAN)
  • Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/xKAN-meta_testing)
  • Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)

If you have any questions, I'll do my best to answer them. Thanks!

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

Hi, @HebaruSan!

Thanks for coming here to tell me about this. I really appreciate it.

I've added the new agents. Please test it when you can.

@HebaruSan
Copy link
Author

Hi @Ezriilc,

Thanks for the quick response! The special _IamCKAN URLs seem to be broken now for all useragents:

$ curl.exe --fail -O --user-agent 'Mozilla/4.0 (compatible; CKAN)' https://www.kerbaltek.com/_IamCKAN_Gimme_hyperedit_
curl: (22) The requested URL returned error: 404

$ curl.exe --fail -O --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; +https://github.com/KSP-CKAN/xKAN-meta_testing)' https://www.kerbaltek.com/_IamCKAN_Gimme_hyperedit_
curl: (22) The requested URL returned error: 403

$ curl.exe --fail -O --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; +https://github.com/KSP-CKAN/NetKAN-Infra)' https://www.kerbaltek.com/_IamCKAN_Gimme_hyperedit_
curl: (22) The requested URL returned error: 403

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

DRAT! I thought that may have been too easy. Working...

@HebaruSan
Copy link
Author

HebaruSan commented Nov 29, 2021

FYI, we are still chatting about the possibility of tweaking the strings I gave a little; @DasSkelett wants to add CKAN somewhere, and I agree. I will post an update once we reach a final decision. Checking for a browser family of Netkanbot would probably be the most flexible way to cover all the newer possibilities.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

DRAT! I thought that may have been too easy. Working...

Sorry for that. I THINK I have it fixed now.

FYI, we are still chatting about the possibility of tweaking the strings I gave a little; @DasSkelett wants to add CKAN somewhere, and I agree. I will post an update once we reach a final decision. Checking for a browser family of Netkanbot would probably be the most flexible way to cover all the newer possibilities.

I've modified my code to allow easy changes and additions to both CKAN and NETKAN user agents. Feel free to update me whenever a change is made.

@HebaruSan
Copy link
Author

Thanks! The description now has the latest strings (added CKAN; to the middle of the parenthesized part).

The _IamCKAN links seem to be going through an HTTP to HTTPS redirection and then failing when I try it with netkan.exe:

879 [1] INFO CKAN.Net (null) - http://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_ redirected to https://kerbaltek.com/_IamCKAN_Gimme_graphotron_
1516 [1] INFO CKAN.Net (null) - https://kerbaltek.com/_IamCKAN_Gimme_graphotron_ redirected to https://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_
2156 [1] FATAL CKAN.NetKAN.Program (null) - The remote server returned an error: (403) Forbidden.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

Thanks! The description now has the latest strings (added CKAN; to the middle of the parenthesized part).

The _IamCKAN links seem to be going through an HTTP to HTTPS redirection and then failing when I try it with netkan.exe:

879 [1] INFO CKAN.Net (null) - http://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_ redirected to https://kerbaltek.com/_IamCKAN_Gimme_graphotron_
1516 [1] INFO CKAN.Net (null) - https://kerbaltek.com/_IamCKAN_Gimme_graphotron_ redirected to https://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_
2156 [1] FATAL CKAN.NetKAN.Program (null) - The remote server returned an error: (403) Forbidden.

Sorry, but isn't the 'http(s)' part at your end?

I'll have to take up this fix a bit later today.

@HebaruSan
Copy link
Author

As far as I can tell, no, the server is doing that:

$ curl.exe --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)' http://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://kerbaltek.com/_IamCKAN_Gimme_graphotron_">here</a>.</p>
</body></html>
$ curl.exe --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)' $ curl.exe --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)' https://kerbaltek.com/_IamCKAN_Gimme_graphotron_

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_">here</a>.</p>
</body></html>

@HebaruSan
Copy link
Author

The currently live version of the bot (using the old useragent, working before I submitted this) is also failing with a 404.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

As far as I can tell, no, the server is doing that:

$ curl.exe --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)' http://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://kerbaltek.com/_IamCKAN_Gimme_graphotron_">here</a>.</p>
</body></html>
$ curl.exe --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)' $ curl.exe --user-agent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/NetKAN-Infra)' https://kerbaltek.com/_IamCKAN_Gimme_graphotron_

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_">here</a>.</p>
</body></html>

I meant that NETKAN/CKAN should be calling https, and not http. I can't control how they call the URL.

However, you should not be getting the 403/404 errors, so I'll get on that later today. Sorry!

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

I meant that NETKAN/CKAN should be calling https, and not http. I can't control how they call the URL.

I just thought that that's probably in my .version file or some such. Later...

@HebaruSan
Copy link
Author

Oh, the mods' .netkan files start us out on http://, but that's the same as it was yesterday. We can update that eventually, but for now I want to limit the number of variables we're changing until things are working again.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

My code is checking the IP address of NETKAN. Is that a problem?

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

Also, I've been separating things based on the CKAN app vs the NETKAN bot. Is sensible?

I'm ready for you to do a test again, when you can. Thanks.

@HebaruSan
Copy link
Author

My code is checking the IP address of NETKAN. Is that a problem?

Oh, maybe that's why I still get a 403 response when testing the new strings from my own computer. Which IP addresses are you allowing? The parts of the bot will run from inside the current AWS containers and (less often) some GitHub Action containers, so I would want to make sure neither of those is blocked.

The old string seems to be working again, though, so that's good. 👍

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

My code is checking the IP address of NETKAN. Is that a problem?

Oh, maybe that's why I still get a 403 response when testing the new strings from my own computer. Which IP addresses are you allowing? The parts of the bot will run from inside the current AWS containers and (less often) some GitHub Action containers, so I would want to make sure neither of those is blocked.

The old string seems to be working again, though, so that's good. 👍

IF you'd like to give me a list of confirmed IP addresses to allow, I can do that.

However, I don't understand why the new strings aren't being approved, but the old is. They're all looked at the same way.

@HebaruSan
Copy link
Author

Hmm, apparently that's not recommended:

https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#ip-addresses

Since there are so many IP address ranges for GitHub-hosted runners, we do not recommend that you use these as allow-lists for your internal resources.

So I guess the simple answer to "Is that a problem?" was "Yes."

However, I don't understand why the new strings aren't being approved, but the old is. They're all looked at the same way.

Hmm, I'm not going to be able to point out what's causing it without seeing your code and server setup, but the first useragent listed in the description currently works, and the others return 403 Forbidden.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 29, 2021

So I guess the simple answer to "Is that a problem?" was "Yes."

Drat. That is a bit of a security problem, but not a big one. I've disabled that check.

Hmm, I'm not going to be able to point out what's causing it without seeing your code and server setup, but the first useragent listed in the description currently works, and the others return 403 Forbidden.

Yep, that one's on me. I'm sorta thinking out loud to let you know where I am. Working...

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 30, 2021

Yep, that one's on me. I'm sorta thinking out loud to let you know where I am. Working...

I THINK I have the 404 errors fixed. Please test when you can.

@HebaruSan
Copy link
Author

Hi, thanks for the response. I'm still seeing the same errors.

$ netkan.exe --net-useragent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/xKAN-meta_testing)' NetKAN/HyperEdit.netkan --verbose
...
838 [1] INFO CKAN.NetKAN.Transformers.HttpTransformer (null) - Executing HTTP transformation with #/ckan/http/http://www.kerbaltek.com/_IamCKAN_Gimme_hyperedit_#cachebuster1
1027 [1] INFO CKAN.Net (null) - http://www.kerbaltek.com/_IamCKAN_Gimme_hyperedit_#cachebuster1 redirected to https://kerbaltek.com/_IamCKAN_Gimme_hyperedit_
1933 [1] INFO CKAN.Net (null) - https://kerbaltek.com/_IamCKAN_Gimme_hyperedit_ redirected to https://www.kerbaltek.com/_IamCKAN_Gimme_hyperedit_
2756 [1] FATAL CKAN.NetKAN.Program (null) - The remote server returned an error: (404) Not Found.

$ netkan.exe --net-useragent 'Mozilla/5.0 (compatible; Netkanbot/1.0; CKAN; +https://github.com/KSP-CKAN/xKAN-meta_testing)' NetKAN/Graphotron.netkan --verbose
...
742 [1] INFO CKAN.NetKAN.Transformers.HttpTransformer (null) - Executing HTTP transformation with #/ckan/http/http://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_
861 [1] INFO CKAN.Net (null) - http://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_ redirected to https://kerbaltek.com/_IamCKAN_Gimme_graphotron_
2008 [1] INFO CKAN.Net (null) - https://kerbaltek.com/_IamCKAN_Gimme_graphotron_ redirected to https://www.kerbaltek.com/_IamCKAN_Gimme_graphotron_
3519 [1] FATAL CKAN.NetKAN.Program (null) - The remote server returned an error: (404) Not Found.

The old useragent still works.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 30, 2021

Hi, thanks for the response. I'm still seeing the same errors.

I'm very sorry for all the trouble, and I'm grateful for all your feedback and efforts to help me.

I've made some changes, and if you could please try again, it will help me to figure out what's wrong.

THANKS!

@HebaruSan
Copy link
Author

HebaruSan commented Nov 30, 2021

It's no problem at all, I am as familiar with the change-test-fix cycle as anyone. 😀

With the latest changes, it works for me with any useragent; after I tried the ones we plan to use, I tested with Something else and that worked, too. This would be fine for us, but I'm guessing you'll want to limit it more than that.

@HebaruSan
Copy link
Author

HebaruSan commented Nov 30, 2021

Hold on that for a moment, I need to double check whether the latter tests used a cached copy of the download, forgot about that before...

OK, confirmed that the Something else useragent is able to retrieve the file without caching.

@Ezriilc
Copy link
Owner

Ezriilc commented Nov 30, 2021

Hold on that for a moment, I need to double check whether the latter tests used a cached copy of the download, forgot about that before...

I was wondering if caching might be the issue. Shall I put it back to see if that's it?

@HebaruSan
Copy link
Author

I don't know what you'd be putting back, but I've confirmed that client-side caching isn't the cause of what I'm seeing.

@HebaruSan
Copy link
Author

We've switched over to the new strings and everything seems to be working, so we can consider this resolved for now. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants