Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to how Redgifs temp auth tokens are handled #940

Closed

Conversation

remghoost
Copy link

Hello.

I'm fairly new to programming (only about a year of python so far).
I'm not entirely comfortable with pull requests yet and I apologize if some aspects are incorrect.
Should I be attempting to merge into the development branch....?
Constructive criticism is appreciated.

I also apologize that my request is wordy.
I prefer being thorough.

This pull request closes #939.


Foreward

I have read both ARCHITECTURE.md and CONTRIBUTING.md.
I believe this pull request fits in line with both of them, with one caveat.

This line declares that bdfr should be a stateless and my pull request technically is not.

https://github.com/aliparlakci/bulk-downloader-for-reddit/blob/7676ff06a3b0c31eb2af7e691412cd63e862deea/docs/ARCHITECTURE.md?plain=1#L7

Currently, temporary auth tokens are pulled from the Redgifs API every time a link is checked.

https://github.com/aliparlakci/bulk-downloader-for-reddit/blob/7676ff06a3b0c31eb2af7e691412cd63e862deea/bdfr/site_downloaders/redgifs.py#L41-L43

I'm not sure when Redgifs changed their rate limiting (as downloads were working fine a few weeks ago), but they did.

My changes cache the temporary token using os.path.join(tempfile.gettempdir(), "redgifs_token.txt") and retrieve/save the token in three different instances.

  1. If redgifs_token.txt does not exist.
  2. If a 401 error occurs, meaning the token is invalid.
  3. If redgifs_token.txt exists and is a working auth_token.

These tokens are good for 24 hours, as stated in the Redgifs API docs,


Primary changes

I will give bullet points on what I changed in each script.
I'm apologize, for some reason I cannot comment on specific lines below. Still learning.

bdfr/__main__.py

  • Adjusted __version__ import. I received an error related to being unable to find __version__ when attempting to use pip install -e ., so the import was adjusted. I can remove this if it's not a problem on your end. This was just required to get development working on my end.

bdfr/downloader.py

  • Importing redgifs via bdfr.site_downloaders. I don't like doing this, but I couldn't figure out a way to only import/grab redgifs when necessary. You'd probably have a better solution to this.
  • Added error handling specifically for Redgifs related 401 errors. I'm using str(e).startswith("Server responded with 401"). It's lame, but it was yelling at me when I tried e.status_code. No clue why. This works though.

bdfr/site_downloaders/base_downloader.py

  • Added error handling for 429 errors. It prints Too many requests, Try again in a little while

bdfr/site_downloaders/redgifs.py

  • Importing os and tempfile, both necessary for creating/reading redgifs_token.txt.
  • Added TOKEN_FILE_PATH to take care of caching the auth_token.
  • Added three functions: _load_token(), _save_token(), and _get_token(). These handle pretty much exactly what their names state. I tried to cut out the fluff and only include what was necessary. I'm sure more could be trimmed off though.

Expected output, tests, and considerations

Expected output

If redgifs_token.txt is not found or a 401 error is received, redgifs.py will create the file and save the token to it.
I added formatting via print statements to let the user know what was going on.

-=-=-=-=-=-=-=-=-=-=-=-
Redgifs API token file not found, retrieving new token
Attempting to retrieve new temporary Redgifs API token
Writing Redgifs temporary API token to C:\Users\USERNAME\AppData\Local\Temp\redgifs_token.txt
Success!

New temporary token is: eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJpc3MiOiJhdXRoLXNlcnZpY2UiLCJpYXQiOjE3MDcwODU0NzgsImF1ZCI6Imh0dHBzOi8vYXBpLnJlZGdpZnMuY29tIiwiYXpwIjoiMTgyM2MzMWY3ZDMtNzQ1YS02NTg5LTAwMDUtZDhlOGZlMGE0NGMyIiwiZXhwIjoxNzA3MTcxODc4LCJzdWIsdaiOiJjbGllbnQvMTgyM2MzMWY3ZDMtNzQ1YS02NTg5LTAwMDUtZDhlOGZlMGE0NGMyIiwic2NvcGVzIjoicmVhZCIsInZhbGlkX2FkZHIiOiIxMDcuMTgyLjEzOS40MiIsInZhbGlkX2FnZW50IjoicHl0aG9uLXJlcXVlc3RzLzIuMzEuMCIsInJhdGUiOi0xLCJodHRwczovL3JlZGdpZnMuY29tL3Nlc3Npb24taWQiOiI2R3RFcVhFVnp0R0xvUlpBZzZaYTdTIn0.O3nK9dlnC_AmbmUF8dWPN3M5RYVA2wMDiQC0MGSZirJyesoE2YCvYik4aNF3lRyo73DjYdSCAKNSP3rVhis7Ji4bQazG3XI5bi9yvlXWgdaIZAljwIEzuqheValt_NpIlBbqHWvgJtdl_oOgyh5O4mNaE84ndRO4GD1FAUm5KawteI8eun-fd7ryY3mZhRWe9uDG3s2Mrdyg6r1EhwO4iiD05hHf66tL-GAlqAjPUCAdqLTI8Mj3Arn7kLnAcRlZfP2VRY9cy_gt-NKo7BmEgHsXCZz4h8W6qhlYzVhqVcAvGT1vAJkoZsXGXpj8jHs22Vdr3biLQWK38WUHdUWlKg
-=-=-=-=-=-=-=-=-=-=-=-

You could remove the print statements if you'd like but I think they look nice....

Redgifs temporary auth tokens are anonymous, read only, and only valid for 24 hours. I don't think there's any issue saving/displaying them. Though, I could be mistaken on that...?

I could remove the print statement that displays the token if you're not comfortable with it.
I made them compact in the code as well, so they're not too intrusive.

-=-

If redgifs_token.txt is found, but is incorrect (giving a 401 error code), the output is similar.

BUT, the first attempted download does not go through.
I could not figure out how to fix this. If you could, I would greatly appreciate it.
This is the only bug that I could find.
This does not crash the script.

-=-

There is no output if the token is correct.

Tests

Command:

pytest -m "not reddit and not authenticated"

My changes failed on 27 tests, but a fresh git clone of the main branch failed on all of these tests as well.
Many of the failures take place due to 404 errors with the provided links.

Considerations

The Redgifs test fails as well, but due to the bug mentioned above.
One link will hit a 401 error before it notices and grabs the token.
I'm not sure how to adjust for that.

Once per day, the first download will fail.
But the current implementation is entirely broken due to the API changes.
The lesser of two evils, I suppose...? It could probably be fixed, but I am currently out of brain juice. lol.

This was only tested on a Windows 10 machine. I'm not sure how Linux/Mac would handle the os.path.join(tempfile.gettempdir(), "redgifs_token.txt"). It should work though.

I could also put in code to remove the redgifs_token.txt after use (not entirely sure how though), if you wanted it to be truly stateless.

@altdc
Copy link

altdc commented Feb 5, 2024

With respect to maintaining statelessness, I think a good compromise could be to instead cache the token in memory. Yes, bdfr would still be fetching a new token on each startup, but it would still result in far fewer requests than it is currently making fetching the token every time.

@Serene-Arc Serene-Arc changed the base branch from master to development February 5, 2024 06:35
@remghoost
Copy link
Author

I did consider caching the token in memory, but I didn't want to go around adjusting primary scripts that didn't really have much to do with Redgifs downloading. Especially for my first PR on this project. And since the Redgifs downloader is stateless, I couldn't quite figure out a way to do that solely in redgifs.py.

It would be a cleaner solution though. It would be good to look into.

This seems to work quite decently though. I've tested downloads on a few other accounts and it seemed to prevent the rate limiting.

Still not the biggest fan of my implementation though, since it skips the link that throws the 401 error.
I tried to pass the url of the 401 request back and forth to attempt a redownload, but I don't know quite enough about python yet to be successful in that attempt.

@altdc
Copy link

altdc commented Feb 5, 2024

Makes sense. I can confirm these changes also work on my machine (Lastest MacOS, Python 3.9). I was able to download about ~10 profiles in a row with many redgifs links when previously a single profile was failing.

redgifs.Redgifs._get_token(redgifs.Redgifs, redgifs_url)

# Attempted redownload of link that hit the 401 error. Don't think it works though.
redgifs.Redgifs._get_link(redgifs_url)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the line that attempts to re-run the download with the skipped link from the 401 error.

I'm guessing the chain resolves somewhere else and the link doesn't get passed correctly.
Might be worth moving the entire error checking code into redgifs.py and just handle it over there.
Would be easier to pass the url and re-run the download function.

@remghoost
Copy link
Author

remghoost commented Feb 5, 2024

Glad it works! I was a bit worried about the temp directory stuff, but I assumed that python's os library had it covered.

You can replicate the 401 error skipping the first link by altering the redgifs_token.txt file, forcing the code to grab a new token. You'll see that it doesn't pick up

If we could figure out a way to pass that link back over and run it again with the new token, there would be no downsides to this change.

@Serene-Arc
Copy link
Owner

Would you prefer me to make comments on things to change or pull and implement those changes myself?

@Soulsuck24
Copy link
Contributor

#769

This appears to be based off the master branch? The dev branch should cache the auth tokens already unless things have changed since then.

I can't say for sure though as my major usage has become almost nil since the imgur and reddit changes.

@Serene-Arc
Copy link
Owner

I can't say for sure though as my major usage has become almost nil since the imgur and reddit changes.

As has mine. @remghoost can you confirm if this is still a problem with the development code?

@remghoost
Copy link
Author

@Serene-Arc You can pull and edit the changes if you want. That's fine by me.

I haven't tested it the changes on the development branch yet, but I can in a few hours.

And if it's true that the development branch already caches the token, then we don't really need to incorporate my changes...

I wasn't exactly aware of the state of the repo when I started working on the project.

@Serene-Arc
Copy link
Owner

All good. Note that when you pull the development branch, your own tests will work if you provide an updated token and secret from Reddit. No doubt there are some tests that will fail anyway (such is the life of testing a web scraper) but you'll get an idea of whether your changes are needed.

@remghoost
Copy link
Author

tl;dr - I'll close this pull request since it was already fixed in the development branch. My solution is not necessary to pull.

@Serene-Arc

So after looking through the development branch changes, it seems like token caching does already exist:

https://github.com/aliparlakci/bulk-downloader-for-reddit/blob/257d44408d3c23014c6f05c3024c03b1bba2c9c8/bdfr/site_downloaders/redgifs.py#L24-L28

It caches the token for 22.8 hours, which falls in line with the 24 hour limit on tokens.
So it was solved on the development branch already.

The main difference is that it caches the token in memory instead of a file, which has its pros and cons.
If someone was downloading a bunch of smaller reddit accounts (with only a handful of redgifs links) they might run into a problem, but it's highly unlikely.

I like my solution a bit more, but the development branch solution is far more elegant. Learned something new from it too. Super neat.

I'll close this pull request since the problem was technically not with the script, but with the inability to update the pip package.
hm.


Also, still pick up an error on the development branch when installed via pip install -e .

Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python310\Scripts\bdfr.exe\__main__.py", line 4, in <module>
  File "C:\git\__pull-requests\bulk-downloader-for-reddit\bdfr\__main__.py", line 10, in <module>
    from bdfr.archiver import Archiver
  File "C:\git\__pull-requests\bulk-downloader-for-reddit\bdfr\archiver.py", line 20, in <module>
    from bdfr.connector import RedditConnector
  File "C:\git\__pull-requests\bulk-downloader-for-reddit\bdfr\connector.py", line 25, in <module>
    from bdfr import __version__
ImportError: cannot import name '__version__' from 'bdfr' (unknown location)

I fixed it on my end by changing from bdfr import __version__ to from bdfr.__init__ import __version__ in both __main.py__ and connector.py.

Not sure why that's happening (as it doesn't seem to happen when installed via pip). Might be something wonky with my python install or perhaps a byproduct of installing a package as editable. Yet another puzzle to solve on my end. haha. Figured I'd mention it anyways.

@remghoost remghoost closed this Feb 9, 2024
@Serene-Arc
Copy link
Owner

I was actually going to suggest that you change to use a caching and memoisation method instead of a temp file so it worked out fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Redgifs 429 Rate Limiting on Temporary Token Auth API
4 participants