Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix large list truncation #127

Conversation

hithomasmorelli
Copy link
Contributor

@hithomasmorelli hithomasmorelli commented Aug 15, 2020

Fixes #126

  • Include tests for bug fix and new functionality
  • Updated documentation for new feature (n/a)

@hithomasmorelli hithomasmorelli changed the title Fix track list truncation Fix large list truncation Aug 15, 2020
@hithomasmorelli
Copy link
Contributor Author

hithomasmorelli commented Aug 16, 2020

@browniebroke I've fixed the bug and updated the tests (to check for item truncation where applicable), but I'm having trouble with the cassettes - tox -e lint,docs,py38 is failing on pytest because new requests are being made (to test for truncation) and I'm getting multiple vcr.errors.CannotOverwriteExistingCassetteExceptions.

How should I solve this? Apologies for my lack of knowledge - to say I'm rusty with vcrpy would be an understatement, I've never used it before!

@hithomasmorelli
Copy link
Contributor Author

OK, so I think I've fixed the issue of not being able to overwrite existing cassettes (by first running pytest --vcr-record new_episodes), but I've got a different sort of issue now:

I added an extra test to TestPlaylist.test_get_fans, in order to test that the list of fans isn't being truncated (apart from at 100, where the API fans endpoint has a firm limit), however that generated a cassette file with the IDs and usernames of 100 fans of a playlist I randomly chose.

I feel a bit conflicted regarding having this username data within this repository - the Deezer website doesn't provide functionality to view this data, it seems to only be available via the API. Additionally, while there might be a reasonable expectation that Deezer editors might have their track/album/artist/playlist details extracted from the API (given their roles), the 100 fans that the API spits out haven't had an opportunity to consent to their IDs & usernames being stored in a random repository.

Because of this, I propose that, before I commit & push the changes, I manually edit the TestPlaylist.test_get_fans cassette file to blank out the IDs & usernames of the new users being retrieved. This shouldn't affect the truncation test as it's just checking for the length of the list.
@browniebroke is this OK with you?

@browniebroke
Copy link
Owner

Hey, thanks for taking some time on this. Yes, sounds reasonable to anonymise the data before pushing it. 👍

@hithomasmorelli
Copy link
Contributor Author

Brilliant, will push now

Only follows "next" urls:
 - when they are present in the API response,
 - when the query is for "tracks", "fans", "albums", "artists" or "playlists, and
 - when a limit has not been specified in kwargs (Deezer seems to follow this limit anyway, even if it is over their default limit)
@hithomasmorelli hithomasmorelli marked this pull request as ready for review August 17, 2020 09:53
Copy link
Owner

@browniebroke browniebroke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, I'm just wondering if this feature could trigger a large amount of hidden API calls. Did you hit any issues with that?

deezer/client.py Outdated
Comment on lines 192 to 196
relation == "tracks"
or relation == "fans"
or relation == "albums"
or relation == "artists"
or relation == "playlists"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add these restrictions on specific relations? Could we could do it whenever there is a next in the response, regardless of the relation being requested?

Copy link
Contributor Author

@hithomasmorelli hithomasmorelli Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember testing it without that and something going horribly wrong, but that was before and "limit" not in kwargs got inserted. I'll do some messing around and see if I can repro the same issue again, otherwise I'll take them out

Copy link
Contributor Author

@hithomasmorelli hithomasmorelli Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still can't remember exactly why I left that in, but I think it had something to do with endpoints such as (and possibly only) artist/<artist>/top. I suppose I thought it was a different question whether artists' top songs should be listed beyond the API's default limit (without specifying a higher limit).

The benefits of following next URLs for requesting (e.g.) playlist tracks is fairly clear-cut in my opinion, as - if you ask for the tracks of a playlist - you expect all the tracks from that playlist to be returned. It's a bit more unclear as to whether all of an artist's top 100 songs (the API maximum) should be returned by default when making that call.

EDIT: I suppose in that case I could change it to just specifically exclude relation == top, but I didn't want to do that in case it breaks anything that is either (1) added to the API in the future or (2) this library may not have implemented yet, which also return next URLs that perhaps shouldn't be followed by default.

deezer/client.py Outdated
Comment on lines 202 to 205
while "next" in new_json:
response = self.session.get(new_json["next"])
new_json = response.json()
json["data"].extend(new_json["data"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bypassing the error handling that we have above. What happens if we get an error in the middle? Could we get throttled by Deezer in this section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this. What do you suggest happens if we get throttled? My justification for fixing this in the first place was that "if you ask for <playlist>'s tracks, you want all the tracks", but what should happen if we get cut off halfway through (or whenever)?

.gitignore Outdated Show resolved Hide resolved
@hithomasmorelli
Copy link
Contributor Author

hithomasmorelli commented Aug 17, 2020

Looks good overall, I'm just wondering if this feature could trigger a large amount of hidden API calls. Did you hit any issues with that?

I did think about that, but I'm not sure if there is a better (feasible) solution. I ran into this issue because I was requesting tracks from a large playlist and they weren't all appearing, which seemed contrary to the design of the library. The Deezer API's preferred way of doing things is seemingly to follow next URLs (I couldn't find any docs on these, it's just what the API does), so it seems to me like the current best solution is to do this. Yes, it will lead to more hidden API calls, but the alternative (not providing all items in response to a request for them) doesn't seem logical imo.

Just to say, when making changes to the tests I did originally have tests where 1000-1500 songs were requested and I don't remember hitting any rate limits (the API does have some but I can't find them formally documented, just this 2015 answer from someone who claims to be a Deezer developer) - I changed those tests to reduce the size of the cassettes. Arguably, if the API throws a rate limit for following its own next URLs in an attempt to get a complete list, that's a fault of the API itself 🙃

To avoid duplicate code, I combined the requesting + error handling of the original API call, and, if applicable, any subsequent `next` API calls. However, the consequence of this is that the code probably makes no sense to the untrained eye. Hence, the absolute barrage of comments.
(I promise you, the tests *do* pass!)
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Aug 17, 2020

Sourcery Code Quality Report (beta)

❌  Merging this PR will decrease code quality in the affected files by 0.04 out of 10.

Quality metrics Before After Change
Complexity 0.33 0.41 0.08 🔴
Method Length 27.24 28.46 1.22 🔴
Quality 9.30 9.26 -0.04 🔴
Other metrics Before After Change
Lines 1035 1116 81
Changed files Quality Before Quality After Quality Change
deezer/client.py 9.08 8.96 -0.12 🔴
deezer/resources.py 9.59 9.59 0.00
tests/test_client.py 9.33 9.33 0.00
tests/test_resources.py 8.94 8.85 -0.09 🔴

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Overall Recommendation

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Let us know what you think of it by mentioning @sourcery-ai in a comment.

@hithomasmorelli
Copy link
Contributor Author

hithomasmorelli commented Aug 17, 2020

I've removed .vscode from .gitignore, and added error handling to the next requests.

To avoid duplicate code, I merged the next requests functionality with the original request functionality. Only consequence of this is that the code is a pain to read unaided - hence the barrage of comments.

If you know a better way to code this, feel free to [suggest an] edit. I am definitely not claiming this is the best way of doing things - it's just the only one that came to mind!

Not quite sure why the bandit lint check failed, running bandit -r deezer on my machine results in 0 issues. EDIT: turns out it was an internal error
image

@browniebroke browniebroke added the enhancement New feature or request label Aug 18, 2020
@hithomasmorelli
Copy link
Contributor Author

Just discovered while working on #134 that Resource.iter_relation does this in a more managed way (so as not to use loads of memory at once). Entirely my fault that I hadn't seen this sooner - I was (wrongly) focusing on the get_*, and not the iter_*.

index = 0
while 1:
items = self.get_relation(relation, index=index, **kwargs)
yield from items
if len(items) == 0:
break
index += len(items)

I'm still of the opinion that if you ask for the tracks in a playlist (in list form), you expect all the tracks from that playlist. But now the code changes basically render any memory usage reduction in Resource.iter_relation useless, as Client.get_object will return everything at once anyway.

Hmm. I'm a bit stumped now as to what's the best way to implement this. Any help from anyone would be greatly appreciated!

@browniebroke
Copy link
Owner

I haven't got around to review this again properly yet, but yes, the iter_... methods are basically doing a similar thing in a different way.

Do you think we could stick to the iter_... constructs for now? That would probably make the change fit better in the current design.

We might want to switch the get_... to have a better automatic handling of pagination, but that seems like a bigger, potentially breaking change. One option would be to switch them to do what iter_... are doing right now, and essentially merge the 2 together.

I recently used PyGithub and I liked how they handle pagination, it works out of the box and transparently goes through the pages while also letting you get a specific page or the total count of elements in the result set.

@hithomasmorelli
Copy link
Contributor Author

hithomasmorelli commented Aug 19, 2020 via email

@browniebroke
Copy link
Owner

I think it could be closed yes. I'll write an issue to rework the pagination, the fact that you missed the iter_... APIs is a sign that it could be improved.

@browniebroke
Copy link
Owner

Opened #137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deezer API doesn't return all items in large collections
2 participants