Fix large list truncation #127

hithomasmorelli · 2020-08-15T13:54:06Z

Fixes #126

Include tests for bug fix and new functionality
Updated documentation for new feature (n/a)

hithomasmorelli · 2020-08-16T10:02:41Z

@browniebroke I've fixed the bug and updated the tests (to check for item truncation where applicable), but I'm having trouble with the cassettes - tox -e lint,docs,py38 is failing on pytest because new requests are being made (to test for truncation) and I'm getting multiple vcr.errors.CannotOverwriteExistingCassetteExceptions.

How should I solve this? Apologies for my lack of knowledge - to say I'm rusty with vcrpy would be an understatement, I've never used it before!

hithomasmorelli · 2020-08-16T19:33:18Z

OK, so I think I've fixed the issue of not being able to overwrite existing cassettes (by first running pytest --vcr-record new_episodes), but I've got a different sort of issue now:

I added an extra test to TestPlaylist.test_get_fans, in order to test that the list of fans isn't being truncated (apart from at 100, where the API fans endpoint has a firm limit), however that generated a cassette file with the IDs and usernames of 100 fans of a playlist I randomly chose.

I feel a bit conflicted regarding having this username data within this repository - the Deezer website doesn't provide functionality to view this data, it seems to only be available via the API. Additionally, while there might be a reasonable expectation that Deezer editors might have their track/album/artist/playlist details extracted from the API (given their roles), the 100 fans that the API spits out haven't had an opportunity to consent to their IDs & usernames being stored in a random repository.

Because of this, I propose that, before I commit & push the changes, I manually edit the TestPlaylist.test_get_fans cassette file to blank out the IDs & usernames of the new users being retrieved. This shouldn't affect the truncation test as it's just checking for the length of the list.
@browniebroke is this OK with you?

browniebroke · 2020-08-17T09:11:09Z

Hey, thanks for taking some time on this. Yes, sounds reasonable to anonymise the data before pushing it. 👍

hithomasmorelli · 2020-08-17T09:12:14Z

Brilliant, will push now

Only follows "next" urls: - when they are present in the API response, - when the query is for "tracks", "fans", "albums", "artists" or "playlists, and - when a limit has not been specified in kwargs (Deezer seems to follow this limit anyway, even if it is over their default limit)

browniebroke

Looks good overall, I'm just wondering if this feature could trigger a large amount of hidden API calls. Did you hit any issues with that?

browniebroke · 2020-08-17T10:25:17Z

deezer/client.py

+                relation == "tracks"
+                or relation == "fans"
+                or relation == "albums"
+                or relation == "artists"
+                or relation == "playlists"


Why add these restrictions on specific relations? Could we could do it whenever there is a next in the response, regardless of the relation being requested?

I remember testing it without that and something going horribly wrong, but that was before and "limit" not in kwargs got inserted. I'll do some messing around and see if I can repro the same issue again, otherwise I'll take them out

Still can't remember exactly why I left that in, but I think it had something to do with endpoints such as (and possibly only) artist/<artist>/top. I suppose I thought it was a different question whether artists' top songs should be listed beyond the API's default limit (without specifying a higher limit).

The benefits of following next URLs for requesting (e.g.) playlist tracks is fairly clear-cut in my opinion, as - if you ask for the tracks of a playlist - you expect all the tracks from that playlist to be returned. It's a bit more unclear as to whether all of an artist's top 100 songs (the API maximum) should be returned by default when making that call.

EDIT: I suppose in that case I could change it to just specifically exclude relation == top, but I didn't want to do that in case it breaks anything that is either (1) added to the API in the future or (2) this library may not have implemented yet, which also return next URLs that perhaps shouldn't be followed by default.

browniebroke · 2020-08-17T10:27:23Z

deezer/client.py

+            while "next" in new_json:
+                response = self.session.get(new_json["next"])
+                new_json = response.json()
+                json["data"].extend(new_json["data"])


This is bypassing the error handling that we have above. What happens if we get an error in the middle? Could we get throttled by Deezer in this section?

I'll fix this. What do you suggest happens if we get throttled? My justification for fixing this in the first place was that "if you ask for <playlist>'s tracks, you want all the tracks", but what should happen if we get cut off halfway through (or whenever)?

.gitignore

hithomasmorelli · 2020-08-17T13:04:46Z

Looks good overall, I'm just wondering if this feature could trigger a large amount of hidden API calls. Did you hit any issues with that?

I did think about that, but I'm not sure if there is a better (feasible) solution. I ran into this issue because I was requesting tracks from a large playlist and they weren't all appearing, which seemed contrary to the design of the library. The Deezer API's preferred way of doing things is seemingly to follow next URLs (I couldn't find any docs on these, it's just what the API does), so it seems to me like the current best solution is to do this. Yes, it will lead to more hidden API calls, but the alternative (not providing all items in response to a request for them) doesn't seem logical imo.

Just to say, when making changes to the tests I did originally have tests where 1000-1500 songs were requested and I don't remember hitting any rate limits (the API does have some but I can't find them formally documented, just this 2015 answer from someone who claims to be a Deezer developer) - I changed those tests to reduce the size of the cassettes. Arguably, if the API throws a rate limit for following its own next URLs in an attempt to get a complete list, that's a fault of the API itself 🙃

To avoid duplicate code, I combined the requesting + error handling of the original API call, and, if applicable, any subsequent `next` API calls. However, the consequence of this is that the code probably makes no sense to the untrained eye. Hence, the absolute barrage of comments. (I promise you, the tests *do* pass!)

sourcery-ai · 2020-08-17T18:21:57Z

Sourcery Code Quality Report (beta)

❌ Merging this PR will decrease code quality in the affected files by 0.04 out of 10.

Quality metrics	Before	After	Change
Complexity	0.33	0.41	0.08 🔴
Method Length	27.24	28.46	1.22 🔴
Quality	9.30	9.26	-0.04 🔴

Other metrics	Before	After	Change
Lines	1035	1116	81

Changed files	Quality Before	Quality After	Quality Change
deezer/client.py	9.08	8.96	-0.12 🔴
deezer/resources.py	9.59	9.59	0.00
tests/test_client.py	9.33	9.33	0.00
tests/test_resources.py	8.94	8.85	-0.09 🔴

Here are some functions in these files that still need a tune-up:

File	Function	Complexity	Length	Overall	Recommendation

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Let us know what you think of it by mentioning @sourcery-ai in a comment.

hithomasmorelli · 2020-08-17T18:28:49Z

I've removed .vscode from .gitignore, and added error handling to the next requests.

To avoid duplicate code, I merged the next requests functionality with the original request functionality. Only consequence of this is that the code is a pain to read unaided - hence the barrage of comments.

If you know a better way to code this, feel free to [suggest an] edit. I am definitely not claiming this is the best way of doing things - it's just the only one that came to mind!

Not quite sure why the bandit lint check failed, running bandit -r deezer on my machine results in 0 issues. EDIT: turns out it was an internal error

hithomasmorelli · 2020-08-18T19:33:41Z

Just discovered while working on #134 that Resource.iter_relation does this in a more managed way (so as not to use loads of memory at once). Entirely my fault that I hadn't seen this sooner - I was (wrongly) focusing on the get_*, and not the iter_*.

deezer-python/deezer/resources.py

Lines 65 to 72 in 99f01b7

    
           index = 0 
        
           while 1: 
        
               items = self.get_relation(relation, index=index, **kwargs) 
        
               yield from items 
        
               if len(items) == 0: 
        
                   break 
        
               index += len(items)

I'm still of the opinion that if you ask for the tracks in a playlist (in list form), you expect all the tracks from that playlist. But now the code changes basically render any memory usage reduction in Resource.iter_relation useless, as Client.get_object will return everything at once anyway.

Hmm. I'm a bit stumped now as to what's the best way to implement this. Any help from anyone would be greatly appreciated!

browniebroke · 2020-08-19T07:15:45Z

I haven't got around to review this again properly yet, but yes, the iter_... methods are basically doing a similar thing in a different way.

Do you think we could stick to the iter_... constructs for now? That would probably make the change fit better in the current design.

We might want to switch the get_... to have a better automatic handling of pagination, but that seems like a bigger, potentially breaking change. One option would be to switch them to do what iter_... are doing right now, and essentially merge the 2 together.

I recently used PyGithub and I liked how they handle pagination, it works out of the box and transparently goes through the pages while also letting you get a specific page or the total count of elements in the result set.

hithomasmorelli · 2020-08-19T10:41:09Z

Yes, that’s probably a good idea. I’ll close this PR for now (should I leave #126 open?)

browniebroke · 2020-08-19T11:07:32Z

I think it could be closed yes. I'll write an issue to rework the pagination, the fact that you missed the iter_... APIs is a sign that it could be improved.

browniebroke · 2020-08-19T11:19:48Z

Opened #137

updated gitignore to ignore VS Code files

e28f4c0

hithomasmorelli changed the title ~~Fix track list truncation~~ Fix large list truncation Aug 15, 2020

hithomasmorelli mentioned this pull request Aug 17, 2020

client.get_chart fails #132

Closed

hithomasmorelli marked this pull request as ready for review August 17, 2020 09:53

browniebroke reviewed Aug 17, 2020

View reviewed changes

hithomasmorelli added 2 commits August 17, 2020 19:08

Removed VS Code folder from repo gitignore

e61919d

hithomasmorelli requested a review from browniebroke August 17, 2020 18:23

browniebroke added the enhancement New feature or request label Aug 18, 2020

hithomasmorelli closed this Aug 19, 2020

browniebroke mentioned this pull request Aug 19, 2020

Improve how to deal with Pagination #137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix large list truncation #127

Fix large list truncation #127

hithomasmorelli commented Aug 15, 2020 •

edited

Loading

hithomasmorelli commented Aug 16, 2020 •

edited

Loading

hithomasmorelli commented Aug 16, 2020

browniebroke commented Aug 17, 2020

hithomasmorelli commented Aug 17, 2020

browniebroke left a comment

browniebroke Aug 17, 2020

hithomasmorelli Aug 17, 2020 •

edited

Loading

hithomasmorelli Aug 17, 2020 •

edited

Loading

browniebroke Aug 17, 2020

hithomasmorelli Aug 17, 2020

hithomasmorelli commented Aug 17, 2020 •

edited

Loading

sourcery-ai bot commented Aug 17, 2020

hithomasmorelli commented Aug 17, 2020 •

edited

Loading

hithomasmorelli commented Aug 18, 2020

browniebroke commented Aug 19, 2020

hithomasmorelli commented Aug 19, 2020 via email •

edited

Loading

browniebroke commented Aug 19, 2020

browniebroke commented Aug 19, 2020

Fix large list truncation #127

Fix large list truncation #127

Conversation

hithomasmorelli commented Aug 15, 2020 • edited Loading

hithomasmorelli commented Aug 16, 2020 • edited Loading

hithomasmorelli commented Aug 16, 2020

browniebroke commented Aug 17, 2020

hithomasmorelli commented Aug 17, 2020

browniebroke left a comment

Choose a reason for hiding this comment

browniebroke Aug 17, 2020

Choose a reason for hiding this comment

hithomasmorelli Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

hithomasmorelli Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

browniebroke Aug 17, 2020

Choose a reason for hiding this comment

hithomasmorelli Aug 17, 2020

Choose a reason for hiding this comment

hithomasmorelli commented Aug 17, 2020 • edited Loading

sourcery-ai bot commented Aug 17, 2020

Sourcery Code Quality Report (beta)

hithomasmorelli commented Aug 17, 2020 • edited Loading

hithomasmorelli commented Aug 18, 2020

browniebroke commented Aug 19, 2020

hithomasmorelli commented Aug 19, 2020 via email • edited Loading

browniebroke commented Aug 19, 2020

browniebroke commented Aug 19, 2020

hithomasmorelli commented Aug 15, 2020 •

edited

Loading

hithomasmorelli commented Aug 16, 2020 •

edited

Loading

hithomasmorelli Aug 17, 2020 •

edited

Loading

hithomasmorelli Aug 17, 2020 •

edited

Loading

hithomasmorelli commented Aug 17, 2020 •

edited

Loading

hithomasmorelli commented Aug 17, 2020 •

edited

Loading

hithomasmorelli commented Aug 19, 2020 via email •

edited

Loading