Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/stream decode #612

Merged
merged 6 commits into from
Nov 22, 2021
Merged

Feature/stream decode #612

merged 6 commits into from
Nov 22, 2021

Conversation

MariusWirtz
Copy link
Collaborator

Continuation of #606

Pulling large data sets resulted in extremely large memory usage because response.json() creates a dict of the entire dataset which blows the memory up to about 10x the usage of the JSON alone. Added some (rudimentary) code to allow decoding of the JSON direct into CSV by streaming using ijson.

Based on 5e89a15

@MariusWirtz
Copy link
Collaborator Author

Hi @raeldor,

Excellent work! Thank you.
I already made a few changes re style and formatting while I'm still trying to wrap my head around the curious ijson library.

At this moment, two test cases fail.

So we need to adjust the code to make it work for queries with only column selections and queries that include attributes.
if you have an idea of what to adjust feel free to open another PR into this branch. I will also look into this over the next few days whenever I find the time.

@rkvinoth
Copy link
Contributor

@raeldor The idea is wonderful! Didn't know such things existed. Thank you for brining it in.
@MariusWirtz Does this affect the performance in any way other than saving RAM?

@MariusWirtz
Copy link
Collaborator Author

Hi @rkvinoth,

I would like to share my findings regarding the memory footprint and performance of the new approach.

When querying a dataset of 1.5 M cells, the memory of the python process didn't go above 700MB while with the old code it went as high as 5GB.

In terms of performance new approach is slightly slower.
For 1.5M cells, the new code is like 3% slower. For 150K cells, the new code is like 5% slower.

Based on that data should we perhaps make the iterative JSON parsing optional?

@rkvinoth
Copy link
Contributor

@MariusWirtz Thank you for the performance test.

I agree that the new parsing method should be made optional. Not all systems worry about memory and most of us parallelize the query.

@rclapp
Copy link
Collaborator

rclapp commented Sep 28, 2021 via email

@rclapp
Copy link
Collaborator

rclapp commented Nov 4, 2021

do we have a path to merge this branch eventually?

@MariusWirtz
Copy link
Collaborator Author

do we have a path to merge this branch eventually?

I need to fix the code so that it doesn't break the two tests. Then we merge. I plan to get to it earliest next week. If someone wants to go ahead and fix it, feel free to create a Pull Request based on the latest commit.

raeldor and others added 3 commits November 12, 2021 14:00
…N to CSV by streaming rather than using dict to reduce memory usage on large data sets.
@MariusWirtz MariusWirtz force-pushed the feature/stream-decode branch 2 times, most recently from d445ef6 to e41e372 Compare November 12, 2021 19:59
@MariusWirtz MariusWirtz force-pushed the feature/stream-decode branch from e41e372 to 9276ed4 Compare November 12, 2021 20:03
@MariusWirtz MariusWirtz force-pushed the feature/stream-decode branch from 3f7c12c to a4a321b Compare November 19, 2021 17:53
@MariusWirtz
Copy link
Collaborator Author

Unfortunately, I haven't been able to make include_attributes and itertative_json_parsing work together.
Instead, the extract_cellset_dataframe function now errors out if both options are used together:

        if iterative_json_parsing and include_attributes:
            raise ValueError("Iterative JSON parsing must not be used together with include_attributes")

@MariusWirtz
Copy link
Collaborator Author

@raeldor
if you still follow this thread, as the original author, please approve this merge request. Then I will merge the changes.

@rclapp
Copy link
Collaborator

rclapp commented Nov 22, 2021

Unfortunately, I haven't been able to make include_attributes and itertative_json_parsing work together. Instead, the extract_cellset_dataframe function now errors out if both options are used together:

        if iterative_json_parsing and include_attributes:
            raise ValueError("Iterative JSON parsing must not be used together with include_attributes")

Could you provide some more detail on what the trouble seems to be?

@MariusWirtz
Copy link
Collaborator Author

Thanks for asking. Perhaps we will figure this out together if we chat about it.

For the ijson library you need to define the "prefixes of interest". The prefixes then sort of trigger events when looping through the JSON iteratively.
Currently we look for these ones:

        prefixes_of_interest = ['Cells.item.Value', 'Axes.item.Tuples.item.Members.item.Name',
                                'Cells.item.Ordinal', 'Axes.item.Tuples.item.Ordinal', 'Cube.Dimensions.item.Name',
                                'Axes.item.Ordinal']

To catch the attributes, we would need to add prefixes for (potential) attributes, such as:

        prefixes_of_interest.append('Axes.item.Tuples.item.Members.item.Attributes.Color')
        prefixes_of_interest.append('Axes.item.Tuples.item.Members.item.Attributes.Size')
        prefixes_of_interest.append('Axes.item.Tuples.item.Members.item.Attributes.Manager')

and catch the events, kinda like this:

            elif (prefix, event) == ('Axes.item.Tuples.item.Members.item.Attributes.Color', 'map_key'):
                attribute_name = "Color"
                attribute_value = value # e.g., 'red', 'green'

I managed to find the attribute-name and attribute-value pairs in the JSON, but I failed to consume them and integrate them into the CSV, not to mention getting the CSV header line right.

I will push my current state into a WIP PR. Feedback and contribution are very welcome. I would like to get this to work it just seemed a hard nut to crack so I decided to move on for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants