Feature/stream decode #612

MariusWirtz · 2021-09-22T10:45:36Z

Continuation of #606

Pulling large data sets resulted in extremely large memory usage because response.json() creates a dict of the entire dataset which blows the memory up to about 10x the usage of the JSON alone. Added some (rudimentary) code to allow decoding of the JSON direct into CSV by streaming using ijson.

Based on 5e89a15

MariusWirtz · 2021-09-22T16:14:29Z

Hi @raeldor,

Excellent work! Thank you.
I already made a few changes re style and formatting while I'm still trying to wrap my head around the curious ijson library.

At this moment, two test cases fail.

test_execute_mdx_dataframe_column_only:

tm1py/Tests/CellService_test.py

Line 1643 in 5e89a15

def test_execute_mdx_dataframe_column_only(self):
test_execute_mdx_dataframe_include_attributes:

tm1py/Tests/CellService_test.py

Line 1672 in 5e89a15

def test_execute_mdx_dataframe_include_attributes(self):

So we need to adjust the code to make it work for queries with only column selections and queries that include attributes.
if you have an idea of what to adjust feel free to open another PR into this branch. I will also look into this over the next few days whenever I find the time.

rkvinoth · 2021-09-23T12:39:23Z

@raeldor The idea is wonderful! Didn't know such things existed. Thank you for brining it in.
@MariusWirtz Does this affect the performance in any way other than saving RAM?

MariusWirtz · 2021-09-25T16:09:28Z

Hi @rkvinoth,

I would like to share my findings regarding the memory footprint and performance of the new approach.

When querying a dataset of 1.5 M cells, the memory of the python process didn't go above 700MB while with the old code it went as high as 5GB.

In terms of performance new approach is slightly slower.
For 1.5M cells, the new code is like 3% slower. For 150K cells, the new code is like 5% slower.

Based on that data should we perhaps make the iterative JSON parsing optional?

rkvinoth · 2021-09-25T16:37:44Z

@MariusWirtz Thank you for the performance test.

I agree that the new parsing method should be made optional. Not all systems worry about memory and most of us parallelize the query.

rclapp · 2021-09-28T14:42:05Z

I behave not had much time to review the code, but I did read about ijson a bit. Looks like we have some opportunities to seed it up https://github.com/ICRAR/ijson#id21

…

Sent from my mobile phone On Sep 25, 2021 9:38 AM, Vinoth Kumar Ravi ***@***.***> wrote: @MariusWirtz<https://github.com/MariusWirtz> Thank you for the performance test. I agree that the new parsing method should be made optional. Not all systems worry about memory and most of us parallelize the query. - You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#612 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEK7GZTEAKXIOG3RXNV73KLUDX3GJANCNFSM5ERAKMWA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

rclapp · 2021-11-04T20:45:28Z

do we have a path to merge this branch eventually?

MariusWirtz · 2021-11-04T22:10:15Z

do we have a path to merge this branch eventually?

I need to fix the code so that it doesn't break the two tests. Then we merge. I plan to get to it earliest next week. If someone wants to go ahead and fix it, feel free to create a Pull Request based on the latest commit.

…N to CSV by streaming rather than using dict to reduce memory usage on large data sets.

MariusWirtz · 2021-11-19T17:54:25Z

Unfortunately, I haven't been able to make include_attributes and itertative_json_parsing work together.
Instead, the extract_cellset_dataframe function now errors out if both options are used together:

        if iterative_json_parsing and include_attributes:
            raise ValueError("Iterative JSON parsing must not be used together with include_attributes")

MariusWirtz · 2021-11-19T17:57:41Z

@raeldor
if you still follow this thread, as the original author, please approve this merge request. Then I will merge the changes.

rclapp · 2021-11-22T19:52:05Z

Unfortunately, I haven't been able to make include_attributes and itertative_json_parsing work together. Instead, the extract_cellset_dataframe function now errors out if both options are used together:
        if iterative_json_parsing and include_attributes:
            raise ValueError("Iterative JSON parsing must not be used together with include_attributes")

Could you provide some more detail on what the trouble seems to be?

MariusWirtz · 2021-11-22T20:09:06Z

Thanks for asking. Perhaps we will figure this out together if we chat about it.

For the ijson library you need to define the "prefixes of interest". The prefixes then sort of trigger events when looping through the JSON iteratively.
Currently we look for these ones:

        prefixes_of_interest = ['Cells.item.Value', 'Axes.item.Tuples.item.Members.item.Name',
                                'Cells.item.Ordinal', 'Axes.item.Tuples.item.Ordinal', 'Cube.Dimensions.item.Name',
                                'Axes.item.Ordinal']

To catch the attributes, we would need to add prefixes for (potential) attributes, such as:

        prefixes_of_interest.append('Axes.item.Tuples.item.Members.item.Attributes.Color')
        prefixes_of_interest.append('Axes.item.Tuples.item.Members.item.Attributes.Size')
        prefixes_of_interest.append('Axes.item.Tuples.item.Members.item.Attributes.Manager')

and catch the events, kinda like this:

            elif (prefix, event) == ('Axes.item.Tuples.item.Members.item.Attributes.Color', 'map_key'):
                attribute_name = "Color"
                attribute_value = value # e.g., 'red', 'green'

I managed to find the attribute-name and attribute-value pairs in the JSON, but I failed to consume them and integrate them into the CSV, not to mention getting the CSV header line right.

I will push my current state into a WIP PR. Feedback and contribution are very welcome. I would like to get this to work it just seemed a hard nut to crack so I decided to move on for the moment.

MariusWirtz mentioned this pull request Sep 22, 2021

Added stream_decode parameter to execute_view_dataframe #606

Closed

MariusWirtz force-pushed the feature/stream-decode branch from 69ab6f7 to 85ae6b8 Compare September 22, 2021 15:53

MariusWirtz force-pushed the feature/stream-decode branch from 85ae6b8 to 80ceebe Compare September 23, 2021 10:20

MariusWirtz mentioned this pull request Nov 9, 2021

can not use write_to_message_log #641

Closed

raeldor and others added 3 commits November 12, 2021 14:00

Added stream_decode parameter to execute_view_dataframe to decode JSO…

07d8871

…N to CSV by streaming rather than using dict to reduce memory usage on large data sets.

Merge master with json-streaming feature

4c648d6

Tidy up

b2f80d7

MariusWirtz force-pushed the feature/stream-decode branch 2 times, most recently from d445ef6 to e41e372 Compare November 12, 2021 19:59

Rename arg to: iterative_json_parsing

9276ed4

MariusWirtz force-pushed the feature/stream-decode branch from e41e372 to 9276ed4 Compare November 12, 2021 20:03

MariusWirtz added 2 commits November 19, 2021 18:36

Final changes to iterative json parsing

d0752e4

User iterative_json_parsing for csv functions

a4a321b

MariusWirtz force-pushed the feature/stream-decode branch from 3f7c12c to a4a321b Compare November 19, 2021 17:53

MariusWirtz merged commit 1f22bb5 into master Nov 22, 2021

MariusWirtz mentioned this pull request Nov 22, 2021

execute_view_dataframe blowing up memory even with top=1000 #604

Closed

MariusWirtz mentioned this pull request Nov 29, 2021

Feature/odata compact json #644

Closed

MariusWirtz deleted the feature/stream-decode branch December 2, 2021 21:04

MariusWirtz mentioned this pull request Jan 7, 2022

Allow iterative_json_parsing in conjunction with include_attributes #646

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/stream decode #612

Feature/stream decode #612

MariusWirtz commented Sep 22, 2021

MariusWirtz commented Sep 22, 2021

rkvinoth commented Sep 23, 2021

MariusWirtz commented Sep 25, 2021

rkvinoth commented Sep 25, 2021

rclapp commented Sep 28, 2021 via email

rclapp commented Nov 4, 2021

MariusWirtz commented Nov 4, 2021

MariusWirtz commented Nov 19, 2021

MariusWirtz commented Nov 19, 2021

rclapp commented Nov 22, 2021

MariusWirtz commented Nov 22, 2021

Feature/stream decode #612

Feature/stream decode #612

Conversation

MariusWirtz commented Sep 22, 2021

MariusWirtz commented Sep 22, 2021

rkvinoth commented Sep 23, 2021

MariusWirtz commented Sep 25, 2021

rkvinoth commented Sep 25, 2021

rclapp commented Sep 28, 2021 via email

rclapp commented Nov 4, 2021

MariusWirtz commented Nov 4, 2021

MariusWirtz commented Nov 19, 2021

MariusWirtz commented Nov 19, 2021

rclapp commented Nov 22, 2021

MariusWirtz commented Nov 22, 2021