Vector Zonal Stats: Colab crashing due to RAM consumption #124

alronlam · 2022-07-04T01:58:59Z

alronlam
Jul 4, 2022

Colab notebook for testing:
https://colab.research.google.com/drive/147HWUgaBztsZuBPrI_HTckBrz_vl9l1l#scrollTo=wvLenjgDUgod

Scenario

Created AOI grid tiles for the Subang Regency in Indonesia (~36k 250mx250m grid tiles)
Tried to get the average population density per tile using HRSL vector data (CSV file is around 1.8GB)

Error
Colab crashes due to exceeding the RAM limit.

Just creating this issue to check if there are straightforward ways to optimize. Otherwise, are there workarounds for handling such vector datasets that are relatively large?

butchtm · 2022-07-04T02:48:53Z

butchtm
Jul 4, 2022
Maintainer

Hi @alronlam
I'm trying to replicate the issue but your colab notebook is missing a dataset 'indonesia-osm-pois-2020.csv'. Could you provide a link to a copy of the dataset we could download?

0 replies

butchtm · 2022-07-04T02:51:28Z

butchtm
Jul 4, 2022
Maintainer

Hi @alronlam, the ookla dataset 'indonesia-ookla-2020-q1-fixed.csv' is missing as well.

0 replies

alronlam · 2022-07-04T03:00:35Z

alronlam
Jul 4, 2022
Author

Oh hi @butchtm , the link to the GDrive folder for these files are in the top-most part of the notebook.

0 replies

alronlam · 2022-07-04T03:35:39Z

alronlam
Jul 4, 2022
Author

Additional detail: I think I also ran into this RAM issue when aligning with the raw Ookla dataset: https://registry.opendata.aws/speedtest-global-performance/

I tried utilizing the latest fixed line data from Ookla:

S3 url: s3://ookla-open-data/parquet/performance/type=fixed/year=2022/quarter=2/2022-04-01_performance_fixed_tiles.parquet
Docs: https://registry.opendata.aws/speedtest-global-performance/
Parquet file was around 500-600MB, corresponding to ~6m data tiles.

My workaround was to utilize an older, filtered version of the data that was for Indonesia only (because this raw data was for the whole world).

So I guess one principle here is that we should always filter the feature datasets as much as we can before aligning to the AOIs to avoid such issues. But in the example of HRSL, this data is already for Indonesia alone. Not sure what else we can do to make it work for such big datasets (some kind of parallel processing?). Or in these cases, are we forced to use other tools like BQ?

0 replies

butchtm · 2022-07-04T03:54:50Z

butchtm
Jul 4, 2022
Maintainer

hi @alronlam, I'm trying to see if I can just convert the HRSL data (1.8GB csv file) to a geojson file and load it as such, but even that is already crashing Colab. Colab might not be ideal for working with production sized datasets but for learning/exploring the modules.
I'll try to replicate the problem on a beefier machine (my laptop :-)) to find a way to do this.

0 replies

tm-kah-alforja · 2022-08-11T03:37:48Z

tm-kah-alforja
Aug 11, 2022
Maintainer

low prio; converting this into a discussion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Zonal Stats: Colab crashing due to RAM consumption #124

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Vector Zonal Stats: Colab crashing due to RAM consumption #124

alronlam Jul 4, 2022

Replies: 6 comments

butchtm Jul 4, 2022 Maintainer

butchtm Jul 4, 2022 Maintainer

alronlam Jul 4, 2022 Author

alronlam Jul 4, 2022 Author

butchtm Jul 4, 2022 Maintainer

tm-kah-alforja Aug 11, 2022 Maintainer

alronlam
Jul 4, 2022

butchtm
Jul 4, 2022
Maintainer

butchtm
Jul 4, 2022
Maintainer

alronlam
Jul 4, 2022
Author

alronlam
Jul 4, 2022
Author

butchtm
Jul 4, 2022
Maintainer

tm-kah-alforja
Aug 11, 2022
Maintainer