-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-load some spatial datasets #894
Comments
Spoke with @jorvis about using Google Filestore as a test space, since we previously discussed having to move our datasets off the VM for performance reasons. Google Buckets would also work but Filestore would be easier to integrate with our current code bases that use filepaths. Google Filestore overview -> https://cloud.google.com/filestore/docs/overview |
Haven't provisioned anything yet, but here's what I'm thinking for the Filestore instance. Prices for us-east1 are actually about $50 more dollars than I listed because the examples cite us-central1
So in a way, I don't think this would be terribly cost-efficient until we actually put data onto the FileStore to use. It would be a pretty inefficient use of resources to pay $260/month for me to test 1 spatial dataset until things were working and we could add more. More info -> https://cloud.google.com/filestore/docs/service-tiers |
The other option would be to use one of Google's block-storage systems instead of their file-storage system (which I described above). I read up on the differences, and it seems the biggest difference is in file-storage, the management occurs on Google's side, whereas for block-storage you receive the disk block and then configure and manage the filesystem yourself (on the server). Block storage (like Hyperdisk) also seems much cheaper than the file-storage options I quoted above. 1 TB of Hyperdisk balanced-provisioned space is $90/month. I believe you mount the disk to the VM just like in the other cases, but I need to just read up more to get a feel for the flow of things. They will also charge extra monthly if we were to exceed the baseline 3,000 IOPS and 140 MBps throughput that is included. I can see us probably going of 3K IOPS in a month (~$31), but maybe not the 140Mbps https://cloud.google.com/compute/disks-image-pricing#disk Found this cool flowchart that may also answer some questions as well -> https://cloud.google.com/static/architecture/images/storage-advisor.svg Based on the flowchart, it seems like Filestore-zonal would be the best candidate but I wouldn't rule out Persistent Disk-Zonal or Hyperdisk Balanced due to potentially better costs, if the integration and flow are right There is also this "which to choose" graphic -> https://cloud.google.com/blog/topics/developers-practitioners/map-storage-options-google-cloud |
Using traditional Google bucket (object) storage is also an option and even cheaper (~$20 TB/month). We would have to use FUSE to mount the bucket to our VMs though -> https://cloud.google.com/storage/docs/gcsfuse-mount I think from a strict requirements perspective, we do not NEED file-based access with respect to datasets. Generally, with the exception of saved analyses, all h5ads are stored in a flat location. But performance would take a hit, as we would need to enable caching to ensure reading data would be faster. |
Sure, we could do this after the uploader step is created (#892), but I feel that it would be better to just pre-load some spatial datasets our own way. One reason is that we can test other tools developed (#890) without having the uploader as a blocker. Another reason is that can already have established the ready-to-go format and stored file structure that the uploader should go into.
The text was updated successfully, but these errors were encountered: