Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar_terra_rast and terra objects won't work with cloud storage #112

Open
Aariq opened this issue Nov 1, 2024 · 7 comments
Open

tar_terra_rast and terra objects won't work with cloud storage #112

Aariq opened this issue Nov 1, 2024 · 7 comments

Comments

@Aariq
Copy link
Collaborator

Aariq commented Nov 1, 2024

I think there may be a problem with using cloud storage. When a SpatRaster is stored with repository = "aws", for example, and a user tries to load it with tar_read() or tar_load(), I think what happens is the file is downloaded from AWS into tempdir()/_targets/scratch, then read in using the read function stored in format, then the file is deleted from the scratch dir once the target is loaded into memory. This means the SpatRaster object makes it into memory, but the file it points to is gone.

I've confirmed this behavior with a S3 bucket hosted on Jetstream2 and I can share (privately) the credentials for it if you'd like to test to confirm this.

I can't find an argument or option in targets that overrides this behavior, although I would have expected memory = "persistent" to maybe do something here.

library(targets)
tar_dir({
  tar_script({
    library(targets)
    library(geotargets)
    tar_option_set(
      repository = "aws",
      resources = tar_resources(
        aws = tar_resources_aws(
          bucket = "test123456",
          prefix = "targets_test",
          endpoint = "https://js2.jetstream-cloud.org:8001"
        )
      )
    )
    list(
      tar_target(
        file,
        system.file("ex/elev.tif", package = "terra"),
        format = "file", 
        repository = "local"
      ),
      tar_terra_rast(
        rast_example,
        terra::rast(file)
      )
    )
  })
  tar_make()
  tar_load(rast_example)
  sources(rast_example)
  fs::file_exists(sources(rast_example))
})
#> ▶ dispatched target file
#> ● completed target file [0.003 seconds, 7.994 kilobytes]
#> ▶ dispatched target rast_example
#> ● completed target rast_example [0.008 seconds, 8.523 kilobytes]
#> ▶ ended pipeline [1.063 seconds]
#> /private/var/folders/wr/by_lst2d2fngf67mknmgf4340000gn/T/RtmpVmmt9h/_targets/scratch/rast_example1674d43dc896a 
#> FALSE
@Aariq Aariq changed the title geotargets and cloud storage tar_terra_rast and terra objects won't work with cloud storage Nov 1, 2024
@Aariq
Copy link
Collaborator Author

Aariq commented Nov 1, 2024

This might not be something we can fix, but we should document it as a limitation and possibly open a discussion in the targets repo

@mdsumner
Copy link

mdsumner commented Nov 4, 2024

Is the rast() call made with the prefix '/vsicurl' or '/vsis3/'? If it isn't GDAL will whole-download the file behind the scenes, which sounds like what you're seeing (I don't understand how to run your example above, it fails for me and I don't understand how to unpack it)

@Aariq
Copy link
Collaborator Author

Aariq commented Nov 4, 2024

Sorry, this was a bit of a "note-to-self" left on a Friday. You'll need some credentials (which I could share privately) for that example to work.

The rast() call is not being made with /vsicurl/ or /vsis3/. The way I think targets works with cloud storage is that when you run tar_read()/tar_load() or when a downstream target takes a cloud-stored target as a dependency, the target file is downloaded from the cloud into a scratch directory, read in using the read function stored in tar_format() (in this case, just rast(path)), and then apparently deleted from the scratch directory.

I don't know if it's possible to customize this behavior, but you're right that in this particular case it would make sense to be able to use /vsis3/. I think there is likely something useful in this discussion about the new content addressable storage options in targets, in particular this comment (ropensci/targets#1232 (reply in thread)) and tar_repository_cas() (https://github.com/ropensci/targets/blob/main/R/tar_repository_cas.R)

@Aariq
Copy link
Collaborator Author

Aariq commented Nov 4, 2024

This might just be a bug in targets—it seems reasonable for the scratch/ directory to stick around for at least the existence of the interactive R session.

@Aariq
Copy link
Collaborator Author

Aariq commented Nov 4, 2024

Opened discussion here: ropensci/targets#1367

@Aariq
Copy link
Collaborator Author

Aariq commented Nov 5, 2024

Changing the read function for tar_terra_rast() from function(path) terra::rast(path) to function(path) terra::rast(path) + 0 forces the SpatRaster source to be memory and then the targets work with cloud storage. I'm not sure what the downsides of this are other than that in-memory SpatRasters appear to lose info about file blocksize (#99), and that they obviously have to fit in memory.

@Aariq
Copy link
Collaborator Author

Aariq commented Nov 20, 2024

Before adding some kind of "force_memory" option, the workaround I'd like to try and document as a vignette is to use the relatively new content addressable storage functionality in targets with a custom download function that doesn't actually download the raster, but returns a /vsis3/bucket/key style path that terra::rast() can read in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants