Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

dchandekstark · 2024-05-02T15:20:29Z

I am working on a storage adapter for S3 (in a somewhat different fashion from https://github.com/stkenny/valkyrie-storage-s3), and I find that Valkyrie::StorageAdapter::File feels awkward in a couple of spots. I wonder whether we could insert a higher level abstraction above it to support find_by in a more "natural" way for cloud providers.

First, #disk_path IMO extends the responsibility of V::SA::F too far, and StreamFile feels too opinionated (and appears to create but not clean up temp files?). Likewise rewind, close, and read seem bound to Ruby's IO class -- nothing wrong with that as such, but if you're dealing with a cloud resource, it's not really "on disk" and it's not really an IO either. What I might suggest for a higher level abstraction is a class that minimally implements #each to yield chunks of the file data; I could also see #stream to get an IO. #size is probably fine too, fwiw.

Here's a partial illustration, where the io used to instantiate the class is an instance of Aws::S3::Object

def each
  io.get do |chunk, _headers|
    yield chunk
  end
end

# @return [IO]
def stream
   io.get.body
end

Interested in any and all feedback. Thanks!

The text was updated successfully, but these errors were encountered:

tpendragon · 2024-05-02T17:31:33Z

@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path because it's often just so necessary to be able to get that file locally for derivative processing.

I'm interested in StreamFiles not cleaning up, that wasn't the intention if it's true. A brief look at the code does look like that's the case though - folks would have to clear out that tmp dir occasionally.

tpendragon · 2024-05-02T17:32:14Z

@dchandekstark Have you looked at valkyrie-shrine for S3 support? That's largely what we've done.

dchandekstark · 2024-05-02T17:42:31Z

@dchandekstark Have you looked at valkyrie-shrine for S3 support? That's largely what we've done.

@tpendragon I haven't. At this stage, I'm looking for the "simplest thing that could possibly work"(tm), but I'll definitely check it out. I actually contemplated trying to use ActiveStorage in a more or less similar way, but it felt over-engineered -- like I should just switch to using ActiveStorage instead of trying to integrate it into Valkyrie as another storage adapter.

tpendragon · 2024-05-02T17:43:59Z

@dchandekstark It's pretty simple. In figgy we have this:

https://github.com/pulibrary/figgy/blob/c575619bfeb42a35ed0545644b21034214f62a75/config/initializers/valkyrie.rb#L129-L145

then that adapter works just like the disk one.

dchandekstark · 2024-05-02T17:57:29Z

@dchandekstark Do you have a reference library that handles file abstraction the way you're proposing? We've leaned on IO because it's how folks expect to handle files, and on #disk_path because it's often just so necessary to be able to get that file locally for derivative processing.

I'm thinking of something along the lines of the Rack spec? -- not that I'm well versed in it -- but at least the low-level notion of an enumerable.

I get the disk_path thing, but if stored file doesn't have a path natively, then I think it's the responsibility of the party that wants a path to download the data. Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.

tpendragon · 2024-05-02T18:45:48Z

Where and how that needs to happen can vary, and I don't really want to embed the logic into the adapter.

I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.

dchandekstark · 2024-05-02T18:57:26Z

I get that, but the motivation is that I want to suddenly be able to decide "Actually I don't want to be on disk, I want to be in S3" and have that be as easy and painless a transition as possible.

That's fair. I do wish the temp file stuff in StreamFile could be addressed, but I'm not sure how without impacting disk_path in a way that undermines the purpose of it. It's sort of indeterminate how long the path is supposed to persist.

tpendragon · 2024-05-02T19:07:01Z

We could use Tempfile maybe, or disk_path could accept a block and clean up after itself when the block ends 🤔

dchandekstark · 2024-05-02T19:17:32Z

We could use Tempfile maybe, or disk_path could accept a block and clean up after itself when the block ends 🤔

Tempfile + optional block would work I think.

dchandekstark · 2024-05-02T21:25:51Z

@tpendragon One more consideration on disk_path: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?

cgalarza · 2024-05-02T21:46:54Z

Hello! I thought I would chime in because I recently ran into a problem where I was using disk_path and had assume that it was a temp file that would get cleaned up on its own, but it didn't and then my server ran out of space, etc etc. I think part of my issue occurred because we deploying in a containerized environment and can't depend on using a tmp directory that a server might cleanup (but I am far from an expert on this).

I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile documentation and it seemed like it is really encouraging manually closing/removing a TempFile. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.

My suggestion would be keep disk_path as is, but add a method to explicitly close the file and add an additional method to take a block.

Just my two cents.

tpendragon · 2024-05-02T21:58:06Z

@tpendragon One more consideration on disk_path: I want to use x-sendfile to have my web server stream files that are actually on locally mounted file systems. In that case, I don't want to inadvertently download the S3 file to disk and then send it. Is there a workaround?

That's interesting! We do this for disk files and it does fail for our cloud ones. I think you could either proxy through nginx or apache or whatever for these, or I've considered adding a download_url to StreamFile that no-ops - I think Shrine does something similar?

tpendragon · 2024-05-02T21:58:44Z

I ended up fixing the problem by creating a method just like @tpendragon described. It accepts a block and cleans up once the block ends. At the time spent some time reading the TempFile documentation and it seemed like it is really encouraging manually closing/removing a TempFile. Another possible option I considered was creating two methods, one to create a temp file and another one to close it.

Ahhh this is great, good work @cgalarza !

tpendragon · 2024-05-02T22:01:57Z

I think I'm gonna look at adding a block to disk_path and make StreamFile clean up using it, and make close on StreamFile ensure any local tmp file is deleted.

tpendragon · 2024-05-02T22:10:39Z

@dchandekstark @cgalarza How's #955 look?

tpendragon · 2024-05-02T22:15:19Z

@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - pulibrary/figgy@main...figgy_in_the_cloud might be useful to you?

(We've since decided to keep files local, but it was a good experiment in how flexible Valkyrie really is 😀 )

dchandekstark · 2024-05-02T23:28:24Z

@dchandekstark Years ago I converted Figgy to cloud only as an experiment and I remember doing a thing for Downloading - pulibrary/figgy@main...figgy_in_the_cloud might be useful to you?

Yes. It also illustrates (in the respond_to? tests) that path is not always (easily) available.

dchandekstark · 2024-05-02T23:33:38Z

@dchandekstark @cgalarza How's #955 look?

@tpendragon Seems like a good start! I do confess a little concern that explicit temp file naming is potentially unsafe (e.g., what happens when separate threads/processes act on the same file?).

dchandekstark · 2024-05-02T23:42:40Z

@tpendragon I feel that I should add to this thread that I appreciate what Valkyrie provides with respect to storage. It's simple, yet effective. I can see the attraction to Shrine and even ActiveStorage, but you don't always need or want that much extra functionality associated with your storage capability. With that said, I thought I would try to fit a cloud provider (in this case S3) into the Valkyrie framework more or less directly and adhere to the storage adapter contracts.

tpendragon · 2024-05-03T00:39:04Z

I'm all for more options! Looking forward to it!

dchandekstark added the question label May 2, 2024

tpendragon mentioned this issue May 2, 2024

Add cleanup functionality to #disk_path #955

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

dchandekstark commented May 2, 2024

cgalarza commented May 2, 2024

tpendragon commented May 2, 2024 •

edited

Loading

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024 •

edited

Loading

dchandekstark commented May 2, 2024 •

edited

Loading

dchandekstark commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 3, 2024

Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

Valkyrie::StorageAdapter::File - insert a higher abstraction to support cloud providers? #954

Comments

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 2, 2024

dchandekstark commented May 2, 2024

dchandekstark commented May 2, 2024

cgalarza commented May 2, 2024

tpendragon commented May 2, 2024 • edited Loading

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024

tpendragon commented May 2, 2024 • edited Loading

dchandekstark commented May 2, 2024 • edited Loading

dchandekstark commented May 2, 2024

dchandekstark commented May 2, 2024

tpendragon commented May 3, 2024

tpendragon commented May 2, 2024 •

edited

Loading

tpendragon commented May 2, 2024 •

edited

Loading

dchandekstark commented May 2, 2024 •

edited

Loading