File size or timestamp of blob? #65

iamwilhelm · 2013-08-21T17:33:22Z

Yesterday, when I was walking my repo with gitteh, and listing out the directories, I noticed that neither the entries nor the blobs contained the file size of the blob nor the timestamp of the blob. I don't remember if it was in the pre-0.17 versions of gitteh.

My question is: is the ommision a matter of "we haven't gotten around to it yet", or "it's hard because of reasons X, Y, and Z?

samcday · 2013-08-25T11:29:07Z

@iamwilhelm When I first read this I assumed that I had just missed it in the rewrite, then I realized it's (kinda) already there.

For starters, getting a blob back actually gives you the data, so getting the file size of a blob is as simple as getting its .length.

Timestamp is a little trickier, there's no notion of a timestamp in the tree or the blob objects. You'd get the timestamp by finding the newest commit that modified that given file and use its commit date. That's a great example of the kind of higher level functionality I'd love to see gitteh start offering.

iamwilhelm · 2013-08-25T22:20:02Z

Right now, I am able to do a hack to get the file size. When I get the tree entries, I load the blob in each entry in succession, and calculate the length of the blob data. However, I don't know if that's memory efficient, since I'd be loading the blob data for each entry. Even though I only keep the blob data in one scope, meaning it gets garbage collected later, it doesn't seem terribly efficient, and I wonder if this would be a problem for larger repos, or repos with large files.

For timestamp, because it is tricky, I had hoped that gitteh would take care of it. However, just as in the case of file size, I was wondering if the timestamp and the blob size should even be included in the entry. What do you think?

If you (and others) think its a good idea, then I'd be happy to try writing the code to make that happen (with guidance, of course).

samcday · 2013-08-25T22:29:17Z

@iamwilhelm It'd be great if you could dig further and look at whether libgit2 itself supports what you're after. I'm a little rusty on Git internals after having not investigated them for a while, but I'm fairly certain that there's no metadata stored for Git objects at all.

For instance, a particular Blob entry that represents your README.md at some particular point in time (attached to a tree, attached to a commit) could either be a "loose" object in .git/objects or it could be in a packfile. Either way, there's no higher level metadata stored elsewhere that says "hey this README.md file is X bytes large". There is some kind of object size field in the Packfile index, but that's for the git object itself, so it's counting the size of the object header.

So what I'm trying to say here is I don't think there is a way to determine the size of a blob without loading that blob into memory anyway. Please feel free to prove me wrong though! :)

iamwilhelm · 2013-08-25T22:36:34Z

@samcday I'll try taking a look at libgit2. So if that's the case, do you think it makes sense to cache the timestamp and the file size in a tree entry?

mildsunrise · 2013-08-25T22:37:59Z

So as you said, looking up the size would require loading the Blob
pointed by the tree entry into memory.

Once that's done, you can query the blob contents' size easily
via blob_rawsize or so. We should maybe expose that function
alone.

mildsunrise · 2013-08-25T22:39:13Z

That implies, if the blob is 150MB big, you'd have to allocate and
load those 150MB in memory just to know their size. :-/

mildsunrise · 2013-08-25T22:40:03Z

There is some kind of object size field in the Packfile index, but that's for the git object itself, so it's counting the size of the object header.

Exactly. I don't think it's feasible to use that.

mildsunrise · 2013-08-25T22:43:31Z

@iamwilhelm

@samcday I'll try taking a look at libgit2. So if that's the case, do you think it makes sense to cache the timestamp and the file size in a tree entry?

Personally, I don't, and you'd have problems doing that.
I'd just keep a hash matching Blob ID ---> Size,
which would be effective and simple.

iamwilhelm · 2013-08-25T23:10:45Z

Like @jmendeth said, it's blob_rawsize, though it seems like you need a reference to the blob in the first place.

The purpose of this is simply to list out the size and timestamp as I'm showing a user their repo. I figured if it made sense, the timestamp and the filesize be included in the entry, if the entry referred to a blob. However, I'm not clear on the design or architecture, so I thought I'd ask if it'd make sense to have it in the entry. Though if it doesn't make sense, it's not the end of the world.

mildsunrise · 2013-08-25T23:21:04Z

Like @jmendeth said, it's blob_rawsize, though it seems like you need a reference to the blob in the first place.

The purpose of this is simply to list out the size and timestamp as I'm showing a user their repo. I figured if it made sense, the timestamp and the filesize be included in the entry, if the entry referred to a blob. However, I'm not clear on the design or architecture, so I thought I'd ask if it'd make sense to have it in the entry. Though if it doesn't make sense, it's not the end of the world.

A blob is usually reused many many times in a repo, so it's IMO a lot better if, instead of modifying the tree (which would create another tree, causing you to massively rebase) you just keep a table associating every blob with its size. You'd only need to load each blob once! Example (please make async):

getBlobSize = (repo, blobId) ->
  if blobId at sizeHash
    sizeHash[blobId]
  else
    sizeHash[blobId] = repo.blob(blobId).size()

sizeHash = {}

Then, when iterating over the tree entries, something like:

for entry in tree.entries
  name = entry.name
  id = entry.id
  size = getBlobSize id
  # do something cool with these values

Lots of blobs to manage? No worries, Redis is your hero!

mildsunrise · 2013-08-25T23:25:04Z

@iamwilhelm Think of the LICENSE file in this repo. It hasn't been
modified in years. That means that the same blob has been used
hundreds of times. Instead of having to load and store the blob
size for each of the 400 trees... you just store it once in the hash. 🎉

iamwilhelm · 2013-08-26T05:40:26Z

@jmendeth Wait, so you're suggesting the blob size hash as something to implement inside of gitteh? Or something that I use do as a user of gitteh outside of gitteh?

mildsunrise · 2013-08-26T08:30:48Z

You should implement that caching in your application, using code like above.

This is because it is not any functionality that Git or LibGit2 have to offer,
it's just some caching you do to improve performance.

PS: In theory you can use a single hash for all the repos you have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File size or timestamp of blob? #65

File size or timestamp of blob? #65

iamwilhelm commented Aug 21, 2013

samcday commented Aug 25, 2013

iamwilhelm commented Aug 25, 2013

samcday commented Aug 25, 2013

iamwilhelm commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

iamwilhelm commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

iamwilhelm commented Aug 26, 2013

mildsunrise commented Aug 26, 2013

File size or timestamp of blob? #65

File size or timestamp of blob? #65

Comments

iamwilhelm commented Aug 21, 2013

samcday commented Aug 25, 2013

iamwilhelm commented Aug 25, 2013

samcday commented Aug 25, 2013

iamwilhelm commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

iamwilhelm commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

mildsunrise commented Aug 25, 2013

iamwilhelm commented Aug 26, 2013

mildsunrise commented Aug 26, 2013