Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Package Bundling (and maybe compression) #132

Closed
sabas opened this issue Jun 18, 2014 · 26 comments
Closed

Data Package Bundling (and maybe compression) #132

sabas opened this issue Jun 18, 2014 · 26 comments

Comments

@sabas
Copy link

sabas commented Jun 18, 2014

Updated: 2016-11-17

We want a way to "bundle" a data package into a single file for transmission. In addition it may be compressed at the same time.

Note also that individual resources can be compressed in themselves - see #290

Desired Features

  • Widely supported in client systems
  • Ability to access data within the bundle easily and without downloading the entire bundle (e.g. to stream resources from the bundle)

Original Description

As other packaging types use compression for distributing each package (JAR is a ZIP archive), there should be a section proposing a way to deal with compressed data packages.

@rufuspollock
Copy link
Contributor

@sabas do you have a specific suggestion? I think you are right this is useful.

/cc @paulfitz

@sabas
Copy link
Author

sabas commented May 26, 2015

I was thinking a specification which would tell how to intepret a zipped package on the fly, in the same way a JAR is executed by Java.
So I could expect:

  • the compression algorithm: gzip?
  • which files are needed for correct decompression or reading on the fly (like the zcat and similar cli tools)
  • how to compress the datapackage
  • which file extension or MIME type to use

@rufuspollock
Copy link
Contributor

@sabas i think this makes a lot of sense. Do you want to start speccing something out?

@sabas
Copy link
Author

sabas commented Jun 1, 2015

See #198

@rufuspollock
Copy link
Contributor

There was a lot of discussion in the PR. The PR basically suggested tar + gzip. Subsequent discussion in the PR suggested reviewing existing best practice more and using zip. Main excerpts:

@mfenner wrote:

In the spirit of keeping things simple I wouldn't provide two options (.dp and .dpz). And in the spirit of not reinventing the wheel I would look at https://researchobject.github.io/specifications/bundle/, which uses Universal Container Format (UCF). Or for a software packacking example Chrome extensions: https://developer.chrome.com/extensions/packaging

Excerpt from Research Object bundle spec:

A UCF container is based on the ZIP compression file format [ZIP], enforcing additional restrictions. The most important restrictions are:

  • Reserved filenames in the root directory: mimetype and META-INF
  • Filenames must be encoded in UTF-8
  • Compression must be Uncompressed or Flate
  • may use Zip64 extensions, but should only do so when required
  • The first file must be the uncompressed mimetype and without any extra attributes

UCF says about mimetype:

The first file in the Zip container must be a file with the ASCII name of mimetype, which holds the MIME type for the Zip container (application/epub+zip as an ASCII string; no padding, white-space, or case change).

@tfmorris wrote:

I'll second @mfenner 's suggestion to exhaust all possible existing alternatives before defining a new format. If you are forced to define something new, I'd strongly consider using zip instead of tar, since every other container format in the world from JAR to EPUB to Research Object Bundle has settled on it. There's an old overview of a bunch of the zip-based formats here: http://broadcast.oreilly.com/2009/01/packaging-formats-of-famous-ap.html

@rufuspollock
Copy link
Contributor

@mfenner would you be interested in taking a bit of editorship here? You were a strong proponent of introducing this (and I'm +1 too). In addition, this should be very simple and short spec to write once we decide what to do.

@mfenner
Copy link

mfenner commented Nov 19, 2015

Let me think about how to approach this.

@rufuspollock
Copy link
Contributor

@mfenner any further thoughts? /cc @danfowler

I am increasingly thinking that "bundling" a data package into one file (compressed) is an important use case and would love your suggestions here.

@mfenner
Copy link

mfenner commented Feb 3, 2016

@rgrp sorry for not following up on this. I want a standard zip compression, and hadn't found the time to spec out the details.

Bundling a data package into one file is an important use case for me.

@pwalsh
Copy link
Member

pwalsh commented Feb 3, 2016

For reference (although not directly related to a spec for compression) we went ahead and added zip support to the recently upgraded Python lib for DataPackage, based on very clear use cases in the CKAN integration, and, in general, that it is sensible and reasonable :). @vitorbaptista developed and led on that initiative.

For reference:

@rufuspollock
Copy link
Contributor

@mfenner i imagine this can be super simple. Would you be able to start a draft and drop it in an issue here?

@vitorbaptista useful to get outline of what you did.

@vitorbaptista
Copy link
Contributor

The requirements for my ZIP file loading were to be able to load both ZIPs that follow the pattern:

./datapackage.json
./data/resource.csv

and also

./my-datapackage/datapackage.json
./my-datapackage/data/resource.csv

This is because we wanted to support the ZIP files generated by GitHub (i.e. https://github.com/datasets/gdp/archive/master.zip), which have all contents inside a folder.

The actual code checks that the ZIP file has only and only one datapackage.json file and loads it. All paths in the datapackage are then relative to the datapackage.json, then. This allows any folder structure inside the ZIP file, as long as there is a single datapackage.json. It was easier to code this way 👍

@mfenner
Copy link

mfenner commented Feb 4, 2016

+1 Makes a lot of sense.

@demyanrogozhin
Copy link

I just hope you awere of ZIP filename encoding problems: http://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/

Lot of users still stick to windows-1251 (cyrillic) or SHIFT_JIS (japanese).

Maybe it would be good idea to pick archive format that doesn't have such desing flaw (if such format exists)?

@tfmorris
Copy link

tfmorris commented Feb 5, 2016 via email

@rufuspollock
Copy link
Contributor

@mfenner are you happy to draft a mini-spec here? I imagine it could be just a few paragraphs saying e.g.

  • We use zip
  • datapackage.json must be at "base" of the zip
  • any issues about "referencing" within the zip
  • zip file naming conventions (if any)

@vitorbaptista
Copy link
Contributor

I wouldn't limit it to datapackage.json only at the base of the zip for the reasons I mentioned before (#132 (comment)). I would suggest we either:

  1. Support datapackage.json either at the base of the zip or in a top-level folder;
  2. Support datapackage.json only at a top-level folder (i.e. the contents of the ZIP must be inside a single folder);
  3. Support datapackage.json anywhere inside the ZIP.

I would suggest us to follow the 3rd option, as it's both easier to code and to explain.

@amercader
Copy link
Member

I think is better to be explicit in this case and limit the options for people. A single datapackage.json at the base of the zip or in a top level folder is easy enough to understand and to code, so my vote goes to 1

@sabas
Copy link
Author

sabas commented Feb 5, 2016

Option 1 would enforce the rules used by the datasets datapackages,

@demyanrogozhin
Copy link

@tfmorris I propose 7zip as its open-source, provide better compression ration and UTF-8 file-names.

Despite 2008 is far away, problems with i18n in filesystems is the same - ZIP file created on PC with Korean locale and contain Korean in filenames will be unreadable gibberish after unZIPing on PC with different locale.
ZIP allows usage of different encoding for filenames, but doesn't contain information about original locale.
It's less about format, but about tools. But still problem exists.

@danfowler
Copy link
Contributor

For reference, BagIt's serialization specification work doesn't actually mandate a given format, just rules for (de)serializing behavior:

Several rules govern the serialization of a bag and apply equally to all types of archive files

https://tools.ietf.org/html/draft-kunze-bagit-13#section-4

@pwalsh
Copy link
Member

pwalsh commented Jul 12, 2016

@mfenner are you still interested to work on a mini spec for this?

@rufuspollock
Copy link
Contributor

Having read the BagIt approach I think they got it pretty much right.

My only question would be about step 3 - we could have instead that you do it in the datapackage directory so that the datapackage.json is at the root of the archive file. However, my guess is that bagit creators thought about this.

Next steps:

  • Create a data-package-identifier draft
  • Port the BagIt approach in there with appropriate tweaking (fulsomely acknowledging BagIt)
  • Publish - suggest this is an extension rather than a core spec
Serialization

   In some scenarios, it may be convenient to serialize the bag's
   filesystem hierarchy (i.e., the base directory) into a single-file
   archive format such as TAR or ZIP (the serialization) and then later
   deserialize the serialization to recreate the filesystem hierarchy.
   Several rules govern the serialization of a bag and apply equally to
   all types of archive files:

   1.  The top-level directory of a serialization MUST contain only one
       bag.

   2.  The serialization SHOULD have the same name as the bag's base
       directory, but MUST have an extension added to identify the
       format.  For example, the receiver of "mybag.tar.gz" expects the
       corresponding base directory to be created as "mybag".

   3.  A bag MUST NOT be serialized from within its base directory, but
       from the parent of the base directory (where the base directory
       appears as an entry).  Thus, after a bag is deserialized in an
       empty directory, a listing of that directory shows exactly one
       entry.  For example, deserializing "mybag.zip" in an empty
       directory causes the creation of the base directory "mybag" and,
       beneath "mybag", the creation of all payload and tag files.

   4.  The deserialization of a bag MUST produce a single base directory
       bag with the top-level structure as described in this
       specification without requiring any additional un-archiving step.
       For example, after one un-archiving step it would be an error for
       the "data/" directory to appear as "data.tar.gz".  TAR and ZIP
       files may appear inside the payload beneath the "data/"
       directory, where they would be treated as any other payload file.

   When serializing a bag, care must be taken to ensure that the archive
   format's restrictions on file naming, such as allowable characters,
   length, or character encoding, will support the requirements of the
   systems on which it will be used.  See Section 7.2.

@rufuspollock rufuspollock changed the title Data package compression Data Package Bundling (and maybe compression) Nov 17, 2016
@rufuspollock rufuspollock self-assigned this Nov 17, 2016
@pwalsh
Copy link
Member

pwalsh commented Dec 21, 2016

@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337

@rufuspollock
Copy link
Contributor

@pwalsh yes - note this is a patterns item at this stage. It won't be part of the spec atm i think.

@lowrece12
Copy link

tar + zstd are great for this purpose.

Zstd is superior to gzip/zlib.

Tools exist and are available on permissive license (BSD).

C:\msys64\usr\bin\bsdtar.exe -a -cf - --format pax <files> -C . | zstd.exe - -19 -o R:\data.tar.zst

related topic #290 (comment)

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Status: Done
Development

No branches or pull requests