Skip to content

File metadata extraction tool and Ruby library

Notifications You must be signed in to change notification settings

AlexParamonov/metadata

 
 

Repository files navigation

Thanks
------

  Konrad Meyer for his patient testing and bug reports.
  Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities
  (along with being the author of wmainfo-rb and flacinfo-rb.)


Description
-----------

  This package `Metadata' comes with a library called `metadata' and
  a small program called `mdh'.

  The library probes files for their metadata (e.g. jpeg dimensions
  and camera make, mp3 artist, pdf text and word count) and returns the
  metadata as a Hash. All strings in the metadata are converted to UTF-8.

  The `mdh'-program can print out file metadata as YAML and package the
  metadata with the file.

  The metadata hash follows the shared file metadata spec naming, with some
  additional fields, see list at the end of this file (Appendix A.)

  For details on the MDH file format, see the end of this file (Appendix B.)


Usage
-----

  # print out metadata for myfile.jpg
  mdh myfile.jpg

  # create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg
  mdh -c myfile.jpg

  # print out the metadata header from an MDH file
  mdh -e -p myfile.jpg.mdh

  # strip out the metadata header from an MDH file and write the actual file
  # to myfile.jpg
  mdh -e myfile.jpg.mdh

  # include file path, filename, md5sum and sha1sum in the metadata header
  mdh --path --name -m -s myfile.jpg

  # guess title for document (first line that starts with a capital letter)
  mdh --guess-title foo.ps

  # guess title, abstract and metadata for document
  mdh --guess-metadata foo.ps

  # don't include document text (File.Content) in the metadata
  mdh --no-text foo.ps

  # query CiteSeer with the document title, add possible results to metadata
  mdh --citeseer foo.ps

  # query DBLP with the document title, add possible results to metadata
  mdh --dblp foo.ps

  # If you have an unknown CS document, this might help identify it:
  mdh --guess-metadata --dblp --citeseer foo.ps

  # print out the list of options
  mdh -h

  irb> require 'metadata'
  irb> Metadata.extract('myfile.jpg')
  irb> Metadata.extract_text('myfile.pdf')
  irb> Pathname.new("myfile.jpg").metadata


List of supported formats
-------------------------

  Audio:
    Whatever you manage to make mplayer play.
    Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.

    Successfully tested with:
      mp3, flac, ogg, wav, ra, m4a, wma

    Should also work:
      wv, mpc, ape


  Video:
    Whatever you manage to make mplayer play.

    Successfully tested with:
      wmv, mov, divx, xvid, flv, ogg, mpg, mkv


  Images:
    Should handle pretty much anything.
    I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.

    Successfully tested with:
      Web formats:
        jpeg, png, gif, svg
      Camera raws:
        nef, dng, crw, pef, orf, arw, raf, cr2
      Image editor state dumps:
        psd, xcf
      The rest:
        tga, tif, bmp, xpm, ppm, pcx


  Documents:
    Successfully tested with:
      Web formats:
        html, txt
      Print formats:
        pdf, ps, ps.gz
      OO formats:
        sxi, odp
      MS formats:
        doc, ppt, xls

    - I'm using unoconv to convert OO & MS docs to temp PDFs for the text &
      dimensions extraction, so those bits of data are missing. MSOffice docs
      are missing dimensions for the same reason. Here's a way to get them:
      ( first, get Thumbnailer: http://github.com/kig/thumbnailer/tree/master )
      $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
      $ mdh foo.odp
      $ rm foo.odp-temp.pdf /tmp/foo.jpg


  Others:
    - BitTorrent .torrent files
    - Archive contents: tar.gz, zip
    - Whatever `extract' outputs and I am handling


  Formats that yield very little metadata:
    ai

  Formats that don't yield usable metadata:
    chm, sis, rb, rar, ttf

  Formats that fail mimetype guessing:
    exr



Requirements
------------

  * Ruby 1.9

  * Tons of metadata extraction programs and libs.
    This package has many dependencies since there is no single universal
    metadata header format that all files use. Blame resource forks, filename
    extensions, bags of bytes and mimetypes.

    List of gems:
      flacinfo-rb
      wmainfo-rb
      MP4Info
      id3lib-ruby
      apetag
      text
      hpricot
      ruby-mp3info

    List of Debian packages:
      dcraw
      libmagick9-dev
      extract
      libimage-exiftool-perl
      poppler-utils
      mplayer
      html2text
      imagemagick
      unhtml
      pstotext
      antiword
      catdoc
      shared-mime-info

    Gem use next system tools:
      md5sum
      sha1sum
      pdftotext
      lynx (lynx-cur)
      wc
      identify
      pstotext
      zcat
      antiword
      catdoc
      catppt
      xls2csv
      mplayer
      exiftool
      dcraw
      pdfinfo
      file
      unzip
      extract

  * You do want to install the latest versions of dcraw and
    shared-mime-info to be able to handle camera raw images.
    http://cybercom.net/~dcoffin/dcraw/
    http://freedesktop.org/wiki/Software/shared-mime-info

  * Python + chardet library
    http://chardet.feedparser.org/


Install
-------

  De-compress archive and enter its top directory.
  Then type:

   ($ su)
    # ruby setup.rb

  These simple step installs this program under the default
  location of Ruby libraries.  You can also install files into
  your favorite directory by supplying setup.rb some options.
  Try "ruby setup.rb --help".


Appendix A: Metadata fields
--------------------------------------

  This list contains the metadata fields output by Metadata and mdh.
  The list follows the shared file metadata spec for the most part.
  http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec

  field name              | field type
  ----------------------------------------------------------------------
  Archive.Contents          array of pathnames

  Audio.Band                string
  Audio.Composer            string
  Audio.Conductor           string
  Audio.Copyright           string (copyright message)
  Audio.Grouping            string
  Audio.Image               base64-encoded binary string (embedded image data)
  Audio.InterpretedBy       string
  Audio.Lyricist            string
  Audio.Publisher           string
  Audio.RemixedBy           string
  Audio.Subtitle            string
  Audio.Tempo               integer
  Audio.VariableBitrate     boolean
  Audio.Writer              string
  Audio.Publicationright    string
  Audio.File                string
  Audio.EAN/UPC             string
  Audio.ISBN                string
  Audio.Catalog             string
  Audio.LC                  string
  Audio.Media               string
  Audio.Index               string
  Audio.Related             string
  Audio.ISRC                string
  Audio.Abstract            string
  Audio.Language            string
  Audio.Bibliography        string
  Audio.Introplay           string
  Audio.Dummy               string
  Audio.DebutAlbum          string
  Audio.RecordDate          string
  Audio.RecordLocation      string
  v-- ORIGINAL FIELDS USED --v
  Audio.Title               string
  Audio.Artist              string
  Audio.Album               string
  Audio.AlbumArtist         string
  Audio.AlbumTrackCount     integer
  Audio.TrackNo             integer
  Audio.DiscNo              integer
  Audio.Performer           string
  Audio.Duration            float
  Audio.ReleaseDate         datetime
  Audio.Comment             string
  Audio.Genre               string
  Audio.Codec               string
  Audio.Samplerate          integer
  Audio.Bitrate             float
  Audio.Channels            integer
  Audio.Lyrics              string

  Doc.Album                 string
  Doc.Artist                string
  Doc.Charset               string
  Doc.Description           string
  Doc.Genre                 string
  Doc.Language              string
  Doc.ModifyDate            date
  Doc.PageSizeName          string (A4, A5, letter, ...)
  Doc.RevisionHistory       array of strings
  Doc.ParagraphCount        integer
  Doc.LineCount             integer
  Doc.CharacterCount        integer
  Doc.LastSavedBy           string
  Doc.Keywords              array of strings
  Doc.Template              string
  Doc.Publisher             string
  Doc.PublicationName       string
  Doc.PublicationPages      string
  Doc.Citations             array of {href=>a, title=>b, rest=>c} hashes
  Doc.Contributor           string
  Doc.CiteSeerIdentifier    string
  Doc.CiteSeerURL           string
  Doc.Published             datetime
  Doc.Source                string
  Doc.DBLPIdentifier        string
  Doc.CrossRef              string (BibTex crossref)
  Doc.BibSource             string (BibTex source)
  Doc.BibTexType            string (BibTex type: article, inbook, ...)
  Doc.ACMCategories         array of strings
  v-- ORIGINAL FIELDS USED --v
  Doc.Title                 string
  Doc.Subject               string
  Doc.Author                string
  Doc.PageCount             integer
  Doc.WordCount             integer
  Doc.Created               datetime

  File.Software             string (software used to create the file)
  File.MD5Sum               string (md5sum of file's contents)
  File.SHA1Sum              string (sha1sum of file's contents)
  v-- ORIGINAL FIELDS USED --v
  File.Name                 string (basename of the file)
  File.Path                 string (dirname of the file)
  File.Format               string (mime type, inode/directory for dirs)
  File.Size                 integer
  File.Content              string
  File.Modified             string

  Image.DateCreated         date
  Image.DateTimeCreated     date
  Image.DateTimeOriginal    date
  Image.DimensionUnit       string (px, mm, pt, ...)
  Image.Editor              string
  Image.EXIF                string (exiftool output)
  Image.FrameCount          integer
  Image.LayerCount          integer
  Image.Modified            date
  Image.OriginatingProgram  string
  Image.ComponentCount      integer
  Image.ColorMode           string (e.g. RGB)
  Image.ColorSpace          string (e.g. sRGB)
  v-- ORIGINAL FIELDS USED --v
  Image.Height              float
  Image.Width               float
  Image.Title               string
  Image.Date                datetime
  Image.Creator             string
  Image.Description         string
  Image.Software            string
  Image.CameraMake          string
  Image.CameraModel         string
  Image.ExposureProgram     string
  Image.ExposureTime        float
  Image.Fnumber             float
  Image.Flash               boolean
  Image.FocalLength         float
  Image.ISOSpeed            float
  Image.MeteringMode        string
  Image.WhiteBalance        string
  Image.Copyright           string

  Location.Latitude         float
  Location.Longitude        float

  Video.Album               string
  Video.Artist              string
  Video.Bitrate             integer
  Video.Codec               string
  Video.Comment             string
  Video.Duration            float
  Video.Framerate           float (frames per second)
  Video.Genre               string
  Video.ReleaseDate         date
  Video.Title               string
  Video.TrackNo             integer
  Video.Demuxer             string

  BitTorrent.Name           string
  BitTorrent.Files          array of { 'path' => string,
                                       'length' => integer,
                                       'md5sum' => string }
  BitTorrent.Length         integer (size of single-file torrents)
  BitTorrent.MD5Sum         string (md5sum for single-file torrents)
  BitTorrent.PieceCount     integer
  BitTorrent.PieceLength    integer (length of a single piece
  BitTorrent.Comment        string
  BitTorrent.Announce       string (announce url)
  BitTorrent.AnnounceList   array of arrays of strings
  BitTorrent.Nodes          array of [hostname, port] -arrays



Appendix B: The MDH file format
-------------------------------

  MDH files are built as follows:

  bytes | content
  ---------------
    3   | "MDH"  - MDH file format identifier
    1   | "\x01" - MDH file format version number
    4   | Long, network byte order - the size of the metadata struct in bytes
   var  | YAML   - The MDH metadata struct
   var  | The actual file contents

  All string fields in the metadata are UTF-8.


License
-------

  Ruby's


Ilmari Heikkinen <ilmari.heikkinen gmail com>

About

File metadata extraction tool and Ruby library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 100.0%