forked from kig/metadata
-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
executable file
·423 lines (343 loc) · 12.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
Thanks
------
Konrad Meyer for his patient testing and bug reports.
Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities
(along with being the author of wmainfo-rb and flacinfo-rb.)
Description
-----------
This package `Metadata' comes with a library called `metadata' and
a small program called `mdh'.
The library probes files for their metadata (e.g. jpeg dimensions
and camera make, mp3 artist, pdf text and word count) and returns the
metadata as a Hash. All strings in the metadata are converted to UTF-8.
The `mdh'-program can print out file metadata as YAML and package the
metadata with the file.
The metadata hash follows the shared file metadata spec naming, with some
additional fields, see list at the end of this file (Appendix A.)
For details on the MDH file format, see the end of this file (Appendix B.)
Usage
-----
# print out metadata for myfile.jpg
mdh myfile.jpg
# create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg
mdh -c myfile.jpg
# print out the metadata header from an MDH file
mdh -e -p myfile.jpg.mdh
# strip out the metadata header from an MDH file and write the actual file
# to myfile.jpg
mdh -e myfile.jpg.mdh
# include file path, filename, md5sum and sha1sum in the metadata header
mdh --path --name -m -s myfile.jpg
# guess title for document (first line that starts with a capital letter)
mdh --guess-title foo.ps
# guess title, abstract and metadata for document
mdh --guess-metadata foo.ps
# don't include document text (File.Content) in the metadata
mdh --no-text foo.ps
# query CiteSeer with the document title, add possible results to metadata
mdh --citeseer foo.ps
# query DBLP with the document title, add possible results to metadata
mdh --dblp foo.ps
# If you have an unknown CS document, this might help identify it:
mdh --guess-metadata --dblp --citeseer foo.ps
# print out the list of options
mdh -h
irb> require 'metadata'
irb> Metadata.extract('myfile.jpg')
irb> Metadata.extract_text('myfile.pdf')
irb> Pathname.new("myfile.jpg").metadata
List of supported formats
-------------------------
Audio:
Whatever you manage to make mplayer play.
Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma.
Successfully tested with:
mp3, flac, ogg, wav, ra, m4a, wma
Should also work:
wv, mpc, ape
Video:
Whatever you manage to make mplayer play.
Successfully tested with:
wmv, mov, divx, xvid, flv, ogg, mpg, mkv
Images:
Should handle pretty much anything.
I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw.
Successfully tested with:
Web formats:
jpeg, png, gif, svg
Camera raws:
nef, dng, crw, pef, orf, arw, raf, cr2
Image editor state dumps:
psd, xcf
The rest:
tga, tif, bmp, xpm, ppm, pcx
Documents:
Successfully tested with:
Web formats:
html, txt
Print formats:
pdf, ps, ps.gz
OO formats:
sxi, odp
MS formats:
doc, ppt, xls
- I'm using unoconv to convert OO & MS docs to temp PDFs for the text &
dimensions extraction, so those bits of data are missing. MSOffice docs
are missing dimensions for the same reason. Here's a way to get them:
( first, get Thumbnailer: http://github.com/kig/thumbnailer/tree/master )
$ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg
$ mdh foo.odp
$ rm foo.odp-temp.pdf /tmp/foo.jpg
Others:
- BitTorrent .torrent files
- Archive contents: tar.gz, zip
- Whatever `extract' outputs and I am handling
Formats that yield very little metadata:
ai
Formats that don't yield usable metadata:
chm, sis, rb, rar, ttf
Formats that fail mimetype guessing:
exr
Requirements
------------
* Ruby 1.9
* Tons of metadata extraction programs and libs.
This package has many dependencies since there is no single universal
metadata header format that all files use. Blame resource forks, filename
extensions, bags of bytes and mimetypes.
List of gems:
flacinfo-rb
wmainfo-rb
MP4Info
id3lib-ruby
apetag
text
hpricot
ruby-mp3info
List of Debian packages:
dcraw
libmagick9-dev
extract
libimage-exiftool-perl
poppler-utils
mplayer
html2text
imagemagick
unhtml
pstotext
antiword
catdoc
shared-mime-info
Gem use next system tools:
md5sum
sha1sum
pdftotext
lynx (lynx-cur)
wc
identify
pstotext
zcat
antiword
catdoc
catppt
xls2csv
mplayer
exiftool
dcraw
pdfinfo
file
unzip
extract
* You do want to install the latest versions of dcraw and
shared-mime-info to be able to handle camera raw images.
http://cybercom.net/~dcoffin/dcraw/
http://freedesktop.org/wiki/Software/shared-mime-info
* Python + chardet library
http://chardet.feedparser.org/
Install
-------
De-compress archive and enter its top directory.
Then type:
($ su)
# ruby setup.rb
These simple step installs this program under the default
location of Ruby libraries. You can also install files into
your favorite directory by supplying setup.rb some options.
Try "ruby setup.rb --help".
Appendix A: Metadata fields
--------------------------------------
This list contains the metadata fields output by Metadata and mdh.
The list follows the shared file metadata spec for the most part.
http://wiki.freedesktop.org/wiki/Specifications/shared-filemetadata-spec
field name | field type
----------------------------------------------------------------------
Archive.Contents array of pathnames
Audio.Band string
Audio.Composer string
Audio.Conductor string
Audio.Copyright string (copyright message)
Audio.Grouping string
Audio.Image base64-encoded binary string (embedded image data)
Audio.InterpretedBy string
Audio.Lyricist string
Audio.Publisher string
Audio.RemixedBy string
Audio.Subtitle string
Audio.Tempo integer
Audio.VariableBitrate boolean
Audio.Writer string
Audio.Publicationright string
Audio.File string
Audio.EAN/UPC string
Audio.ISBN string
Audio.Catalog string
Audio.LC string
Audio.Media string
Audio.Index string
Audio.Related string
Audio.ISRC string
Audio.Abstract string
Audio.Language string
Audio.Bibliography string
Audio.Introplay string
Audio.Dummy string
Audio.DebutAlbum string
Audio.RecordDate string
Audio.RecordLocation string
v-- ORIGINAL FIELDS USED --v
Audio.Title string
Audio.Artist string
Audio.Album string
Audio.AlbumArtist string
Audio.AlbumTrackCount integer
Audio.TrackNo integer
Audio.DiscNo integer
Audio.Performer string
Audio.Duration float
Audio.ReleaseDate datetime
Audio.Comment string
Audio.Genre string
Audio.Codec string
Audio.Samplerate integer
Audio.Bitrate float
Audio.Channels integer
Audio.Lyrics string
Doc.Album string
Doc.Artist string
Doc.Charset string
Doc.Description string
Doc.Genre string
Doc.Language string
Doc.ModifyDate date
Doc.PageSizeName string (A4, A5, letter, ...)
Doc.RevisionHistory array of strings
Doc.ParagraphCount integer
Doc.LineCount integer
Doc.CharacterCount integer
Doc.LastSavedBy string
Doc.Keywords array of strings
Doc.Template string
Doc.Publisher string
Doc.PublicationName string
Doc.PublicationPages string
Doc.Citations array of {href=>a, title=>b, rest=>c} hashes
Doc.Contributor string
Doc.CiteSeerIdentifier string
Doc.CiteSeerURL string
Doc.Published datetime
Doc.Source string
Doc.DBLPIdentifier string
Doc.CrossRef string (BibTex crossref)
Doc.BibSource string (BibTex source)
Doc.BibTexType string (BibTex type: article, inbook, ...)
Doc.ACMCategories array of strings
v-- ORIGINAL FIELDS USED --v
Doc.Title string
Doc.Subject string
Doc.Author string
Doc.PageCount integer
Doc.WordCount integer
Doc.Created datetime
File.Software string (software used to create the file)
File.MD5Sum string (md5sum of file's contents)
File.SHA1Sum string (sha1sum of file's contents)
v-- ORIGINAL FIELDS USED --v
File.Name string (basename of the file)
File.Path string (dirname of the file)
File.Format string (mime type, inode/directory for dirs)
File.Size integer
File.Content string
File.Modified string
Image.DateCreated date
Image.DateTimeCreated date
Image.DateTimeOriginal date
Image.DimensionUnit string (px, mm, pt, ...)
Image.Editor string
Image.EXIF string (exiftool output)
Image.FrameCount integer
Image.LayerCount integer
Image.Modified date
Image.OriginatingProgram string
Image.ComponentCount integer
Image.ColorMode string (e.g. RGB)
Image.ColorSpace string (e.g. sRGB)
v-- ORIGINAL FIELDS USED --v
Image.Height float
Image.Width float
Image.Title string
Image.Date datetime
Image.Creator string
Image.Description string
Image.Software string
Image.CameraMake string
Image.CameraModel string
Image.ExposureProgram string
Image.ExposureTime float
Image.Fnumber float
Image.Flash boolean
Image.FocalLength float
Image.ISOSpeed float
Image.MeteringMode string
Image.WhiteBalance string
Image.Copyright string
Location.Latitude float
Location.Longitude float
Video.Album string
Video.Artist string
Video.Bitrate integer
Video.Codec string
Video.Comment string
Video.Duration float
Video.Framerate float (frames per second)
Video.Genre string
Video.ReleaseDate date
Video.Title string
Video.TrackNo integer
Video.Demuxer string
BitTorrent.Name string
BitTorrent.Files array of { 'path' => string,
'length' => integer,
'md5sum' => string }
BitTorrent.Length integer (size of single-file torrents)
BitTorrent.MD5Sum string (md5sum for single-file torrents)
BitTorrent.PieceCount integer
BitTorrent.PieceLength integer (length of a single piece
BitTorrent.Comment string
BitTorrent.Announce string (announce url)
BitTorrent.AnnounceList array of arrays of strings
BitTorrent.Nodes array of [hostname, port] -arrays
Appendix B: The MDH file format
-------------------------------
MDH files are built as follows:
bytes | content
---------------
3 | "MDH" - MDH file format identifier
1 | "\x01" - MDH file format version number
4 | Long, network byte order - the size of the metadata struct in bytes
var | YAML - The MDH metadata struct
var | The actual file contents
All string fields in the metadata are UTF-8.
License
-------
Ruby's
Ilmari Heikkinen <ilmari.heikkinen gmail com>