Explore other ways to identify file types #175

prjemian · 2022-02-18T20:10:37Z

When serving a directory of files, there may exist valid data files that lack a feature in the file name (such as a file extension) to identify the type of file. For example, there is no common file extension for SPEC data files and some users are accustomed to omitting a file extension. As shown in #174, the file extension may be too complicated to examine or not one of the recognized values. The .dat and .txt extensions are also used for various types of data files, including CSV.

Need some programmatic technique to identify the type of file, similar to the UNIX file command. Python examples include is_spec_file(filename), isNeXusFile(filename)

Such routines could be called with unrecognized files.

The text was updated successfully, but these errors were encountered:

prjemian · 2022-02-18T20:25:09Z

The addition technique could be inserted into this block:

tiled/tiled/adapters/files.py

Lines 648 to 654 in 3d269d4

    
           if ext in mimetypes_by_file_ext: 
        
               mimetype = mimetypes_by_file_ext[ext] 
        
           else: 
        
               # Use the Python's built-in facility for guessing mimetype 
        
               # from file extension. This loads data about mimetypes from 
        
               # the operating system the first time it is used. 
        
               mimetype, _ = mimetypes.guess_type(path)

Might be better to search known mimetypes first since identification by file content is the more expensive operation. Associate each identified file type with ad hoc, unique mimetype.

danielballan · 2022-02-18T21:07:27Z

I support this. I intentionally kept it simple to start, looking at file extension only, but I agree it's time to enable more sophisticated techniques.

I propose to add a configuration setting:

# config.yml
...
mimetype_detection_hook: my_custom_module:my_sniifer

which would enable you and anyone to experiment with this outside the tiled package like this:

# my_custom_module.py

def my_sniffer(filepath):
    ...
    return "..."

The function may inspect the filename and, if it needs to, open the file and read as many bytes as it wants to. The return value should be MIME type, either a registered one like text/csv or a custom one text/x-specfile.

This would override the code you excerpted above, so it would be in total control over how types were determined. It could decide whether to copy the mimetype search approach as a first pass or to overrule it.

If people developed "sniffers" that prove to be generally useful, we can always move them into tiled proper at some later point. Either way, I think it will be important to enable people who deploy tiled to customize the sniffer behavior like this on their own.

What do you think?

prjemian · 2022-02-18T21:18:42Z

That seems very general. I like it.

danielballan · 2022-08-04T17:28:38Z

@prjemian This is now implemented in v0.1.0a67 and documented at https://blueskyproject.io/tiled/how-to/read-custom-formats.html.

Let me know if you get a chance to try it out on SPEC or NeXus.

prjemian · 2022-08-28T16:37:43Z

Starting to look at this now. Case 2 is the most likely scenario since our data files may have extensions. Yet that extension cannot be trusted to be informative when the extension content is overloaded for various data formats (such as .dat: could be SPEC, CSV, binary, ..., .h5 could be NeXus, Data Exchange, or other).

The interface is called for each file:

# custom.py

def detect_mimetype(filepath, mimetype):
    if mimetype is None:
        # If we are here, detection based on file extension came up empty.
        ...
        mimetype = "text/csv"
    return mimetype

While this could become time-expensive when repeating over a directory structure with many similar files (a typical pattern), it could be optimized. One optimization (in the custom handler) could be a sense of recognition that files in a directory likely follow a pattern, such as any combination of these rules:

all files in this directory are [this known format], regardless of naming style
NeXus/HDF5 area detector files in this directory have .h5 extension
SPEC files in this directory have .dat extension
custom NeXus/HDF5 files in this directory have .hdf5 or 'nx' or 'nxs` extension
file starts with recognized pattern

Even if that handling is better suited to a class, the optimizing class would be called from the detect_mimetype() function. Seems straightforward.

prjemian · 2022-08-28T16:42:41Z

Another optimization:

directory contains a file that provides the mime type mapping
the mapping file could be created manually or by a previous run

danielballan · 2022-08-29T10:37:41Z

This aligns with two optimizations I have been working on:

Stash the detection results (and metadata) in an index file so that the detection only has to happen once for each new file, not repeatedly on every tiled server startup.
Enable the user to explicitly index (or would a better term be “register”) certain files or directories with a new command like tiled register. This would give the user the opportunity to provide additional guidance on how to handle those specific files or directories, perhaps tiled register dir/ --ext .h5=application/x-nexus. That may be easier for users than going through trial-and-error to guide an automated detection scheme.

prjemian · 2022-08-29T16:10:28Z

The local mapping may provide more flexibility. Our directories tend to have mixed content such that an ignore setting would be good for Python, SPEC macro, MatLab procedures, IgorPro procedures, text, markdown, ... But then, this is just another aspect of a custom handler.

Unless you have some specifics in mind, let's work up some custom handlers and compare.

danielballan · 2022-08-29T18:11:02Z

Sounds good, let’s!

prjemian added the enhancement New feature or request label Feb 18, 2022

This was referenced Jul 18, 2022

Wildcard for mimetypes_by_file_ext? #255

Closed

Add mimetype_detection_hook. #259

Merged

danielballan closed this as completed in #259 Jul 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore other ways to identify file types #175

Explore other ways to identify file types #175

prjemian commented Feb 18, 2022

prjemian commented Feb 18, 2022

danielballan commented Feb 18, 2022

prjemian commented Feb 18, 2022

danielballan commented Aug 4, 2022

prjemian commented Aug 28, 2022

prjemian commented Aug 28, 2022

danielballan commented Aug 29, 2022 •

edited

Loading

prjemian commented Aug 29, 2022

danielballan commented Aug 29, 2022

Explore other ways to identify file types #175

Explore other ways to identify file types #175

Comments

prjemian commented Feb 18, 2022

prjemian commented Feb 18, 2022

danielballan commented Feb 18, 2022

prjemian commented Feb 18, 2022

danielballan commented Aug 4, 2022

prjemian commented Aug 28, 2022

prjemian commented Aug 28, 2022

danielballan commented Aug 29, 2022 • edited Loading

prjemian commented Aug 29, 2022

danielballan commented Aug 29, 2022

danielballan commented Aug 29, 2022 •

edited

Loading