Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore other ways to identify file types #175

Closed
prjemian opened this issue Feb 18, 2022 · 9 comments · Fixed by #259
Closed

Explore other ways to identify file types #175

prjemian opened this issue Feb 18, 2022 · 9 comments · Fixed by #259
Labels
enhancement New feature or request

Comments

@prjemian
Copy link
Contributor

When serving a directory of files, there may exist valid data files that lack a feature in the file name (such as a file extension) to identify the type of file. For example, there is no common file extension for SPEC data files and some users are accustomed to omitting a file extension. As shown in #174, the file extension may be too complicated to examine or not one of the recognized values. The .dat and .txt extensions are also used for various types of data files, including CSV.

Need some programmatic technique to identify the type of file, similar to the UNIX file command. Python examples include is_spec_file(filename), isNeXusFile(filename)

Such routines could be called with unrecognized files.

@prjemian prjemian added the enhancement New feature or request label Feb 18, 2022
@prjemian
Copy link
Contributor Author

The addition technique could be inserted into this block:

if ext in mimetypes_by_file_ext:
mimetype = mimetypes_by_file_ext[ext]
else:
# Use the Python's built-in facility for guessing mimetype
# from file extension. This loads data about mimetypes from
# the operating system the first time it is used.
mimetype, _ = mimetypes.guess_type(path)

Might be better to search known mimetypes first since identification by file content is the more expensive operation. Associate each identified file type with ad hoc, unique mimetype.

@danielballan
Copy link
Member

I support this. I intentionally kept it simple to start, looking at file extension only, but I agree it's time to enable more sophisticated techniques.

I propose to add a configuration setting:

# config.yml
...
mimetype_detection_hook: my_custom_module:my_sniifer

which would enable you and anyone to experiment with this outside the tiled package like this:

# my_custom_module.py

def my_sniffer(filepath):
    ...
    return "..."

The function may inspect the filename and, if it needs to, open the file and read as many bytes as it wants to. The return value should be MIME type, either a registered one like text/csv or a custom one text/x-specfile.

This would override the code you excerpted above, so it would be in total control over how types were determined. It could decide whether to copy the mimetype search approach as a first pass or to overrule it.

If people developed "sniffers" that prove to be generally useful, we can always move them into tiled proper at some later point. Either way, I think it will be important to enable people who deploy tiled to customize the sniffer behavior like this on their own.

What do you think?

@prjemian
Copy link
Contributor Author

That seems very general. I like it.

@danielballan
Copy link
Member

@prjemian This is now implemented in v0.1.0a67 and documented at https://blueskyproject.io/tiled/how-to/read-custom-formats.html.

Let me know if you get a chance to try it out on SPEC or NeXus.

@prjemian
Copy link
Contributor Author

Starting to look at this now. Case 2 is the most likely scenario since our data files may have extensions. Yet that extension cannot be trusted to be informative when the extension content is overloaded for various data formats (such as .dat: could be SPEC, CSV, binary, ..., .h5 could be NeXus, Data Exchange, or other).

The interface is called for each file:

# custom.py

def detect_mimetype(filepath, mimetype):
    if mimetype is None:
        # If we are here, detection based on file extension came up empty.
        ...
        mimetype = "text/csv"
    return mimetype

While this could become time-expensive when repeating over a directory structure with many similar files (a typical pattern), it could be optimized. One optimization (in the custom handler) could be a sense of recognition that files in a directory likely follow a pattern, such as any combination of these rules:

  • all files in this directory are [this known format], regardless of naming style
  • NeXus/HDF5 area detector files in this directory have .h5 extension
  • SPEC files in this directory have .dat extension
  • custom NeXus/HDF5 files in this directory have .hdf5 or 'nx' or 'nxs` extension
  • file starts with recognized pattern

Even if that handling is better suited to a class, the optimizing class would be called from the detect_mimetype() function. Seems straightforward.

@prjemian
Copy link
Contributor Author

Another optimization:

  • directory contains a file that provides the mime type mapping
  • the mapping file could be created manually or by a previous run

@danielballan
Copy link
Member

danielballan commented Aug 29, 2022

This aligns with two optimizations I have been working on:

  • Stash the detection results (and metadata) in an index file so that the detection only has to happen once for each new file, not repeatedly on every tiled server startup.
  • Enable the user to explicitly index (or would a better term be “register”) certain files or directories with a new command like tiled register. This would give the user the opportunity to provide additional guidance on how to handle those specific files or directories, perhaps tiled register dir/ --ext .h5=application/x-nexus. That may be easier for users than going through trial-and-error to guide an automated detection scheme.

@prjemian
Copy link
Contributor Author

The local mapping may provide more flexibility. Our directories tend to have mixed content such that an ignore setting would be good for Python, SPEC macro, MatLab procedures, IgorPro procedures, text, markdown, ... But then, this is just another aspect of a custom handler.

Unless you have some specifics in mind, let's work up some custom handlers and compare.

@danielballan
Copy link
Member

Sounds good, let’s!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants