Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mimetype_detection_hook. #259

Merged
merged 9 commits into from
Jul 27, 2022

Conversation

danielballan
Copy link
Member

@danielballan danielballan commented Jul 21, 2022

@J-avery32 Would you check out this branch and give it a try?

pip install git+https://github.com/danielballan/tiled@mimetype-detection-hook

You need to place a Python script next to your config.yml file. It can be named anything:

# custom.py
def detect_mimetype(path):
    ...

That function has access to the file path and it can do anything it wants to decide what the mimetype is, including opening the file.

To tell DirectoryAdapter to use it, add this to the config.yml:

...
- path: /
  tree: files
  args:
    directory: ...
    mimetype_detection_hook: custom:detect_mimetype

If that function returns None, then the DirectoryAdapter will fall back to its usual method of looking at file extensions.

By default, Tiled will strip everything after the first . from the name. That is, thing.csv will appear as just thing. We do this to avoid "leaking" to the user details about how data happens to be stored, so that it can change over time without breaking our contract with the user.

In your case, dropping everything after the . may not be what you want. You can add:

...
- path: /
  tree: files
  args:
    directory: ...
    mimetype_detection_hook: custom:detect_mimetype
    key_from_filename: tiled.adapters.files:identity

Or write your own function in custom.py for transforming the filename into whatever you want to name this node in Tiled:

...
- path: /
  tree: files
  args:
    directory: ...
    mimetype_detection_hook: custom:detect_mimetype
    key_from_filename: custom:key_from_filename

If this works for you we'll add documentation and merge.

Closes #255
Closes #175

@J-avery32
Copy link

Yes I'll try it right now

@J-avery32
Copy link

Not sure if this is asking too much, but is there a way to set precedence? For example, making it so that this function will only be used after tiled has exhausted its default mimetypes and the mimetypes_by_file_ext option?

@J-avery32
Copy link

Ideally that order could go in any direction.

@J-avery32
Copy link

This way I don't have to return None for every other file extension I might encounter in my folder.

@danielballan
Copy link
Member Author

Definitely not asking too much. :-)

I did give precedence some thought, but I am not 100% I got it right. My thinking was that the function go first so it can get in before the file extension detector might misidentify something.

I had in mind that it would look for a certain naming pattern, perhaps using regex in your case, and then just do nothing if it sees nothing it recognizes, implicitly returning None. The example in my unit test in this PR works that way: isn’t filed that don’t match what it is looking for are effectively ignored, and no extra code is required to handle that.

But if for any reason you need to run the file extension detection first, you can copy that part of the tiled code directly inside your custom function, at the top.

@J-avery32
Copy link

Perhaps instead of copying the code would it be possible to wrap the other checker in a function that I can import?

@J-avery32
Copy link

Though this is picky on my part, and I can do it any of the other ways.

@J-avery32
Copy link

How would I access the variable mimetypes_by_file_ext in my function? Is there a way to import this?

@danielballan
Copy link
Member Author

danielballan commented Jul 21, 2022

You'd have to duplicate it. Alternatively, we could provide it as an argument to the function, i.e.

def detect_mimetype(path, mimetypes_by_file_ext):
    ...

That occurred to me but seemed a little...overcooked. I think it depends on how frequently we need to do the mimetype check first. What does you function look like without the mimetype check?

@J-avery32
Copy link

J-avery32 commented Jul 21, 2022

Without a mimetype check I would probably use a regex. But I'm not sure how guaranteed it is that all the files will have the numbers as their extension. How does duplicating it look like? Do you mean parsing the yaml within the detect_mimetype function?

@J-avery32
Copy link

Also I am getting this error:

Traceback (most recent call last):

  File "/home/j/programming/work/tiled_als/venv/bin/tiled", line 8, in <module>
    sys.exit(main())

  File "/home/j/programming/work/tiled_als/venv/lib/python3.8/site-packages/tiled/commandline/main.py", line 613, in serve_config
    kwargs = construct_build_app_kwargs(parsed_config, source_filepath=config_path)

  File "/home/j/programming/work/tiled_als/venv/lib/python3.8/site-packages/tiled/config.py", line 122, in construct_build_app_kwargs
    tree = obj(**item["args"])

  File "/home/j/programming/work/tiled_als/venv/lib/python3.8/site-packages/tiled/adapters/files.py", line 330, in from_directory
    reader_factory = _reader_factory_for_file(

  File "/home/j/programming/work/tiled_als/venv/lib/python3.8/site-packages/tiled/adapters/files.py", line 677, in _reader_factory_for_file
    mimetype = mimetype_detection_hook(path)

TypeError: 'str' object is not callable

It seems that the mimetype_detection_hook is a string and not the actual function.

@danielballan
Copy link
Member Author

Oops, yes I missed something important there. Please stand by.

@danielballan
Copy link
Member Author

Fix for str issue pushed above.

I meant hard-coding the dictionary of custom mimetypes in the definition of the function, duplicating whatever is in the config. Not ideal, but possibly better than the alternative of overly-magical complexity in the precedence rules.

Concrete use cases will be very helpful in landing on a good design. Does your directory mix standard files like TIFF or CSV with these unusually-named ones?

@J-avery32
Copy link

Yes, I have a csv file mixed in with them.

@J-avery32
Copy link

It would not be too hard to parse the config file and then duplicate the mimetypes in that way.

@J-avery32
Copy link

Thanks, it seems to be working from the python client, however for some reason the browser is now only returning 404s for the browse UI. Not sure if it's just on my end though.

@danielballan
Copy link
Member Author

The distributions on PyPI include the pre-built React app. When you install from GitHub you need to build the React app yourself: see web-frontend/README.md.

@J-avery32
Copy link

Everything seems to work!

@danielballan
Copy link
Member Author

I thought of a new possibility that seems a better balance of flexibility and simplicity. My goals are:

  • avoid duplicated code
  • avoid duplicated effort
  • make it possible to override extension-based detection sometimes
  • avoid having multiple hooks or anything too hard to explain

What if the file extension detection runs first and then calls your hook with two parameters: the path and the mimetype. If it doesn’t match, the mimetype with be None. Therefore, if you only care about files that do not match based on the extension you can do

def detect_mimetype(path, mimetype):
    if mimetype is None:

But, in other situations, it is still possible to override the mimetype detected based on ext if it was wrong.

@J-avery32
Copy link

That looks perfect. Trying it out now.

@J-avery32
Copy link

Oh wait you haven't implemented it yet lol.

@danielballan
Copy link
Member Author

Haha, yeah. Implemented and pushed now. If it works for you, I'll update the documentation and merge.

Wrote it out while it's fresh in my brain...now on to the weekend! 🌴

@J-avery32
Copy link

Works for me! Not sure why the tests are failing though.

@danielballan
Copy link
Member Author

Looks like an unrelated issue started failing the unit tests. I've been meaning to adjust that anyway; fix pushed.

@danielballan
Copy link
Member Author

@J-avery32 The documentation you used is sort of a "case study" that covers a specific situation, and it gets into some details that might not be immediately needed by all users.

I tried writing more entry-level documentation here: https://github.com/bluesky/tiled/pull/259/files#diff-2a446760a1636bbd9aae3291ce4a430b145cda1ec4fe01f33e1a5e829a8f0b2a

Do you think it is understandable and useful?

@danielballan
Copy link
Member Author

@J-avery32
Copy link

Yes it is readable to me.

@danielballan danielballan merged commit 34ede8a into bluesky:main Jul 27, 2022
@danielballan danielballan deleted the mimetype-detection-hook branch July 27, 2022 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wildcard for mimetypes_by_file_ext? Explore other ways to identify file types
2 participants