-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some output still over two CSV rows for multiple matches on a format #75
Comments
Does yours look something like..example.xlsx There looks to be a minor bug here with the repetition of the pronom ns but otherwise it is outputting as expected. The change in the changes notes was to give a separate set of columns for each identifier but not each identification. I.e. the old output would have had even more rows and just the first 11 columns, the new output gives additional columns for the two additional identifiers (tika and freedesktop) with rows that total the maximum identifications from any single identifier + padding for the other identifiers where they give fewer results. Does that make sense? Otherwise, in multiple identification scenarios it would be impossible to know ahead of time how many columns would be in the CSV. I had a thought in the shower today that it might be worth adding a -nomulti (no multiple identifications) flag to roy. This would force identifiers to only return a single result and when they encounter multi identifications would return UNKNOWN instead, plus a descriptive warning giving the possible matches. Would this help in your use case? |
Hi Richard, Looking at your reply, it seems it wasn't a bug: Bit it does seem like you found one! I like -nomulti as it normalizes the shape of the CSV output if we're using CSV. Could the label be 'MULTI' or 'MANY' or similar instead of 'UNKNOWN'? I just spoke with the team here, and it sounds promising so if put into the dev branch I'd output some samples and get them tested too. I'll be trying to work with your YAML output for that slight potential of multiple identifications that I've been pushing SF to return. i think the YAML is a bit more intuitive. But would like to offer CSV analysis too. It'd be good to hear @timothyryanwalsh's thoughts too as he's working with the CSV output already. |
Could definitely have a new MULTIPLE return value instead of UNKNOWN - like that! Rather than introducing a At the moment, multiple identifications are only returned where you get multiple IDs with same priority/weight. This is actually quite rare (esp. for PRONOM which has a fairly complete set of priority relations between formats). The effect of Introducing a The effect of the original proposal would be:
A simpler approach might be just to rename
Advantage of the first approach is it retains current behaviour (item (2) in the first list), but at the cost of complexity & would users realistically want all those choices? Or are the two latter choices the only ones users would really care about? |
Trying to work through this, the latter two behaviours seem to be the most intuitive to me. I don't think the flag name needs to be changed. -nopriority works well, it's understandable. I'm a bit torn on any change, as the benefit seems to be for the SF CSV, where having a file's details across two rows is ever so slightly more difficult to handle - one has to filter path for unique values first, and then display all identification results. I think that's where the real benefit comes in. Removing the repetition on rows and making all the data available in a single field as a warning. Does that help your thinking too? |
Hi Richard and Ross! Jumping in a bit late here (still working through all the vacation email). I'm in favor of Richard's last suggestion to rename -nopriority to -multi and simplify to the two rules. It seems like the simplest solution, and like Richard I am also struggling to think of a practical use case for when having both flags would be desirable. I'm also on board with Ross' last comment - removing repetition seems key for processing the CSV outputs, and if multiple identification with equal weight consistently throws a warning and a value of MULTIPLE, it'll be easy to flag these for review. |
Thx for chiming in Tim. I'll aim to get a small 1.5.1 release out in the next few weeks and will include this change. |
sf 1.6.0 has new -multi flag. This has 5 levels -multi 0 through to -multi 4: |
Hi Richard,
Example XSL snippet/document that I have that's still outputting to CSV over two rows. I'm using a sig with PRONOM/Tika/and FreeDesktop. -nopriority.
That seems to be contrary to the change log, but I'm not sure what the desired behavior of CSV is to be.
Ross
The text was updated successfully, but these errors were encountered: