Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPI-3294 - Updates to file properties extraction to allow broader use, addition of file name vs file contents check functionality #21

Merged
merged 19 commits into from
Aug 2, 2024

Conversation

treefern
Copy link
Collaborator

@treefern treefern commented May 7, 2024

This PR splits out file properties extraction code from filename prediction code, to allow it to also be used in other checks (e.g. to check whether the timespan that a filename implies, matches what is found in the file).

…ediction code, to allow it to also be used in file checks
@treefern treefern requested a review from ronaldmaj May 7, 2024 05:35
@treefern treefern self-assigned this May 7, 2024
ronaldmaj
ronaldmaj previously approved these changes May 7, 2024
return generate_IGS_long_filename(**name_properties)


def determine_file_props(file_path: pathlib.Path, defaults: Dict[str, Any], overrides: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be: 'determine_file_properties`? Just to follow the general convention of expanding out all abbreviations

@@ -722,6 +753,35 @@ def determine_name_props_from_filename(filename: str) -> Dict[str, Any]:
}


def warn_on_unexpected_filename(input_file: pathlib.Path) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring would be good here

f"didn't match expected: '{expected_file_name}'. "
"Contents may be incorrect and lead to failures."
)
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking about the name of the function warn_on..., should this return as True when it does find an issue? (i.e. when it should warn)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would concur and further suggest that warning is not really the responsibility of the library code. Detecting differences is probably in scope but warning is something that should be done in application code (like the ops comparison system).
Given that I'm wondering if this file should either just compare two names (at which point it's relatively simple and maybe could disappear?) or maybe tell you exactly which name properties differ between what's detected and what the current name is (if we can extract properties from the filename)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Detecting which bit differed is relatively easy given the properties dict...
Would you surface that information through an exception, if not a log message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just return the information in some format (maybe a possible empty list of items that disagree). This is library code and so in my head these tests are equivalent to specialised versions of == and you wouldn't ever expect a comparison operator to log or throw an exception, it's just making a comparison and then what to do about that comparison is really up to application code, in our case scripts in ginan or in the operations code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured to provide a discrepancy checker function here in gnssanalysis. That is now leveraged by a wrapper function in the other code which logs errors in that context. Makes it more generic, and also more streamlined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is not leveraged in other codebases at present. So for now the new and changed functionality appears unused.



def check_file_timespan_as_claimed(input_file: pathlib.Path) -> bool:
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring

return generate_IGS_long_filename(**name_properties)


def determine_file_props(file_path: pathlib.Path, defaults: Dict[str, Any], overrides: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults and overrides can and probably should default to empty dictionaries (this is true for the parent function as well). To avoid mutable-default issues it probably then needs a defaults = None if defaults is None: defaults = {} idiom.
In hindsight the type can probably just be a Optional[Mapping[str, Any]] and I can't remember whether that Any can be tightened up further.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

return False # Assume it's bad, as we can't verify it's good.
claimed_timespan:datetime.timedelta = determine_name_props_from_filename(input_file.name)["timespan"]

if claimed_timespan != actual_timespan:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, I feel slightly uncomfortable around the logging messages in library code and the assumptions made for unsupported files. These have the feeling of being domain/application specific. If you then remove that logging and catching, this function boils down to a slightly nicer return determine_file_props(input_file)["timespan"] == determine_name_props_from_filename(input_file.name)["timespan"] and I'm not sure whether that's justifying its existence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, see above comment.

@treefern treefern changed the title NPI-3294 - Minor restructure of file properties extraction to allow broader reuse NPI-3294 - Updates to file properties extraction to allow broader use, addition of file name vs file contents check functionality May 7, 2024
treefern added 2 commits May 8, 2024 05:38
… properties function. Make sampling rate calculation consistent by disabling duplicated logic on one code path.
@ronaldmaj
Copy link
Collaborator

This PR hasn't been touched in a while. @treefern it looks like you've resolved all of @jashlearn 's concerns so I would just resolved the conflicts that now exist between this branch and main and we can probably push this through as well

@@ -7,7 +7,7 @@

# The collections.abc (rather than typing) versions don't support subscripting until 3.9
# from collections import Iterable
from typing import Iterable
from typing import Iterable, Mapping
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line, and the one below are "from typing", can be merged

Copy link
Collaborator

@ronaldmaj ronaldmaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the looks of it, the main change is a refactor determine_name_props_from_filename() to determine_properties_from_filename() and splitting out the determine_file_name() function to include a new function called determine_properties_from_contents()

There are a couple functions introduced that aren't used anywhere as well.
@treefern If you've tested this and all seems to work as before, it might be good to write up unittests for functions touched by this PR

@treefern
Copy link
Collaborator Author

treefern commented Aug 2, 2024

Initial set of unit tests added. One undiscovered bug was unearthed in the process, and fixed.

It would be good to review what values should be returned by these functions when a definitive answer can't be found.

Copy link
Collaborator

@ronaldmaj ronaldmaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in a good enough state to push through. We'll need to sort out the values for the unit-tests but we can do that on the next iteration


derived_filename = filenames.determine_file_name(test_sp3_file)

# Computed at time of wrting. Seems valid, but FIL and EXP are a bit odd.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that it is coming out as FIL because those are the first three characters of the input filename: file1.sp3. Generally the first three characters are for the AC

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Computed at time of wrting. Seems valid, but FIL and EXP are a bit odd.
# Computed at time of writing. Note that Analysis Centre is derived from the first three chars
# of the filename, which is why we ended up with `FIL` here.

derived_filename = filenames.determine_file_name(test_sp3_file)

# Computed at time of wrting. Seems valid, but FIL and EXP are a bit odd.
expected_filename = "FIL0EXP_20242010000_05M_05M_ORB.SP3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXP I am not sure of - perhaps because there is no further info in the filename, it assumes an experimental file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together so quickly! It looks good enough to me. We can work out the proper values for the unit-tests in the next iteration of PRs for this

@ronaldmaj ronaldmaj merged commit ff25255 into main Aug 2, 2024
1 check passed
@ronaldmaj ronaldmaj deleted the NPI-3294-split-out-file-properties-extraction branch August 2, 2024 12:28
@treefern
Copy link
Collaborator Author

treefern commented Aug 5, 2024

Multiple issues identified after merging this. These should have no impact as the relevant code wasn't yet utilised. New PR to follow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants