Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] IO - Change origin attribute when not find on system #6555

Merged
merged 7 commits into from
Oct 6, 2023

Conversation

PrimozGodec
Copy link
Contributor

@PrimozGodec PrimozGodec commented Aug 28, 2023

Issue

For example, files that Orange loads (with the File widget) may contain paths in one or more columns—for example, the path to images for image analysis. One way to store paths may be to keep them as origin prefixes in attributes of attribute and the last part of the path as column values. Paths (origin + column values) are absolute paths. When the table is transferred to another computer, the path may not be valid anymore.

Description of changes

If the dataset's author provides files besides the CSV file, they may be discovered, and the origin can be fixed.

Additionally, I replaced StringIO reader inputs with actual files in the test. The code should not be adapted to cases only possible in tests.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Aug 28, 2023

Codecov Report

Merging #6555 (ad5dc9a) into master (0a9c762) will increase coverage by 0.00%.
Report is 50 commits behind head on master.
The diff coverage is 96.55%.

❗ Current head ad5dc9a differs from pull request most recent head 72ccefe. Consider uploading reports for the commit 72ccefe to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6555   +/-   ##
=======================================
  Coverage   87.78%   87.78%           
=======================================
  Files         321      321           
  Lines       69420    69445   +25     
=======================================
+ Hits        60938    60962   +24     
- Misses       8482     8483    +1     

@PrimozGodec PrimozGodec force-pushed the io-update-origin branch 2 times, most recently from 1764fde to e81adb2 Compare August 29, 2023 08:59
@PrimozGodec PrimozGodec changed the title IO - Change origin attribute when not find on system [ENH] IO - Change origin attribute when not find on system Aug 29, 2023
@PrimozGodec PrimozGodec force-pushed the io-update-origin branch 9 times, most recently from b189e71 to ad5dc9a Compare August 30, 2023 14:38
@PrimozGodec
Copy link
Contributor Author

/rebase

@PrimozGodec PrimozGodec marked this pull request as ready for review August 31, 2023 09:26
@janezd janezd self-assigned this Sep 1, 2023
Copy link
Contributor

@janezd janezd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm not very happy about this guessing, but something has to be done and I have no better idea. It would probably do the work.

Orange/data/io_util.py Outdated Show resolved Hide resolved

# all column paths in lookup dirs
for ld in lookup_dirs:
if all(os.path.exists(os.path.join(ld, v)) for v in table.get_column(attr)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we would skip unknown values here, e.g. by adding if v?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. I also added the test case for it.

Orange/data/io_util.py Outdated Show resolved Hide resolved
Orange/data/io_util.py Outdated Show resolved Hide resolved
file_dir = os.path.dirname(file_path)
parent_dir = os.path.dirname(file_dir)
# if file_dir already root file_dir == parent_dir
lookup_dirs = tuple({file_dir, parent_dir})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to look into file_dir first, and only then into parent_dir? Sets are unordered.

If you want to keep it short, use tuple({file_dir: 0, parent_dir: 0}). :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Fixed

parent_dir = os.path.dirname(file_dir)
# if file_dir already root file_dir == parent_dir
lookup_dirs = tuple({file_dir, parent_dir})
for attr in table.domain:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just table.domain.metas?

Not only because of efficiency; I'm never sure whether a loop over domain includes metas or not. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# if file_dir already root file_dir == parent_dir
lookup_dirs = tuple({file_dir, parent_dir})
for attr in table.domain:
if "origin" in attr.attributes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add if attr.is_string as precaution. Is someone uses origin for something else this could prevent some false positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janezd janezd removed their assignment Sep 8, 2023
@PrimozGodec
Copy link
Contributor Author

Tests fail because of xgboost release. Fixed in #6570

@PrimozGodec
Copy link
Contributor Author

/rebase

@janezd janezd self-assigned this Sep 22, 2023
@janezd janezd merged commit ab0c155 into biolab:master Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants