Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a default data directory #303

Merged
merged 1 commit into from
Jan 9, 2024

Conversation

greglucas
Copy link
Collaborator

Change Summary

Overview

This adds a global data directory to maintain all files underneath that is user-settable via an environment variable or command line argument. The purpose is to start moving towards the same logical folder structure we are using in AWS, but locally so all our file management is consistent. I ended up touching more than I wanted to here to try and add some tests and bring the current tests into pathlib parity. I'm happy to try and break some of this out if it is too large for a review now.

New Dependencies

n/a

New Files

n/a

Deleted Files

n/a

Updated Files

Many updates to various path handling functions.

Testing

I added a few subprocess calls to verify that instantiating via environment variable works as expected.

# Eg. imap_module_directory = /usr/local/lib/python3.11/site-packages/imap_processing
imap_module_directory = Path(__file__).parent

instruments = [
INSTRUMENTS = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these to a typical GLOBAL caps pattern since they seemed like globals. I can change these back if this is just noise.

imap_processing/cdf/utils.py Show resolved Hide resolved
pyproject.toml Outdated
[project.scripts]
imap_processing = "imap_processing.run_processing"
[tool.poetry.scripts]
imap_cli = "imap_processing.run_processing:main"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like poetry doesn't respect project.scripts so need to change this over to tool.poetry.scripts. @laspsandoval you changed this in one of your previous PRs that you then declined. Did you have a different better name for this? I couldn't find it immediately, so happy to hear your thoughts on the naming here.

imap_processing is not great because it clashes with our package/directory name so when you are in the current directory it tries to run that folder instead of the installed script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between, Laura's PR and this PR, you may have merge conflict depending on who merge first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not good with names. But I agree about choosing a good name for the script file. imap_processing or cli is not good. one is confusing with package and other is too generic. since we are using imap_cli, we could change script file name to be that too unless others have good name suggestion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. It will be an easy rebase, so I'm happy to resolve after hers goes in.

Agreed on moving things around further! Maybe do that as a follow-up PR when someone with good naming skills comes in :)

@greglucas greglucas force-pushed the data-dir branch 4 times, most recently from 0e8d45e to 134c61b Compare January 3, 2024 23:54
Copy link
Contributor

@tech3371 tech3371 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. To double check, DATA_DIR is where data is downloaded to when we are using download API, right?

@greglucas
Copy link
Collaborator Author

It looks good to me. To double check, DATA_DIR is where data is downloaded to when we are using download API, right?

Yes, you could set it to wherever you want, and if you wanted to set it to different places that is fine too.

@tech3371 Thoughts on whether this should be the place we write files explicitly to? i.e. this comment: #303 (comment)

@greglucas greglucas force-pushed the data-dir branch 2 times, most recently from 217423e to 417b846 Compare January 5, 2024 17:19
@greglucas greglucas requested a review from tech3371 January 5, 2024 17:19
@greglucas
Copy link
Collaborator Author

I apologize for the churn/expansion here, this is significantly updated since you reviewed @tech3371.

  • Use a config dictionary, so it is a mutable global object, otherwise updating previous imports from other modules wouldn't have been updated globally. (changing an item in imap_processing.config will show up to someone else referencing the imap_processing.config object, but previously if you would have imported imap_processing.DATA_DIR pointing to a string, that wouldn't have received an update if somewhere else changed that value)
  • Remove as a filename attribute and option to pass into write_cdf()
  • Default to the directory structure the project has agreed to
  • Create a global test fixture that sets the data directory to a temporary path so no one has to worry about the scope of their test files being written.

Copy link
Contributor

@tech3371 tech3371 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor questions and comment. It looks good to me still though.

imap_processing/__init__.py Show resolved Hide resolved
imap_processing/cdf/utils.py Show resolved Hide resolved
imap_processing/cdf/utils.py Show resolved Hide resolved
pyproject.toml Outdated
[project.scripts]
imap_processing = "imap_processing.run_processing"
[tool.poetry.scripts]
imap_cli = "imap_processing.run_processing:main"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between, Laura's PR and this PR, you may have merge conflict depending on who merge first.

pyproject.toml Outdated
[project.scripts]
imap_processing = "imap_processing.run_processing"
[tool.poetry.scripts]
imap_cli = "imap_processing.run_processing:main"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not good with names. But I agree about choosing a good name for the script file. imap_processing or cli is not good. one is confusing with package and other is too generic. since we are using imap_cli, we could change script file name to be that too unless others have good name suggestion.

@greglucas greglucas force-pushed the data-dir branch 2 times, most recently from b7dde16 to bc3116a Compare January 9, 2024 02:20
@greglucas greglucas requested a review from bourque January 9, 2024 02:50
Copy link
Collaborator

@bourque bourque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I like the changes. This will keep us more consistent/organized. I just have a few non-blocking suggestions.

data: xr.Dataset, description: str = "", mode: str = "", directory: str = ""
data: xr.Dataset,
description: str = "",
directory: Optional[Path] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also update the Parameters section of the docstring to reflect that directory is now a Path object instead of the str? (Looks like that whole section is not formatted correctly too)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I updated all of this to be the Numpy docstring style. I think we can use the type-hints to automagically generate these if we wanted to go that route as well and avoid having to manually document these...

Comment on lines +7 to +13
@pytest.fixture(autouse=True)
def _set_global_data_dir(tmp_path):
"""Set the global data directory to a temporary directory."""
_original_data_dir = imap_processing.config["DATA_DIR"]
imap_processing.config["DATA_DIR"] = tmp_path
yield
imap_processing.config["DATA_DIR"] = _original_data_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat way to do this!

def test_write_cdf(tmp_path):
# Set up a fake dataset
# lots of requirements on attributes, so depend on SWE for now
ds = xr.Dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ds = xr.Dataset(
dataset = xr.Dataset(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this and fname to spell it out. But, I think there is an argument to be made for keeping it ds for xarray dataset, df for pandas dataframe, np for numpy etc... Their documentation uses ds pretty universally, so is it better to stick to that or the spelled out name?
https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am annoyed that they chose to use ds instead of dataset, but I am totally fine with keeping it ds, fname, etc. if that is basically a 'standard' for the particular library.

imap_processing/tests/cdf/test_utils.py Outdated Show resolved Hide resolved
This adds the ability to set the data directory for reading
and writing data files to an explicit location of the user's choosing
through command line or environment variables.
@greglucas greglucas merged commit 4e87e04 into IMAP-Science-Operations-Center:dev Jan 9, 2024
17 checks passed
@greglucas greglucas deleted the data-dir branch January 9, 2024 23:22
laspsandoval pushed a commit to laspsandoval/imap_processing that referenced this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants