Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazy loading of raw data for local files #48

Merged
merged 5 commits into from
May 13, 2024

Conversation

marcoross
Copy link
Contributor

No description provided.

@marcoross marcoross linked an issue May 2, 2024 that may be closed by this pull request
@marcoross marcoross requested a review from hofaflo May 2, 2024 11:38
CHANGELOG.md Outdated Show resolved Hide resolved
@cbrnr
Copy link
Contributor

cbrnr commented May 2, 2024

This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded). Sometimes, it would also be useful to know the total size of the Edf object in memory. I don't think this is currently possible, right?

Also, I assume loaded vs. not loaded applies to all signals that are part of an Edf objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?

Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as Edf.data, and Edf.get_signal() requires a label. Sorry if I'm missing something obvious!

edfio/edf.py Outdated Show resolved Hide resolved
edfio/edf_signal.py Outdated Show resolved Hide resolved
edfio/edf_signal.py Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
edfio/edf.py Outdated Show resolved Hide resolved
edfio/edf.py Outdated Show resolved Hide resolved
@hofaflo
Copy link
Contributor

hofaflo commented May 2, 2024

This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded).

Yes, this could be useful, I'm not sure how to best show it though, since (with this PR) lazy loading is done on a per-signal basis 🤔

Sometimes, it would also be useful to know the total size of the Edf object in memory. I don't think this is currently possible, right?

Currently not, no... maybe this could be combined with your first suggestion, i.e. display the size of the loaded signals and the total size on disk? E.g. <Edf 20 signals 0 annotations 50/100 MB loaded>

Also, I assume loaded vs. not loaded applies to all signals that are part of an Edf objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?

With the implementation suggested in this PR, it could be anything in between as well.

Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as Edf.data, and Edf.get_signal() requires a label. Sorry if I'm missing something obvious!

Right, currently there's nothing like that. While this could be nice for recordings with uniform sampling frequencies (just return a 2d-array), I'm not sure how to best treat those with differing ones (a list of arrays? fail? ...?). What use case are you thinking about for this? (-> a new issue would be nice here, yes)

@cbrnr
Copy link
Contributor

cbrnr commented May 2, 2024

I've created a new issue to discuss accessing the data array in #49.

Regarding the other points, I think it would be nice not to overcomplicate the API. So you are saying it is currently possible to have some signals loaded in memory and some not, because each signal is treated as a separate EdfSignal (which is basically a memory-mapped NumPy array)? I'm thinking that for users, it might be useful to know how much memory their entire Edf object currently uses.

On the other hand, when loading an EDF file, either no signals or all signals are loaded in memory, so there is no way to influence individual signals being loaded with read_edf(). I guess the only way to do load individual signals (from a lazy object) is to access a signal, but then the question is which methods trigger/require loading the data?

@hofaflo
Copy link
Contributor

hofaflo commented May 3, 2024

[...] each signal is treated as a separate EdfSignal (which is basically a memory-mapped NumPy array)?

Exactly, with this PR the array stays memory mapped until the data is accessed for the first time.

I'm thinking that for users, it might be useful to know how much memory their entire Edf object currently uses.

Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?

[...] which methods trigger/require loading the data?

With the currently suggested implementation, this would be

  • EdfSignal.data
  • EdfSignal.digital_data
  • EdfSignal.update_data()
  • Edf.update_data_record_duration() (all signals)
  • Edf.write() (all signals)
  • Edf.annotations (all annotation signals)
  • Edf.starttime (timekeeping annotation signal)

EDIT: slicing operations would also require loading the data (thanks @cbrnr!):

  • Edf.slice_between_annotations() (all signals)
  • Edf.slice_between_seconds() (all signals)

@cbrnr
Copy link
Contributor

cbrnr commented May 5, 2024

Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?

Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal should also show its memory consumption in its repr (and attribute).

With the currently suggested implementation, this would be
...

What about slicing between seconds or annotations? In any case, if the current memory consumption is available, users will always be able to find out which operation loads the data!

@hofaflo
Copy link
Contributor

hofaflo commented May 5, 2024

Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal should also show its memory consumption in its repr (and attribute).

👍 Feel free to open a PR for that, ideally once this one is merged!

What about slicing between seconds or annotations?

Right, thanks for pointing this out! I'll edit the above list. For non-annotation signals this could even be done without loading the data (as long as the desired slice only contains complete datarecords), though that would complicate the implementation a bit.

@cbrnr
Copy link
Contributor

cbrnr commented May 7, 2024

Another question that I had was that if read_edf() loads the signal into memory, it actually loads the digital data, right? So I'm not sure how surprising it will be that the physical data consumes four times as much?

What I'm trying to say is that this is becoming a bit complicated already, given that the original intention of lazy loading was that people sometimes work with EDF headers only, so they don't need the data. Maybe a simpler option would be to add a separate function read_edf_header() instead?

@marcoross marcoross requested a review from hofaflo May 13, 2024 09:58
@marcoross marcoross self-assigned this May 13, 2024
@hofaflo
Copy link
Contributor

hofaflo commented May 13, 2024

@cbrnr, let's move the discussion about memory consumption information into a new issue!

@cbrnr
Copy link
Contributor

cbrnr commented May 13, 2024

Sure! I just thought that maybe the current implementation is not even necessary because it adds too much complexity...

@hofaflo
Copy link
Contributor

hofaflo commented May 13, 2024

For the original use case described in #47 you're definitely right :D However, it's a nice way to also speed up working with only a (small) subset of signals.

@marcoross marcoross merged commit 70e9de8 into main May 13, 2024
9 checks passed
@marcoross marcoross deleted the 47-lazy-load-raw-data branch May 13, 2024 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lazy load raw data
3 participants