-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCV bug. Ensuring None
for edges
#1
Conversation
* separate out methods to write to nwb per subject * add the option to provide a config_dict instead of .yml file path * scrapt using config dict * Update dlc2nwb/utils.py Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com> * Update dlc2nwb/utils.py Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com> Co-authored-by: Heberto Mayorquin <h.mayorquin@gmail.com> Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com>
Hi @CodyCBakerPhD - Do you have time to review? With a stamp of approval from the Catalyst team, I'll PR to the DeepLabCut repo. |
@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo? Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie? |
Also looping @h-mayorquin in on this, who might have some other ideas (and will also soon be improving the DLC2NWB writing procedures related to timestamps vs. rate) |
Of course. I've uploaded the set of related files to google drive here. The video itself was truncated by
That's a good point. With my first pass, it was always 3 frames. With this MVP example I just uploaded, it was only the last one. My most recent commit looks for zeros after the first frame and warns the user regarding the percent of timestamps that will be interpolated. |
Sounds good, thanks for bringing this to our attention as well.
Thanks for sharing - I've requested access and will add them to our behavioral testing data repo for future integration into the testing suites |
Go for it. You may be interested, however, in some version of our DLC test data (download instructions here). Future versions of |
The general test to see if something fishy is going on with the data is if the timestamps are not monotonically increasing. I am less sure about doing interpolation. I tested this with a toy example and the result for 2.5 % zeros at the end leads to a flat function at the end (see script at the end): (maybe this is what you intended? maybe the data is not realistic? I did not think deeply about this). Finally, @bendichter opened some related issue recently but then he closed: Script: import numpy as np
num_timestamps = 1e6
timestamps = np.arange(1e6)
percentage_of_zeros_at_end = 0.025 # percent
frame_where_zeros_start = int(num_timestamps * (1.0 - percentage_of_zeros_at_end))
timestamps[frame_where_zeros_start:] = 0
original_timestamps = np.copy(timestamps)
timestamps[timestamps == 0] = np.nan # replace 0s with nan
timestamps[np.isnan(timestamps)] = np.interp( # interpolate nans
np.isnan(timestamps).nonzero()[0], # nans idx to replace
(~np.isnan(timestamps)).nonzero()[0], # good idx to keep
timestamps[(~np.isnan(timestamps))], # good timestamps
)
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
figure(figsize=(8, 6), dpi=80)
plt.plot(original_timestamps, label="original")
plt.plot(timestamps, label="after interpolation")
plt.legend() |
My original implementation was to sum the average difference and timepoint N-1, but I changed wanting to future-proof against 0s elsewhere (not just the last few). I can revert to that |
I feel that we should just leave np.nan values instead of trying to do amputation in that specific library. Seems like a big commitment to patch open-cv faulty reading algorithm locally. Personally, I would not known how to chose between different strategies of data amputation. Both of your approaches seem good to me for the case of small zeros at the end. |
I took from my initial conversation with @bendichter that 0s would break readability for downstream NWB tools. Is that also the case with |
That's a good point. Let's see what they say. |
In NWB, any As such this technically against the NWB best practice for timestamps, but again we've never encountered this question so we might have to discuss that. I can think of some past cases where the acquisition system (not video-related, but ephys) detected that some individual frames throughout the experiment were to be excluded, which is kind of similar - but we just excluded the frames from the data in that case. Just thinking aloud belowBut this seems fundamentally different in that the frames themselves are well-behaved and contain data, and importantly you know that the timing of the frame is bounded. If you had a single frame with a timestamp of Maybe the best course of action would be to have this kind of interpolation be an optional flag? |
OK we discussed it a bit more ourselves - what we would recommend for NWB writing purposes is an optional flag (just an example, if test_if_timestamps_are_regular(timestamps, decimal_tolerance): # excluding of course the NaN ones on the end
# don't use timestamps at all and just at starting_time + rate instead
# otherwise
if infer_timestamps:
# do that interpolation thing, but in a way that does not have that plateau effect that Heberto pointed out above
# i.e., just set the missing timestamps to frame_idx / sampling_rate or something similar
else:
# remove the data for the frames that correspond to missing timing information The majority of use cases I would expect to fall under the regularity condition (depending of course on source format/codec/external files with exact timing info as well as how stringent the threshold is set for For for significantly irregular series with this issue, NWB would ask that you not include timestamps that are not (i) strictly ascending and (ii) do not contain interspersed @CBroz1 What are your thoughts? Would an optional flag like that make sense to present to a user (or devs such as ourselves)? |
@CodyCBakerPhD Sure, we can leave it to user choice. My C++ is not strong enough to know for sure, but I think it might be the case that, depending on the reader, OpenCV may already return best-guess or error-prone timestamps. A couple examples across many readers: I'm not sure it makes much sense to treat what we're getting back as ground-truth when it could come from any number of sources with any number of underlying codecs/calculation strategies. Therefore, I'm inclined to have the default behavior be an inference. I've pushed a commit along these lines and added resulting nwb files ( |
Yeah, we've noticed that as well when perusing OpenCV - many of the formats/codecs seem to force interpolated/regular timestamps, but I think we're hypothesizing that it is 'possible' that there exists at least one such reader than supports variable timing information (would love to find an example file with this encoded, but usually variable timing info is in a separate file like a
Fair point, I don't feel that strongly either way since this is such an edge case. LGTM now, @h-mayorquin any final thoughts? |
@@ -37,6 +37,24 @@ def get_movie_timestamps(movie_file, VARIABILITYBOUND=1000): | |||
"Variability of timestamps suspiciously small. See: https://github.com/DeepLabCut/DLC2NWB/issues/1" | |||
) | |||
|
|||
if any(timestamps[1:] == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only two small suggestions then:
- I think the amputation should be a function of its own that takes timestamps and returns timestamps.
- Some suggestions for your code, not very important as your comments are clear enough but maybe you do find them useful.
bad_timestamps_mask = np.isnan(timestamps)
bad_timestamps_indexes = np.argwhere(bad_timestamps_mask)[:, 0]
estimated_sampling_rate = np.mean(np.diff(timestamps[~bad_timestamps_mask]))
inferred_timestamps = bad_timestamps_indexes * estimated_sampling_rate
timestamps[bad_timestamps_mask] = inferred_timestamps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions! I can see how the way you've written it is cleaner and doesn't need commenting. It's not clear to me why this block should be a separate function. What's your threshold there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it is just part of the general arguments for modularization. Of all those arguments I think that two are strong here (the usual one of re-usability probably does not apply within this repository):
- If somebody wants to change the specific operation of inference / amputation then they known exactly what they should modify and where. The non-coupled nature of this piece of code with the rest of the function then is explicit by making it a function. Therefore, the specific amputation routine inside the function could change but the role would still be the one of "inferring bad or missing timestamps". Is a good abstraction in that way.
- If someone is getting a first glance at how the function
get_movie_timestamps
works then all the details about the specific algorithm that you use for amputation is noise at first level. Having that in a function makes the big picture more evident. There are some things going on here, 1) You build a cv reader. 2) You extract the data 3) You do a check for timestamps variability, 4) You do inference for wrong/missing timestamps. Having one of those steps -or more- codified as a function makes this big picture on my opinion.
All that said, those are heuristics and I am kind of following my intuition / internal model of what people would feel easier to understand, read and maintain in the future. Here is some article that argues against indirection for some cases which delineates where this model kind of breaks down:
https://matthewrocklin.com/blog/work/2019/06/23/avoid-indirection
Thanks for all the hard work and good luck with the other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as well. Let's see what the maintainers say.
@CodyCBakerPhD - Lmk if there's anything else I should do before opening PR on the DeepLabCut fork |
@CBroz1 Go for it! Thanks for all the work on this 😀 |
I recommend fetching from upstream before review