Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCV bug. Ensuring None for edges #1

Closed
wants to merge 7 commits into from
Closed

Conversation

CBroz1
Copy link

@CBroz1 CBroz1 commented Jul 27, 2022

I recommend fetching from upstream before review

  1. Addressing OpenCV bug - providing timestamps of 0 on last 3 frames
  2. Revising to empty array passed to edges

Saksham20 and others added 3 commits July 27, 2022 10:49
* separate out methods to write to nwb per subject

* add the option to provide a config_dict instead of .yml file path

* scrapt using config dict

* Update dlc2nwb/utils.py

Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com>

* Update dlc2nwb/utils.py

Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com>

Co-authored-by: Heberto Mayorquin <h.mayorquin@gmail.com>
Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com>
@CBroz1
Copy link
Author

CBroz1 commented Aug 11, 2022

Hi @CodyCBakerPhD - Do you have time to review? With a stamp of approval from the Catalyst team, I'll PR to the DeepLabCut repo.

@CodyCBakerPhD
Copy link
Member

@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo?

Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie?

@CodyCBakerPhD
Copy link
Member

Also looping @h-mayorquin in on this, who might have some other ideas (and will also soon be improving the DLC2NWB writing procedures related to timestamps vs. rate)

@CBroz1
Copy link
Author

CBroz1 commented Aug 12, 2022

@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo?

Of course. I've uploaded the set of related files to google drive here. The video itself was truncated by ffmpeg - specifically this process.

Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie?

That's a good point. With my first pass, it was always 3 frames. With this MVP example I just uploaded, it was only the last one. My most recent commit looks for zeros after the first frame and warns the user regarding the percent of timestamps that will be interpolated.

@CodyCBakerPhD
Copy link
Member

My most recent commit looks for zeros after the first frame and warns the user regarding the percent of timestamps that will be interpolated.

Sounds good, thanks for bringing this to our attention as well.

Of course. I've uploaded the set of related files to google drive here.

Thanks for sharing - I've requested access and will add them to our behavioral testing data repo for future integration into the testing suites

@CBroz1
Copy link
Author

CBroz1 commented Aug 15, 2022

I've requested access and will add them to our behavioral testing data repo for future integration into the testing suites

Go for it. You may be interested, however, in some version of our DLC test data (download instructions here). Future versions of DLC2NWB may include importing training data (see ndx-pose PR#9). In the drive folder, you'll find only the items required for the present PR's use-case.

@h-mayorquin
Copy link

h-mayorquin commented Aug 16, 2022

@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo?

Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie?

The general test to see if something fishy is going on with the data is if the timestamps are not monotonically increasing. I am less sure about doing interpolation. I tested this with a toy example and the result for 2.5 % zeros at the end leads to a flat function at the end (see script at the end):

image

(maybe this is what you intended? maybe the data is not realistic? I did not think deeply about this).

Finally, @bendichter opened some related issue recently but then he closed:
DeepLabCut#12
EDIT: It was because it is an OpenCV issue.

Script:

import numpy as np 

num_timestamps = 1e6
timestamps = np.arange(1e6)
percentage_of_zeros_at_end = 0.025 # percent
frame_where_zeros_start = int(num_timestamps * (1.0 - percentage_of_zeros_at_end))
timestamps[frame_where_zeros_start:] = 0
original_timestamps = np.copy(timestamps)

timestamps[timestamps == 0] = np.nan            # replace 0s with nan
timestamps[np.isnan(timestamps)] = np.interp(   # interpolate nans
    np.isnan(timestamps).nonzero()[0],          # nans idx to replace
    (~np.isnan(timestamps)).nonzero()[0],       # good idx to keep
    timestamps[(~np.isnan(timestamps))],        # good timestamps
)

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

figure(figsize=(8, 6), dpi=80)
plt.plot(original_timestamps, label="original")
plt.plot(timestamps, label="after interpolation")

plt.legend()

@CBroz1
Copy link
Author

CBroz1 commented Aug 16, 2022

My original implementation was to sum the average difference and timepoint N-1, but I changed wanting to future-proof against 0s elsewhere (not just the last few). I can revert to that avg_diff + previous strategy if we think that's best.

@h-mayorquin
Copy link

My original implementation was to sum the average difference and timepoint N-1, but I changed wanting to future-proof against 0s elsewhere (not just the last few). I can revert to that avg_diff + previous strategy if we think that's best.

I feel that we should just leave np.nan values instead of trying to do amputation in that specific library. Seems like a big commitment to patch open-cv faulty reading algorithm locally.

Personally, I would not known how to chose between different strategies of data amputation. Both of your approaches seem good to me for the case of small zeros at the end.

@CBroz1
Copy link
Author

CBroz1 commented Aug 16, 2022

I took from my initial conversation with @bendichter that 0s would break readability for downstream NWB tools. Is that also the case with np.nan @CodyCBakerPhD ?

@h-mayorquin
Copy link

I took from my initial conversation with @bendichter that 0s would break readability for downstream NWB tools. Is that also the case with np.nan @CodyCBakerPhD ?

That's a good point. Let's see what they say.

@CodyCBakerPhD
Copy link
Member

In NWB, any timestamps field ought to be strictly increasing. We have in the past set starting_time (when paired with rate for regular series) to NaN to indicate an unknown starting time, but I don't think we've dealt with any cases of a small number of individual timestamps in an array that are not known. I think setting them to NaN might have the same disruptive effect in visualizations and processing, which often try to 'query' a given time point in the vector to align or display with and between various objects.

As such this technically against the NWB best practice for timestamps, but again we've never encountered this question so we might have to discuss that.

I can think of some past cases where the acquisition system (not video-related, but ephys) detected that some individual frames throughout the experiment were to be excluded, which is kind of similar - but we just excluded the frames from the data in that case.

Just thinking aloud below

But this seems fundamentally different in that the frames themselves are well-behaved and contain data, and importantly you know that the timing of the frame is bounded. If you had a single frame with a timestamp of 0 wedged in between two otherwise known timestamps, it would make sense I would think to just average the timing to occur at the midpoint as a 'best guess' and very little if anything would be harmed by doing so. It might technically introduce a small amount of error if the system is truly using some particular design of irregular frame captures but that is not very common.

Maybe the best course of action would be to have this kind of interpolation be an optional flag?

@CodyCBakerPhD
Copy link
Member

OK we discussed it a bit more ourselves - what we would recommend for NWB writing purposes is an optional flag (just an example, infer_timestamps: bool = False) that behaves as follows:

if test_if_timestamps_are_regular(timestamps, decimal_tolerance):  # excluding of course the NaN ones on the end
    # don't use timestamps at all and just at starting_time + rate instead

# otherwise
if infer_timestamps:
    # do that interpolation thing, but in a way that does not have that plateau effect that Heberto pointed out above
    # i.e., just set the missing timestamps to frame_idx / sampling_rate or something similar
else:
    # remove the data for the frames that correspond to missing timing information

The majority of use cases I would expect to fall under the regularity condition (depending of course on source format/codec/external files with exact timing info as well as how stringent the threshold is set for decimal_tolerance) so the entire thing is a non-issue for that.

For for significantly irregular series with this issue, NWB would ask that you not include timestamps that are not (i) strictly ascending and (ii) do not contain interspersed NaN's - but it is best to leave it entirely as the user's choice whether to (a) make up timing information that is roughly consistent with the average sampling frequency (interpolation) or (b) to simply drop the data from consideration since the exact timing is unknown. I'd also vote for case (b) as the default behavior since if it is a small number of affected frames, not much will ultimately be lost.

@CBroz1 What are your thoughts? Would an optional flag like that make sense to present to a user (or devs such as ourselves)?

@CBroz1
Copy link
Author

CBroz1 commented Aug 17, 2022

@CBroz1 What are your thoughts? Would an optional flag like that make sense to present to a user (or devs such as ourselves)?

@CodyCBakerPhD Sure, we can leave it to user choice.

My C++ is not strong enough to know for sure, but I think it might be the case that, depending on the reader, OpenCV may already return best-guess or error-prone timestamps. A couple examples across many readers:

I'm not sure it makes much sense to treat what we're getting back as ground-truth when it could come from any number of sources with any number of underlying codecs/calculation strategies. Therefore, I'm inclined to have the default behavior be an inference. I've pushed a commit along these lines and added resulting nwb files (infer and exclude_nans) the same drive folder.

@CodyCBakerPhD
Copy link
Member

My C++ is not strong enough to know for sure, but I think it might be the case that, depending on the reader, OpenCV may already return best-guess or error-prone timestamps. A couple examples across many readers:

Yeah, we've noticed that as well when perusing OpenCV - many of the formats/codecs seem to force interpolated/regular timestamps, but I think we're hypothesizing that it is 'possible' that there exists at least one such reader than supports variable timing information (would love to find an example file with this encoded, but usually variable timing info is in a separate file like a .csv adjacent to the file OpenCV reads).

Therefore, I'm inclined to have the default behavior be an inference.

Fair point, I don't feel that strongly either way since this is such an edge case.

LGTM now, @h-mayorquin any final thoughts?

@@ -37,6 +37,24 @@ def get_movie_timestamps(movie_file, VARIABILITYBOUND=1000):
"Variability of timestamps suspiciously small. See: https://github.com/DeepLabCut/DLC2NWB/issues/1"
)

if any(timestamps[1:] == 0):
Copy link

@h-mayorquin h-mayorquin Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two small suggestions then:

  1. I think the amputation should be a function of its own that takes timestamps and returns timestamps.
  2. Some suggestions for your code, not very important as your comments are clear enough but maybe you do find them useful.
bad_timestamps_mask = np.isnan(timestamps)
bad_timestamps_indexes = np.argwhere(bad_timestamps_mask)[:, 0]
estimated_sampling_rate = np.mean(np.diff(timestamps[~bad_timestamps_mask]))
inferred_timestamps = bad_timestamps_indexes * estimated_sampling_rate

timestamps[bad_timestamps_mask] = inferred_timestamps

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I can see how the way you've written it is cleaner and doesn't need commenting. It's not clear to me why this block should be a separate function. What's your threshold there?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it is just part of the general arguments for modularization. Of all those arguments I think that two are strong here (the usual one of re-usability probably does not apply within this repository):

  1. If somebody wants to change the specific operation of inference / amputation then they known exactly what they should modify and where. The non-coupled nature of this piece of code with the rest of the function then is explicit by making it a function. Therefore, the specific amputation routine inside the function could change but the role would still be the one of "inferring bad or missing timestamps". Is a good abstraction in that way.
  2. If someone is getting a first glance at how the function get_movie_timestamps works then all the details about the specific algorithm that you use for amputation is noise at first level. Having that in a function makes the big picture more evident. There are some things going on here, 1) You build a cv reader. 2) You extract the data 3) You do a check for timestamps variability, 4) You do inference for wrong/missing timestamps. Having one of those steps -or more- codified as a function makes this big picture on my opinion.

All that said, those are heuristics and I am kind of following my intuition / internal model of what people would feel easier to understand, read and maintain in the future. Here is some article that argues against indirection for some cases which delineates where this model kind of breaks down:

https://matthewrocklin.com/blog/work/2019/06/23/avoid-indirection

Thanks for all the hard work and good luck with the other PR.

Copy link

@h-mayorquin h-mayorquin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well. Let's see what the maintainers say.

@CBroz1
Copy link
Author

CBroz1 commented Aug 18, 2022

@CodyCBakerPhD - Lmk if there's anything else I should do before opening PR on the DeepLabCut fork

@CodyCBakerPhD
Copy link
Member

@CBroz1 Go for it! Thanks for all the work on this 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants