OpenCV bug. Ensuring `None` for edges #1

CBroz1 · 2022-07-27T21:14:09Z

I recommend fetching from upstream before review

Addressing OpenCV bug - providing timestamps of 0 on last 3 frames
Revising to empty array passed to edges

* separate out methods to write to nwb per subject * add the option to provide a config_dict instead of .yml file path * scrapt using config dict * Update dlc2nwb/utils.py Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com> * Update dlc2nwb/utils.py Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com> Co-authored-by: Heberto Mayorquin <h.mayorquin@gmail.com> Co-authored-by: Jessy Lauer <30733203+jeylau@users.noreply.github.com>

CBroz1 · 2022-08-11T18:30:53Z

Hi @CodyCBakerPhD - Do you have time to review? With a stamp of approval from the Catalyst team, I'll PR to the DeepLabCut repo.

CodyCBakerPhD · 2022-08-11T18:39:46Z

@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo?

Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie?

CodyCBakerPhD · 2022-08-11T18:40:21Z

Also looping @h-mayorquin in on this, who might have some other ideas (and will also soon be improving the DLC2NWB writing procedures related to timestamps vs. rate)

CBroz1 · 2022-08-12T22:25:50Z

@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo?

Of course. I've uploaded the set of related files to google drive here. The video itself was truncated by ffmpeg - specifically this process.

Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie?

That's a good point. With my first pass, it was always 3 frames. With this MVP example I just uploaded, it was only the last one. My most recent commit looks for zeros after the first frame and warns the user regarding the percent of timestamps that will be interpolated.

CodyCBakerPhD · 2022-08-15T04:08:42Z

My most recent commit looks for zeros after the first frame and warns the user regarding the percent of timestamps that will be interpolated.

Sounds good, thanks for bringing this to our attention as well.

Of course. I've uploaded the set of related files to google drive here.

Thanks for sharing - I've requested access and will add them to our behavioral testing data repo for future integration into the testing suites

CBroz1 · 2022-08-15T15:19:59Z

I've requested access and will add them to our behavioral testing data repo for future integration into the testing suites

Go for it. You may be interested, however, in some version of our DLC test data (download instructions here). Future versions of DLC2NWB may include importing training data (see ndx-pose PR#9). In the drive folder, you'll find only the items required for the present PR's use-case.

h-mayorquin · 2022-08-16T09:43:10Z

@CBroz1 Thanks for bringing this to our attention - I don't think we've seen this issue before on any of our files. Would you be able to share the file that causes it so we can add it to the testing repo?

Also are you sure that it would always necessarily be limited to only the last 3 frames and not some arbitrary number of frames from the end - or even worse, randomly chosen frames from a given movie?

The general test to see if something fishy is going on with the data is if the timestamps are not monotonically increasing. I am less sure about doing interpolation. I tested this with a toy example and the result for 2.5 % zeros at the end leads to a flat function at the end (see script at the end):

(maybe this is what you intended? maybe the data is not realistic? I did not think deeply about this).

Finally, @bendichter opened some related issue recently but then he closed:
DeepLabCut#12
EDIT: It was because it is an OpenCV issue.

Script:

import numpy as np 

num_timestamps = 1e6
timestamps = np.arange(1e6)
percentage_of_zeros_at_end = 0.025 # percent
frame_where_zeros_start = int(num_timestamps * (1.0 - percentage_of_zeros_at_end))
timestamps[frame_where_zeros_start:] = 0
original_timestamps = np.copy(timestamps)

timestamps[timestamps == 0] = np.nan            # replace 0s with nan
timestamps[np.isnan(timestamps)] = np.interp(   # interpolate nans
    np.isnan(timestamps).nonzero()[0],          # nans idx to replace
    (~np.isnan(timestamps)).nonzero()[0],       # good idx to keep
    timestamps[(~np.isnan(timestamps))],        # good timestamps
)

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

figure(figsize=(8, 6), dpi=80)
plt.plot(original_timestamps, label="original")
plt.plot(timestamps, label="after interpolation")

plt.legend()

CBroz1 · 2022-08-16T15:16:13Z

My original implementation was to sum the average difference and timepoint N-1, but I changed wanting to future-proof against 0s elsewhere (not just the last few). I can revert to that avg_diff + previous strategy if we think that's best.

h-mayorquin · 2022-08-16T15:44:21Z

My original implementation was to sum the average difference and timepoint N-1, but I changed wanting to future-proof against 0s elsewhere (not just the last few). I can revert to that avg_diff + previous strategy if we think that's best.

I feel that we should just leave np.nan values instead of trying to do amputation in that specific library. Seems like a big commitment to patch open-cv faulty reading algorithm locally.

Personally, I would not known how to chose between different strategies of data amputation. Both of your approaches seem good to me for the case of small zeros at the end.

CBroz1 · 2022-08-16T15:48:54Z

I took from my initial conversation with @bendichter that 0s would break readability for downstream NWB tools. Is that also the case with np.nan @CodyCBakerPhD ?

h-mayorquin · 2022-08-16T16:17:09Z

I took from my initial conversation with @bendichter that 0s would break readability for downstream NWB tools. Is that also the case with np.nan @CodyCBakerPhD ?

That's a good point. Let's see what they say.

CodyCBakerPhD · 2022-08-16T16:30:53Z

In NWB, any timestamps field ought to be strictly increasing. We have in the past set starting_time (when paired with rate for regular series) to NaN to indicate an unknown starting time, but I don't think we've dealt with any cases of a small number of individual timestamps in an array that are not known. I think setting them to NaN might have the same disruptive effect in visualizations and processing, which often try to 'query' a given time point in the vector to align or display with and between various objects.

As such this technically against the NWB best practice for timestamps, but again we've never encountered this question so we might have to discuss that.

I can think of some past cases where the acquisition system (not video-related, but ephys) detected that some individual frames throughout the experiment were to be excluded, which is kind of similar - but we just excluded the frames from the data in that case.

Just thinking aloud below

But this seems fundamentally different in that the frames themselves are well-behaved and contain data, and importantly you know that the timing of the frame is bounded. If you had a single frame with a timestamp of 0 wedged in between two otherwise known timestamps, it would make sense I would think to just average the timing to occur at the midpoint as a 'best guess' and very little if anything would be harmed by doing so. It might technically introduce a small amount of error if the system is truly using some particular design of irregular frame captures but that is not very common.

Maybe the best course of action would be to have this kind of interpolation be an optional flag?

CodyCBakerPhD · 2022-08-16T18:15:51Z

OK we discussed it a bit more ourselves - what we would recommend for NWB writing purposes is an optional flag (just an example, infer_timestamps: bool = False) that behaves as follows:

if test_if_timestamps_are_regular(timestamps, decimal_tolerance):  # excluding of course the NaN ones on the end
    # don't use timestamps at all and just at starting_time + rate instead

# otherwise
if infer_timestamps:
    # do that interpolation thing, but in a way that does not have that plateau effect that Heberto pointed out above
    # i.e., just set the missing timestamps to frame_idx / sampling_rate or something similar
else:
    # remove the data for the frames that correspond to missing timing information

The majority of use cases I would expect to fall under the regularity condition (depending of course on source format/codec/external files with exact timing info as well as how stringent the threshold is set for decimal_tolerance) so the entire thing is a non-issue for that.

For for significantly irregular series with this issue, NWB would ask that you not include timestamps that are not (i) strictly ascending and (ii) do not contain interspersed NaN's - but it is best to leave it entirely as the user's choice whether to (a) make up timing information that is roughly consistent with the average sampling frequency (interpolation) or (b) to simply drop the data from consideration since the exact timing is unknown. I'd also vote for case (b) as the default behavior since if it is a small number of affected frames, not much will ultimately be lost.

@CBroz1 What are your thoughts? Would an optional flag like that make sense to present to a user (or devs such as ourselves)?

CBroz1 · 2022-08-17T16:18:08Z

@CBroz1 What are your thoughts? Would an optional flag like that make sense to present to a user (or devs such as ourselves)?

@CodyCBakerPhD Sure, we can leave it to user choice.

My C++ is not strong enough to know for sure, but I think it might be the case that, depending on the reader, OpenCV may already return best-guess or error-prone timestamps. A couple examples across many readers:

I'm not sure it makes much sense to treat what we're getting back as ground-truth when it could come from any number of sources with any number of underlying codecs/calculation strategies. Therefore, I'm inclined to have the default behavior be an inference. I've pushed a commit along these lines and added resulting nwb files (infer and exclude_nans) the same drive folder.

CodyCBakerPhD · 2022-08-17T16:24:25Z

My C++ is not strong enough to know for sure, but I think it might be the case that, depending on the reader, OpenCV may already return best-guess or error-prone timestamps. A couple examples across many readers:

Yeah, we've noticed that as well when perusing OpenCV - many of the formats/codecs seem to force interpolated/regular timestamps, but I think we're hypothesizing that it is 'possible' that there exists at least one such reader than supports variable timing information (would love to find an example file with this encoded, but usually variable timing info is in a separate file like a .csv adjacent to the file OpenCV reads).

Therefore, I'm inclined to have the default behavior be an inference.

Fair point, I don't feel that strongly either way since this is such an edge case.

LGTM now, @h-mayorquin any final thoughts?

h-mayorquin · 2022-08-17T16:55:52Z

dlc2nwb/utils.py

@@ -37,6 +37,24 @@ def get_movie_timestamps(movie_file, VARIABILITYBOUND=1000):
            "Variability of timestamps suspiciously small. See: https://github.com/DeepLabCut/DLC2NWB/issues/1"
        )

+    if any(timestamps[1:] == 0):


Only two small suggestions then:

I think the amputation should be a function of its own that takes timestamps and returns timestamps.

Some suggestions for your code, not very important as your comments are clear enough but maybe you do find them useful.

bad_timestamps_mask = np.isnan(timestamps) bad_timestamps_indexes = np.argwhere(bad_timestamps_mask)[:, 0] estimated_sampling_rate = np.mean(np.diff(timestamps[~bad_timestamps_mask])) inferred_timestamps = bad_timestamps_indexes * estimated_sampling_rate timestamps[bad_timestamps_mask] = inferred_timestamps

Thanks for the suggestions! I can see how the way you've written it is cleaner and doesn't need commenting. It's not clear to me why this block should be a separate function. What's your threshold there?

To me it is just part of the general arguments for modularization. Of all those arguments I think that two are strong here (the usual one of re-usability probably does not apply within this repository):

If somebody wants to change the specific operation of inference / amputation then they known exactly what they should modify and where. The non-coupled nature of this piece of code with the rest of the function then is explicit by making it a function. Therefore, the specific amputation routine inside the function could change but the role would still be the one of "inferring bad or missing timestamps". Is a good abstraction in that way.

If someone is getting a first glance at how the function get_movie_timestamps works then all the details about the specific algorithm that you use for amputation is noise at first level. Having that in a function makes the big picture more evident. There are some things going on here, 1) You build a cv reader. 2) You extract the data 3) You do a check for timestamps variability, 4) You do inference for wrong/missing timestamps. Having one of those steps -or more- codified as a function makes this big picture on my opinion.

All that said, those are heuristics and I am kind of following my intuition / internal model of what people would feel easier to understand, read and maintain in the future. Here is some article that argues against indirection for some cases which delineates where this model kind of breaks down:

https://matthewrocklin.com/blog/work/2019/06/23/avoid-indirection

Thanks for all the hard work and good luck with the other PR.

h-mayorquin

Looks good to me as well. Let's see what the maintainers say.

CBroz1 · 2022-08-18T14:00:25Z

@CodyCBakerPhD - Lmk if there's anything else I should do before opening PR on the DeepLabCut fork

CodyCBakerPhD · 2022-08-18T14:17:12Z

@CBroz1 Go for it! Thanks for all the work on this 😀

Saksham20 and others added 3 commits July 27, 2022 10:49

Infer buged timestamps. Address no edges bug

a842a35

Edit test for OpenCV bug

6eef612

CBroz1 mentioned this pull request Aug 1, 2022

Add NWB export function datajoint/element-deeplabcut#46

Merged

Replace timestamp block; any 0s interpolated

6227b0b

Revert to idx*1/fps inference. Add user choice.

39e3a76

h-mayorquin reviewed Aug 17, 2022

View reviewed changes

Apply suggestions from @h-mayorquin

4502950

Add timestamp_cleaned when infer_timestamps

f2c2641

CBroz1 mentioned this pull request Aug 18, 2022

Infer missing timestamps DeepLabCut/DLC2NWB#16

Merged

CBroz1 closed this Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCV bug. Ensuring `None` for edges #1

OpenCV bug. Ensuring `None` for edges #1

CBroz1 commented Jul 27, 2022 •

edited

Loading

CBroz1 commented Aug 11, 2022

CodyCBakerPhD commented Aug 11, 2022

CodyCBakerPhD commented Aug 11, 2022

CBroz1 commented Aug 12, 2022

CodyCBakerPhD commented Aug 15, 2022

CBroz1 commented Aug 15, 2022

h-mayorquin commented Aug 16, 2022 •

edited

Loading

CBroz1 commented Aug 16, 2022

h-mayorquin commented Aug 16, 2022

CBroz1 commented Aug 16, 2022

h-mayorquin commented Aug 16, 2022

CodyCBakerPhD commented Aug 16, 2022

CodyCBakerPhD commented Aug 16, 2022

CBroz1 commented Aug 17, 2022

CodyCBakerPhD commented Aug 17, 2022

h-mayorquin Aug 17, 2022 •

edited

Loading

CBroz1 Aug 17, 2022

h-mayorquin Aug 18, 2022

h-mayorquin left a comment

CBroz1 commented Aug 18, 2022

CodyCBakerPhD commented Aug 18, 2022

OpenCV bug. Ensuring None for edges #1

OpenCV bug. Ensuring None for edges #1

Conversation

CBroz1 commented Jul 27, 2022 • edited Loading

CBroz1 commented Aug 11, 2022

CodyCBakerPhD commented Aug 11, 2022

CodyCBakerPhD commented Aug 11, 2022

CBroz1 commented Aug 12, 2022

CodyCBakerPhD commented Aug 15, 2022

CBroz1 commented Aug 15, 2022

h-mayorquin commented Aug 16, 2022 • edited Loading

CBroz1 commented Aug 16, 2022

h-mayorquin commented Aug 16, 2022

CBroz1 commented Aug 16, 2022

h-mayorquin commented Aug 16, 2022

CodyCBakerPhD commented Aug 16, 2022

Just thinking aloud below

CodyCBakerPhD commented Aug 16, 2022

CBroz1 commented Aug 17, 2022

CodyCBakerPhD commented Aug 17, 2022

h-mayorquin Aug 17, 2022 • edited Loading

Choose a reason for hiding this comment

CBroz1 Aug 17, 2022

Choose a reason for hiding this comment

h-mayorquin Aug 18, 2022

Choose a reason for hiding this comment

h-mayorquin left a comment

Choose a reason for hiding this comment

CBroz1 commented Aug 18, 2022

CodyCBakerPhD commented Aug 18, 2022

OpenCV bug. Ensuring `None` for edges #1

OpenCV bug. Ensuring `None` for edges #1

CBroz1 commented Jul 27, 2022 •

edited

Loading

h-mayorquin commented Aug 16, 2022 •

edited

Loading

h-mayorquin Aug 17, 2022 •

edited

Loading