Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(alignment-filling-start-end): fix the bug that may generate non-monotonic timecode when filling Nan #793

Closed
wants to merge 1 commit into from

Conversation

fseasy
Copy link

@fseasy fseasy commented Apr 28, 2024

Problem

Get non-monotonic timecode when a segment get both invalid start and end.

See the following example.

Raw data from alignment model:

seg  start end
seg1  Nan   Nan
seg2  1     3
seg3  4     9

Filled result:

seg  start end
seg1  1    3
seg2  1    3
seg3  4    9

We can see the end of seg1(3) >= start of seg2(1), it's non-monotonic

Why

aligned_subsegments["start"] = interpolate_nans(aligned_subsegments["start"], method=interpolate_method)
aligned_subsegments["end"] = interpolate_nans(aligned_subsegments["end"], method=interpolate_method)

Current code process the start and end independently.

Fix

Let's interpolate them jointly, it can generate monotonic result:

seq_timecode_vals = aligned_subsegments[["start", "end"]].values.ravel("C")
filled_seq_timecodes = interpolate_nans(pd.Series(seq_timecode_vals), method=interpolate_method)
aligned_subsegments["start"] = filled_seq_timecodes.iloc[::2].values
aligned_subsegments["end"] = filled_seq_timecodes.iloc[1::2].values

Test Code

import pandas as pd
datas = [
    [None, None],
    [1, 3],
    [4, 9]
]
records = [
    {
        "start": p[0],
        "end": p[1],
        "words": f"{p}",
    }
    for p in datas
]

# copy from utils for quickly run
def interpolate_nans(x, method='nearest'): 
    if x.notnull().sum() > 1:
        return x.interpolate(method=method).ffill().bfill()
    else:
        return x.ffill().bfill()

# ORIGINAL RESULT
interpolate_method = "nearest"
aligned_subsegments = pd.DataFrame.from_records(records)
## -> old logic
aligned_subsegments["start"] = interpolate_nans(aligned_subsegments["start"], method=interpolate_method)
aligned_subsegments["end"] = interpolate_nans(aligned_subsegments["end"], method=interpolate_method)
print(aligned_subsegments)

#>    start  end         words
#> 0    1.0  3.0  [None, None] # not monotonic
#> 1    1.0  3.0        [1, 3]
#> 2    4.0  9.0        [4, 9]

## New result
interpolate_method = "nearest"
aligned_subsegments = pd.DataFrame.from_records(records)
## new logic
seq_timecode_vals = aligned_subsegments[["start", "end"]].values.ravel("C")
filled_seq_timecodes = interpolate_nans(pd.Series(seq_timecode_vals), method=interpolate_method)
aligned_subsegments["start"] = filled_seq_timecodes.iloc[::2].values
aligned_subsegments["end"] = filled_seq_timecodes.iloc[1::2].values
print(aligned_subsegments)

#>   start  end         words
#> 0    1.0  1.0  [None, None] # fixed
#> 1    1.0  3.0        [1, 3]
#> 2    4.0  9.0        [4, 9]

Final Notes: Why we get Nan for whole segments?

See this sentence example:

1. Quickly put the water to the table

It will be split to 2 sentences badly by the punc model:

1.  => Oooops, this leads to start, end all Nan 
Quickly put the water to the table

@fseasy fseasy changed the title fix(alignment-filling-start-end): fix the bug that may generate non-monotonic timecode fix(alignment-filling-start-end): fix the bug that may generate non-monotonic timecode when filling Nan Apr 28, 2024
@fseasy fseasy closed this by deleting the head repository Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant