Tables aren't redefined for re-runs of UDF apply #536

robbieculkin · 2021-03-09T23:40:28Z

Description of the bug

As part of iterative development in a Jupyter environment, apply may be re-run several times. The developer might need to update candidates or create a new labeling function, for example.
When this happens, the corresponding Postgres table is cleared but not dropped. This means that the definition of the table cannot change to accommodate the updated parameters for apply.

To Reproduce

Steps to reproduce the behavior:

Run the max_storage_temp_tutorial notebook in fonduer-tutorials, up to and including the Labeling Functions section.
Add a new LF, doesn't need to do anything in particular (could return ABSTAIN every time). Add this to the stg_temp_lfs list.
Re-run the remainder of cells in the section.

Upon calling LFAnalysis, the following exception is thrown:

ValueError: Number of LFs (7) and number of LF matrix columns (6) are different

Expected behavior

Underlying tables for a re-run of a UDF apply method should not only be cleared, but dropped.

Error Logs/Screenshots

Full stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-62-e005feee6300> in <module>
      5 sorted_lfs = sorted(lfs, key=lambda lf: lf.name)
      6 
----> 7 LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1))

~/.venv/lib/python3.7/site-packages/snorkel/labeling/analysis.py in __init__(self, L, lfs)
     44             if len(lfs) != self._L_sparse.shape[1]:
     45                 raise ValueError(
---> 46                     f"Number of LFs ({len(lfs)}) and number of "
     47                     f"LF matrix columns ({self._L_sparse.shape[1]}) are different"
     48                 )

ValueError: Number of LFs (7) and number of LF matrix columns (6) are different

Environment (please complete the following information)

OS: Ubuntu 18.04
PostgreSQL Version: 12.1
Poppler Utils Version: 0.71.0-5
Fonduer Version: 0.8.3

Additional context

#263 (comment) advises restarting Python, but this does not appear to solve the problem.

The text was updated successfully, but these errors were encountered:

senwu · 2021-03-14T07:37:31Z

Hi @robbieculkin,

Thanks for your question.

I think this is a problem related to the Snorkel package since it assumes all labeling functions can at least apply to one sample in the dataset which means the labeling function cannot always return ABSTAIN.

In Fonduer, we save the labeling function outputs in a sparse format which means we will store the labeling function name as a key based on your definition while if it always returns ABSTAIN Fonduer won't save any results. And we send the labeling function names and outputs to snorkel to calculate the weak labels which cause your error if you have some labeling function always return ABSTAIN.

FYI: Fonduer gradually updates labeling function outputs which means if you update the only results or add new results (it won't clear existing results by default). If you want to clear all existing results you can call the clear() function.

Thanks,
Sen

robbieculkin · 2021-03-17T20:18:22Z

Hi Sen, thanks for the information.

Maybe it's an edge case, but I can imagine scenarios (like mine) with small training sets or very specific labeling functions that might result in only ABSTAIN answers.
Is this an issue worth raising with Snorkel?

Thanks,
Robbie

senwu · 2021-03-30T20:45:32Z

Hi @robbieculkin,

I am not sure whether Snorkel can handle that or not. Let us think a way to solve this issue on our side.

Sen

robbieculkin · 2021-03-30T20:47:09Z

Thanks @senwu, I really appreciate your team's support.

senwu · 2021-03-30T20:48:10Z

@robbieculkin Sorry for the late response. We will fix this asap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tables aren't redefined for re-runs of UDF apply #536

Tables aren't redefined for re-runs of UDF apply #536

robbieculkin commented Mar 9, 2021

senwu commented Mar 14, 2021

robbieculkin commented Mar 17, 2021

senwu commented Mar 30, 2021

robbieculkin commented Mar 30, 2021

senwu commented Mar 30, 2021

Tables aren't redefined for re-runs of UDF apply #536

Tables aren't redefined for re-runs of UDF apply #536

Comments

robbieculkin commented Mar 9, 2021

Description of the bug

To Reproduce

Expected behavior

Error Logs/Screenshots

Environment (please complete the following information)

Additional context

senwu commented Mar 14, 2021

robbieculkin commented Mar 17, 2021

senwu commented Mar 30, 2021

robbieculkin commented Mar 30, 2021

senwu commented Mar 30, 2021