Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables aren't redefined for re-runs of UDF apply #536

Open
robbieculkin opened this issue Mar 9, 2021 · 5 comments
Open

Tables aren't redefined for re-runs of UDF apply #536

robbieculkin opened this issue Mar 9, 2021 · 5 comments

Comments

@robbieculkin
Copy link

Description of the bug

As part of iterative development in a Jupyter environment, apply may be re-run several times. The developer might need to update candidates or create a new labeling function, for example.
When this happens, the corresponding Postgres table is cleared but not dropped. This means that the definition of the table cannot change to accommodate the updated parameters for apply.

To Reproduce

Steps to reproduce the behavior:

  1. Run the max_storage_temp_tutorial notebook in fonduer-tutorials, up to and including the Labeling Functions section.
  2. Add a new LF, doesn't need to do anything in particular (could return ABSTAIN every time). Add this to the stg_temp_lfs list.
  3. Re-run the remainder of cells in the section.

Upon calling LFAnalysis, the following exception is thrown:

ValueError: Number of LFs (7) and number of LF matrix columns (6) are different

Expected behavior

Underlying tables for a re-run of a UDF apply method should not only be cleared, but dropped.

Error Logs/Screenshots

Full stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-62-e005feee6300> in <module>
      5 sorted_lfs = sorted(lfs, key=lambda lf: lf.name)
      6 
----> 7 LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1))

~/.venv/lib/python3.7/site-packages/snorkel/labeling/analysis.py in __init__(self, L, lfs)
     44             if len(lfs) != self._L_sparse.shape[1]:
     45                 raise ValueError(
---> 46                     f"Number of LFs ({len(lfs)}) and number of "
     47                     f"LF matrix columns ({self._L_sparse.shape[1]}) are different"
     48                 )

ValueError: Number of LFs (7) and number of LF matrix columns (6) are different

Environment (please complete the following information)

  • OS: Ubuntu 18.04
  • PostgreSQL Version: 12.1
  • Poppler Utils Version: 0.71.0-5
  • Fonduer Version: 0.8.3

Additional context

#263 (comment) advises restarting Python, but this does not appear to solve the problem.

@senwu
Copy link
Collaborator

senwu commented Mar 14, 2021

Hi @robbieculkin,

Thanks for your question.

I think this is a problem related to the Snorkel package since it assumes all labeling functions can at least apply to one sample in the dataset which means the labeling function cannot always return ABSTAIN.

In Fonduer, we save the labeling function outputs in a sparse format which means we will store the labeling function name as a key based on your definition while if it always returns ABSTAIN Fonduer won't save any results. And we send the labeling function names and outputs to snorkel to calculate the weak labels which cause your error if you have some labeling function always return ABSTAIN.

FYI: Fonduer gradually updates labeling function outputs which means if you update the only results or add new results (it won't clear existing results by default). If you want to clear all existing results you can call the clear() function.

Thanks,
Sen

@robbieculkin
Copy link
Author

Hi Sen, thanks for the information.

Maybe it's an edge case, but I can imagine scenarios (like mine) with small training sets or very specific labeling functions that might result in only ABSTAIN answers.
Is this an issue worth raising with Snorkel?

Thanks,
Robbie

@senwu
Copy link
Collaborator

senwu commented Mar 30, 2021

Hi @robbieculkin,

I am not sure whether Snorkel can handle that or not. Let us think a way to solve this issue on our side.

Sen

@robbieculkin
Copy link
Author

Thanks @senwu, I really appreciate your team's support.

@senwu
Copy link
Collaborator

senwu commented Mar 30, 2021

@robbieculkin Sorry for the late response. We will fix this asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants