Issue with custom dataset after updating to flair 0.11 #2722

stefanobranco · 2022-04-11T09:58:39Z

Describe the bug
We're using flair to perform named entity recognition to identify specific parts of a document as part of a citation to a different documents. Our dataset consists of space separated tokens and labels, like this:

Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC

I'm reading in the dataset with a columncorups like this:

# define columns
columns = {0: 'text', 1: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '../Data/Flair/Regex_Tagging_Full'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              in_memory=True,
                              document_separator_token="-DOCSTART-")

From my understanding, the expected corpus would be something like this:

This is also the way it has worked in 0.10. However, every since upgrading to 0.11, our dataset is being ripped apart, and our labels are cut down in weird ways (looks like the first two characters are replaced with a forward slash?):

I understand the labeling logic has been refactored, but I don't assume the change in behaviour is intended, or is there a setting about labels that I'm missing?

This also doesn't just seem to be a display issue in the dataset, since this causes an entirely incorrect label dictionary to be created full of broken labels.

The text was updated successfully, but these errors were encountered:

alanakbik · 2022-04-12T04:03:01Z

Hello @stefanobranco I am not seeing this behavior with your snippet. I get the following printout:

Sentence: "Vgl. Rundschreiben RAB 1/2010 Rz 8." → ["RAB"/PARTA, "1/2010"/YEAR, "Rz"/MISC, "8."/MISC]

and if I do:

for entity in corpus.train[0].get_labels('ner'):
    print(entity)

I get:

Token[2]: "RAB" → PARTA (1.0)
Token[3]: "1/2010" → YEAR (1.0)
Token[4]: "Rz" → MISC (1.0)
Token[5]: "8." → MISC (1.0)

So it seems that everything is working as it should.

stefanobranco · 2022-04-12T05:07:18Z

Hi @alanakbik! Thanks for the feedback. I completely uninstalled the flair package and then reinstalled it, and now I can no longer reproduce the problem either. It seems something must have gone wrong during the update on my end. Sorry for the confusion!

stefanobranco · 2022-04-12T18:53:40Z

Hey @alanakbik! Sorry to dig this out again, but turns out the issue is not quite resolved after all, and I think I figured out the root cause. We are using document separator tokens to signify boundaries of documents. The problem appears only if the training file starts with such a document separator token:

-DOCSTART-

Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC

This happens regardless of whether the document_separator_token value is set or not. Is it incorrect to have a document separator token right at the start in the first place? It seemed sensible to me, since it's called "-DOCSTART-" in all examples, but I guess functionally it might also just start appearing after the first document.
I'm not even sure this is a bug, but since this behaved differently in 0.10 I figured it's probably still worth looking into.

GH-2722: Make Span detection more robust

alanakbik · 2022-05-09T03:08:06Z

@stefanobranco just merged a PR that should make span detection more robust and hopefully cover your case (DOCSTART as first sentence).

@Weyaaron

* flairNLPGH-2632: Revert "Removes hyperparameter features" This reverts commit 9aff426. * flairNLPGH-2632: Updating the param selection docs for the v0.10 syntax * flairNLPGH-2632: Adding hyperopt back to requirements.txt * flairNLPGH-2632: Fixing paramselection code to work with changes in Flair v0.10 * flairNLPGH-2632: Fixing bug where embeddings got added twice on multiple training runs * flairNLPGH-2632: Enabling and fixing tests for param selection * flairNLPGH-2632: Fixing flake, mypy and isort issues * Dropout for all * fix first_last * Fix printouts for SequenceTagger * 🐛 Fix .pre-commit-config.yaml While trying to set up pre-commit, I got an indentation error. Moreover, pycqa/isort does not have a stable rev. I set it to the most recent release tag. * feat: ✨ initial implementation of JsonlCorpora and Datasets * flairNLPGH-2654: Fixed printing and logging inconsistencies. * Adding TransformerDocumentEmbeddings support to TextClassifierParamSelector and applying PR suggestions * Fixing flake tests * Using a small transformer in tests to reduce the CI agent memory usage * Fix find_learning_rate * Updating korean docs * removing warining from step() * fix: patch the missing `document_delmiter` for `lm.__get_state__()` * updated broken link * flairNLPGH-2654: Added review comments made by @Weyaaron * flairNLPGH-2654: Fix breaking gzip import * Making fune_tune a normal (non-tunable) parameter and defaulting it to True * refactor: pin pytest in pipfile * refactor: ♻️ make label_type configurable for Jsonl corpora * fix: pin isort correctly to major release 5 * refactor: pin isort in pipfile to major release 5 * Fix relation extractor * datasets: add support for HIPE 2022 * datasets: register NER_HIPE_2022 * tests: add extensive test cases for all sub-datasets for HIPE 2022 * Set default dropouts to 0 for equality to previous approaches * datasets: fix flake8 errors for HIPE 2022 integration * Update flair/models/language_model.py Co-authored-by: Tadej Magajna <tmagajna@gmail.com> * Formatting * datasets: add support for v2 of HIPE-2022 dataset * tests: update cases for v2 of HIPE-2022 dataset * tests: minor flake fix for datasets * tests: adjust latest HIPE v2.0 data changes for SONAR and NewsEye dataset * datasets: switch to main as default branch name for HIPE-2022 data repo * datasets: introduce some bug fixes for HIPE-2022 (tab as delimiter, ignore empty tokens) * test: include label checking tests for HIPE-2022 * datasets: beautify emtpy token fix for HIPE-2022 dataset reader * tests: fix mypy error (hopefully) * datasets: fix mypy error (hopefully) * flairNLPGH-2689: bump version numbers * different way to exclude labels * remove comment * Change printouts for all DataPoints and Labels * Black formatting * Update printouts * Update printouts to round confidence scores * Add Arabic NER models back in * Update readmes for new label logic and printouts * Make DataPoint hashable and add test * Do not add O tags * Remap relation labels * minor formatting * Changed the documentation on OneHotEmbeddings to reflect changes in the master version: OneHotEmbeddings.from_corpus() instead of OneHotEmbeddings(). * Nicer printouts for make_label_dictionary * Update documentation * Black formatting * small fixes * Global arrow symbol * Global arrow symbol * Update relation model * Fix unit test * Fix unit tests * datasets import * Update documentation * Update TUTORIAL_7_TRAINING_A_MODEL.md * datasets: add possibility to use custom preprocessing function for HIPE-2022 * datasets: fix mypy error for HIPE-2022 preprocessing function * datasets: revert self from HIPE-2022 preprocessing fn * datasets: fix preprocessing function handling in HIPE-2022 * Minor fixes for tutorials * Fix the SciSpacyTokenizer.tokenize() bug. Makes sure the words are added to the correct list variable and that strings, not SpaCy Token objects, are returned. * Fixing Hunflair docs that depended on SciSpacyTokenizer * flairNLPGH-2713: make transformer offset calculation more robust * flairNLPGH-2717: add option to ignore labels to ColumnCorpus * flairNLPGH-2717: formatting * flairNLPGH-2689: bump version numbers to 0.11.1 * flairNLPGH-2720: handle consecutive whitespaces * add exclude labels parameter to trainer.train and minor change in AIDA corpus * minor formatting * minor formating * Remove unnecessary more-itertools pin The dependency and the pin were added in https://github.com/flairNLP/flair/pull/2312/files. more-itertools is a pretty stable library. * fix wrong initialisations of label (where data_type was missing) and reintroduce working version of "return_probabilities_for_all_classes" for sequence tagger * datasets: add support for version 2.1 of HIPE-2022 * added missing property decorator * add encoding=utf-8 to file handles in NER_HIPE_2022 corpus * minor formatting * flairNLPGH-2728: add option to force token-level predictions * Move files to fix unit tests * Adapt dataset name depending on whether use_ids_and_check_existence is set * Fix unit tests for GERMEVAL dataset rename * Ignore deviation in signature in mypy * Black formattin * Extend span detection logic * flairNLPGH-2722: make span detection more robust * Add missing data * cache models used in testing to speed up tests * create cache folder if it doesn't exist * set cache to local folder * don't create redundant cache prefix * fix mypy error * dummy commit to see how fast tests run with caching * don't force creation of cache folder (as it should be created whenever needed anyways) * flairNLPGH-2754: bump version numbers * Update gdown requirement Advance gdown to latest release. * flairNLPGH-2763: remove legacy TransformerXLEmbeddings class * flairNLPGH-2765: Test with Python 3.7 * fix unit tests * flairNLPGH-2770: bump version numbers Co-authored-by: Tadej Magajna <tmagajna@gmail.com> Co-authored-by: Alan Akbik <alan.akbik@gmail.com> Co-authored-by: AnotherStranger <AnotherStranger@users.noreply.github.com> Co-authored-by: Xabier Lahuerta Vázquez <xlahuerta@protonmail.com> Co-authored-by: Mike Tian-Jian Jiang <tmjiang@gmail.com> Co-authored-by: Rishivant Singh <rishivant.singh@knoldus.com> Co-authored-by: Stefan Schweter <stefan@schweter.it> Co-authored-by: Marcel <marcelmilch@gmx.de> Co-authored-by: j <9658618+stw2@users.noreply.github.com> Co-authored-by: Benedikt Fuchs <e1526472@student.tuwien.ac.at> Co-authored-by: mauryaland <amaury@fouret.org> Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: susannaruecker <susanna.ruecker@hu-berlin.de> Co-authored-by: upgradvisor-bot <92053865+upgradvisor-bot@users.noreply.github.com>

stale · 2022-09-09T02:01:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stefanobranco added the bug Something isn't working label Apr 11, 2022

stefanobranco closed this as completed Apr 12, 2022

stefanobranco reopened this Apr 12, 2022

alanakbik added a commit that referenced this issue May 7, 2022

GH-2722: make span detection more robust

f47ee1e

alanakbik mentioned this issue May 7, 2022

GH-2722: Make Span detection more robust #2752

Merged

alanakbik added a commit that referenced this issue May 7, 2022

Merge pull request #2752 from flairNLP/GH-2722-span-dataset

a1732bc

GH-2722: Make Span detection more robust

stale bot added the wontfix This will not be worked on label Sep 9, 2022

stale bot closed this as completed Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with custom dataset after updating to flair 0.11 #2722

Issue with custom dataset after updating to flair 0.11 #2722

stefanobranco commented Apr 11, 2022 •

edited

Loading

alanakbik commented Apr 12, 2022

stefanobranco commented Apr 12, 2022

stefanobranco commented Apr 12, 2022 •

edited

Loading

alanakbik commented May 9, 2022

stale bot commented Sep 9, 2022

Issue with custom dataset after updating to flair 0.11 #2722

Issue with custom dataset after updating to flair 0.11 #2722

Comments

stefanobranco commented Apr 11, 2022 • edited Loading

alanakbik commented Apr 12, 2022

stefanobranco commented Apr 12, 2022

stefanobranco commented Apr 12, 2022 • edited Loading

alanakbik commented May 9, 2022

stale bot commented Sep 9, 2022

stefanobranco commented Apr 11, 2022 •

edited

Loading

stefanobranco commented Apr 12, 2022 •

edited

Loading