Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with custom dataset after updating to flair 0.11 #2722

Closed
stefanobranco opened this issue Apr 11, 2022 · 5 comments
Closed

Issue with custom dataset after updating to flair 0.11 #2722

stefanobranco opened this issue Apr 11, 2022 · 5 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@stefanobranco
Copy link

stefanobranco commented Apr 11, 2022

Describe the bug
We're using flair to perform named entity recognition to identify specific parts of a document as part of a citation to a different documents. Our dataset consists of space separated tokens and labels, like this:

Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC

I'm reading in the dataset with a columncorups like this:

# define columns
columns = {0: 'text', 1: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = '../Data/Flair/Regex_Tagging_Full'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              in_memory=True,
                              document_separator_token="-DOCSTART-")

From my understanding, the expected corpus would be something like this:

image

This is also the way it has worked in 0.10. However, every since upgrading to 0.11, our dataset is being ripped apart, and our labels are cut down in weird ways (looks like the first two characters are replaced with a forward slash?):
image

I understand the labeling logic has been refactored, but I don't assume the change in behaviour is intended, or is there a setting about labels that I'm missing?

This also doesn't just seem to be a display issue in the dataset, since this causes an entirely incorrect label dictionary to be created full of broken labels.

@stefanobranco stefanobranco added the bug Something isn't working label Apr 11, 2022
@alanakbik
Copy link
Collaborator

Hello @stefanobranco I am not seeing this behavior with your snippet. I get the following printout:

Sentence: "Vgl. Rundschreiben RAB 1/2010 Rz 8." → ["RAB"/PARTA, "1/2010"/YEAR, "Rz"/MISC, "8."/MISC]

and if I do:

for entity in corpus.train[0].get_labels('ner'):
    print(entity)

I get:

Token[2]: "RAB" → PARTA (1.0)
Token[3]: "1/2010" → YEAR (1.0)
Token[4]: "Rz" → MISC (1.0)
Token[5]: "8." → MISC (1.0)

So it seems that everything is working as it should.

@stefanobranco
Copy link
Author

Hi @alanakbik! Thanks for the feedback. I completely uninstalled the flair package and then reinstalled it, and now I can no longer reproduce the problem either. It seems something must have gone wrong during the update on my end. Sorry for the confusion!

@stefanobranco
Copy link
Author

stefanobranco commented Apr 12, 2022

Hey @alanakbik! Sorry to dig this out again, but turns out the issue is not quite resolved after all, and I think I figured out the root cause. We are using document separator tokens to signify boundaries of documents. The problem appears only if the training file starts with such a document separator token:

-DOCSTART-

Vgl. O
Rundschreiben O
RAB PARTA
1/2010 YEAR
Rz MISC
8. MISC

This happens regardless of whether the document_separator_token value is set or not. Is it incorrect to have a document separator token right at the start in the first place? It seemed sensible to me, since it's called "-DOCSTART-" in all examples, but I guess functionally it might also just start appearing after the first document.
I'm not even sure this is a bug, but since this behaved differently in 0.10 I figured it's probably still worth looking into.

@alanakbik
Copy link
Collaborator

@stefanobranco just merged a PR that should make span detection more robust and hopefully cover your case (DOCSTART as first sentence).

patrickjae added a commit to showheroes/flair that referenced this issue May 18, 2022
* flairNLPGH-2632: Revert "Removes hyperparameter features"

This reverts commit 9aff426.

* flairNLPGH-2632: Updating the param selection docs for the v0.10 syntax

* flairNLPGH-2632: Adding hyperopt back to requirements.txt

* flairNLPGH-2632: Fixing paramselection code to work with changes in Flair v0.10

* flairNLPGH-2632: Fixing bug where embeddings got added twice on multiple training runs

* flairNLPGH-2632: Enabling and fixing tests for param selection

* flairNLPGH-2632: Fixing flake, mypy and isort issues

* Dropout for all

* fix first_last

* Fix printouts for SequenceTagger

* 🐛 Fix .pre-commit-config.yaml

While trying to set up pre-commit, I got an indentation error.
Moreover, pycqa/isort does not have a stable rev. I set it to the most recent release tag.

* feat: ✨ initial implementation of JsonlCorpora and Datasets

* flairNLPGH-2654: Fixed printing and logging inconsistencies.

* Adding TransformerDocumentEmbeddings support to TextClassifierParamSelector and applying PR suggestions

* Fixing flake tests

* Using a small transformer in tests to reduce the CI agent memory usage

* Fix find_learning_rate

* Updating korean docs

* removing warining from step()

* fix: patch the missing `document_delmiter` for `lm.__get_state__()`

* updated broken link

* flairNLPGH-2654: Added review comments made by @Weyaaron

* flairNLPGH-2654: Fix breaking gzip import

* Making fune_tune a normal (non-tunable) parameter and defaulting it to True

* refactor: pin pytest in pipfile

* refactor: ♻️ make label_type configurable for Jsonl corpora

* fix: pin isort correctly to major release 5

* refactor: pin isort in pipfile to major release 5

* Fix relation extractor

* datasets: add support for HIPE 2022

* datasets: register NER_HIPE_2022

* tests: add extensive test cases for all sub-datasets for HIPE 2022

* Set default dropouts to 0 for equality to previous approaches

* datasets: fix flake8 errors for HIPE 2022 integration

* Update flair/models/language_model.py

Co-authored-by: Tadej Magajna <tmagajna@gmail.com>

* Formatting

* datasets: add support for v2 of HIPE-2022 dataset

* tests: update cases for v2 of HIPE-2022 dataset

* tests: minor flake fix for datasets

* tests: adjust latest HIPE v2.0 data changes for SONAR and NewsEye dataset

* datasets: switch to main as default branch name for HIPE-2022 data repo

* datasets: introduce some bug fixes for HIPE-2022 (tab as delimiter, ignore empty tokens)

* test: include label checking tests for HIPE-2022

* datasets: beautify emtpy token fix for HIPE-2022 dataset reader

* tests: fix mypy error (hopefully)

* datasets: fix mypy error (hopefully)

* flairNLPGH-2689: bump version numbers

* different way to exclude labels

* remove comment

* Change printouts for all DataPoints and Labels

* Black formatting

* Update printouts

* Update printouts to round confidence scores

* Add Arabic NER models back in

* Update readmes for new label logic and printouts

* Make DataPoint hashable and add test

* Do not add O tags

* Remap relation labels

* minor formatting

* Changed the documentation on OneHotEmbeddings to reflect changes in the master version: OneHotEmbeddings.from_corpus() instead of OneHotEmbeddings().

* Nicer printouts for make_label_dictionary

* Update documentation

* Black formatting

* small fixes

* Global arrow symbol

* Global arrow symbol

* Update relation model

* Fix unit test

* Fix unit tests

* datasets import

* Update documentation

* Update TUTORIAL_7_TRAINING_A_MODEL.md

* datasets: add possibility to use custom preprocessing function for HIPE-2022

* datasets: fix mypy error for HIPE-2022 preprocessing function

* datasets: revert self from HIPE-2022 preprocessing fn

* datasets: fix preprocessing function handling in HIPE-2022

* Minor fixes for tutorials

* Fix the SciSpacyTokenizer.tokenize() bug.

Makes sure the words are added to the correct list variable and that strings, not SpaCy Token objects, are returned.

* Fixing Hunflair docs that depended on SciSpacyTokenizer

* flairNLPGH-2713: make transformer offset calculation more robust

* flairNLPGH-2717: add option to ignore labels to ColumnCorpus

* flairNLPGH-2717: formatting

* flairNLPGH-2689: bump version numbers to 0.11.1

* flairNLPGH-2720: handle consecutive whitespaces

* add exclude labels parameter to trainer.train and minor change in AIDA corpus

* minor formatting

* minor formating

* Remove unnecessary more-itertools pin

The dependency and the pin were added in https://github.com/flairNLP/flair/pull/2312/files. more-itertools is a pretty stable library.

* fix wrong initialisations of label (where data_type was missing) and reintroduce working version of "return_probabilities_for_all_classes" for sequence tagger

* datasets: add support for version 2.1 of HIPE-2022

* added missing property decorator

* add encoding=utf-8 to file handles in NER_HIPE_2022 corpus

* minor formatting

* flairNLPGH-2728: add option to force token-level predictions

* Move files to fix unit tests

* Adapt dataset name depending on whether use_ids_and_check_existence is set

* Fix unit tests for GERMEVAL dataset rename

* Ignore deviation in signature in mypy

* Black formattin

* Extend span detection logic

* flairNLPGH-2722: make span detection more robust

* Add missing data

* cache models used in testing to speed up tests

* create cache folder if it doesn't exist

* set cache to local folder

* don't create redundant cache prefix

* fix mypy error

* dummy commit to see how fast tests run with caching

* don't force creation of cache folder (as it should be created whenever needed anyways)

* flairNLPGH-2754: bump version numbers

* Update gdown requirement

Advance gdown to latest release.

* flairNLPGH-2763: remove legacy TransformerXLEmbeddings class

* flairNLPGH-2765: Test with Python 3.7

* fix unit tests

* flairNLPGH-2770: bump version numbers

Co-authored-by: Tadej Magajna <tmagajna@gmail.com>
Co-authored-by: Alan Akbik <alan.akbik@gmail.com>
Co-authored-by: AnotherStranger <AnotherStranger@users.noreply.github.com>
Co-authored-by: Xabier Lahuerta Vázquez <xlahuerta@protonmail.com>
Co-authored-by: Mike Tian-Jian Jiang <tmjiang@gmail.com>
Co-authored-by: Rishivant Singh <rishivant.singh@knoldus.com>
Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Marcel <marcelmilch@gmx.de>
Co-authored-by: j <9658618+stw2@users.noreply.github.com>
Co-authored-by: Benedikt Fuchs <e1526472@student.tuwien.ac.at>
Co-authored-by: mauryaland <amaury@fouret.org>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: susannaruecker <susanna.ruecker@hu-berlin.de>
Co-authored-by: upgradvisor-bot <92053865+upgradvisor-bot@users.noreply.github.com>
@stale
Copy link

stale bot commented Sep 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 9, 2022
@stale stale bot closed this as completed Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants