Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSV to STAM conversion #1

Closed
proycon opened this issue Mar 25, 2023 · 3 comments
Closed

TSV to STAM conversion #1

proycon opened this issue Mar 25, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request ready This has been implemented but not released yet

Comments

@proycon
Copy link
Contributor

proycon commented Mar 25, 2023

Implement TSV (possibly also CSV but let's keep it simple) to STAM conversion.

@proycon proycon added the enhancement New feature or request label Mar 25, 2023
@proycon proycon self-assigned this Mar 25, 2023
@proycon
Copy link
Contributor Author

proycon commented Mar 26, 2023

Make the columns configurable, also for STAM to TSV conversion.

@proycon
Copy link
Contributor Author

proycon commented May 25, 2023

I started implementing this now. The idea is to have a flexible and powerful method of
ingesting tabular stand-off annotation data in STAM and, if needed,
either automatically align this with a text file (i.e. compute offsets if not
explicitly provided), or even reconstruct the text file from zero.

Users should be able to provide simple TSV data like:

Text	pos
Hello	interjection
world	noun

Here Text is a recognized column and pos is not so it translates to an
AnnotationSet (undefined here) and DataKey (pos). When loaded against an existing
resource file (like below), the offsets are computed automatically

Hello world!

Alternatively, this text (without the exclamation mark) can be reconstructed on
the basis of the input data (with space as an output delimiter). Note that text
input doesn't need to be constrained to words/tokens. Reconstruction and
alignment both assumes the input rows are sequential. If rows are explicitly
marked as not sequential (via some parameter), we can fall back on a tagging
mechanism to simple tag all found matches (e.g. with natural word boundaries).

The above illustrates the more complex case I want to support where input data
is incomplete, when more predefined columns are used parsing can be much
simpler and no alignment or reconstruction is needed in the first place:

Text	BeginOffset	EndOffset	pos
Hello	0	5	interjection
world	6	10	noun

proycon added a commit that referenced this issue May 29, 2023
proycon added a commit that referenced this issue May 30, 2023
This includes text validation and support for custom columns.
Other parse modes are to be implemented still.
@proycon proycon added the ready This has been implemented but not released yet label Jun 5, 2023
@proycon
Copy link
Contributor Author

proycon commented Jun 5, 2023

This is mostly implemented now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready This has been implemented but not released yet
Development

No branches or pull requests

1 participant