-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the tokenizer contents to a FoLiA doc #14
Comments
This is already built-in functionality. You can just request FoLiA output from python-ucto using |
I have a Python dictionary that holds elements of a real-life dictionary, ie. headword and body. So I am constucting the folia doc on the fly. I initiated the tokenizer with and then the script goes like:
How shall I add the sentences and tokens from the tokenizer to the Term element?
|
Ah ok, you're feeding parts to the tokenizer on the fly, that probably doesn't combine well with
Yes, if you want to do it on-the-fly then there's no shortcut unfortunately.
Iterate over |
OK, I see. I tried that earlier but got stuck since I am not sure how to safely
This might be a lot of overhead and I might be better off indeed to create the doc first and then run ucto over it. Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: #13 (comment) |
The
Similarly, there is a
From the top of my head (untested so there may be mistakes), take for token in tokenizer:
sentence = None
if token.isbeginofsentence():
sentence = body.add(folia.Sentence)
word = sentence.add(folia.Word, str(token), space=not token.nospace())
Ah right! Sorry, missed that one, will take a look! |
I get 'Token' object has no attribute 'newparagraph' :-o |
Right, it should be |
I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc.
It is not clear to me how to do that with the
add
method: is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation), or is there an direct way to add the tokenizer content structure to the FoLiA doc?Or is python-ucto not meant to be used for that, and one should rather first create a folia doc with untokenized content and run CLI ucto on it?
The text was updated successfully, but these errors were encountered: