Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the tokenizer contents to a FoLiA doc #14

Open
pirolen opened this issue Mar 28, 2023 · 7 comments
Open

Adding the tokenizer contents to a FoLiA doc #14

pirolen opened this issue Mar 28, 2023 · 7 comments
Assignees
Labels

Comments

@pirolen
Copy link

pirolen commented Mar 28, 2023

I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc.
It is not clear to me how to do that with the add method: is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation), or is there an direct way to add the tokenizer content structure to the FoLiA doc?

Or is python-ucto not meant to be used for that, and one should rather first create a folia doc with untokenized content and run CLI ucto on it?

@proycon
Copy link
Owner

proycon commented Mar 28, 2023

This is already built-in functionality. You can just request FoLiA output from python-ucto using foliaoutput=True, see the example in the README.

@proycon proycon self-assigned this Mar 28, 2023
@pirolen
Copy link
Author

pirolen commented Mar 28, 2023

I have a Python dictionary that holds elements of a real-life dictionary, ie. headword and body.
While iterating over the Python dict, I aim to add its tokenized content to a folia.entry element, so that I end up with a structure like below (created by CLI ucto).

So I am constucting the folia doc on the fly.

I initiated the tokenizer with
tokenizer = ucto.Tokenizer("tokconfig-generic", foliaoutput=True)

and then the script goes like:

ft = doc_out.add(folia.Text)
for k, v in entrydict.items():
    ctr += 1
    ''' Add new entry and ID '''
    ent = ft.add(folia.Entry, id='e'+str(ctr))
    try:
        ''' Process and add `Term` content  '''
        fterm = ent.add(folia.Term)
        ''' Create and access tokenised data from ucto tokenizer '''
        tokenizer.process(k.strip())
       fterm.add(???)

How shall I add the sentences and tokens from the tokenizer to the Term element?


   <entry xml:id="e2">
      <term xml:id="e2.term.1">
        <s xml:id="e2.term.1.s.1">
          <w xml:id="e2.term.1.s.1.w.1" class="WORD">
            <t>Ab</t>
          </w>
        </s>
      </term>
      <def xml:id="e2.def.1">
        <p xml:id="e2.def.1.p.1">
          <s xml:id="e2.def.1.p.1.s.1">
            <w xml:id="e2.def.1.p.1.s.1.w.1" class="WORD">
              <t>apud</t>
            </w>
            <w xml:id="e2.def.1.p.1.s.1.w.2" class="WORD">
              <t>Hebraeos</t>
            </w>
            <w xml:id="e2.def.1.p.1.s.1.w.3" class="WORD">
              <t>dicitur</t>
...

@proycon
Copy link
Owner

proycon commented Mar 28, 2023

Ah ok, you're feeding parts to the tokenizer on the fly, that probably doesn't combine well with foliaoutput=True indeed, as that produces entire documents for the input. You're on the right track:

is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation)

Yes, if you want to do it on-the-fly then there's no shortcut unfortunately.

How shall I add the sentences and tokens from the tokenizer to the Term element?

Iterate over tokenizer and call fterm.add()

@pirolen
Copy link
Author

pirolen commented Mar 28, 2023

OK, I see. I tried that earlier but got stuck since I am not sure how to safely

  • access sentence starts/ends
  • harmonize Tokens (which the tokenizer holds) with folia.Word annotation (which the folia doc expects to be added to sentences). E.g. one needs to access the a token class and then specify it for a folia.Word annotation. I would be indebted for some example code.

This might be a lot of overhead and I might be better off indeed to create the doc first and then run ucto over it.

Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: #13 (comment)

@proycon
Copy link
Owner

proycon commented Apr 3, 2023

  • access sentence starts/ends

The Token instance has two methods to determine if it is at the start/end of a sentence:

  • token.isbeginofsentence()
  • token.isendofsentence()

Similarly, there is atoken.newparagraph()(token starts a new paragraph) and a token.nospace() (token is NOT followed by a space).

  • harmonize Tokens (which the tokenizer holds) with folia.Word annotation (which the folia doc expects to be added to sentences). E.g. one needs to access the a token class and then specify it for a folia.Word annotation. I would be indebted for some example code.

From the top of my head (untested so there may be mistakes), take body to be the FoLiA structure where you want to add sentence and tokens (some subclass of folia.AbstractStructureElement):

for token in tokenizer:
   sentence = None
   if token.isbeginofsentence():
      sentence = body.add(folia.Sentence)
   
   word = sentence.add(folia.Word, str(token), space=not token.nospace())

Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: #13 (comment)

Ah right! Sorry, missed that one, will take a look!

@pirolen
Copy link
Author

pirolen commented Apr 3, 2023

I get 'Token' object has no attribute 'newparagraph' :-o

@proycon
Copy link
Owner

proycon commented Apr 3, 2023

Right, it should be isnewparagraph().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants