Adding the tokenizer contents to a FoLiA doc #14

pirolen · 2023-03-28T11:24:51Z

I wonder if there is a straightforward way to add from the tokenizer the sentences and their token content to build a new folia doc.
It is not clear to me how to do that with the add method: is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation), or is there an direct way to add the tokenizer content structure to the FoLiA doc?

Or is python-ucto not meant to be used for that, and one should rather first create a folia doc with untokenized content and run CLI ucto on it?

The text was updated successfully, but these errors were encountered:

proycon · 2023-03-28T12:31:45Z

This is already built-in functionality. You can just request FoLiA output from python-ucto using foliaoutput=True, see the example in the README.

pirolen · 2023-03-28T12:51:38Z

I have a Python dictionary that holds elements of a real-life dictionary, ie. headword and body.
While iterating over the Python dict, I aim to add its tokenized content to a folia.entry element, so that I end up with a structure like below (created by CLI ucto).

So I am constucting the folia doc on the fly.

I initiated the tokenizer with
tokenizer = ucto.Tokenizer("tokconfig-generic", foliaoutput=True)

and then the script goes like:

ft = doc_out.add(folia.Text)
for k, v in entrydict.items():
    ctr += 1
    ''' Add new entry and ID '''
    ent = ft.add(folia.Entry, id='e'+str(ctr))
    try:
        ''' Process and add `Term` content  '''
        fterm = ent.add(folia.Term)
        ''' Create and access tokenised data from ucto tokenizer '''
        tokenizer.process(k.strip())
       fterm.add(???)

How shall I add the sentences and tokens from the tokenizer to the Term element?

   <entry xml:id="e2">
      <term xml:id="e2.term.1">
        <s xml:id="e2.term.1.s.1">
          <w xml:id="e2.term.1.s.1.w.1" class="WORD">
            <t>Ab</t>
          </w>
        </s>
      </term>
      <def xml:id="e2.def.1">
        <p xml:id="e2.def.1.p.1">
          <s xml:id="e2.def.1.p.1.s.1">
            <w xml:id="e2.def.1.p.1.s.1.w.1" class="WORD">
              <t>apud</t>
            </w>
            <w xml:id="e2.def.1.p.1.s.1.w.2" class="WORD">
              <t>Hebraeos</t>
            </w>
            <w xml:id="e2.def.1.p.1.s.1.w.3" class="WORD">
              <t>dicitur</t>
...

proycon · 2023-03-28T13:29:40Z

Ah ok, you're feeding parts to the tokenizer on the fly, that probably doesn't combine well with foliaoutput=True indeed, as that produces entire documents for the input. You're on the right track:

is one supposed to recursively access sentences and tokens from the tokenizer that yields Token types, and subsequently render the token contents by scripting (e.g. accessing a token class and then specifying it for a folia.Word annotation)

Yes, if you want to do it on-the-fly then there's no shortcut unfortunately.

How shall I add the sentences and tokens from the tokenizer to the Term element?

Iterate over tokenizer and call fterm.add()

pirolen · 2023-03-28T13:37:34Z

OK, I see. I tried that earlier but got stuck since I am not sure how to safely

access sentence starts/ends
harmonize Tokens (which the tokenizer holds) with folia.Word annotation (which the folia doc expects to be added to sentences). E.g. one needs to access the a token class and then specify it for a folia.Word annotation. I would be indebted for some example code.

This might be a lot of overhead and I might be better off indeed to create the doc first and then run ucto over it.

Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: #13 (comment)

proycon · 2023-04-03T10:39:26Z

access sentence starts/ends

The Token instance has two methods to determine if it is at the start/end of a sentence:

token.isbeginofsentence()
token.isendofsentence()

Similarly, there is atoken.newparagraph()(token starts a new paragraph) and a token.nospace() (token is NOT followed by a space).

harmonize Tokens (which the tokenizer holds) with folia.Word annotation (which the folia doc expects to be added to sentences). E.g. one needs to access the a token class and then specify it for a folia.Word annotation. I would be indebted for some example code.

From the top of my head (untested so there may be mistakes), take body to be the FoLiA structure where you want to add sentence and tokens (some subclass of folia.AbstractStructureElement):

for token in tokenizer:
   sentence = None
   if token.isbeginofsentence():
      sentence = body.add(folia.Sentence)
   
   word = sentence.add(folia.Word, str(token), space=not token.nospace())

Btw, in another ticket I made some notes about not being able to specify the textclass when passing foliaoutput=True: #13 (comment)

Ah right! Sorry, missed that one, will take a look!

pirolen · 2023-04-03T14:16:09Z

I get 'Token' object has no attribute 'newparagraph' :-o

proycon · 2023-04-03T16:35:13Z

Right, it should be isnewparagraph().

proycon self-assigned this Mar 28, 2023

proycon added the question label Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the tokenizer contents to a FoLiA doc #14

Adding the tokenizer contents to a FoLiA doc #14

pirolen commented Mar 28, 2023

proycon commented Mar 28, 2023

pirolen commented Mar 28, 2023

proycon commented Mar 28, 2023

pirolen commented Mar 28, 2023

proycon commented Apr 3, 2023 •

edited

Loading

pirolen commented Apr 3, 2023

proycon commented Apr 3, 2023

Adding the tokenizer contents to a FoLiA doc #14

Adding the tokenizer contents to a FoLiA doc #14

Comments

pirolen commented Mar 28, 2023

proycon commented Mar 28, 2023

pirolen commented Mar 28, 2023

proycon commented Mar 28, 2023

pirolen commented Mar 28, 2023

proycon commented Apr 3, 2023 • edited Loading

pirolen commented Apr 3, 2023

proycon commented Apr 3, 2023

proycon commented Apr 3, 2023 •

edited

Loading