output sentences to file #918

FuzzyNZ · 2022-01-07T06:04:14Z

FuzzyNZ
Jan 7, 2022

I would like like to be able to print the output from the following command to new lines. At the moment I can only get the output as a single array. Is there a way to output this as a list of sentences? I would like be able to number each sentence. The current output would mean that I would have to split the sentences in AWK or SED, which seems counter-intuitive.

print([sentence.text for sentence in doc.sentences]

output_example.txt

Thankyou.

Answered by AngledLuffa

Jan 7, 2022

for idx, sentence in enumerate(doc.sentences):
  print(idx, sentence.text)

View full answer

AngledLuffa · 2022-01-07T06:58:27Z

AngledLuffa
Jan 7, 2022
Maintainer

for idx, sentence in enumerate(doc.sentences):
  print(idx, sentence.text)

2 replies

FuzzyNZ Jan 7, 2022
Author

legendary. spent the whole day on that one (and only had to sub .word for .text). first day in the core though - so thanks heaps.

FuzzyNZ Jan 7, 2022
Author

It's so close, but when I write the output to file it compresses it into a chunk. I tried adding output_file.write(sentence.text sep='\n') but it throws an error: TypeError: write() takes no keyword arguments. New to Python and new to Stanford Core today so please bear with me - it's amazing.

AngledLuffa · 2022-01-07T17:07:08Z

AngledLuffa
Jan 7, 2022
Maintainer

separate command: output_file.write("\n")

…

On Fri, Jan 7, 2022 at 2:42 AM SteveBrodie ***@***.***> wrote: It's so close, but when I write the output to file it compresses it into a chunk. I tried adding output_file.write(sentence.text sep='\n') but it throws an error: TypeError: write() takes no keyword arguments. New to Python and new to Stanford Core today so please bear with me - it's amazing. — Reply to this email directly, view it on GitHub <#918 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWN6UK32WSRENMDZ623UU27SVANCNFSM5LOCATYA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

FuzzyNZ Jan 7, 2022
Author

TokensRegex next.

AngledLuffa · 2022-01-07T22:34:25Z

AngledLuffa
Jan 7, 2022
Maintainer

Bit more clarification on what you need would be helpful...

1 reply

FuzzyNZ Jan 7, 2022
Author

My apologies. That's where I'll be heading next with the code. I'm working with a repository of fictional texts to hopefully run some link detection/prediction using your Stanford SNAP library.

FuzzyNZ · 2022-02-04T22:55:04Z

FuzzyNZ
Feb 4, 2022
Author

Happy New Year. Found some time to get back to this. I'm running NER on a file and trying to output to a file as before. I have the NER results printing but not saving to file. Keep getting error: AttributeError: 'str' object has no attribute 'ent'

nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
filename = "waiter_chunks"
with open(filename, 'r') as f:
doc = nlp(f.read())

with open("output/names","w") as waiter_names:
for sent in ([f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents]):
waiter_names.write(sent.text)
waiter_names.write("\n")

0 replies

AngledLuffa · 2022-02-05T16:25:29Z

AngledLuffa
Feb 5, 2022
Maintainer

Suggestion: use three backticks (backwards apostrophes) to format code. It will look like this:

in this block:
  indentation will be preserved

waiter_names is a weird name for an output file. People will usually name it output_file or fout or something else evocative of what it is. Anyway, I think the problem is waiter_names.write(sent.text) - you've already made it a string at that point, so you just need to do write(sent) instead. For example, I can do this:

import stanza

nlp = stanza.Pipeline("en", processors="tokenize, ner")
text = "John Bauer works at Stanford"
doc = nlp(text)

for sent in ([f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents]):
    print(sent)

--->

entity: John Bauer      type: PERSON
entity: Stanford        type: ORG

0 replies

FuzzyNZ · 2022-02-05T23:37:05Z

FuzzyNZ
Feb 5, 2022
Author

Thanks again for your help. I see now - I was complicating the output. waiter_names is a strange name but is evocative when taken in the context of the source file - it's is a selection of fiction that I grep'd for "waiter" and now I am creating a list of the characters names from within the results to see what characters were involved. Hope that makes some kind of sense. Is there way to filter the entity's before outputting them? For example I only need the PERSON filed for this task. Thanks.

0 replies

AngledLuffa · 2022-02-06T04:10:44Z

AngledLuffa
Feb 6, 2022
Maintainer

ent.type == PERSON should work

0 replies

FuzzyNZ · 2022-02-07T00:59:40Z

FuzzyNZ
Feb 7, 2022
Author

Awesome - I have all of that running now. Now I have a bash script that I'm trying to embed the stanza python script in. The bash script creates an array of filenames (books) and loops over them with simple linux commands (grep | sort | uniq) and outputs the results to separate files using the filename variable in the naming. Here is the script so far. Lines 20-22 are where I'd need to feed the bash variable in as a filename and then pipe the output to grep somehow. I'm not even sure if this is doable so I thought I'd ask first before trying to figure it out. No idea how to run the loop in Python with those variables in that way sorry.
entityStanza.txt

https://trstringer.com/python-in-shell-script/

1 reply

AngledLuffa Feb 7, 2022
Maintainer

Personally I'm here to make sure the stanza issues are smoothed out, but that's not to say someone else can't chime in... if there's something specific about stanza which isn't working, happy to help

FuzzyNZ · 2022-02-07T03:33:20Z

FuzzyNZ
Feb 7, 2022
Author

Ahh good point. Thanks!

1 reply

rahonalab Mar 18, 2022

Hello, this is probably related to the discussion.
I am using the CoNLL.write_doc2coll() method from the dev branch to write down CoNLL to documents, but I see that no sent_id and text (i.e., comments) are written. I have tried to print the sentence.comments from a parsed doc, but I see it's empty. Is there an option to pass to the nlp pipeline to fill the comments?
Thank you !
Luigi

AngledLuffa · 2022-03-19T01:22:56Z

AngledLuffa
Mar 19, 2022
Maintainer

This is certainly something we could do ourselves, but there are a couple questions: for #text, what to do with sentences with newlines? remove the newlines? for #sent_id, is there some pattern to follow, or just count 1, 2, 3, ... The easiest thing to do would be if you put the exact comments you want on the sentences before calling output

…

On Fri, Mar 18, 2022 at 3:58 AM Luigi Talamo ***@***.***> wrote: Hello, this is probably related to the discussion. I am using the CoNLL.write_doc2coll() method from the dev branch to write down CoNLL to documents, but I see that no sent_id and text (i.e., comments) are written. I have tried to print the sentence.comments from a parsed doc, but I see it's empty. Is there an option to pass to the nlp pipeline to fill the comments? Thank you ! Luigi — Reply to this email directly, view it on GitHub <#918 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIQRXKQ53UOTSAZLXLVAROWVANCNFSM5LOCATYA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

0 replies

rahonalab · 2022-03-24T10:25:59Z

rahonalab
Mar 24, 2022

I think that the output provided by the UDPipe parser may be a correct pattern, so something like
#sent_id = docname_1,2,3
#text = text without newlines

While reading the text line by line, I have tried to put the enumerate(line) as sent_id and the line as text before parsing, but the result is non-consistent, as several times the line is split into more than one sentence... So, the comment has definitely to be taken from the NLP pipeline.

0 replies

AngledLuffa · 2022-03-24T22:36:20Z

AngledLuffa
Mar 24, 2022
Maintainer

# text is easy enough for us to add

i assume the conll tests will throw some error, but the changelist i just posted should do it once those tests are fixed

0 replies

rahonalab · 2022-03-31T08:26:44Z

rahonalab
Mar 31, 2022

so, the requested feature is already in the dev branch?

0 replies

AngledLuffa · 2022-03-31T08:34:13Z

AngledLuffa
Mar 31, 2022
Maintainer

For # text it is, yes

I haven't figured out a great way to have it index sentence id. I mean, obviously they could just count 1, 2, 3, etc, but most of the datasets have some kind of prefix or format on the ID

0 replies

rahonalab · 2022-03-31T09:11:08Z

rahonalab
Mar 31, 2022

ok thank you , I have tested it and now it prints the text in comment.

yes, UD treebanks have for instance the pattern:

acronym of the treebank-1,2,3

e.g., # sent_id = isst_tanl-2 for the UD Italian ISST treebank

Documents are also introduced by a newdoc attribute:

#newdoc = tanl

which for instance can correspond to the filename, or other attributes.

1 reply

rahonalab Apr 1, 2022

I have just discovered that sentence_id is already in the sentence Document, under the ._id attribute.
So I have modified the doc2conll() function as follows:

def doc2conll(doc):
    """ Convert a Document object to a list of list of strings
    Each sentence is represented by a list of strings: first the comments, then the converted tokens
    """
    doc_conll = []
    for sentence in doc.sentences:
        sent_id = "# sent_id = "+NEWDOC+"-"+str(sentence._id)
        sent_conll = list()
        sent_conll.append(sent_id)
        sent_conll.append("\n".join(sentence.comments))
        for token_dict in sentence.to_dict():
            token_conll = convert_token_dict(token_dict)
            sent_conll.append("\t".join(token_conll))
        doc_conll.append(sent_conll)
    return doc_conll

where NEWDOC is a global variable set to the filename.
Sorry for my code, I am a linguist :-)

AngledLuffa · 2022-04-08T05:59:53Z

AngledLuffa
Apr 8, 2022
Maintainer

# sent_id: #995

0 replies

output sentences to file #918

FuzzyNZ Jan 7, 2022

Replies: 16 comments · 7 replies

AngledLuffa Jan 7, 2022 Maintainer

FuzzyNZ Jan 7, 2022 Author

FuzzyNZ Jan 7, 2022 Author

AngledLuffa Jan 7, 2022 Maintainer

FuzzyNZ Jan 7, 2022 Author

AngledLuffa Jan 7, 2022 Maintainer

FuzzyNZ Jan 7, 2022 Author

FuzzyNZ Feb 4, 2022 Author

AngledLuffa Feb 5, 2022 Maintainer

FuzzyNZ Feb 5, 2022 Author

AngledLuffa Feb 6, 2022 Maintainer

FuzzyNZ Feb 7, 2022 Author

AngledLuffa Feb 7, 2022 Maintainer

FuzzyNZ Feb 7, 2022 Author

rahonalab Mar 18, 2022

AngledLuffa Mar 19, 2022 Maintainer

rahonalab Mar 24, 2022

AngledLuffa Mar 24, 2022 Maintainer

rahonalab Mar 31, 2022

AngledLuffa Mar 31, 2022 Maintainer

rahonalab Mar 31, 2022

rahonalab Apr 1, 2022

AngledLuffa Apr 8, 2022 Maintainer

FuzzyNZ
Jan 7, 2022

Replies: 16 comments 7 replies

AngledLuffa
Jan 7, 2022
Maintainer

FuzzyNZ Jan 7, 2022
Author

FuzzyNZ Jan 7, 2022
Author

AngledLuffa
Jan 7, 2022
Maintainer

FuzzyNZ Jan 7, 2022
Author

AngledLuffa
Jan 7, 2022
Maintainer

FuzzyNZ Jan 7, 2022
Author

FuzzyNZ
Feb 4, 2022
Author

AngledLuffa
Feb 5, 2022
Maintainer

FuzzyNZ
Feb 5, 2022
Author

AngledLuffa
Feb 6, 2022
Maintainer

FuzzyNZ
Feb 7, 2022
Author

AngledLuffa Feb 7, 2022
Maintainer

FuzzyNZ
Feb 7, 2022
Author

AngledLuffa
Mar 19, 2022
Maintainer

rahonalab
Mar 24, 2022

AngledLuffa
Mar 24, 2022
Maintainer

rahonalab
Mar 31, 2022

AngledLuffa
Mar 31, 2022
Maintainer

rahonalab
Mar 31, 2022

AngledLuffa
Apr 8, 2022
Maintainer