Context for ngrams? #171

DTchebotarev · 2018-02-28T22:39:18Z

Is it possible to add context to ngram extraction?

For example, currently running

list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True))

returns a list

['-PRON- like green', 'like green egg', 'egg and ham']

But I would ideally like to have the option to specify something like

list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True, left_pad=True, right_pad=True))

and have it return something along the lines of

['<s2> <s1> -PRON', '<s1> -PRON- like' ,'-PRON- like green', 'like green egg', 'egg and ham', 'and ham </s1>', 'ham </s1> </s2>]

I don't think this is possible in textacy currently, so I guess this is a feature request.

Also any ideas for a workaround are greatly appreciated :)

The text was updated successfully, but these errors were encountered:

bdewilde · 2018-03-01T14:17:25Z

Hi @DTchebotarev , this is not currently a feature, but I appreciate that padding sequences is a common task in deep learning. I've been dragging my feet on getting DL models into textacy, but when I do, I'd expect to include useful adjacent functionality like this as well.

jnothman · 2019-07-04T07:39:43Z

Padding sequences is common even not in deep learning. It gives more context to an n-gram (i.e. it states that it is text-initial).

bdewilde · 2019-07-04T15:09:15Z

I recently implemented something like this in a keyterm extraction algorithm:

textacy/textacy/keyterms.py

Lines 247 to 251 in 794be59

    
           for sent_idx, sent in enumerate(doc.sents): 
        
               padding = [None] * window_size 
        
               sent_padded = itertoolz.concatv(padding, sent, padding) 
        
               for window in itertoolz.sliding_window(1 + (2 * window_size), sent_padded): 
        
                   lwords, word, rwords = window[:window_size], window[window_size], window[window_size + 1:]

Unlike extract.ngrams(), this method produces Tuple[Token] rather than Span objects, so it doesn't work in the context of to_terms_list(). But maybe it's helpful.

bdewilde added the enhancement label Mar 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context for ngrams? #171

Context for ngrams? #171

DTchebotarev commented Feb 28, 2018

bdewilde commented Mar 1, 2018

jnothman commented Jul 4, 2019

bdewilde commented Jul 4, 2019

Context for ngrams? #171

Context for ngrams? #171

Comments

DTchebotarev commented Feb 28, 2018

bdewilde commented Mar 1, 2018

jnothman commented Jul 4, 2019

bdewilde commented Jul 4, 2019