-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: discard "gensim.summarization"? #2592
Comments
+1 on that. IIRC, the algo is actually OK / standard, but the technical execution (engineering, design) was poor. One of the (several) modules in Gensim I'd be scared to use myself, and consequently never did. Discussions go on the mailing list though, why did you open it here? |
Opened this here because this seemed to me more like a committer-level discussion regarding quality/standards/policies. Also, it'd ideally yield tangible issue-like followup steps, if there was agreement, for which the issue could then record the motivating reasoning & decisions. That's a bit like the prior GH-issue to discuss when/whether Python2-support should be dropped, or the GH-issue asking whether issues themselves should auto-close after deadlines. It's essentially a "feature request" in reverse: a "de-feature request". But happy to discuss there instead or also, as appropriate. I've generally not been too impressed with "extractive summarization" – it seems to only be useful when the original text was already well authored, in a hierarchical & expository "reference" style. There, extractive summarization has a fair chance of finding the inherently-summarizing sentences/passages the author already included. (Elsewhere, it stumbles hard – as on some of the winding-plot-narratives that some of the tutorial code for this feature has inexplicably chosen to highlight.) So to the extent TextRank or some other extractive method survives, it'd be helpful to more specifically set expectations. For example, get the name of algorithm ( And, docs/tutorials could highlight some kinds of texts on which it works well, and others where it doesn't. (One potential evaluative method, for a method that's not order-dependent in its choice of sentences: shuffle all the sentences in a Wikipedia article together, run the algorithm, consider those algorithms that choose more sentences from the article's actual 'summary' section-number-0 to be better.) From what I've read of TextRank, it seems its method of calculating sentence-to-sentence similarity (and thus the edges on its sentence-to-sentence graph) could be pluggable, and methods based on average-of-word-vectors, or doc-vectors, or WMD-similarity might work quite well compared to the current code (which if I've read right just checks nearly-exact-word overlap). |
+1 for deprecation and eventual removal. Perhaps this is something we should do in the next major release? |
There are still a lot of places on the web that recommend using gensim.summarization, so this was not super helpful. |
@fredzannarbor It'd be helpful if you let those places know they now need to make some other better recommendation! |
Do you have any recommendation for bm25 ? There is a tuto that I want replicate in a my use case and It still uses BM25 |
If a tutorial/approach worked well with the older Gensim version, you can always choose to install & use that older version, for example in an isolated, project-specific virtual environment. Only if you also need closely-integrated later-version features or fixes would there be any complications. (And, if you really like some of the removed code, & are sure it meets your needs, you can always copy the source code into your own project, adapting names/prerequisites lightly as necessary. Just remember that the choice to remove things has usually been driven by an assessment that the code had limitations that made it hard to officially support, often including no one active in the project with the knowledge/interest to answer questions or investigate issues.) |
:( Would having more maintainers help in a decision like this? |
Yes, if you need a text split by sentences, using a project that has well-maintained code for doing that is wise. That's what Gensim itself would want to do, if any of its current algorithms needed to split text into sentences. (In general, they don't.) The prior code for this in But also, it was about 2 lines of crude regex-based string splitting. If that's all you need, it's easy to copy. See: |
Although I agree with the removal of the Is there any suitable replacement for |
+1 on including BM25 in Gensim. We'll just need to vet the code better. But I don't expect it will a problem with your code. |
How can one now accomplish summarization with gensim? |
There's no summarization functionality in current versions. You could try a 3.x version, & if the results work well for you, keep using that old version, or copy its source-code into your project. If you want state-of-the-art summarization – including potentially abstractive (paraphrasing) summarization not just a crude selection of some subset of guessed-important sentences that the previous Gensim extractive summarization provided – and have sufficient resources, you could look at newer, deeper large language models, like BERT/etc. |
Artificial intelligence has been evolving rapidly, and we can enhance the functionality of a simple open-source library both algorithmically and with a database-based approach. Since it hasn't been explicitly stated that summarization must be algorithm-based, I would like to request bringing back this idea. What is your perspective on starting a pull request for this thought? We could add a warning during the development phase to ensure that it doesn't consume people's time until satisfactory results are achieved. |
Do you mean a PR to restore exactly the old code? I think that'd be silly - it was bad code, poorly maintained, without any public examples of it providing good results, that as far as I could tell wasted the time of most people who tried it. A mere documentation or code-comment or even printed-to-console warning that the code is likely to disappoint people doesn't, in my experience, provide enough discouragement. They're still tempted by the label, or misleading old examples online – & thus waste their time, & ours. But still, if people really want it, maybe they have one of the rare tasks where this technique has good results. (I've seen people report this, but never seen any working demo of this code, on even contrived/cherry-picked data, showing useful results.) In that case, they can fetch the code out of the old versions. It's easy to get, it's not that long, it's open-source. As mentioned in the initial 2019 discussion, if someone wanted to make a more-generalizable and more-maintainable implementation of the 'TextRank' algorithm on which this With pluggable word/sentence tokenization, & pluggable/configurable sentence-centrality-ranking options, this kind of early extractive text summarization algorithm might still be useful against some well-written texts, or interesting didactically about the limits of summarization capabilities before deep neural networks. But here in 2023+, even an excellent & flexible implementation of TextRank-style, sentence-excerpts summarization will be far worse than what's cheap & easy with modern LLMs. |
A couple of thoughts about this.
1. “Cheap and easy” is not free. Useful to have free summarization
built into the package.
2. Extractive summarization is an important alternative because you can
rely on the words in the summary being the same as the original source.
For some applications that’s essential.
3. Modern LLMs still struggle with context window size. It’s crucial
to have at least one tool that can summarize very long documents as a
whole, ideally not constrained by memory size.
…---
Fred Zimmerman, Publisher
Nimble Books LLC
The AI Lab for Book-Lovers <http://NimbleBooks.com>
On Jul 5, 2023 at 1:46:16 PM, Gordon Mohr ***@***.***> wrote:
Since it hasn't been explicitly stated that summarization must be
algorithm-based, I would like to request bringing back this idea. What is
your perspective on starting a pull request for this thought?
Do you mean a PR to restore exactly the old code?
I think that'd be silly - it was bad code, poorly maintained, without any
public examples of it providing good results, that as far as I could tell
wasted the time of most people who tried it. A mere documentation or
code-comment or even printed-to-console warning that the code is likely to
disappoint people doesn't, in my experience, provide enough discouragement.
They're still tempted by the label, or misleading old examples online – &
thus waste their time, & ours.
But still, if people really want it, maybe they have one of the rare tasks
where this technique has good results. (I've seen people report this, but
never seen any working demo of this code, on even contrived data, showing
useful results.)
In that case, they can fetch the code out of the old versions. It's easy
to get, it's open-source.
As mentioned in the initial 2019 discussion, if someone wanted to make a
more-generalizable and more-maintainable implementation of the 'TextRank'
algorithm on which this gensim-summarization was based, that might have a
case. With pluggable word/sentence tokenization, & pluggable/configurable
sentence-centrality-ranking options, this kind of early extractive text
summarization algorithm might still be useful against some well-written
texts, or interesting didactically about the limits of
pre-deep-neural-networks summarization capabilities.
But here in 2023+, even an excellent & flexible implementation of
TextRank-style sentence-choosing summarization will be far worse than
what's cheap & easy with modern LLMs.
—
Reply to this email directly, view it on GitHub
<#2592 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI4TS3YNEOWYWA3KBCNTXLXOWR6RANCNFSM4ITKAWEQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
You didn't clarify whether your proposal is to bring back the old code, but your allusion to 'free' suggests that might be what you are suggesting.
There was never any truly 'free' summarization in the past, nor is any possible in the future. The prior code was low quality. Users wasted time & effort, which is not free, trying to get it to work. Maintainers faced questions from frustrated users, which impose costs even when the answer is, "no help is available". (And, with compact & open-source LLMs, those options are potentially as close to 'free' as anything else.)
I am unfamiliar with applications where using the exact same words is essential. Can you provide some links to representative applications where that's better than high-quality abstractive summarization? As I've mentioned, I've never seen any texts on which the old code delivered good results. (Our own demo notebook showed only poor/nonsense results.) If you know of cases where this has been shown to work well, can you provide links? To the extent someone really wanted to retrieve the "most representative" verbatim sentences from a longer work, as a sort of IR task, I suspect that applying other algorithms better-supported in Gensim – LDA, average-of-word-vectors, WMD, Doc2Vec, etc – would select better excerpts than the prior crude If such selection-of-verbatim-excerpts is a real need driving your request, I suggest trying some of those other algorithms. But also, if you have any published or private evaluations showing the old
A tool that could effectively summarize arbitrarily long documents would be useful! I've seen no evidence the old code could serve as that tool. Among its other substandard aspects: it required entire documents in memory, and its analysis required a massive expansion in memory use. Even after the fixes in #2298, it was reported to fail with a If you think you've found an extractive-summarization technique that could outcompete an LLM due to an LLM's window-size limitations, I'd want to see some credible evaluations demonstrating that, including that it outperforms the most-simple plausible LLM workaround: summarize acceptably-sized chunks, concatenate those summaries, repeat. It doesn't seem likely to me that any extractive approach would be competitive, but I'd enjoy being suprised if that can be shown! |
Seems from the tone and amount of feedback that you really don’t want to do
this, which is fine. I don’t use genesis anymore - I stopped when you
dropped the summarization tool.
1. “Free to me” is a very important consideration for users. Up to you
as a developer whether you want to bear the cost.
2. Law - citing cases. History - citing documents. Many situations
where paraphrasing is unacceptable, this should be obvious.
3. Concatenation and recursion are more cumbersome than a single
function or command line call, which is what I am looking for. As I noted,
extraction from large documents is sometimes preferable to abstraction.
Message ID: ***@***.***>
… |
The old code is still there, free to use if it works well for your needs. (You can install older versions of Gensim on request, or copy & paste the relevant source code into your projects.) And I'm still interested for actual viewable examples where it worked well – I've still never seen one. I sympathize if your historic use might be too private/proprietary to share details, but in the absence of any public examples of this particular code working well, it's hard to justify any cost of maintenance/user-frustration. By my understanding, quoting (to support a specific point) is very different than summarization. And trusting the old code's excerpts to reflect the original faithfully would be unwise - its technique couldn't be sure if a sentence were a quote of arguments the main document was refuting, or holding up to ridicule. And my other point remains: the other (stronger, better documented, test-case-covered, better-coded, easier-to-demonstrate) similarity-algorithms can likely find representative excerpts, to quote verbatim if that is necessary, even better than the very fragile/crude/underpowered/inefficient/idiosyncratic Simple concatenation & recursion can easily be bundled in a single function call in user code. The claim "LLMs can't do this - unless you put their operations into a simple loops of a few lines of code" isn't really the same as "LLMs can't do this". |
I agree with your planned way forward. There are better alternatives than
gensim.oldsummarization. I will only observe that in my spot testing,
gensim’s old summarizer did pretty well at pulling out 20 significant
sentences from book-length manuscripts. I was happy with the results, but
that is scarcely scientific.
…On Jul 5, 2023 at 11:50:55 PM, Gordon Mohr ***@***.***> wrote:
The old code is still there, free to use if it works well for your needs.
(You can install older versions of Gensim on request, or copy & paste the
relevant source code into your projects.)
And I'm still interested for actual viewable examples where it worked well
– I've still never seen one.
I sympathize if your historic use might be too private/proprietary to
share details, but in the absence of *any* public examples of this
particular code working well, it's hard to justify *any* cost of
maintenance/user-frustration.
By my understanding, *quoting* (to support a specific point) is very
different than *summarization*. And trusting the old code's excerpts to
reflect the original faithfully would be unwise - its technique couldn't be
sure if a sentence were a quote of arguments the main document was
refuting, or holding up to ridicule.
And my other point remains: the other (stronger, better documented,
test-case-covered, better-coded, easier-to-demonstrate)
similarity-algorithms can likely find representative excerpts, to quote
verbatim if that is necessary, even better than the very
fragile/crude/underpowered/inefficient/idiosyncratic gensim.summarization
did. Anyone who needs such functionality should try them in that role.
Simple concatenation & recursion can easily be bundled in a single
function call in user code. The claim "LLMs can't do this - unless you put
their operations into a simple loops of a few lines of code" isn't really
the same as "LLMs can't do this".
—
Reply to this email directly, view it on GitHub
<#2592 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI4TSZQOBAQKEV3P2SL77LXOYYZ7ANCNFSM4ITKAWEQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's helpful to know, even as anecdotal spot testing. Can you say any more about these texts' sizes in words or sentences, and their domain/style? (EG, were they fiction/non-fiction, academic/popular/governmental, etc?) I ask because I'm still curious where So the sort of "single function or command line call" functionality you'd like might still be possible, if there were a few more hints about what reference set of texts, & baseline performance, were worth optimizing around. |
I was usually editing nonfiction books on history written for enthusiast
audiences, so, a good amount of proper nouns, foreign language, and
technical terms.
---
Fred Zimmerman, Publisher
Nimble Books LLC
The AI Lab for Book-Lovers <http://NimbleBooks.com>
…On Jul 6, 2023 at 1:04:33 PM, Gordon Mohr ***@***.***> wrote:
That's helpful to know, even as anecdotal spot testing.
Can you say any more about these texts' sizes in words or sentences, and
their domain/style? (EG, were they fiction/non-fiction,
academic/popular/governmental, etc?)
I ask because I'm still curious where oldsummarization was providing
value – none of our documentation/demo/tutorial examples showed good
results, and it *may* be possible to match/exceed its value with a few
dozen lines of other code using better-supported remaining algorithms (&
more-standard tokenization functions/libraries).
So the sort of "single function or command line call" functionality you'd
like might still be possible, if there were a few more hints about what
reference set of texts, & baseline performance, were worth optimizing
around.
—
Reply to this email directly, view it on GitHub
<#2592 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI4TS7CIS5QPUTSDEACQ3DXO3V2DANCNFSM4ITKAWEQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
In the course of considering the list question at https://groups.google.com/d/msg/gensim/v24RI3-oUq0/NYlPpif1AQAJ, I took a slightly-deeper look at
gensim.summarization
than before.From that look, my opinion is that its presence is more likely to waste peoples' time than help them. It's fairly rudimentary functionality, but spread across many files, with its own non-configurable regex-based word- and sentence- tokenization, with a lot of hard-to-follow steps. None of the doc/tutorial examples show impressive results.
I even find it hard to imagine anyone getting satisfactory results from this approach, so I expect most peoples' interaction with this code is: (1) "I need summarization – and cool, gensim has a summarization feature!" (2) View its docs/tutorial and try on some real data. (3) "This is nowhere near what I need nor is it customizable/fixable enough to be tweaked into service." (4) They look for something else entirely.
I'd suggest marking the whole module 'deprecated' with an eye towards eventual removal. And, if summarization is an important thing to truly support, soliciting someone to work-up a better algorithm or implementation, one that can actually demo some useful results in a tutorial/demo, and that also mixes well with other corpus-format/tokenization practices in gensim. (It might even be TextRank-based – but with configurable tokenization & sentence-similarity/graph-building steps.)
The text was updated successfully, but these errors were encountered: