An automatic Summarizer that using Extractive Summarization Technique to find the Summary of n sentences
The function removes the punctuations and unnecessary symbols and uses word tokenizer from NLTK to tokenize the words of the sentence. After all these preprocessing, it creates a frequency table(dictionary based) which contain the frequency of the words according to the occurrence in the text.This process is continued for all the words in the corpus to design the word frequency table and a dictionary containing all the words and their frequency is returned.
This function uses a dictionary to store the sentences as well as their individual scores. The sentence score is calculated based on the individual word scores that it contains, that is, Sentence Score=Σ(word score) First it tries to find the word in the frequency table, after which it picks up the score of each individual word from the frquency table and adds with the sentence score, which is then asssigned against each sentence in the sentence score dictionary.
This function also uses a dictionary to store the highest scored sentences. It considers three parameters, the set of sentences, the scores and the number of sentences in the summary that user wants. Then it sorts the sentences in descending order using sorted() and stores the highest scored n sentences in the dictionary and returns it.
This function organizes the summary based on the occurrence of sentences in the text for user's convenience. The highest ranked sentences from the previous function is considered and matched with the text, according to which the sentences are ordered and joined together to find the summary. The number of sentences in the summary is equal to n, which is considered as user input.
- Read the first 1000000 lines of the corpus.
- Call create_frequency_table() for preprocessing and find the word frequency. A wordFrequency table is returned.
- Call score_sentences() to find the score of the individual senteneces. The set of sentences from the corpus, the frequency of the words are passed to the function. A dictionary of sentences and their individual scores are returned.
- Call find_high_score() to find the highest ranked sentences which takes sentences,sentence scores and number of sentences(n) and returns the dictionary with the highest scored n sentences where n is the user input.
- Call gen_summary() to create the summary of n sentences. the parameters given are set of sentences, the result from step 4 and number of sentences in which the summary is sought by the user and it returns the summary according to the occurrence of sentences in the text.
- Change the path of the corpus file in Cell 6(last cell) line 1 [ with open("path") as...]
- Execute each cell one by one from the top.
- At the last cell, it will ask "how many sentence summary do you need?". Put the number as reqd (example:6000).
- the result should come right below that dialogue box.