Documents Summary in Python

Analyzing the content of a Microsoft Word document, extracting text, and creating a summary involves a multi-step process. Python provides libraries that can help you achieve these tasks.

Here's a general approach

✅ Install Libraries

You'll need the docx2txt library to extract text from Word documents and the nltk library for text summarization. Install them using pip:

pip install python-docx nltk

✅ Extract Text from Word Document

Use the docx2txt library to extract text from the Word document:

import docx2txt

# Replace 'document.docx' with your file path
text = docx2txt.process("document.docx")

✅ Preprocess Text

Before summarization, it's a good idea to preprocess the extracted text. You can use the nltk library to tokenize the text into sentences and words:

from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(text)
words = word_tokenize(text)

✅ Text Summarization

To create a summary, you can use various techniques. One common approach is extractive summarization, where important sentences from the text are selected to form the summary. The nltk library offers a basic implementation of the TextRank algorithm for this purpose:

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
    words1 = [w.lower() for w in sent1]
    words2 = [w.lower() for w in sent2]
    all_words = list(set(words1 + words2))
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
    for w in words1:
        if w not in stopwords:
            vector1[all_words.index(w)] += 1
    for w in words2:
        if w not in stopwords:
            vector2[all_words.index(w)] += 1
    return 1 - cosine_distance(vector1, vector2)

def build_similarity_matrix(sentences, stop_words):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j], stop_words)
    return similarity_matrix

def generate_summary(text, top_n=5):
    stop_words = stopwords.words('english')
    sentences = sent_tokenize(text)
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)
    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
    summary = ""
    for i in range(top_n):
        summary += " " + ranked_sentences[i][1]
    return summary

# Generate a summary with the top 5 sentences
summary = generate_summary(text, top_n=5)

The above code defines functions to calculate sentence similarity and generate a summary using TextRank Algorithm.

Please note that this is a simplified example, and there are more advanced methods and libraries available for text summarization.

Additionally, the quality of the summary might vary depending on the complexity of the document and the chosen summarization technique.

Resources

👉 Generate Django Apps using Rocket Generator
👉 Join the Community and chat with the support team

Documents Summary in Python

✅ Install Libraries​

✅ Extract Text from Word Document​

✅ Preprocess Text​

✅ Text Summarization​

Resources​

✅ Install Libraries

✅ Extract Text from Word Document

✅ Preprocess Text

✅ Text Summarization

Resources