Documents Summary in Python
Analyzing the content of a Microsoft Word document, extracting text, and creating a summary involves a multi-step process. Python provides libraries that can help you achieve these tasks.
Here's a general approach
✅ Install Libraries
You'll need the docx2txt
library to extract text from Word documents and the nltk
library for text summarization. Install them using pip
:
pip install python-docx nltk
✅ Extract Text from Word Document
Use the docx2txt
library to extract text from the Word document:
import docx2txt
# Replace 'document.docx' with your file path
text = docx2txt.process("document.docx")
✅ Preprocess Text
Before summarization, it's a good idea to preprocess the extracted text. You can use the nltk
library to tokenize the text into sentences and words:
from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize(text)
words = word_tokenize(text)
✅ Text Summarization
To create a summary, you can use various techniques. One common approach is extractive summarization, where important sentences from the text are selected to form the summary. The nltk
library offers a basic implementation of the TextRank algorithm for this purpose:
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
def sentence_similarity(sent1, sent2, stopwords=None):
if stopwords is None:
stopwords = []
words1 = [w.lower() for w in sent1]
words2 = [w.lower() for w in sent2]
all_words = list(set(words1 + words2))
vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
for w in words1:
if w not in stopwords:
vector1[all_words.index(w)] += 1
for w in words2:
if w not in stopwords:
vector2[all_words.index(w)] += 1
return 1 - cosine_distance(vector1, vector2)
def build_similarity_matrix(sentences, stop_words):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j], stop_words)
return similarity_matrix
def generate_summary(text, top_n=5):
stop_words = stopwords.words('english')
sentences = sent_tokenize(text)
sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
scores = nx.pagerank(sentence_similarity_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
summary = ""
for i in range(top_n):
summary += " " + ranked_sentences[i][1]
return summary
# Generate a summary with the top 5 sentences
summary = generate_summary(text, top_n=5)
The above code defines functions to calculate sentence similarity and generate a summary using TextRank Algorithm.
Please note that this is a simplified example, and there are more advanced methods and libraries available for text summarization.
Additionally, the quality of the summary might vary depending on the complexity of the document and the chosen summarization technique.
Resources
- 👉 Generate Django Apps using
Rocket Generator
- 👉 Join the Community and chat with the
support
team