How to work with .doc or .docx files in Python
Textual data is a very important of working with any modern or even a decently semi-modern system relying on Language processing using Machine Learning. Real life data can be very messy. Sometimes you might get data in a simple and well organised .csv or .json file. Trust me, you are way lucky to get data in a proper format by this stage and has possibly saved you hours of pre-processing and data filtering tasks which are very mundane and seemingly boring. This blog will also focus on why exactly is a file format necessary and the benefits and problems associated with the same. If you are just here for the tutorial, skip to the end.
Why is a file format necessary ?
Imagine a scenario where you just get a series of a lot of 0's and 1's. Using this system of bits, you have to tell someone how a file should look. What should be bold, what should be italic or which text uses which fonts, what is the size or alignment of a particular paragraph. It would be a very hard task to fit everything in a single format. On top of this, imagine that everyone has a different way of representing these features in a binary representation. Suddenly, no one is able to interpret what someone is saying and how someone is saying that same information. Stuff like (where to look in the binary file) can be very hard. This problem was first tackled when a few students wanted their research papers to look the same everywhere they were opened. This was a very difficult task at that time as there was no standardisation. They ended up creating a simple standardised framework to display textual data with some a very few formatting additions like bold, italic, underline, text size, colour, alignment and fonts. This format was slowly accepted by many other softwares who were working on a similar problem and eventually turned out to be a good success. Today, we lovingly call that platform HTML.
In order to solve such a simple problem, a very complicated system had to be designed. Over the course of time, HTML was also turning out to be unsuitable for majority of highly formatted text applications. This is when word processors like MS Word, Open Office Writer were gaining momentum. Both of these platforms released their own formats for storing strongly formatted text values (.doc and .odt respectively)
You can read a lot about how is the .doc format used by word processors. There is a very strict guideline to store data in a very particular format described by this huge 400 + pager document highlighting the documentation and usage of the .doc format. Some really smart people used this document to create libraries for Python that can read these files. The .doc file format was further left back for the .docx format used from 2007 onwards which enabled additional formatting options to be present in the same file.
For starters, we will be reading and writing to a .docx file using the docx library. Install the library using Python's default package manager
pip install docx
Let us now open a file using the direct file address. You can also work with relative addresses here. These addresses are relative to the current directory that you are working in
import docx doc = docx.Document("address_to_my_file.docx")
Once you create an object of the Document class using the file path, you can access all the paragraphs in the document via the paragraphs attribute. An empty line is also read as a paragraph by the Document. Let's fetch all the paragraphs from the my_word_file.docx and then display the total number of paragraphs in the document:
all_paras = doc.paragraphs len(all_paras)
Now we'll iteratively print all the paragraphs in the my_word_file.docx file:
for para in all_paras: print(para.text) print("-------")
------- Introduction ------- ------- Welcome to atomicloops.com ------- ------- This website contains some pretty interesting stuff, Hang on! for career opportunities, feel free to drop an email at atomicloops.com -------
The above output shows all of the paragraphs in the Word file.
A run in a word document is a continuous sequence of words having similar properties, such as similar font sizes, font shapes, and font styles.
single_para = doc.paragraphs for run in single_para.runs: print(run.text)
Writing MS Word Files with Python-Docx Module
In the previous section, you saw how to read MS Word files in Python using the python-docx module. In this section, you will see how to write MS Word files via the python-docx module.
To write MS Word files, you have to create an object of the Document class with an empty constructor, or without passing a file name.
mydoc = docx.Document()
To write paragraphs, you can use the add_paragraph() method of the Document class object. Once you have added a paragraph, you will need to call the save() method on the Document class object. The path of the file to which you want to write your paragraph is passed as a parameter to the save() method. If the file doesn't already exist, a new file will be created, otherwise the paragraph will be appended at the end of the existing MS Word file.
The following script writes a simple paragraph to a newly created MS Word file named "my_written_file.docx".
mydoc.add_paragraph("This is first paragraph of a MS Word file.") mydoc.save("my_written_file.docx")
Once you execute the above script, you should see a new file "my_written_file.docx" in the directory that you specified in the save() method. Inside the file, you should see one paragraph which reads "This is first paragraph of a MS Word file."
Let's add another paragraph to the my_written_file.docx:
mydoc.add_paragraph("This is the second paragraph of a MS Word file.") mydoc.save("my_written_file.docx")
This second paragraph will be appended at the end of the existing content in my_written_file.docx.
Hope this blog gave you a good idea of how to interact with word files in Python. Stay tuned for more amazing stuff from Atomic Loops!