Working with PDF Files in Python

We all are familiar with pdf files. PDF stands for Portable Document Format and its file extension is “.pdf”. We use it in our daily lives knowingly or unknowingly, for eg. you get your book or certificate downloaded from the internet they are mostly in pdf format and in the current scenario after the COVID outbreak we all are studying in digitalized format where we often have to deal with the notes and assignments which are mostly in pdf format. Now I hope you might get familiar with the pdfs. You have learnt how to deal with text files and binary files, but dealing with pdf is not the same. You think I might be fooling around because when we read pdf files they look all normal to us but actually they are encoded in a format that looks gibberish to us. There are different ways to deal with pdf files. In this blog, we will be working with pdf in python and learn some of the operations that we can perform on our pdf using a third party module PyPDF2.

Operations we are going to perform:-

  1. Extracting text from PDF.
  2. Merging PDF files.
  3. Rotating a Page of Pdf File.
Working with pdf in Python

What is PyPDF2 ?

PyPDF2 is one of the python PDF libraries which is capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Before we work with pdf, just make sure that you have PYPDF2 installed in your system and if not then please install it by running the command given below in your terminal.

pip install PyPDF2

1. Extracting text from PDF

Working with pdf in Python

Python File.py

import PyPDF2
pdfFileObj = open('C++ by E Balagurusamy.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)

Let’s understand working of each step:-

  1. First we imported PyPDF2 .
  2. Then created a file object where we are storing the pdf file which we had opened in the binary read mode.
  3. Now we have created a read_pdf object and with the help of PyPDF2.PdfFileReader we will read the pdf file which was passed as the parameter.
  4. With the help of getNumPages()  we can count no. of pages available in pdf.
  5. getPage() will return that particular page of the pdf.
  6. extractText() will extract the text from the pdf file

2. Merging PDF Files

Working with pdf in Python

Python File.py

import PyPDF2
pdfs = ['C++ by E Balagurusamy.pdf', 'at_your_age.pdf']
pdfmerger = PyPDF2.PdfFileMerger()
for pdf in pdfs:
    pdfmerger.append(pdf)
with open('combined.pdf', 'wb') as f:
    pdfmerger.write(f)

Let’s understand working of each step:-

  1. First we imported PyPDF2 .
  2. Then create a list of pdfs that you want to merge.

Note:- Here i haven’t specified the path for the pdf files as they are in the same directory where i am having my python file but, if your pdf files are at different locations then please specify the path as well with names.

  1. Now created an object pdfmerger for class PdfFileMerger.
  2. After that just normally append my pdfs as we are appending an element in the list.
  3. At last create a new file where you want your merge data to be stored.

3. Rotating a Page of PDF File

Python File.py

import PyPDF2
pdfFileObj = open('at_your_age.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdfFileObj)
page = read_pdf.getPage(5)
page_content =page.rotateClockwise(90)
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(page_content)
with open('Python_Tutorial_rotated.pdf', 'wb') as pdf_file_rotated:
        pdf_writer.write(pdf_file_rotated)

Let’s understand working of each step:-

  1. Import PyPDF2.
  2. Then created a file object where we are storing the pdf file which we had opened in the binary read mode.
  3. Now we will call a function getPage to get the particular page whose page number we had passed as a parameter and stored it in a variable named page.
  4. Then created a file object pdf_writer which will help us to write in a pdf file.
  5. Then created a new pdf file where I will store the rotated page.
  6. You can also completely rotate the pdf file by just adding the for loop.

Conclusion

So, we had learned how to work with pdf files in python. I hope this blog will be useful for you and helps you to understand each and every step. Hope all your doubts get cleared. Thank you.