## Remove watermark with PyPDF2

2018年06月19日

You may find the watermark of PDF file very annoying and it usually very hard to remove. With the power of python package PyPDF2 it would be done with some research.

PyPDF2 can analysis the source PDF file and identify all elements, in which some may correspond to the watermark. To be noticed, the watermark can be composed of several objects.

Like following code, you can analysis the structure of PDF file to identify the watermark objects.

Some very simple watermarks may be just plain text. But some can be very complex. With the help of InkScape, one may find out the specific signeture of the watermark very easily. Import several pages into InkScape and use the Edit -> XML editor to view the structure of the pages.

Some svg node may have a transform matrix，use that to identity the PyPDF2 objects. Set the matrix of these object to empty will make the watermark disappear!

If you read pdf file into PdfFileReader and modifiy every page, then insert them into a PdfFileWriter you may loss the bookmark information, like what have been done in upper code. A better choice is to extent the PdfFileMerger class because it will automatically handle the bookmarks and other information. Some may suggest using the cloneDocumentFromReader method, but it may generate blank pages.