From cd8f1a55b1b48dea9a277a4cfa6b1ad56440b5ea Mon Sep 17 00:00:00 2001 From: jvoisin Date: Tue, 3 Apr 2018 21:45:05 +0200 Subject: [PATCH] Add a note about why we do clean PDF in a completely overkill way --- doc/implementation_notes.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/implementation_notes.md b/doc/implementation_notes.md index bc83671..b385659 100644 --- a/doc/implementation_notes.md +++ b/doc/implementation_notes.md @@ -25,6 +25,10 @@ handle PDF. But apparently, people are ok with [pdf redact tools](https://github.com/firstlookmedia/pdf-redact-tools), that simply transform the PDF into images. So this is what's MAT2 is doing too. +Of course, it would be possible to detect images in PDf file, and process them +with MAT2, but since a PDF can contain a lot of things, like images, videos, +javascript, pdf, blobs, … this is the easiest and safest way to clean them. + Images handling ---------------