1
0
mirror of https://github.com/kanzure/pdfparanoia.git synced 2024-10-31 22:58:52 +01:00
Commit Graph

51 Commits

Author SHA1 Message Date
Bryan Bishop
cafb7330b6 bump version to 0.0.14 2013-07-09 01:35:23 -05:00
Bryan Bishop
13d388e7ee Merge pull request #31 from delinquentme/master
Make pdfparanoia install on systems without pdfminer.
2013-07-08 23:35:52 -07:00
Carl Crott
2275565fb2 fixed self-referential package install and cleaned out __init__.py 2013-07-08 22:48:45 -07:00
Bryan Bishop
388d2b289e README: write russian intro 2013-05-21 14:02:33 -05:00
Bryan Bishop
d54f6e826c Merge pull request #27 from DonnchaC/rsc
Check PDF is from the RSC before cleaning.
2013-05-13 13:06:23 -07:00
Donncha O'Cearbhaill
c673d77ec6 Check PDF is from the RSC before cleaning 2013-05-13 21:01:52 +01:00
Bryan Bishop
404e3577e0 Merge pull request #26 from DonnchaC/rsc
Watermark removal for Royal Society of Chemistry.
2013-05-13 12:52:40 -07:00
Donncha O'Cearbhaill
18140d838d Adding support for PDF's from pubs.rsc.org 2013-05-13 20:28:35 +01:00
Bryan Bishop
9d26a0aa01 Merge pull request #25 from semorrison/master
comparediffs, a tool to download, scrub, and compare PDFs
2013-05-02 11:39:25 -07:00
Scott Morrison
2ec1ca21a6 hash bang 2013-05-02 23:29:07 +10:00
Scott Morrison
11b59bd544 fixing README 2013-05-02 23:27:49 +10:00
Scott Morrison
702f2e2895 adding README for tests/diff/ 2013-05-02 23:25:10 +10:00
Scott Morrison
74649a1a05 merging urls.denied back into urls 2013-05-02 23:18:11 +10:00
Scott Morrison
27ad746861 adding a few more URLs for testing 2013-05-02 23:16:15 +10:00
Scott Morrison
2fb3783dea comparediffs seems to be working nicely 2013-05-02 22:45:34 +10:00
Scott Morrison
54b6ab070a initial attempt to pairwise diff testing 2013-05-02 20:06:18 +10:00
Bryan Bishop
6abfe2a380 Merge pull request #23 from cathalgarvey/master
Updated terminal script to use argparse.
2013-03-24 23:12:57 -07:00
Cathal Garvey
db514ff744 Fixed a few bugs so reading from stdin now works. Involves a potentially costly recast of file contents as StringIO. 2013-03-21 23:49:03 +00:00
Cathal Garvey
95e92420c9 Modified the "pdfparanoia" script in bin/ so it uses Argparse and the "with" context statement.
As python 2.6 was already commented as a potential environment, there seemed little
reason to not use Argparse rather than a sys.argv popping system; argparse offers
automatically generated usage documentation and can offer useful errors when input
is incorrect.

The "with" context statement is also highly excellent and should be used wherever
legacy support for old-timers using 2.6 is not needed.
2013-03-21 23:37:34 +00:00
Bryan Bishop
0d1da12f71 README: increase explaining 2013-02-14 03:43:23 -06:00
Bryan Bishop
ee483ab986 Merge pull request #21 from zooko/verbose-option
Verbosity argument.
2013-02-14 01:39:19 -08:00
Zooko O'Whielacronx
503b8aead5 add -v -v mode which prints out the details (potentially sensitive, potentially bulky)
remove spie, which appears to do nothing
2013-02-13 21:08:49 +00:00
Zooko O'Whielacronx
9204b2e17e fix up verbose printouts, don't print out large data 2013-02-13 20:56:33 +00:00
Zooko O'Whielacronx
56cc7719da add a "--verbose" option that writes to stderr if it finds anything to omit
Also cleaned up some flakes noticed by pyflakes, and make the scrub() be @classmethod instead of @staticmethod so I could use the class for the verbose output.

caveats:

* there are no unit tests of this patch
* now your logs of your stderr have potentially sensitive information in them
* the implementation of arg parsing is very low-tech; (a *good* way to do arg parsing is the "argparse" module)
2013-02-13 19:58:47 +00:00
Bryan Bishop
caed396870 SPIE watermark removal
This is slightly broken because the SPIE plugin removes more than just
watermarks. For some reason it seems to also remove images and large
blocks of text from the paper. However, the object that is being removed
is tiny. In the unit testing sample, the removed object is pdf stream
55.

For now, SPIE is partially disabled until this is fixed. The problem
does not originate from the other plugins.

fixes #20
2013-02-11 23:52:59 -06:00
Bryan Bishop
9d7fd1dbb6 README: add command-line usage 2013-02-10 01:29:58 -06:00
Bryan Bishop
775b927b42 pdfparanoia command-line interface 2013-02-09 09:44:48 -06:00
Bryan Bishop
5c8a194445 deflation tool to help with debugging
The deflate function expands some of the FlateDecode streams in a pdf
file. The output of the deflate function is not always correct and it is
very buggy. Still, this is a useful tool to poke around in foreign pdfs
under investigation.
2013-02-07 20:51:10 -06:00
Bryan Bishop
e108a43e26 make eraser handle more pdf formats 2013-02-07 03:56:18 -06:00
Bryan Bishop
f3d8475c79 better import formatting for core.py 2013-02-07 03:55:57 -06:00
Bryan Bishop
25195f9b11 add swap files to make clean 2013-02-06 17:39:42 -06:00
Bryan Bishop
11abd551d7 add certain pdfs to .gitignore 2013-02-06 17:34:38 -06:00
Bryan Bishop
b7b5a4ef65 jstor watermark removal
fixes #1
2013-02-06 17:33:00 -06:00
Bryan Bishop
47bc734318 replace_object_with - alternative removal method
Some publishers generate pdfs with the watermarks inside the text of a
page, in which case the object needs to be replaced. This deflates the
object and uses plaintext instead. While this increases the size of the
pdf, it is also effective for removing watermarks from the stream.
2013-02-06 17:27:12 -06:00
Bryan Bishop
57b6aba099 create requirements.txt 2013-02-06 00:03:48 -06:00
Bryan Bishop
4f0208963d remove pdfquery from requirements 2013-02-06 00:03:33 -06:00
Bryan Bishop
2711fc174b README: minor wording change 2013-02-05 23:44:00 -06:00
Bryan Bishop
8eb8797eeb support pdf formats with whitespace line endings
JSTOR pdfs have whitespace at the end of each line in their pdfs. Though
their watermarks are not yet removable, this supports parsing their
files in the future or any other publisher that does similar things.

see #1
2013-02-05 19:07:28 -06:00
Bryan Bishop
bc89bc5335 clean repo before uploading to pypi 2013-02-05 17:24:47 -06:00
Bryan Bishop
30c6e30891 version bump to 0.0.9 2013-02-05 17:21:58 -06:00
Bryan Bishop
f78aad78ef AIP: better false-positives check 2013-02-05 17:20:11 -06:00
Bryan Bishop
d276954bfa IEEE: remove print statement (oops) 2013-02-05 17:19:37 -06:00
Bryan Bishop
6ef9a19ff6 README: update changelog 2013-02-05 04:56:18 -06:00
Bryan Bishop
14f1439c76 ieee watermark removal 2013-02-05 04:49:56 -06:00
Bryan Bishop
0adec6c74e better plugin support 2013-02-05 04:34:45 -06:00
Bryan Bishop
aa3e188ef4 more setup.py madness 2013-02-05 04:29:16 -06:00
Bryan Bishop
302d2ff2e5 include plugins in package 2013-02-05 04:20:26 -06:00
Bryan Bishop
99285252fc include README.md via MANIFEST.in 2013-02-05 04:17:05 -06:00
Bryan Bishop
011c10c5c4 more setup.py magic 2013-02-05 04:14:08 -06:00
Bryan Bishop
99c88643f7 fix README packaging 2013-02-05 03:56:07 -06:00