1
0
Fork 0
Commit Graph

47 Commits

Author SHA1 Message Date
jvoisin 2ba38dd2a1 Bump mypy typing coverage 2018-10-12 14:32:09 +02:00
jvoisin 0d25b18d26 Improve both the typing and the comments 2018-10-05 17:07:58 +02:00
jvoisin 8e98593b02 Trash word/people.xml in office files 2018-10-04 16:28:20 +02:00
jvoisin 5a5c642a46 Don't break office files for MS Office
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
2018-10-03 16:38:05 +02:00
jvoisin 1b356b8c6f Improve mat2's cli reliability
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
2018-10-03 15:22:36 +02:00
jvoisin c67bbafb2c Use [Content_Types].xml to improve MS Office coverage 2018-10-02 11:55:42 -07:00
georg 5b606f939d
fix typo 2018-10-02 16:01:24 +00:00
jvoisin 652b8e519f Files processed via MAT2 are now accepted without warnings by MS Office 2018-10-01 12:25:37 -07:00
jvoisin 81a3881aa4 Please mypy 2018-09-30 19:55:17 +02:00
jvoisin e342671ead Remove dangling references in MS Office's [Content_types].xml 2018-09-30 19:53:18 +02:00
jvoisin 719cdf20fa Second pass of minor formatting 2018-09-24 20:15:07 +02:00
jvoisin 2e243355f5 Fix some minor formatting issues 2018-09-24 19:50:24 +02:00
jvoisin 174d4a0ac0 Implement rsid stripping for office files
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."

See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/.
2018-09-24 18:03:59 +02:00
jvoisin fbcf68c280 Lexicographical sort on xml attributes for office files
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
2018-09-24 17:45:09 +02:00
jvoisin e3d817f57e Split office and archives 2018-09-06 11:34:14 +02:00
Daniel Kahn Gillmor f3cef319b9 Unknown Members: make policy use an Enum
Closes #60

Note: this changeset also ensures that clean.cleaned.docx is removed
up after the pytest is over.
2018-09-05 18:59:33 -04:00
jvoisin 072ee1814d Remove defusedxml support and document why 2018-09-05 18:41:08 +02:00
jvoisin 46bb1b83ea Improve the previous commit 2018-09-05 17:26:09 +02:00
Daniel Kahn Gillmor 1d7e374e5b office: try all members, even when one fails
the end result will be the same -- an abort -- but the user will get
to see all the warnings for a particular file, instead of getting them
one at a time.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor 915dc634c4 document all unknown/unhandlable files even on abort
This makes it easy to get a list of all files that mat2 doesn't know
how to handle, without having to choose -u keep or -u omit.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor 4192a2daa3 office: create policy for what to do about unknown members
previously, encountering an unknown member meant that any parser of
this type would abort.

now, the user can set parser.unknown_member_policy to either 'omit' or
'keep' if they don't want the current action of 'abort'

note that this causes pylint to complain about branching depth for
remove_all() because of the nuanced error-handling.  I've disabled
this check.
2018-09-04 16:13:33 -04:00
jvoisin 7877ba0da5 Fix a minor formatting issue 2018-09-01 14:16:55 +02:00
dkg e2634f7a50 Logging cleanup 2018-09-01 05:14:32 -07:00
jvoisin 942859601d Improve the code's documentation 2018-07-19 23:10:27 +02:00
jvoisin 565cb66d14 Minor simplification in how we're handling xml for office files 2018-07-19 22:55:08 +02:00
jvoisin 5a7c7f35f7 Remove `print` from libmat, and use the `logging` module instead
This should close #28
2018-07-10 21:30:38 +02:00
jvoisin 080d6769ca Make pylint even happier 2018-07-09 01:11:44 +02:00
jvoisin 8c21006e6c Fix some pep8 issues spotted by pyflakes 2018-07-08 22:40:36 +02:00
jvoisin f49aa5cab7 Achieve 100% coverage! 2018-07-08 22:27:37 +02:00
jvoisin ad3e7ccee8 Bump coverage for office files and fix some related crashes 2018-07-08 21:35:45 +02:00
jvoisin ca01484126 Silence a mypy's stupid warning 2018-07-08 17:12:17 +02:00
jvoisin f9bc022c96 Add defusedxml as an (optional) way to prevent XML-based attacks
Those attacks are DoS-only.
2018-07-08 17:07:26 +02:00
jvoisin 85455a4419 Fix a mistake in office file revisions handling 2018-07-07 18:05:54 +02:00
jvoisin 893f58554a Improve a bit the formatting of the code thanks to pyflakes3 2018-07-02 00:22:05 +02:00
jvoisin bee56a57ce Remove docx revisions 2018-07-01 23:16:14 +02:00
jvoisin 02f7605ac1 MAT2 is now cleaning revisions from odt files! 2018-07-01 21:09:20 +02:00
jvoisin 80fc4ffb40 Remove the thumbnails from libreoffice files 2018-07-01 17:29:05 +02:00
jvoisin 177184ac67 Massively simplify how we're cleaning office files 2018-06-27 21:48:46 +02:00
jvoisin 5b38bd7ccd Improve the reliability of the office parser 2018-06-21 23:18:59 +02:00
jvoisin 846a261465 Fix some linter warnings 2018-06-21 23:07:21 +02:00
jvoisin 09e748fa4c Refactor how offices files are handled
- xml files are no longer considered harmless
- Factorization of the `remove_all` method for office files
- Explicit whitelist are used
- Blacklist are used to skip files completely
  - Non-blacklisted files are _still cleaned_
  - Unsupported files are still triggering an error
2018-06-21 23:02:41 +02:00
jvoisin a89dae054a Minor simplification of the office-related code 2018-06-21 21:24:53 +02:00
jvoisin 545887af98 Minor code simplification 2018-06-10 20:20:32 +02:00
jvoisin 7dad77a785 Make the parsing of office format's metadata more robust 2018-06-10 20:20:00 +02:00
jvoisin 8c7979aae3 Add some tests for non-supported embedded fileformats 2018-06-10 20:19:35 +02:00
jvoisin 6a1b0b31f0 Add more typing and use mypy in the CI 2018-06-04 23:20:30 +02:00
jvoisin 38fae60b8b Rename some files to simplify packaging
- the `src` folder is now `libmat2`
- the `main.py` script is now `mat2.py`
2018-05-18 23:52:40 +02:00