1
0
Fork 0
Commit Graph

248 Commits

Author SHA1 Message Date
jvoisin 0170f0e37e Improve a bit the comments in the code
This is related to the previous commit
2019-09-01 13:52:02 +02:00
jvoisin 0cf0541ad9 Remove nsid fields from MSOffice documents
nsids are random identifiers, usually used to ease merging
between documents, and can trivially be used for fingerprinting.
2019-09-01 13:52:02 +02:00
jvoisin 0c75cd15dc Remove a mypy workaround to bump coverage back to 100% 2019-07-22 23:28:51 +02:00
jvoisin 5280b6c2b3 Add a test for svg namespace 2019-07-22 23:21:06 +02:00
georg 8bb2826f7a CI: Add job to run codespell, a spell checking software 2019-07-22 13:31:40 -07:00
jvoisin 5c33b290ae Fix mypy 2019-07-20 16:05:55 +02:00
jvoisin dc5603eb1d Please mypy 2019-07-13 23:25:44 +02:00
jvoisin 4999209f9c Add support for svg 2019-07-13 21:26:05 +02:00
jvoisin bdd5581033 Compress cleaned zip archives by default 2019-07-13 15:04:43 +02:00
jvoisin 47f9cb33bf Please mypy 2019-07-13 15:03:40 +02:00
jvoisin 35d550d229 Use memoization get _*_path() functions
This shouldn't make a big difference in the CLI/extension
usage, but might improve the performances of long-running
instances, or people misusing the API.
2019-05-16 00:31:40 +02:00
jvoisin aa52a5c91c Please mypy wrt. the last two commits 2019-05-14 00:50:17 +02:00
Antoine Tenart f19f6ed8b6 Rework the dependency checks to distinguish required/optional ones
Rework the dependencies definition to include a 'required' flags, which
is passed by the check_dependencies helper to the callers, so that they
can distinguish between required and optional dependencies.

This help in two ways:
- The unit test for the dependencies was now failing when an optional
  one was missing, due to a previous rework.
- Mat2's --check-dependencies was referring to "required dependencies"
  and was misleading for the user as some of them could be optional.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-05-13 23:35:26 +02:00
jvoisin 97abafdc58 Minor code cleanup 2019-05-09 09:41:05 +02:00
fuzzy 7e031c9757 typo 2019-05-03 02:39:15 -07:00
jvoisin 9516990693 Add some verification for "dangerous" tarfiles 2019-05-01 17:55:35 +02:00
jvoisin a7ebb587e1 Handle weird permissions in tar archives 2019-04-27 22:48:40 +02:00
jvoisin 14a4cddb8b Improve the display of tarfile's members mtime 2019-04-27 21:15:06 +02:00
jvoisin 8e41b098d6 Add support for compressed tar files 2019-04-27 06:03:09 -07:00
jvoisin 82cc822a1d Add tar archive support 2019-04-27 04:05:36 -07:00
jvoisin 05f429b197 Add support for xhtml files 2019-04-14 20:36:33 +02:00
jvoisin 1e325c5b5b Please mypy
Apparently, mypy isn't able (yet?) to deal
with variables that are changing their types
at runtime.

Python is wonderful.
2019-03-30 10:33:16 +01:00
Antoine Tenart d454ef5b8e libmat2: fix dependency checks for cmd line utilities
The command line checks for command line utilities are done by trying to
access the executables and by throwing an exception when not found. This
lead to:
- The mat2 cmd line --check-dependencies option failing.
- The ffmpeg unit tests failing when ffmpeg isn't installed (even though
  it's an optional dependency).

This patch fixes it.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-03-29 19:29:28 +01:00
Antoine Tenart c824a68dd8 libmat2: reshape the dependencies list
Invert the keys and values in DEPENDENCIES. It seems more natural to use
the key as a key in check_dependencies(), and the value as the value.
This also help in preparing for reworking the check_dependencies()
helper.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-03-29 19:29:28 +01:00
jvoisin b8c92fec09 Fix the testsuite 2019-03-23 00:41:23 +01:00
Antoine Tenart 0e3c2c9b1b libmat2: audio: not all id3 types have a text attribute
Not all id3 types have a text attribute (such as mutagen.id3.APIC or
mutagen.id3.UFID). This leads to the get_meta helper to crash when
trying to access the text attribute of an object which does not have it.
Fixes it by checking the text attribute is available before accessing
it.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-03-23 00:32:44 +01:00
Brolf 5ac91cd4f9
Refactor {black,white}list into {block,allow}list
Closes #96
2019-03-05 23:13:42 +00:00
georg c3f097a82b
fix typo 2019-03-01 22:00:23 +00:00
jvoisin 55214206b5 Improve the previous commit
- More tests
- More documentation
- Minor code cleanup
2019-02-27 23:53:07 +01:00
jvoisin 73d2966e8c Improve epub support 2019-02-27 23:04:38 +01:00
jvoisin eb2e702f37 Document the previous commit 2019-02-25 15:37:44 +01:00
jvoisin 545dccc352 In archive-based formats, the `mimetype` file comes first
This should improve epub compatibility,
along with other formats as a side-effect
2019-02-24 23:32:32 +01:00
jvoisin 524bae5972 <title> is also an html metadata 2019-02-23 20:47:26 +01:00
jvoisin c757a9b7ef Fix a bug in css cleaning
It's not mandatory to actually have a comment inside
comment delimiter, like `/**/`.
2019-02-23 20:21:11 +01:00
jvoisin 02ff21b158 Implement epub support 2019-02-20 16:28:11 -08:00
jvoisin a81b7658a8 Make the mandatory metadata warning generic
This should close #95.
2019-02-10 21:46:13 +01:00
jvoisin 6e63e03b86 Streamline a bit the previous commit 2019-02-09 15:23:16 +01:00
Poncho a71488d459 bind mount /etc/ld.so.cache to the sandbox
without /etc/ld.so.cache available in the sandbox, tests fail on gentoo with:
/usr/bin/ffmpeg: error while loading shared libraries: libstdc++.so.6:
    cannot open shared object file: No such file or directory
2019-02-09 09:49:51 +01:00
jvoisin 6ef6aaa222 Improve a bit get_meta for libreoffice files 2019-02-08 23:23:56 +01:00
jvoisin 6cc034e81b Add support for html files 2019-02-08 23:05:18 +01:00
jvoisin e1dd439fc8 Use of the archive refactoring for the office documents too 2019-02-07 22:19:37 +01:00
jvoisin b9a62d798a Refactor a bit office get_meta handling
This should make easier to get more metadata from
archive-based file formats.
2019-02-04 00:31:26 +01:00
jvoisin 433609f8ea Implement .gif support 2019-02-03 21:01:58 +01:00
intrigeri e8c1bb0e3c Whenever possible, use bwrap for subprocesses
This should closes  #90
2019-02-03 19:18:41 +01:00
jvoisin 8e84ba547a Add support for wmv 2019-02-02 19:19:36 +01:00
jvoisin 04bb8c8ccf Add mp4 support 2018-10-28 07:41:04 -07:00
jvoisin 3a070b0ab7 Add support for zip files 2018-10-25 11:56:46 +02:00
jvoisin 283e5e5787 Improve archive-based parser's robustness against corrupted embedded files 2018-10-25 11:56:12 +02:00
jvoisin 513d897ea0 Implement get_meta() for archives 2018-10-25 11:29:50 +02:00
jvoisin 5a9dc388ad Minor refactorisation of how we're checking for exiftool's presence 2018-10-25 11:05:06 +02:00
jvoisin fe885babee Implement lightweight cleaning for jpg 2018-10-24 19:35:07 +02:00
jvoisin 9a81b3adfd Improve type annotation coverage 2018-10-23 16:32:28 +02:00
jvoisin f1a071d460 Implement lightweight cleaning for png and tiff 2018-10-23 16:22:11 +02:00
jvoisin 38df679a88 Optimize the handling of problematic files 2018-10-23 13:49:58 +02:00
jvoisin 44f267a596 Improve problematic filenames support 2018-10-22 16:56:05 +02:00
jvoisin 83389a63e9 Test mat2's reliability wrt. corrupted video files 2018-10-22 13:42:04 +02:00
jvoisin e70ea811c9 Implement support for .avi files, via ffmpeg
- This commit introduces optional dependencies (namely ffmpeg):
  mat2 will spit a warning when trying to process an .avi file
  if ffmpeg isn't installed.
- Since metadata are obtained via exiftool, this commit
  also refactors a bit our exfitool wrapper.
2018-10-22 12:58:01 +02:00
jvoisin 2ba38dd2a1 Bump mypy typing coverage 2018-10-12 14:32:09 +02:00
jvoisin b832a59414 Refactor lightweight mode implementation 2018-10-12 11:49:24 +02:00
jvoisin b9dbd12ef9 Implement recursive metadata for FLAC files
Since FLAC files can contain covers, it makes sense
to parse their metadata
2018-10-11 19:52:47 +02:00
jvoisin b2e153b69c Delete pictures of FLAC files 2018-10-11 18:15:11 +02:00
jvoisin 0d25b18d26 Improve both the typing and the comments 2018-10-05 17:07:58 +02:00
jvoisin d0f3534eff Hide unsupported extensions in `mat2 -l` 2018-10-05 12:43:21 +02:00
jvoisin 8e98593b02 Trash word/people.xml in office files 2018-10-04 16:28:20 +02:00
georg 34fbd633fd
libmat2: fix shebang
Relates 0a2a398c9c
2018-10-03 18:38:28 +00:00
jvoisin 5a5c642a46 Don't break office files for MS Office
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
2018-10-03 16:38:05 +02:00
jvoisin 1b356b8c6f Improve mat2's cli reliability
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
2018-10-03 15:22:36 +02:00
jvoisin c67bbafb2c Use [Content_Types].xml to improve MS Office coverage 2018-10-02 11:55:42 -07:00
georg 5b606f939d
fix typo 2018-10-02 16:01:24 +00:00
jvoisin 652b8e519f Files processed via MAT2 are now accepted without warnings by MS Office 2018-10-01 12:25:37 -07:00
jvoisin 81a3881aa4 Please mypy 2018-09-30 19:55:17 +02:00
jvoisin e342671ead Remove dangling references in MS Office's [Content_types].xml 2018-09-30 19:53:18 +02:00
jvoisin 719cdf20fa Second pass of minor formatting 2018-09-24 20:15:07 +02:00
jvoisin 2e243355f5 Fix some minor formatting issues 2018-09-24 19:50:24 +02:00
jvoisin 174d4a0ac0 Implement rsid stripping for office files
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."

See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/.
2018-09-24 18:03:59 +02:00
jvoisin fbcf68c280 Lexicographical sort on xml attributes for office files
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
2018-09-24 17:45:09 +02:00
jvoisin a1a06d023e Insert archive members in lexicographic order 2018-09-18 22:44:21 +02:00
jvoisin 5cf94bd256 Bump coverage back to 100% 2018-09-12 14:54:54 +02:00
jvoisin de65f4f4d4 Improve the resilience of MAT2 wrt. corrupted PNG 2018-09-09 19:09:05 +02:00
jvoisin 9fe6f1023b Make pylint happy 2018-09-06 11:36:04 +02:00
jvoisin e3d817f57e Split office and archives 2018-09-06 11:34:14 +02:00
jvoisin 120b204988 Change a bit the previous commit 2018-09-06 11:13:11 +02:00
Daniel Kahn Gillmor f3cef319b9 Unknown Members: make policy use an Enum
Closes #60

Note: this changeset also ensures that clean.cleaned.docx is removed
up after the pytest is over.
2018-09-05 18:59:33 -04:00
jvoisin 072ee1814d Remove defusedxml support and document why 2018-09-05 18:41:08 +02:00
jvoisin 46bb1b83ea Improve the previous commit 2018-09-05 17:26:09 +02:00
Daniel Kahn Gillmor 1d7e374e5b office: try all members, even when one fails
the end result will be the same -- an abort -- but the user will get
to see all the warnings for a particular file, instead of getting them
one at a time.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor 915dc634c4 document all unknown/unhandlable files even on abort
This makes it easy to get a list of all files that mat2 doesn't know
how to handle, without having to choose -u keep or -u omit.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor 4192a2daa3 office: create policy for what to do about unknown members
previously, encountering an unknown member meant that any parser of
this type would abort.

now, the user can set parser.unknown_member_policy to either 'omit' or
'keep' if they don't want the current action of 'abort'

note that this causes pylint to complain about branching depth for
remove_all() because of the nuanced error-handling.  I've disabled
this check.
2018-09-04 16:13:33 -04:00
jvoisin 907fc591cc Bump the coverage back to 100% 2018-09-01 16:58:34 +02:00
Daniel Kahn Gillmor 3e2890eb9e three minor spelling fixes 2018-09-01 06:47:22 -07:00
jvoisin 91e80527fc Add archlinux to the CI 2018-09-01 15:41:22 +02:00
jvoisin 7877ba0da5 Fix a minor formatting issue 2018-09-01 14:16:55 +02:00
dkg e2634f7a50 Logging cleanup 2018-09-01 05:14:32 -07:00
jvoisin 1c72448e58 Improve the detection of unsupported extensions in uppercase 2018-08-23 21:28:37 +02:00
Antoine Tenart f068621628 libmat2: images: fix handling of .JPG files
Pixbuf only supports .jpeg files, not .jpg, so libmat2 looks for such an
extension and converts it if necessary. As this check is case sensitive,
processing .JPG files does not work.

Fixes #47.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-08-23 20:43:27 +02:00
georg 71b1ced842
AbstractParser: Fix typos 2018-07-21 00:46:48 +00:00
jvoisin 942859601d Improve the code's documentation 2018-07-19 23:10:27 +02:00
jvoisin 565cb66d14 Minor simplification in how we're handling xml for office files 2018-07-19 22:55:08 +02:00
jvoisin 84d50f97c0 Add a check for a missed dependency in `./mat2 -c` 2018-07-15 17:00:01 +02:00
jvoisin 5a7c7f35f7 Remove `print` from libmat, and use the `logging` module instead
This should close #28
2018-07-10 21:30:38 +02:00
jvoisin d5861e4653 Implement a check for dependencies in mat2
Example use:

```
$ mat2 -c
Dependencies required for MAT2 0.1.3:
- Cairo: yes
- Exiftool: yes
- GdkPixbuf from PyGobject: yes
- Mutagen: yes
- Poppler from PyGobject: yes
- PyGobject: yes
```

This should close #35
2018-07-10 21:24:26 +02:00
jvoisin 080d6769ca Make pylint even happier 2018-07-09 01:11:44 +02:00
jvoisin 8c21006e6c Fix some pep8 issues spotted by pyflakes 2018-07-08 22:40:36 +02:00
jvoisin f49aa5cab7 Achieve 100% coverage! 2018-07-08 22:27:37 +02:00
jvoisin ad3e7ccee8 Bump coverage for office files and fix some related crashes 2018-07-08 21:35:45 +02:00
jvoisin ca01484126 Silence a mypy's stupid warning 2018-07-08 17:12:17 +02:00
jvoisin f9bc022c96 Add defusedxml as an (optional) way to prevent XML-based attacks
Those attacks are DoS-only.
2018-07-08 17:07:26 +02:00
jvoisin 72e1fda18d Remove a leftover print 2018-07-08 15:19:18 +02:00
jvoisin 3cd4f9111f Bump coverage for torrent handling 2018-07-08 15:13:03 +02:00
jvoisin b5fcddd6a6 Simplify how torrent files are handled
- Rework the testsuite wrt. torrent
- fail at parser's instantiation on corrupted torrent,
  instead of during `get_meta` or `remove_all` call
2018-07-08 13:49:11 +02:00
jvoisin 7ea362d908 Bump the coverage for pdf 2018-07-07 18:12:33 +02:00
jvoisin 85455a4419 Fix a mistake in office file revisions handling 2018-07-07 18:05:54 +02:00
jvoisin 3d80f97524 Simplify BMP handling 2018-07-06 00:49:17 +02:00
jvoisin 53271495f7 Add support for .txt files 2018-07-06 00:42:09 +02:00
jvoisin 893f58554a Improve a bit the formatting of the code thanks to pyflakes3 2018-07-02 00:22:05 +02:00
jvoisin bee56a57ce Remove docx revisions 2018-07-01 23:16:14 +02:00
jvoisin 02f7605ac1 MAT2 is now cleaning revisions from odt files! 2018-07-01 21:09:20 +02:00
jvoisin 80fc4ffb40 Remove the thumbnails from libreoffice files 2018-07-01 17:29:05 +02:00
jvoisin 177184ac67 Massively simplify how we're cleaning office files 2018-06-27 21:48:46 +02:00
jvoisin f44769df41 Ensure Poppler's minimal version
We're using methods that aren't available in Poppler
below 0.46, so we're checking for this upon import.

This commit is based on ideas from @LogicalDash ♥
2018-06-24 22:40:57 +02:00
jvoisin 74f2d50433 Split the testsuite a bit and add more tests 2018-06-22 21:16:55 +02:00
jvoisin b4ef0c9622 Improve reliability against corrupted image files 2018-06-22 20:38:29 +02:00
jvoisin 5b38bd7ccd Improve the reliability of the office parser 2018-06-21 23:18:59 +02:00
jvoisin 846a261465 Fix some linter warnings 2018-06-21 23:07:21 +02:00
jvoisin 09e748fa4c Refactor how offices files are handled
- xml files are no longer considered harmless
- Factorization of the `remove_all` method for office files
- Explicit whitelist are used
- Blacklist are used to skip files completely
  - Non-blacklisted files are _still cleaned_
  - Unsupported files are still triggering an error
2018-06-21 23:02:41 +02:00
jvoisin a89dae054a Minor simplification of the office-related code 2018-06-21 21:24:53 +02:00
Antoine Tenart cce5de82e5 libmat2: harmless: add the text/xml mime type
Fedora defines the 'text/xml' mime type for xml files. Adds this mime
type to the harmless parser.

Fixes #36.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-06-12 21:34:47 +02:00
Antoine Tenart 484e26dd9c libmat2: audio: add the audio/x-flac mime type
The FLAC parser looks for the 'audio/flac' mime type, but Fedora
defines 'audio/x-flac' in /etc/mime.types for FLAC files. Add this mime
type to the audio parser.

Fixes #36.

Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-06-12 21:34:47 +02:00
jvoisin 545887af98 Minor code simplification 2018-06-10 20:20:32 +02:00
jvoisin 7dad77a785 Make the parsing of office format's metadata more robust 2018-06-10 20:20:00 +02:00
jvoisin 8c7979aae3 Add some tests for non-supported embedded fileformats 2018-06-10 20:19:35 +02:00
jvoisin 87bdcd1a95 Improve a bit our coverage wrt. torrent files handling 2018-06-10 00:56:55 +02:00
jvoisin 3c56fa3237 Improve a bit the performances wrt. image's metadata display 2018-06-10 00:43:38 +02:00
jvoisin 9c7aa34f50 Bump a bit the coverage 2018-06-10 00:43:25 +02:00
jvoisin e81ce6cd1a Fix and add a test for explicitly non-supported formats 2018-06-10 00:28:43 +02:00
jvoisin 633654376a Improve a bit parsers autoloading 2018-06-10 00:28:26 +02:00
jvoisin aa42b905d5 Speed up a bit the processing of get_meta for images with a "regular" name 2018-06-08 23:30:12 +02:00
jvoisin e86e8e3c23 Improve the code to handle problematic filenames 2018-06-08 17:34:53 +02:00
jvoisin 6a832a4104 Prevent exiftool-based parameter-injection 2018-06-06 23:50:25 +02:00
jvoisin 8368de7fa7 Sort the output of `mat2 -l` 2018-06-04 23:32:13 +02:00
jvoisin 6a1b0b31f0 Add more typing and use mypy in the CI 2018-06-04 23:20:30 +02:00
jvoisin 4ebf9754f8 Import the dynamic import system
The dynamic import should now work when MAT2 is
installed system-wide, either via the distribution's
packaging system, or via pip.
2018-06-04 20:53:21 +02:00
jvoisin d1392de6f5 Make pyflakes happier 2018-06-04 20:43:28 +02:00
totallylegit 183667a7f9 Improve a bit the typing, again 2018-06-04 20:39:27 +02:00
totallylegit 8143b63ee3 Improve a return type annotation 2018-06-04 20:29:41 +02:00
jvoisin 38fae60b8b Rename some files to simplify packaging
- the `src` folder is now `libmat2`
- the `main.py` script is now `mat2.py`
2018-05-18 23:52:40 +02:00
jvoisin 12e2330ca6 Remove some useless files 2018-03-19 00:04:00 +01:00
jvoisin df3c27d79d Improve the testsuite 2018-03-18 21:42:12 +01:00