jvoisin
97abafdc58
Minor code cleanup
2019-05-09 09:41:05 +02:00
fuzzy
7e031c9757
typo
2019-05-03 02:39:15 -07:00
jvoisin
9516990693
Add some verification for "dangerous" tarfiles
2019-05-01 17:55:35 +02:00
jvoisin
a7ebb587e1
Handle weird permissions in tar archives
2019-04-27 22:48:40 +02:00
jvoisin
14a4cddb8b
Improve the display of tarfile's members mtime
2019-04-27 21:15:06 +02:00
jvoisin
8e41b098d6
Add support for compressed tar files
2019-04-27 06:03:09 -07:00
jvoisin
82cc822a1d
Add tar archive support
2019-04-27 04:05:36 -07:00
jvoisin
05f429b197
Add support for xhtml files
2019-04-14 20:36:33 +02:00
jvoisin
1e325c5b5b
Please mypy
...
Apparently, mypy isn't able (yet?) to deal
with variables that are changing their types
at runtime.
Python is wonderful.
2019-03-30 10:33:16 +01:00
Antoine Tenart
d454ef5b8e
libmat2: fix dependency checks for cmd line utilities
...
The command line checks for command line utilities are done by trying to
access the executables and by throwing an exception when not found. This
lead to:
- The mat2 cmd line --check-dependencies option failing.
- The ffmpeg unit tests failing when ffmpeg isn't installed (even though
it's an optional dependency).
This patch fixes it.
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-03-29 19:29:28 +01:00
Antoine Tenart
c824a68dd8
libmat2: reshape the dependencies list
...
Invert the keys and values in DEPENDENCIES. It seems more natural to use
the key as a key in check_dependencies(), and the value as the value.
This also help in preparing for reworking the check_dependencies()
helper.
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-03-29 19:29:28 +01:00
jvoisin
b8c92fec09
Fix the testsuite
2019-03-23 00:41:23 +01:00
Antoine Tenart
0e3c2c9b1b
libmat2: audio: not all id3 types have a text attribute
...
Not all id3 types have a text attribute (such as mutagen.id3.APIC or
mutagen.id3.UFID). This leads to the get_meta helper to crash when
trying to access the text attribute of an object which does not have it.
Fixes it by checking the text attribute is available before accessing
it.
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2019-03-23 00:32:44 +01:00
Brolf
5ac91cd4f9
Refactor {black,white}list into {block,allow}list
...
Closes #96
2019-03-05 23:13:42 +00:00
georg
c3f097a82b
fix typo
2019-03-01 22:00:23 +00:00
jvoisin
55214206b5
Improve the previous commit
...
- More tests
- More documentation
- Minor code cleanup
2019-02-27 23:53:07 +01:00
jvoisin
73d2966e8c
Improve epub support
2019-02-27 23:04:38 +01:00
jvoisin
eb2e702f37
Document the previous commit
2019-02-25 15:37:44 +01:00
jvoisin
545dccc352
In archive-based formats, the mimetype
file comes first
...
This should improve epub compatibility,
along with other formats as a side-effect
2019-02-24 23:32:32 +01:00
jvoisin
524bae5972
<title> is also an html metadata
2019-02-23 20:47:26 +01:00
jvoisin
c757a9b7ef
Fix a bug in css cleaning
...
It's not mandatory to actually have a comment inside
comment delimiter, like `/**/`.
2019-02-23 20:21:11 +01:00
jvoisin
02ff21b158
Implement epub support
2019-02-20 16:28:11 -08:00
jvoisin
a81b7658a8
Make the mandatory metadata warning generic
...
This should close #95 .
2019-02-10 21:46:13 +01:00
jvoisin
6e63e03b86
Streamline a bit the previous commit
2019-02-09 15:23:16 +01:00
Poncho
a71488d459
bind mount /etc/ld.so.cache to the sandbox
...
without /etc/ld.so.cache available in the sandbox, tests fail on gentoo with:
/usr/bin/ffmpeg: error while loading shared libraries: libstdc++.so.6:
cannot open shared object file: No such file or directory
2019-02-09 09:49:51 +01:00
jvoisin
6ef6aaa222
Improve a bit get_meta for libreoffice files
2019-02-08 23:23:56 +01:00
jvoisin
6cc034e81b
Add support for html files
2019-02-08 23:05:18 +01:00
jvoisin
e1dd439fc8
Use of the archive refactoring for the office documents too
2019-02-07 22:19:37 +01:00
jvoisin
b9a62d798a
Refactor a bit office get_meta handling
...
This should make easier to get more metadata from
archive-based file formats.
2019-02-04 00:31:26 +01:00
jvoisin
433609f8ea
Implement .gif support
2019-02-03 21:01:58 +01:00
intrigeri
e8c1bb0e3c
Whenever possible, use bwrap for subprocesses
...
This should closes #90
2019-02-03 19:18:41 +01:00
jvoisin
8e84ba547a
Add support for wmv
2019-02-02 19:19:36 +01:00
jvoisin
04bb8c8ccf
Add mp4 support
2018-10-28 07:41:04 -07:00
jvoisin
3a070b0ab7
Add support for zip files
2018-10-25 11:56:46 +02:00
jvoisin
283e5e5787
Improve archive-based parser's robustness against corrupted embedded files
2018-10-25 11:56:12 +02:00
jvoisin
513d897ea0
Implement get_meta() for archives
2018-10-25 11:29:50 +02:00
jvoisin
5a9dc388ad
Minor refactorisation of how we're checking for exiftool's presence
2018-10-25 11:05:06 +02:00
jvoisin
fe885babee
Implement lightweight cleaning for jpg
2018-10-24 19:35:07 +02:00
jvoisin
9a81b3adfd
Improve type annotation coverage
2018-10-23 16:32:28 +02:00
jvoisin
f1a071d460
Implement lightweight cleaning for png and tiff
2018-10-23 16:22:11 +02:00
jvoisin
38df679a88
Optimize the handling of problematic files
2018-10-23 13:49:58 +02:00
jvoisin
44f267a596
Improve problematic filenames support
2018-10-22 16:56:05 +02:00
jvoisin
83389a63e9
Test mat2's reliability wrt. corrupted video files
2018-10-22 13:42:04 +02:00
jvoisin
e70ea811c9
Implement support for .avi files, via ffmpeg
...
- This commit introduces optional dependencies (namely ffmpeg):
mat2 will spit a warning when trying to process an .avi file
if ffmpeg isn't installed.
- Since metadata are obtained via exiftool, this commit
also refactors a bit our exfitool wrapper.
2018-10-22 12:58:01 +02:00
jvoisin
2ba38dd2a1
Bump mypy typing coverage
2018-10-12 14:32:09 +02:00
jvoisin
b832a59414
Refactor lightweight mode implementation
2018-10-12 11:49:24 +02:00
jvoisin
b9dbd12ef9
Implement recursive metadata for FLAC files
...
Since FLAC files can contain covers, it makes sense
to parse their metadata
2018-10-11 19:52:47 +02:00
jvoisin
b2e153b69c
Delete pictures of FLAC files
2018-10-11 18:15:11 +02:00
jvoisin
0d25b18d26
Improve both the typing and the comments
2018-10-05 17:07:58 +02:00
jvoisin
d0f3534eff
Hide unsupported extensions in mat2 -l
2018-10-05 12:43:21 +02:00
jvoisin
8e98593b02
Trash word/people.xml in office files
2018-10-04 16:28:20 +02:00
georg
34fbd633fd
libmat2: fix shebang
...
Relates 0a2a398c9c
2018-10-03 18:38:28 +00:00
jvoisin
5a5c642a46
Don't break office files for MS Office
...
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
2018-10-03 16:38:05 +02:00
jvoisin
1b356b8c6f
Improve mat2's cli reliability
...
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
2018-10-03 15:22:36 +02:00
jvoisin
c67bbafb2c
Use [Content_Types].xml to improve MS Office coverage
2018-10-02 11:55:42 -07:00
georg
5b606f939d
fix typo
2018-10-02 16:01:24 +00:00
jvoisin
652b8e519f
Files processed via MAT2 are now accepted without warnings by MS Office
2018-10-01 12:25:37 -07:00
jvoisin
81a3881aa4
Please mypy
2018-09-30 19:55:17 +02:00
jvoisin
e342671ead
Remove dangling references in MS Office's [Content_types].xml
2018-09-30 19:53:18 +02:00
jvoisin
719cdf20fa
Second pass of minor formatting
2018-09-24 20:15:07 +02:00
jvoisin
2e243355f5
Fix some minor formatting issues
2018-09-24 19:50:24 +02:00
jvoisin
174d4a0ac0
Implement rsid stripping for office files
...
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."
See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/ .
2018-09-24 18:03:59 +02:00
jvoisin
fbcf68c280
Lexicographical sort on xml attributes for office files
...
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
2018-09-24 17:45:09 +02:00
jvoisin
a1a06d023e
Insert archive members in lexicographic order
2018-09-18 22:44:21 +02:00
jvoisin
5cf94bd256
Bump coverage back to 100%
2018-09-12 14:54:54 +02:00
jvoisin
de65f4f4d4
Improve the resilience of MAT2 wrt. corrupted PNG
2018-09-09 19:09:05 +02:00
jvoisin
9fe6f1023b
Make pylint happy
2018-09-06 11:36:04 +02:00
jvoisin
e3d817f57e
Split office and archives
2018-09-06 11:34:14 +02:00
jvoisin
120b204988
Change a bit the previous commit
2018-09-06 11:13:11 +02:00
Daniel Kahn Gillmor
f3cef319b9
Unknown Members: make policy use an Enum
...
Closes #60
Note: this changeset also ensures that clean.cleaned.docx is removed
up after the pytest is over.
2018-09-05 18:59:33 -04:00
jvoisin
072ee1814d
Remove defusedxml support and document why
2018-09-05 18:41:08 +02:00
jvoisin
46bb1b83ea
Improve the previous commit
2018-09-05 17:26:09 +02:00
Daniel Kahn Gillmor
1d7e374e5b
office: try all members, even when one fails
...
the end result will be the same -- an abort -- but the user will get
to see all the warnings for a particular file, instead of getting them
one at a time.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor
915dc634c4
document all unknown/unhandlable files even on abort
...
This makes it easy to get a list of all files that mat2 doesn't know
how to handle, without having to choose -u keep or -u omit.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor
4192a2daa3
office: create policy for what to do about unknown members
...
previously, encountering an unknown member meant that any parser of
this type would abort.
now, the user can set parser.unknown_member_policy to either 'omit' or
'keep' if they don't want the current action of 'abort'
note that this causes pylint to complain about branching depth for
remove_all() because of the nuanced error-handling. I've disabled
this check.
2018-09-04 16:13:33 -04:00
jvoisin
907fc591cc
Bump the coverage back to 100%
2018-09-01 16:58:34 +02:00
Daniel Kahn Gillmor
3e2890eb9e
three minor spelling fixes
2018-09-01 06:47:22 -07:00
jvoisin
91e80527fc
Add archlinux to the CI
2018-09-01 15:41:22 +02:00
jvoisin
7877ba0da5
Fix a minor formatting issue
2018-09-01 14:16:55 +02:00
dkg
e2634f7a50
Logging cleanup
2018-09-01 05:14:32 -07:00
jvoisin
1c72448e58
Improve the detection of unsupported extensions in uppercase
2018-08-23 21:28:37 +02:00
Antoine Tenart
f068621628
libmat2: images: fix handling of .JPG files
...
Pixbuf only supports .jpeg files, not .jpg, so libmat2 looks for such an
extension and converts it if necessary. As this check is case sensitive,
processing .JPG files does not work.
Fixes #47 .
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-08-23 20:43:27 +02:00
georg
71b1ced842
AbstractParser: Fix typos
2018-07-21 00:46:48 +00:00
jvoisin
942859601d
Improve the code's documentation
2018-07-19 23:10:27 +02:00
jvoisin
565cb66d14
Minor simplification in how we're handling xml for office files
2018-07-19 22:55:08 +02:00
jvoisin
84d50f97c0
Add a check for a missed dependency in ./mat2 -c
2018-07-15 17:00:01 +02:00
jvoisin
5a7c7f35f7
Remove print
from libmat, and use the logging
module instead
...
This should close #28
2018-07-10 21:30:38 +02:00
jvoisin
d5861e4653
Implement a check for dependencies in mat2
...
Example use:
```
$ mat2 -c
Dependencies required for MAT2 0.1.3:
- Cairo: yes
- Exiftool: yes
- GdkPixbuf from PyGobject: yes
- Mutagen: yes
- Poppler from PyGobject: yes
- PyGobject: yes
```
This should close #35
2018-07-10 21:24:26 +02:00
jvoisin
080d6769ca
Make pylint even happier
2018-07-09 01:11:44 +02:00
jvoisin
8c21006e6c
Fix some pep8 issues spotted by pyflakes
2018-07-08 22:40:36 +02:00
jvoisin
f49aa5cab7
Achieve 100% coverage!
2018-07-08 22:27:37 +02:00
jvoisin
ad3e7ccee8
Bump coverage for office files and fix some related crashes
2018-07-08 21:35:45 +02:00
jvoisin
ca01484126
Silence a mypy's stupid warning
2018-07-08 17:12:17 +02:00
jvoisin
f9bc022c96
Add defusedxml as an (optional) way to prevent XML-based attacks
...
Those attacks are DoS-only.
2018-07-08 17:07:26 +02:00
jvoisin
72e1fda18d
Remove a leftover print
2018-07-08 15:19:18 +02:00
jvoisin
3cd4f9111f
Bump coverage for torrent handling
2018-07-08 15:13:03 +02:00
jvoisin
b5fcddd6a6
Simplify how torrent files are handled
...
- Rework the testsuite wrt. torrent
- fail at parser's instantiation on corrupted torrent,
instead of during `get_meta` or `remove_all` call
2018-07-08 13:49:11 +02:00
jvoisin
7ea362d908
Bump the coverage for pdf
2018-07-07 18:12:33 +02:00
jvoisin
85455a4419
Fix a mistake in office file revisions handling
2018-07-07 18:05:54 +02:00
jvoisin
3d80f97524
Simplify BMP handling
2018-07-06 00:49:17 +02:00
jvoisin
53271495f7
Add support for .txt files
2018-07-06 00:42:09 +02:00
jvoisin
893f58554a
Improve a bit the formatting of the code thanks to pyflakes3
2018-07-02 00:22:05 +02:00
jvoisin
bee56a57ce
Remove docx revisions
2018-07-01 23:16:14 +02:00
jvoisin
02f7605ac1
MAT2 is now cleaning revisions from odt files!
2018-07-01 21:09:20 +02:00
jvoisin
80fc4ffb40
Remove the thumbnails from libreoffice files
2018-07-01 17:29:05 +02:00
jvoisin
177184ac67
Massively simplify how we're cleaning office files
2018-06-27 21:48:46 +02:00
jvoisin
f44769df41
Ensure Poppler's minimal version
...
We're using methods that aren't available in Poppler
below 0.46, so we're checking for this upon import.
This commit is based on ideas from @LogicalDash ♥
2018-06-24 22:40:57 +02:00
jvoisin
74f2d50433
Split the testsuite a bit and add more tests
2018-06-22 21:16:55 +02:00
jvoisin
b4ef0c9622
Improve reliability against corrupted image files
2018-06-22 20:38:29 +02:00
jvoisin
5b38bd7ccd
Improve the reliability of the office parser
2018-06-21 23:18:59 +02:00
jvoisin
846a261465
Fix some linter warnings
2018-06-21 23:07:21 +02:00
jvoisin
09e748fa4c
Refactor how offices files are handled
...
- xml files are no longer considered harmless
- Factorization of the `remove_all` method for office files
- Explicit whitelist are used
- Blacklist are used to skip files completely
- Non-blacklisted files are _still cleaned_
- Unsupported files are still triggering an error
2018-06-21 23:02:41 +02:00
jvoisin
a89dae054a
Minor simplification of the office-related code
2018-06-21 21:24:53 +02:00
Antoine Tenart
cce5de82e5
libmat2: harmless: add the text/xml mime type
...
Fedora defines the 'text/xml' mime type for xml files. Adds this mime
type to the harmless parser.
Fixes #36 .
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-06-12 21:34:47 +02:00
Antoine Tenart
484e26dd9c
libmat2: audio: add the audio/x-flac mime type
...
The FLAC parser looks for the 'audio/flac' mime type, but Fedora
defines 'audio/x-flac' in /etc/mime.types for FLAC files. Add this mime
type to the audio parser.
Fixes #36 .
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-06-12 21:34:47 +02:00
jvoisin
545887af98
Minor code simplification
2018-06-10 20:20:32 +02:00
jvoisin
7dad77a785
Make the parsing of office format's metadata more robust
2018-06-10 20:20:00 +02:00
jvoisin
8c7979aae3
Add some tests for non-supported embedded fileformats
2018-06-10 20:19:35 +02:00
jvoisin
87bdcd1a95
Improve a bit our coverage wrt. torrent files handling
2018-06-10 00:56:55 +02:00
jvoisin
3c56fa3237
Improve a bit the performances wrt. image's metadata display
2018-06-10 00:43:38 +02:00
jvoisin
9c7aa34f50
Bump a bit the coverage
2018-06-10 00:43:25 +02:00
jvoisin
e81ce6cd1a
Fix and add a test for explicitly non-supported formats
2018-06-10 00:28:43 +02:00
jvoisin
633654376a
Improve a bit parsers autoloading
2018-06-10 00:28:26 +02:00
jvoisin
aa42b905d5
Speed up a bit the processing of get_meta for images with a "regular" name
2018-06-08 23:30:12 +02:00
jvoisin
e86e8e3c23
Improve the code to handle problematic filenames
2018-06-08 17:34:53 +02:00
jvoisin
6a832a4104
Prevent exiftool-based parameter-injection
2018-06-06 23:50:25 +02:00
jvoisin
8368de7fa7
Sort the output of mat2 -l
2018-06-04 23:32:13 +02:00
jvoisin
6a1b0b31f0
Add more typing and use mypy in the CI
2018-06-04 23:20:30 +02:00
jvoisin
4ebf9754f8
Import the dynamic import system
...
The dynamic import should now work when MAT2 is
installed system-wide, either via the distribution's
packaging system, or via pip.
2018-06-04 20:53:21 +02:00
jvoisin
d1392de6f5
Make pyflakes happier
2018-06-04 20:43:28 +02:00
totallylegit
183667a7f9
Improve a bit the typing, again
2018-06-04 20:39:27 +02:00
totallylegit
8143b63ee3
Improve a return type annotation
2018-06-04 20:29:41 +02:00
jvoisin
38fae60b8b
Rename some files to simplify packaging
...
- the `src` folder is now `libmat2`
- the `main.py` script is now `mat2.py`
2018-05-18 23:52:40 +02:00
jvoisin
12e2330ca6
Remove some useless files
2018-03-19 00:04:00 +01:00
jvoisin
df3c27d79d
Improve the testsuite
2018-03-18 21:42:12 +01:00