jvoisin
283e5e5787
Improve archive-based parser's robustness against corrupted embedded files
2018-10-25 11:56:12 +02:00
jvoisin
513d897ea0
Implement get_meta() for archives
2018-10-25 11:29:50 +02:00
jvoisin
5a9dc388ad
Minor refactorisation of how we're checking for exiftool's presence
2018-10-25 11:05:06 +02:00
jvoisin
fe885babee
Implement lightweight cleaning for jpg
2018-10-24 19:35:07 +02:00
jvoisin
9a81b3adfd
Improve type annotation coverage
2018-10-23 16:32:28 +02:00
jvoisin
f1a071d460
Implement lightweight cleaning for png and tiff
2018-10-23 16:22:11 +02:00
jvoisin
38df679a88
Optimize the handling of problematic files
2018-10-23 13:49:58 +02:00
jvoisin
44f267a596
Improve problematic filenames support
2018-10-22 16:56:05 +02:00
jvoisin
83389a63e9
Test mat2's reliability wrt. corrupted video files
2018-10-22 13:42:04 +02:00
jvoisin
e70ea811c9
Implement support for .avi files, via ffmpeg
...
- This commit introduces optional dependencies (namely ffmpeg):
mat2 will spit a warning when trying to process an .avi file
if ffmpeg isn't installed.
- Since metadata are obtained via exiftool, this commit
also refactors a bit our exfitool wrapper.
2018-10-22 12:58:01 +02:00
jvoisin
2ba38dd2a1
Bump mypy typing coverage
2018-10-12 14:32:09 +02:00
jvoisin
b832a59414
Refactor lightweight mode implementation
2018-10-12 11:49:24 +02:00
jvoisin
b9dbd12ef9
Implement recursive metadata for FLAC files
...
Since FLAC files can contain covers, it makes sense
to parse their metadata
2018-10-11 19:52:47 +02:00
jvoisin
b2e153b69c
Delete pictures of FLAC files
2018-10-11 18:15:11 +02:00
jvoisin
0d25b18d26
Improve both the typing and the comments
2018-10-05 17:07:58 +02:00
jvoisin
d0f3534eff
Hide unsupported extensions in mat2 -l
2018-10-05 12:43:21 +02:00
jvoisin
8e98593b02
Trash word/people.xml in office files
2018-10-04 16:28:20 +02:00
georg
34fbd633fd
libmat2: fix shebang
...
Relates 0a2a398c9c
2018-10-03 18:38:28 +00:00
jvoisin
5a5c642a46
Don't break office files for MS Office
...
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
2018-10-03 16:38:05 +02:00
jvoisin
1b356b8c6f
Improve mat2's cli reliability
...
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
2018-10-03 15:22:36 +02:00
jvoisin
c67bbafb2c
Use [Content_Types].xml to improve MS Office coverage
2018-10-02 11:55:42 -07:00
georg
5b606f939d
fix typo
2018-10-02 16:01:24 +00:00
jvoisin
652b8e519f
Files processed via MAT2 are now accepted without warnings by MS Office
2018-10-01 12:25:37 -07:00
jvoisin
81a3881aa4
Please mypy
2018-09-30 19:55:17 +02:00
jvoisin
e342671ead
Remove dangling references in MS Office's [Content_types].xml
2018-09-30 19:53:18 +02:00
jvoisin
719cdf20fa
Second pass of minor formatting
2018-09-24 20:15:07 +02:00
jvoisin
2e243355f5
Fix some minor formatting issues
2018-09-24 19:50:24 +02:00
jvoisin
174d4a0ac0
Implement rsid stripping for office files
...
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."
See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/ .
2018-09-24 18:03:59 +02:00
jvoisin
fbcf68c280
Lexicographical sort on xml attributes for office files
...
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
2018-09-24 17:45:09 +02:00
jvoisin
a1a06d023e
Insert archive members in lexicographic order
2018-09-18 22:44:21 +02:00
jvoisin
5cf94bd256
Bump coverage back to 100%
2018-09-12 14:54:54 +02:00
jvoisin
de65f4f4d4
Improve the resilience of MAT2 wrt. corrupted PNG
2018-09-09 19:09:05 +02:00
jvoisin
9fe6f1023b
Make pylint happy
2018-09-06 11:36:04 +02:00
jvoisin
e3d817f57e
Split office and archives
2018-09-06 11:34:14 +02:00
jvoisin
120b204988
Change a bit the previous commit
2018-09-06 11:13:11 +02:00
Daniel Kahn Gillmor
f3cef319b9
Unknown Members: make policy use an Enum
...
Closes #60
Note: this changeset also ensures that clean.cleaned.docx is removed
up after the pytest is over.
2018-09-05 18:59:33 -04:00
jvoisin
072ee1814d
Remove defusedxml support and document why
2018-09-05 18:41:08 +02:00
jvoisin
46bb1b83ea
Improve the previous commit
2018-09-05 17:26:09 +02:00
Daniel Kahn Gillmor
1d7e374e5b
office: try all members, even when one fails
...
the end result will be the same -- an abort -- but the user will get
to see all the warnings for a particular file, instead of getting them
one at a time.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor
915dc634c4
document all unknown/unhandlable files even on abort
...
This makes it easy to get a list of all files that mat2 doesn't know
how to handle, without having to choose -u keep or -u omit.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor
4192a2daa3
office: create policy for what to do about unknown members
...
previously, encountering an unknown member meant that any parser of
this type would abort.
now, the user can set parser.unknown_member_policy to either 'omit' or
'keep' if they don't want the current action of 'abort'
note that this causes pylint to complain about branching depth for
remove_all() because of the nuanced error-handling. I've disabled
this check.
2018-09-04 16:13:33 -04:00
jvoisin
907fc591cc
Bump the coverage back to 100%
2018-09-01 16:58:34 +02:00
Daniel Kahn Gillmor
3e2890eb9e
three minor spelling fixes
2018-09-01 06:47:22 -07:00
jvoisin
91e80527fc
Add archlinux to the CI
2018-09-01 15:41:22 +02:00
jvoisin
7877ba0da5
Fix a minor formatting issue
2018-09-01 14:16:55 +02:00
dkg
e2634f7a50
Logging cleanup
2018-09-01 05:14:32 -07:00
jvoisin
1c72448e58
Improve the detection of unsupported extensions in uppercase
2018-08-23 21:28:37 +02:00
Antoine Tenart
f068621628
libmat2: images: fix handling of .JPG files
...
Pixbuf only supports .jpeg files, not .jpg, so libmat2 looks for such an
extension and converts it if necessary. As this check is case sensitive,
processing .JPG files does not work.
Fixes #47 .
Signed-off-by: Antoine Tenart <antoine.tenart@ack.tf>
2018-08-23 20:43:27 +02:00
georg
71b1ced842
AbstractParser: Fix typos
2018-07-21 00:46:48 +00:00
jvoisin
942859601d
Improve the code's documentation
2018-07-19 23:10:27 +02:00