1
0
mirror of synced 2024-11-25 18:54:22 +01:00
Commit Graph

116 Commits

Author SHA1 Message Date
jvoisin
524bae5972 <title> is also an html metadata 2019-02-23 20:47:26 +01:00
jvoisin
c757a9b7ef Fix a bug in css cleaning
It's not mandatory to actually have a comment inside
comment delimiter, like `/**/`.
2019-02-23 20:21:11 +01:00
jvoisin
02ff21b158 Implement epub support 2019-02-20 16:28:11 -08:00
jvoisin
a81b7658a8 Make the mandatory metadata warning generic
This should close #95.
2019-02-10 21:46:13 +01:00
jvoisin
6e63e03b86 Streamline a bit the previous commit 2019-02-09 15:23:16 +01:00
Poncho
a71488d459 bind mount /etc/ld.so.cache to the sandbox
without /etc/ld.so.cache available in the sandbox, tests fail on gentoo with:
/usr/bin/ffmpeg: error while loading shared libraries: libstdc++.so.6:
    cannot open shared object file: No such file or directory
2019-02-09 09:49:51 +01:00
jvoisin
6ef6aaa222 Improve a bit get_meta for libreoffice files 2019-02-08 23:23:56 +01:00
jvoisin
6cc034e81b Add support for html files 2019-02-08 23:05:18 +01:00
jvoisin
e1dd439fc8 Use of the archive refactoring for the office documents too 2019-02-07 22:19:37 +01:00
jvoisin
b9a62d798a Refactor a bit office get_meta handling
This should make easier to get more metadata from
archive-based file formats.
2019-02-04 00:31:26 +01:00
jvoisin
433609f8ea Implement .gif support 2019-02-03 21:01:58 +01:00
intrigeri
e8c1bb0e3c Whenever possible, use bwrap for subprocesses
This should closes  #90
2019-02-03 19:18:41 +01:00
jvoisin
8e84ba547a Add support for wmv 2019-02-02 19:19:36 +01:00
jvoisin
04bb8c8ccf Add mp4 support 2018-10-28 07:41:04 -07:00
jvoisin
3a070b0ab7 Add support for zip files 2018-10-25 11:56:46 +02:00
jvoisin
283e5e5787 Improve archive-based parser's robustness against corrupted embedded files 2018-10-25 11:56:12 +02:00
jvoisin
513d897ea0 Implement get_meta() for archives 2018-10-25 11:29:50 +02:00
jvoisin
5a9dc388ad Minor refactorisation of how we're checking for exiftool's presence 2018-10-25 11:05:06 +02:00
jvoisin
fe885babee Implement lightweight cleaning for jpg 2018-10-24 19:35:07 +02:00
jvoisin
9a81b3adfd Improve type annotation coverage 2018-10-23 16:32:28 +02:00
jvoisin
f1a071d460 Implement lightweight cleaning for png and tiff 2018-10-23 16:22:11 +02:00
jvoisin
38df679a88 Optimize the handling of problematic files 2018-10-23 13:49:58 +02:00
jvoisin
44f267a596 Improve problematic filenames support 2018-10-22 16:56:05 +02:00
jvoisin
83389a63e9 Test mat2's reliability wrt. corrupted video files 2018-10-22 13:42:04 +02:00
jvoisin
e70ea811c9 Implement support for .avi files, via ffmpeg
- This commit introduces optional dependencies (namely ffmpeg):
  mat2 will spit a warning when trying to process an .avi file
  if ffmpeg isn't installed.
- Since metadata are obtained via exiftool, this commit
  also refactors a bit our exfitool wrapper.
2018-10-22 12:58:01 +02:00
jvoisin
2ba38dd2a1 Bump mypy typing coverage 2018-10-12 14:32:09 +02:00
jvoisin
b832a59414 Refactor lightweight mode implementation 2018-10-12 11:49:24 +02:00
jvoisin
b9dbd12ef9 Implement recursive metadata for FLAC files
Since FLAC files can contain covers, it makes sense
to parse their metadata
2018-10-11 19:52:47 +02:00
jvoisin
b2e153b69c Delete pictures of FLAC files 2018-10-11 18:15:11 +02:00
jvoisin
0d25b18d26 Improve both the typing and the comments 2018-10-05 17:07:58 +02:00
jvoisin
d0f3534eff Hide unsupported extensions in mat2 -l 2018-10-05 12:43:21 +02:00
jvoisin
8e98593b02 Trash word/people.xml in office files 2018-10-04 16:28:20 +02:00
georg
34fbd633fd
libmat2: fix shebang
Relates 0a2a398c9c
2018-10-03 18:38:28 +00:00
jvoisin
5a5c642a46 Don't break office files for MS Office
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
2018-10-03 16:38:05 +02:00
jvoisin
1b356b8c6f Improve mat2's cli reliability
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
2018-10-03 15:22:36 +02:00
jvoisin
c67bbafb2c Use [Content_Types].xml to improve MS Office coverage 2018-10-02 11:55:42 -07:00
georg
5b606f939d
fix typo 2018-10-02 16:01:24 +00:00
jvoisin
652b8e519f Files processed via MAT2 are now accepted without warnings by MS Office 2018-10-01 12:25:37 -07:00
jvoisin
81a3881aa4 Please mypy 2018-09-30 19:55:17 +02:00
jvoisin
e342671ead Remove dangling references in MS Office's [Content_types].xml 2018-09-30 19:53:18 +02:00
jvoisin
719cdf20fa Second pass of minor formatting 2018-09-24 20:15:07 +02:00
jvoisin
2e243355f5 Fix some minor formatting issues 2018-09-24 19:50:24 +02:00
jvoisin
174d4a0ac0 Implement rsid stripping for office files
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."

See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/.
2018-09-24 18:03:59 +02:00
jvoisin
fbcf68c280 Lexicographical sort on xml attributes for office files
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
2018-09-24 17:45:09 +02:00
jvoisin
a1a06d023e Insert archive members in lexicographic order 2018-09-18 22:44:21 +02:00
jvoisin
5cf94bd256 Bump coverage back to 100% 2018-09-12 14:54:54 +02:00
jvoisin
de65f4f4d4 Improve the resilience of MAT2 wrt. corrupted PNG 2018-09-09 19:09:05 +02:00
jvoisin
9fe6f1023b Make pylint happy 2018-09-06 11:36:04 +02:00
jvoisin
e3d817f57e Split office and archives 2018-09-06 11:34:14 +02:00
jvoisin
120b204988 Change a bit the previous commit 2018-09-06 11:13:11 +02:00