jvoisin
d8b68ef68e
Improve a bit Microsoft word support
2020-05-17 16:53:36 +02:00
jvoisin
c8dc020dc5
Improve xlsx support
2020-04-06 20:47:32 +02:00
jvoisin
599909a760
Improve xlsx support
2020-04-02 20:58:10 +02:00
jvoisin
d7a03d907b
Vastly improve ppt compatibility
2020-03-08 14:06:27 +01:00
jvoisin
a23dc001cd
Improve compatibility with MS Office of cleaned ppt
2020-03-07 14:34:07 +01:00
jvoisin
f93df85d03
Improve a bit ppt support
2020-03-07 05:22:36 -08:00
jvoisin
e5b1068ed6
Improve a bit the support of ppt files
2020-03-07 12:49:45 +01:00
jvoisin
e4114af3b5
Improve a bit ppt support
2019-11-30 11:38:22 +01:00
jvoisin
d56f83bed1
Improve a bit odt handling
2019-11-30 10:25:24 +01:00
jvoisin
655c19d17d
Improve a bit the support for ppt files
2019-10-17 23:02:17 +02:00
jvoisin
0170f0e37e
Improve a bit the comments in the code
...
This is related to the previous commit
2019-09-01 13:52:02 +02:00
jvoisin
0cf0541ad9
Remove nsid fields from MSOffice documents
...
nsids are random identifiers, usually used to ease merging
between documents, and can trivially be used for fingerprinting.
2019-09-01 13:52:02 +02:00
jvoisin
82cc822a1d
Add tar archive support
2019-04-27 04:05:36 -07:00
Brolf
5ac91cd4f9
Refactor {black,white}list into {block,allow}list
...
Closes #96
2019-03-05 23:13:42 +00:00
jvoisin
6ef6aaa222
Improve a bit get_meta for libreoffice files
2019-02-08 23:23:56 +01:00
jvoisin
e1dd439fc8
Use of the archive refactoring for the office documents too
2019-02-07 22:19:37 +01:00
jvoisin
b9a62d798a
Refactor a bit office get_meta handling
...
This should make easier to get more metadata from
archive-based file formats.
2019-02-04 00:31:26 +01:00
intrigeri
e8c1bb0e3c
Whenever possible, use bwrap for subprocesses
...
This should closes #90
2019-02-03 19:18:41 +01:00
jvoisin
513d897ea0
Implement get_meta() for archives
2018-10-25 11:29:50 +02:00
jvoisin
2ba38dd2a1
Bump mypy typing coverage
2018-10-12 14:32:09 +02:00
jvoisin
0d25b18d26
Improve both the typing and the comments
2018-10-05 17:07:58 +02:00
jvoisin
8e98593b02
Trash word/people.xml in office files
2018-10-04 16:28:20 +02:00
jvoisin
5a5c642a46
Don't break office files for MS Office
...
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
2018-10-03 16:38:05 +02:00
jvoisin
1b356b8c6f
Improve mat2's cli reliability
...
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
2018-10-03 15:22:36 +02:00
jvoisin
c67bbafb2c
Use [Content_Types].xml to improve MS Office coverage
2018-10-02 11:55:42 -07:00
georg
5b606f939d
fix typo
2018-10-02 16:01:24 +00:00
jvoisin
652b8e519f
Files processed via MAT2 are now accepted without warnings by MS Office
2018-10-01 12:25:37 -07:00
jvoisin
81a3881aa4
Please mypy
2018-09-30 19:55:17 +02:00
jvoisin
e342671ead
Remove dangling references in MS Office's [Content_types].xml
2018-09-30 19:53:18 +02:00
jvoisin
719cdf20fa
Second pass of minor formatting
2018-09-24 20:15:07 +02:00
jvoisin
2e243355f5
Fix some minor formatting issues
2018-09-24 19:50:24 +02:00
jvoisin
174d4a0ac0
Implement rsid stripping for office files
...
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."
See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/ .
2018-09-24 18:03:59 +02:00
jvoisin
fbcf68c280
Lexicographical sort on xml attributes for office files
...
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
2018-09-24 17:45:09 +02:00
jvoisin
e3d817f57e
Split office and archives
2018-09-06 11:34:14 +02:00
Daniel Kahn Gillmor
f3cef319b9
Unknown Members: make policy use an Enum
...
Closes #60
Note: this changeset also ensures that clean.cleaned.docx is removed
up after the pytest is over.
2018-09-05 18:59:33 -04:00
jvoisin
072ee1814d
Remove defusedxml support and document why
2018-09-05 18:41:08 +02:00
jvoisin
46bb1b83ea
Improve the previous commit
2018-09-05 17:26:09 +02:00
Daniel Kahn Gillmor
1d7e374e5b
office: try all members, even when one fails
...
the end result will be the same -- an abort -- but the user will get
to see all the warnings for a particular file, instead of getting them
one at a time.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor
915dc634c4
document all unknown/unhandlable files even on abort
...
This makes it easy to get a list of all files that mat2 doesn't know
how to handle, without having to choose -u keep or -u omit.
2018-09-04 18:28:04 -04:00
Daniel Kahn Gillmor
4192a2daa3
office: create policy for what to do about unknown members
...
previously, encountering an unknown member meant that any parser of
this type would abort.
now, the user can set parser.unknown_member_policy to either 'omit' or
'keep' if they don't want the current action of 'abort'
note that this causes pylint to complain about branching depth for
remove_all() because of the nuanced error-handling. I've disabled
this check.
2018-09-04 16:13:33 -04:00
jvoisin
7877ba0da5
Fix a minor formatting issue
2018-09-01 14:16:55 +02:00
dkg
e2634f7a50
Logging cleanup
2018-09-01 05:14:32 -07:00
jvoisin
942859601d
Improve the code's documentation
2018-07-19 23:10:27 +02:00
jvoisin
565cb66d14
Minor simplification in how we're handling xml for office files
2018-07-19 22:55:08 +02:00
jvoisin
5a7c7f35f7
Remove print
from libmat, and use the logging
module instead
...
This should close #28
2018-07-10 21:30:38 +02:00
jvoisin
080d6769ca
Make pylint even happier
2018-07-09 01:11:44 +02:00
jvoisin
8c21006e6c
Fix some pep8 issues spotted by pyflakes
2018-07-08 22:40:36 +02:00
jvoisin
f49aa5cab7
Achieve 100% coverage!
2018-07-08 22:27:37 +02:00
jvoisin
ad3e7ccee8
Bump coverage for office files and fix some related crashes
2018-07-08 21:35:45 +02:00
jvoisin
ca01484126
Silence a mypy's stupid warning
2018-07-08 17:12:17 +02:00