Jason Smalls
|
8c26020f67
|
Add more files to ignore for MSOffice documents
|
2023-07-11 21:38:22 +02:00 |
|
jvoisin
|
1b9608aecf
|
Use proper type annotations instead of comments
|
2023-05-03 22:28:02 +02:00 |
|
jvoisin
|
3cb3f58084
|
Another typing pass
|
2023-01-28 17:22:26 +01:00 |
|
jvoisin
|
39fb254e01
|
Fix the type annotations
|
2023-01-28 15:57:20 +00:00 |
|
jvoisin
|
62a45c29df
|
Improve xlsx support
|
2022-12-25 18:05:13 +01:00 |
|
jvoisin
|
180ea24e5a
|
Remove pyflakes
Isn't borderline useless compared to mypy and pylint
|
2022-11-21 19:57:38 +01:00 |
|
jvoisin
|
cc5be8608b
|
Simplify the typing annotations
|
2022-08-28 22:29:06 +02:00 |
|
jvoisin
|
3378f3ab8c
|
Please pylint by iterating on dict directly, instead of calling .keys()
|
2021-12-26 15:23:26 +01:00 |
|
jvoisin
|
0b094b594b
|
Improve xlsx support
This should close #156
|
2021-07-14 23:34:02 +02:00 |
|
jvoisin
|
bf0c777cb9
|
Improve support for xlsx files
|
2021-05-20 18:16:28 +02:00 |
|
jvoisin
|
d00ca800b2
|
Keep sharedStrings.xml when processing MSOffice sheets
|
2021-03-14 14:41:40 +01:00 |
|
jvoisin
|
8b42b28b70
|
Don't keep [trash] files when processing MS Office files
|
2021-03-14 14:35:29 +01:00 |
|
jvoisin
|
148bcbba52
|
Bump coverage
|
2020-11-13 17:27:23 +01:00 |
|
jvoisin
|
b84f73c5c3
|
Handle multiple namespaces in MSOffice's content types
|
2020-11-06 15:29:42 +01:00 |
|
jvoisin
|
96e639dfd3
|
Fix a regexp for xsls files
This should increase a bit the compability with Excel files
|
2020-11-06 15:26:30 +01:00 |
|
jvoisin
|
d8b68ef68e
|
Improve a bit Microsoft word support
|
2020-05-17 16:53:36 +02:00 |
|
jvoisin
|
c8dc020dc5
|
Improve xlsx support
|
2020-04-06 20:47:32 +02:00 |
|
jvoisin
|
599909a760
|
Improve xlsx support
|
2020-04-02 20:58:10 +02:00 |
|
jvoisin
|
d7a03d907b
|
Vastly improve ppt compatibility
|
2020-03-08 14:06:27 +01:00 |
|
jvoisin
|
a23dc001cd
|
Improve compatibility with MS Office of cleaned ppt
|
2020-03-07 14:34:07 +01:00 |
|
jvoisin
|
f93df85d03
|
Improve a bit ppt support
|
2020-03-07 05:22:36 -08:00 |
|
jvoisin
|
e5b1068ed6
|
Improve a bit the support of ppt files
|
2020-03-07 12:49:45 +01:00 |
|
jvoisin
|
e4114af3b5
|
Improve a bit ppt support
|
2019-11-30 11:38:22 +01:00 |
|
jvoisin
|
d56f83bed1
|
Improve a bit odt handling
|
2019-11-30 10:25:24 +01:00 |
|
jvoisin
|
655c19d17d
|
Improve a bit the support for ppt files
|
2019-10-17 23:02:17 +02:00 |
|
jvoisin
|
0170f0e37e
|
Improve a bit the comments in the code
This is related to the previous commit
|
2019-09-01 13:52:02 +02:00 |
|
jvoisin
|
0cf0541ad9
|
Remove nsid fields from MSOffice documents
nsids are random identifiers, usually used to ease merging
between documents, and can trivially be used for fingerprinting.
|
2019-09-01 13:52:02 +02:00 |
|
jvoisin
|
82cc822a1d
|
Add tar archive support
|
2019-04-27 04:05:36 -07:00 |
|
Brolf
|
5ac91cd4f9
|
Refactor {black,white}list into {block,allow}list
Closes #96
|
2019-03-05 23:13:42 +00:00 |
|
jvoisin
|
6ef6aaa222
|
Improve a bit get_meta for libreoffice files
|
2019-02-08 23:23:56 +01:00 |
|
jvoisin
|
e1dd439fc8
|
Use of the archive refactoring for the office documents too
|
2019-02-07 22:19:37 +01:00 |
|
jvoisin
|
b9a62d798a
|
Refactor a bit office get_meta handling
This should make easier to get more metadata from
archive-based file formats.
|
2019-02-04 00:31:26 +01:00 |
|
intrigeri
|
e8c1bb0e3c
|
Whenever possible, use bwrap for subprocesses
This should closes #90
|
2019-02-03 19:18:41 +01:00 |
|
jvoisin
|
513d897ea0
|
Implement get_meta() for archives
|
2018-10-25 11:29:50 +02:00 |
|
jvoisin
|
2ba38dd2a1
|
Bump mypy typing coverage
|
2018-10-12 14:32:09 +02:00 |
|
jvoisin
|
0d25b18d26
|
Improve both the typing and the comments
|
2018-10-05 17:07:58 +02:00 |
|
jvoisin
|
8e98593b02
|
Trash word/people.xml in office files
|
2018-10-04 16:28:20 +02:00 |
|
jvoisin
|
5a5c642a46
|
Don't break office files for MS Office
We didn't take the whitelist into account while
removing dangling files from [Content_types].xml
|
2018-10-03 16:38:05 +02:00 |
|
jvoisin
|
1b356b8c6f
|
Improve mat2's cli reliability
- Replace some class members by instance members
- Don't thread the cleaning process anymore for now
|
2018-10-03 15:22:36 +02:00 |
|
jvoisin
|
c67bbafb2c
|
Use [Content_Types].xml to improve MS Office coverage
|
2018-10-02 11:55:42 -07:00 |
|
georg
|
5b606f939d
|
fix typo
|
2018-10-02 16:01:24 +00:00 |
|
jvoisin
|
652b8e519f
|
Files processed via MAT2 are now accepted without warnings by MS Office
|
2018-10-01 12:25:37 -07:00 |
|
jvoisin
|
81a3881aa4
|
Please mypy
|
2018-09-30 19:55:17 +02:00 |
|
jvoisin
|
e342671ead
|
Remove dangling references in MS Office's [Content_types].xml
|
2018-09-30 19:53:18 +02:00 |
|
jvoisin
|
719cdf20fa
|
Second pass of minor formatting
|
2018-09-24 20:15:07 +02:00 |
|
jvoisin
|
2e243355f5
|
Fix some minor formatting issues
|
2018-09-24 19:50:24 +02:00 |
|
jvoisin
|
174d4a0ac0
|
Implement rsid stripping for office files
MS Office XML rsid is a "unique identifier used to track the editing session
when the physical character representing this section mark was last formatted."
See the following links for details:
- https://msdn.microsoft.com/en-us/library/office/documentformat.openxml.wordprocessing.previoussectionproperties.rsidrpr.aspx
- https://blogs.msdn.microsoft.com/brian_jones/2006/12/11/whats-up-with-all-those-rsids/.
|
2018-09-24 18:03:59 +02:00 |
|
jvoisin
|
fbcf68c280
|
Lexicographical sort on xml attributes for office files
In XML, the order of the attributes shouldn't be meaningful,
however, MS Office sorts attributes for a given XML tag
differently than LibreOffice.
|
2018-09-24 17:45:09 +02:00 |
|
jvoisin
|
e3d817f57e
|
Split office and archives
|
2018-09-06 11:34:14 +02:00 |
|
Daniel Kahn Gillmor
|
f3cef319b9
|
Unknown Members: make policy use an Enum
Closes #60
Note: this changeset also ensures that clean.cleaned.docx is removed
up after the pytest is over.
|
2018-09-05 18:59:33 -04:00 |
|