Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CLSID for Office 97-2003 detection #243

Merged
merged 2 commits into from
Feb 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,22 @@ using magic numbers is slow, inaccurate, and non-standard. Most of the times
protocols have methods for specifying such metadata; e.g., `Content-Type` header
in HTTP and SMTP.

## FAQ
Q: My file is in the list of [supported MIME types](supported_mimes.md) but
it is not correctly detected. What should I do?

A: Some file formats (often Microsoft Office documents) keep their signatures
towards the end of the file. Try increasing the number of bytes used for detection
with:
```go
mimetype.SetLimit(1024*1024) // Set limit to 1MB.
// or
mimetype.SetLimit(0) // No limit, whole file content used.
mimetype.DetectFile("file.doc")
```
If increasing the limit does not help, please
[open an issue](https://github.com/gabriel-vasile/mimetype/issues/new?assignees=&labels=&template=mismatched-mime-type-detected.md&title=).

## Structure
**mimetype** uses a hierarchical structure to keep the MIME type detection logic.
This reduces the number of calls needed for detecting the file type. The reason
Expand Down
26 changes: 18 additions & 8 deletions internal/magic/ms_office.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,24 @@ func Aaf(raw []byte, limit uint32) bool {
}

// Doc matches a Microsoft Word 97-2003 file.
//
// BUG(gabriel-vasile): Doc should look for subheaders like Ppt and Xls does.
//
// Ole is a container for Doc, Ppt, Pub and Xls.
// Right now, when an Ole file is detected, it is considered to be a Doc file
// if the checks for Ppt, Pub and Xls failed.
func Doc(raw []byte, limit uint32) bool {
return true
// See: https://github.com/decalage2/oletools/blob/412ee36ae45e70f42123e835871bac956d958461/oletools/common/clsid.py
func Doc(raw []byte, _ uint32) bool {
clsids := [][]byte{
// Microsoft Word 97-2003 Document (Word.Document.8)
{0x06, 0x09, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x46},
// Microsoft Word 6.0-7.0 Document (Word.Document.6)
{0x00, 0x09, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x46},
// Microsoft Word Picture (Word.Picture.8)
{0x07, 0x09, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x46},
}

for _, clsid := range clsids {
if matchOleClsid(raw, clsid) {
return true
}
}

return false
}

// Ppt matches a Microsoft PowerPoint 97-2003 file or a PowerPoint 95 presentation.
Expand Down
1 change: 0 additions & 1 deletion mimetype_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ var files = map[string]string{
"dcm.dcm": "application/dicom",
"deb.deb": "application/vnd.debian.binary-package",
"djvu.djvu": "image/vnd.djvu",
"doc.1.doc": "application/msword",
"doc.doc": "application/msword",
"docx.1.docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"docx.docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
Expand Down
Binary file removed testdata/doc.1.doc
Binary file not shown.
Binary file modified testdata/doc.doc
Binary file not shown.