-
Notifications
You must be signed in to change notification settings - Fork 59
PDF Parsing in yextend
In yextend (as of version 1.5) if a PDF file is encountered (whether inside an archive or not) it is natively processed against a given Yara ruleset in 2 ways:
- the target PDF's binary data set
- the raw text (extracted out of the binary data set) from the target PDF
This is useful for those cases where one wants to scan a PDF for structural issues (i.e. embedded malware) at the same time as scanning the actual text inside of the same document (i.e. keyword search). Yextend performs both of these checks automatically.
Here is an example outlining some differences via an example ...
We will use 2 target files to explain the point here:
- lipsum.txt.pdf - a simple PDF with some text
- lipsum.pdf.zip - a zip file that holds lipsum.pdf.tar.gz. Inside that gzipped tarball is lipsum.txt.pdf
For our example ruleset we will use 'lorem_pdf.yara', found in directory 'test_rulesets'. It has 2 very simple rules in it.
Note: we will ignore rule warnings here
dre@debian:~/software/yextend$ yara -w -s test_rulesets/lorem_pdf.yara test_files/lipsum.pdf.zip
dre@debian:~/software/yextend$
As expected we got zero hits on that run.
dre@debian:~/software/yextend$ yara -w -s test_rulesets/lorem_pdf.yara test_files/lipsum.txt.pdf
invalid_trailer_structure test_files/lipsum.txt.pdf
0x0:$magic: 25 50 44 46
dre@debian:~/software/yextend$
In this case we got a hit against the PDF data in native format but no hit against the actual text from the content inside the PDF.
dre@debian:~/software/yextend$ ./yextend test_rulesets/lorem_pdf.yara test_files/lipsum.txt.pdf
===============================ALPHA===================================
File Name: test_files/lipsum.txt.pdf
File Size: 40450
File Signature (MD5): ec650a3a287603d350718b74716aee1c
=======================================================================
Yara Result(s): invalid_trailer_structure:[author=Glenn Edwards (@hiddenillusion),version=0.1,weight=1,detected offsets=0x0:$magic,hit_count=1]
Scan Type: Yara Scan (PDF - Raw data)
Parent File Name: test_files/lipsum.txt.pdf
Child File Name: test_files/lipsum.txt.pdf
File Signature (MD5): ec650a3a287603d350718b74716aee1c
Yara Result(s): LOREM_FILE_BODY:[type=PDF body text (lorem),detected offsets=0x3b:$lipsum_pdf_body_lorem-0x429:$lipsum_pdf_body_lorem-0x1250:$lipsum_pdf_body_lorem-0x12f6:$lipsum_pdf_body_lorem-0x1374:$lipsum_pdf_body_lorem-0x166e:$lipsum_pdf_body_lorem-0x1b6f:$lipsum_pdf_body_lorem-0x1eb7:$lipsum_pdf_body_lorem-0x2149:$lipsum_pdf_body_lorem-0x282b:$lipsum_pdf_body_lorem-0x2a8a:$lipsum_pdf_body_lorem-0x2f1d:$lipsum_pdf_body_lorem-0x301e:$lipsum_pdf_body_lorem-0x305f:$lipsum_pdf_body_lorem-0x3324:$lipsum_pdf_body_lorem-0x3653:$lipsum_pdf_body_lorem-0x38c9:$lipsum_pdf_body_lorem-0x3ac9:$lipsum_pdf_body_lorem-0x41a5:$lipsum_pdf_body_lorem-0x41d8:$lipsum_pdf_body_lorem-0x44df:$lipsum_pdf_body_lorem-0x5654:$lipsum_pdf_body_lorem-0x6647:$lipsum_pdf_body_lorem-0x6727:$lipsum_pdf_body_lorem-0x6939:$lipsum_pdf_body_lorem-0x721a:$lipsum_pdf_body_lorem,hit_count=26]
Scan Type: Yara Scan (PDF - Extracted text)
Parent File Name: test_files/lipsum.txt.pdf
Child File Name: test_files/lipsum.txt.pdf
File Signature (MD5): 126a551fd3801cb33c8dbacfc04ba75f
===============================OMEGA===================================
dre@debian:~/software/yextend$
In this case we got the same hit as the straight Yara run on the PDF formatted data but we also got a rule hit with multiple counts (each at a separate offset) against the actual text from inside the PDF.
dre@debian:~/software/yextend$ ./yextend test_rulesets/lorem_pdf.yara test_files/lipsum.pdf.zip
===============================ALPHA===================================
File Name: test_files/lipsum.pdf.zip
File Size: 34943
File Signature (MD5): 9a709033aa3a59fd16a39a024d3e9c8a
=======================================================================
Yara Result(s): invalid_trailer_structure:[author=Glenn Edwards (@hiddenillusion),version=0.1,weight=1,detected offsets=0x0:$magic,hit_count=1]
Scan Type: Yara Scan (PDF - Raw data) inside GNU tar format file
Parent File Name: lipsum.pdf.tar
Child File Name: lipsum.txt.pdf
File Signature (MD5): ec650a3a287603d350718b74716aee1c
Yara Result(s): LOREM_FILE_BODY:[type=PDF body text (lorem),detected offsets=0x3b:$lipsum_pdf_body_lorem-0x429:$lipsum_pdf_body_lorem-0x1250:$lipsum_pdf_body_lorem-0x12f6:$lipsum_pdf_body_lorem-0x1374:$lipsum_pdf_body_lorem-0x166e:$lipsum_pdf_body_lorem-0x1b6f:$lipsum_pdf_body_lorem-0x1eb7:$lipsum_pdf_body_lorem-0x2149:$lipsum_pdf_body_lorem-0x282b:$lipsum_pdf_body_lorem-0x2a8a:$lipsum_pdf_body_lorem-0x2f1d:$lipsum_pdf_body_lorem-0x301e:$lipsum_pdf_body_lorem-0x305f:$lipsum_pdf_body_lorem-0x3324:$lipsum_pdf_body_lorem-0x3653:$lipsum_pdf_body_lorem-0x38c9:$lipsum_pdf_body_lorem-0x3ac9:$lipsum_pdf_body_lorem-0x41a5:$lipsum_pdf_body_lorem-0x41d8:$lipsum_pdf_body_lorem-0x44df:$lipsum_pdf_body_lorem-0x5654:$lipsum_pdf_body_lorem-0x6647:$lipsum_pdf_body_lorem-0x6727:$lipsum_pdf_body_lorem-0x6939:$lipsum_pdf_body_lorem-0x721a:$lipsum_pdf_body_lorem,hit_count=26]
Scan Type: Yara Scan (PDF - Extracted text) inside GNU tar format file
Parent File Name: lipsum.pdf.tar
Child File Name: lipsum.txt.pdf
File Signature (MD5): 126a551fd3801cb33c8dbacfc04ba75f
===============================OMEGA===================================
dre@debian:~/software/yextend$
In this case we see the same results as when we scanned the PDF directly since yextend extracted the contents of the archive and then scanned them as if directly. Take note of the differences in the 'Parent File Name' values since the data was deflated inside of an archive this time.