Skip to content

PDF Parsing in yextend

Dre edited this page Aug 29, 2016 · 5 revisions

In yextend (as of version 1.5) if a PDF file is encountered (whether inside an archive or not) it is natively processed against a given Yara ruleset in 2 ways:

  • the target PDF's binary data set
  • the raw text (extracted out of the binary data set) from the target PDF

This is useful for those cases where one wants to scan a PDF for structural issues (i.e. embedded malware) at the same time as scanning the actual text inside of the same document (i.e. keyword search). Yextend performs both of these checks automatically.

Here is an example outlining some differences via an example ...

We will use 2 target files to explain the point here:

  • lipsum.txt.pdf - a simple PDF with some text
  • lipsum.pdf.zip - a zip file that holds lipsum.pdf.tar.gz. Inside that gzipped tarball is lipsum.txt.pdf

For our example ruleset we will use 'lorem_pdf.yara', found in directory 'test_rulesets'. It has 2 very simple rules in it.

Note: we will ignore rule warnings here

Case 1: Yara run against the zip file 'lipsum.pdf.zip'

dre@debian:~/software/yextend$ yara -w -s test_rulesets/lorem_pdf.yara test_files/lipsum.pdf.zip 
dre@debian:~/software/yextend$

As expected we got zero hits on that run.

Case 2: Yara run against the pdf file 'lipsum.txt.pdf'

dre@debian:~/software/yextend$ yara -w -s test_rulesets/lorem_pdf.yara test_files/lipsum.txt.pdf 
invalid_trailer_structure test_files/lipsum.txt.pdf
0x0:$magic: 25 50 44 46 
dre@debian:~/software/yextend$

In this case we got a hit against the PDF data in native format but no hit against the actual text from the content inside the PDF.

Case 3: Yextend run against the pdf file 'lipsum.txt.pdf'

dre@debian:~/software/yextend$ ./yextend test_rulesets/lorem_pdf.yara test_files/lipsum.txt.pdf 

===============================ALPHA===================================
File Name: test_files/lipsum.txt.pdf
File Size: 40450
File Signature (MD5): ec650a3a287603d350718b74716aee1c

=======================================================================

Yara Result(s): invalid_trailer_structure:[author=Glenn Edwards (@hiddenillusion),version=0.1,weight=1,detected offsets=0x0:$magic,hit_count=1]
Scan Type: Yara Scan (PDF - Raw data)
Parent File Name: test_files/lipsum.txt.pdf
Child File Name: test_files/lipsum.txt.pdf
File Signature (MD5): ec650a3a287603d350718b74716aee1c


Yara Result(s): LOREM_FILE_BODY:[type=PDF body text (lorem),detected offsets=0x3b:$lipsum_pdf_body_lorem-0x429:$lipsum_pdf_body_lorem-0x1250:$lipsum_pdf_body_lorem-0x12f6:$lipsum_pdf_body_lorem-0x1374:$lipsum_pdf_body_lorem-0x166e:$lipsum_pdf_body_lorem-0x1b6f:$lipsum_pdf_body_lorem-0x1eb7:$lipsum_pdf_body_lorem-0x2149:$lipsum_pdf_body_lorem-0x282b:$lipsum_pdf_body_lorem-0x2a8a:$lipsum_pdf_body_lorem-0x2f1d:$lipsum_pdf_body_lorem-0x301e:$lipsum_pdf_body_lorem-0x305f:$lipsum_pdf_body_lorem-0x3324:$lipsum_pdf_body_lorem-0x3653:$lipsum_pdf_body_lorem-0x38c9:$lipsum_pdf_body_lorem-0x3ac9:$lipsum_pdf_body_lorem-0x41a5:$lipsum_pdf_body_lorem-0x41d8:$lipsum_pdf_body_lorem-0x44df:$lipsum_pdf_body_lorem-0x5654:$lipsum_pdf_body_lorem-0x6647:$lipsum_pdf_body_lorem-0x6727:$lipsum_pdf_body_lorem-0x6939:$lipsum_pdf_body_lorem-0x721a:$lipsum_pdf_body_lorem,hit_count=26]
Scan Type: Yara Scan (PDF - Extracted text)
Parent File Name: test_files/lipsum.txt.pdf
Child File Name: test_files/lipsum.txt.pdf
File Signature (MD5): 126a551fd3801cb33c8dbacfc04ba75f


===============================OMEGA===================================
dre@debian:~/software/yextend$ 

In this case we got the same hit as the straight Yara run on the PDF formatted data but we also got a rule hit with multiple counts (each at a separate offset) against the actual text from inside the PDF.

Case 4: Yextend run against the zip file 'lipsum.pdf.zip'

dre@debian:~/software/yextend$ ./yextend test_rulesets/lorem_pdf.yara test_files/lipsum.pdf.zip 

===============================ALPHA===================================
File Name: test_files/lipsum.pdf.zip
File Size: 34943
File Signature (MD5): 9a709033aa3a59fd16a39a024d3e9c8a

=======================================================================

Yara Result(s): invalid_trailer_structure:[author=Glenn Edwards (@hiddenillusion),version=0.1,weight=1,detected offsets=0x0:$magic,hit_count=1]
Scan Type: Yara Scan (PDF - Raw data) inside GNU tar format file
Parent File Name: lipsum.pdf.tar
Child File Name: lipsum.txt.pdf
File Signature (MD5): ec650a3a287603d350718b74716aee1c


Yara Result(s): LOREM_FILE_BODY:[type=PDF body text (lorem),detected offsets=0x3b:$lipsum_pdf_body_lorem-0x429:$lipsum_pdf_body_lorem-0x1250:$lipsum_pdf_body_lorem-0x12f6:$lipsum_pdf_body_lorem-0x1374:$lipsum_pdf_body_lorem-0x166e:$lipsum_pdf_body_lorem-0x1b6f:$lipsum_pdf_body_lorem-0x1eb7:$lipsum_pdf_body_lorem-0x2149:$lipsum_pdf_body_lorem-0x282b:$lipsum_pdf_body_lorem-0x2a8a:$lipsum_pdf_body_lorem-0x2f1d:$lipsum_pdf_body_lorem-0x301e:$lipsum_pdf_body_lorem-0x305f:$lipsum_pdf_body_lorem-0x3324:$lipsum_pdf_body_lorem-0x3653:$lipsum_pdf_body_lorem-0x38c9:$lipsum_pdf_body_lorem-0x3ac9:$lipsum_pdf_body_lorem-0x41a5:$lipsum_pdf_body_lorem-0x41d8:$lipsum_pdf_body_lorem-0x44df:$lipsum_pdf_body_lorem-0x5654:$lipsum_pdf_body_lorem-0x6647:$lipsum_pdf_body_lorem-0x6727:$lipsum_pdf_body_lorem-0x6939:$lipsum_pdf_body_lorem-0x721a:$lipsum_pdf_body_lorem,hit_count=26]
Scan Type: Yara Scan (PDF - Extracted text) inside GNU tar format file
Parent File Name: lipsum.pdf.tar
Child File Name: lipsum.txt.pdf
File Signature (MD5): 126a551fd3801cb33c8dbacfc04ba75f


===============================OMEGA===================================
dre@debian:~/software/yextend$ 

In this case we see the same results as when we scanned the PDF directly since yextend extracted the contents of the archive and then scanned them as if directly. Take note of the differences in the 'Parent File Name' values since the data was deflated inside of an archive this time.