Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode does not show correctly #86

Closed
adals opened this issue Nov 16, 2019 · 12 comments
Closed

unicode does not show correctly #86

adals opened this issue Nov 16, 2019 · 12 comments

Comments

@adals
Copy link

adals commented Nov 16, 2019

#[macro_use]
extern crate lopdf;
use lopdf::content::{Content, Operation};
use lopdf::{Document, Object, Stream};

fn main() {
	let mut doc = Document::with_version("1.5");
	let pages_id = doc.new_object_id();
	let font_id = doc.add_object(dictionary! {
		"Type" => "Font",
		"Subtype" => "Type1",
		"BaseFont" => "Courier",
	});
	let resources_id = doc.add_object(dictionary! {
		"Font" => dictionary! {
			"F1" => font_id,
		},
	});
	let content = Content {
		operations: vec![
			Operation::new("BT", vec![]),
			Operation::new("Tf", vec!["F1".into(), 48.into()]),
			Operation::new("Td", vec![100.into(), 600.into()]),
			
			//change text to unicode (arabic)
			Operation::new("Tj", vec![Object::string_literal("مرحبا بالعالم!")]),
			Operation::new("ET", vec![]),
		],
	};
	let content_id = doc.add_object(Stream::new(dictionary! {}, content.encode().unwrap()));
	let page_id = doc.add_object(dictionary! {
		"Type" => "Page",
		"Parent" => pages_id,
		"Contents" => content_id,
	});
	let pages = dictionary! {
		"Type" => "Pages",
		"Kids" => vec![page_id.into()],
		"Count" => 1,
		"Resources" => resources_id,
		"MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
	};
	doc.objects.insert(pages_id, Object::Dictionary(pages));
	let catalog_id = doc.add_object(dictionary! {
		"Type" => "Catalog",
		"Pages" => pages_id,
	});
	doc.trailer.set("Root", catalog_id);
	doc.compress();
	doc.save("example.pdf").unwrap();
}
`


and the result is 


<img width="1029" alt="Screen Shot 2019-11-16 at 1 36 12 PM" src="https://user-images.githubusercontent.com/169691/68992001-40876680-0876-11ea-8c05-a1f20fbd824a.png">
@LHRchina
Copy link

any update for this problem?

@arifd
Copy link

arifd commented Apr 12, 2023

@fold-squirrel would you happen to have any wisdom here?

I was able to get codepoints up to 255 (i.e. the extended latin ASCII set) to display with this hack:

let hex_text = text.chars().map(|c| c as u8).collect::<Vec<_>>();
let contents = Object::String(hex_text, StringFormat::Hexadecimal);

supplying that into an object that looks like this

let highlight = dictionary! {
    "Contents" => contents,
    "AP" => dictionary!{
        "N" => appearance_id
    },
    "Border" => vec![
        0.into(),
        0.into(),
        1.into(),
    ],
    "C" => color,
    "CA" => 1,
    "F" => 4,
    "P" => page_id,
    "QuadPoints" => vec![
        (left).into(),
        (top).into(),
        //
        (right).into(),
        (top).into(),
        //
        (left).into(),
        (bottom).into(),
        //
        (right).into(),
        (bottom).into(),
        ],
        "Rect" => vec![
            (left).into(),
            (bottom).into(),
            (right).into(),
            (top).into(),
            ],
            "Type" => "Annot",
            "Subtype" => "Highlight",
}

Not sure why that doesn't work plugging it into the "Tj" operation of @adals example.

But also, I'm not sure why I can't get codepoints above 255 to work either.

I looked at a highlight produced by an official adobe PDF... if you enter ASCII inside the annotation, it will produce an object similar to mine but with a string_literal, but if you add any non ASCII symbols, it becomes HexText, which appears to have the BOM prefix b"\xfe\xff\x00" I have tried all combinations of adding or not adding the BOM and also

let hex_text = text.chars().flat_map(|c| (c as u32).to_le_bytes()).collect::<Vec<_>>();

but that works worse than the text.chars().map(|c| c as u8).collect::<Vec<_>>() version.

I think i read somewhere that the character encoding needs to be UCS-2 which is maybe why as u8 works but (c as u32).to_le_bytes() does not? (because that is not technically UCS-2?

Do you have any insight on this matter? Thanks!!

@fold-squirrel
Copy link

The only way I know how to insert unicode into a pdf is to embed a font and reference it using a hex string, that gives you 255 unicode characters to work with, I only need 2 unicode characters in my tool so I didn't look further. I read that you can extend the 255 limit through some kind of font inheritance.

I also read that changing the pdf encoding to utf-16 can give you access to unicode characters but I never got it to work.

When I wanted to insert unicode I simply opened some word processor and typed in all the characters I needed then using mutool from mupdf I would extract the embedded font subset along with its cmap and hard code them in my tool. I looked at using the allsorts font library to automate that process but at that point I lost interest.

@arifd
Copy link

arifd commented Apr 12, 2023

Wow. thanks for swift and helpful answer.

How does one change the pdf encoding to UTF-16? Or maybe just add an extra encoding, because in my case i want to take an existing PDF, and just add extra annotations to it.

Perhaps then it might be as simple as "hello".encode_utf16().flat_map(u16::to_le_bytes).collect::<Vec<_>>()

@fold-squirrel
Copy link

The thing is I'm not sure how but you made me interested in trying again so I'll give it a shot latter today. I would also like to know if you're embedding a font file or using one of the 14 standard fonts that don't require embedding.

@arifd
Copy link

arifd commented Apr 12, 2023

Well I very much appreciate it! I'm trying to recreate this feature:

simple_with_arabic_annot.pdf

Notice I took an existing document and the word "Zombies" has been annotated. (interestingly in Chrome the text is garbled)

I'm not sure I have control over the font being used.

@fold-squirrel
Copy link

simple_with_arabic_annot.pdf

so that's what annotate means, I thought more like an arrow pointing to a word, I haven't looked at how annotation works in pdf before.

The font used (if you don't set it) will be the last font the document used which will create problems especially if it's a subset font, the previous font colour is also inherited.

@fold-squirrel
Copy link

fold-squirrel commented Apr 13, 2023

So after some googling it seems that embedding a font is not required for annotations. I found this example from the pdf-association pdf 2.0 utf-8 annotation, in that example they put a BOM U+FEFF before the text in object 2 0 and after that they write utf-8 text, I'm not sure if this works for pdf versions before 2.0 but my guess in not.

@fold-squirrel
Copy link

The font used (if you don't set it) will be the last font the document used which will create problems especially if it's a subset font, the previous font color is also inherited.

I was wrong about this part, it's only true for content that is rendered in the pdf page but not for annotations, it seems the pdf reader is that one that chooses the font for annotations.

@arifd
Copy link

arifd commented Apr 13, 2023

pdf 2.0 utf-8 annotation

Interesting, Chrome and Edge both show gargabe and the native pdf viewer in Ubuntu 21.04 just crashes :)

I also tried encoding the text with this https://docs.rs/ucs2/0.3.2/ucs2/fn.encode.html but that did not appear to work either.

I forgot to mention that my hack (text.chars().map(|c| c as u8).collect::<Vec<_>>()) causes Safari, OSXPreview, and Samsung notes, all to refuse to open the popup at all.

@fold-squirrel
Copy link

I tried it on Firefox and it worked fine, acrobat should also render fine.

@Heinenen
Copy link
Collaborator

Heinenen commented Aug 4, 2024

The relevant part of the PDF 1.7 on this topic is section 7.9.2.2 text strings.

image
(Image taken from PDF2.0 section 7.9.2 String object types)

There are three ways to encode text strings: PDFDocEncoding, UTF-16BE, UTF-8.
Note that UTF-8 is only available in PDF2.0.

So to create a valid encoded string as described in the linked section on could use the following code:

// recommended for better compatibility
// UTF-16BE encoding
let str = "مرحبا بالعالم!";
let mut bytes16 = vec![];
// push the two BOM bytes
bytes16.push(254u8);
bytes16.push(255u8);
// encode the actual content as UTF-16BE
// (not 100% if this is the best way to do it, please correct me if this is wrong)
bytes16.append(
  &mut str
    .encode_utf16()
    .map(|it| u16::to_be_bytes(it))
    .flatten()
    .collect::<Vec<u8>>(),
);
// NOT RECOMMENDED, ONLY IN PDF2.0
// the only reader I got this working with is Firefox, not even Adobe worked
// UTF-8
let str = "مرحبا بالعالم!";
let mut bytes8 = vec![];
// push the three bytes needed to mark UTF-8 encoding
bytes8.push(239u8);
bytes8.push(187u8);
bytes8.push(191u8);
// encode the actual content as UTF-8 (no conversion needed because Rust strings are already UTF-8)
bytes8.append(&mut str.bytes().collect::<Vec<u8>>());

The following code should produce the following PDF: annotation.pdf

Full working code example
let mut doc = Document::with_version("1.7");
// doc.reference_table = Xref::new(0, super::XrefType::CrossReferenceTable);
let pages_id = doc.new_object_id();
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Courier",
});
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! {
        "F1" => font_id,
    },
});
let str = "مرحبا بالعالم!";
let mut bytes = vec![];
bytes.push(254u8);
bytes.push(255u8);
bytes.append(
    &mut str
        .encode_utf16()
        .map(|it| u16::to_be_bytes(it))
        .flatten()
        .collect::<Vec<u8>>(),
);

let content = Content {
    operations: vec![
        Operation::new("BT", vec![]),
        Operation::new("Tf", vec!["F1".into(), 48.into()]),
        Operation::new("Td", vec![100.into(), 600.into()]),
        Operation::new("Tj", vec![Object::string_literal("hello")]),
        Operation::new("ET", vec![]),
    ],
};
let content_id = doc.add_object(Stream::new(dictionary! {}, content.encode().unwrap()));

let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "Parent" => pages_id,
    "Contents" => content_id,
    // hardcoded id, please don't copy this yourself :D
    "Annots" => vec![(6u32, 0u16).into()],
});

let highlight = dictionary! {
    "Type" => "Annot",
    "Subtype" => "Text",
    "Contents" => Object::String(bytes, StringFormat::Hexadecimal),
    "CA" => 1,
    "F" => 4,
    "P" => page_id,
    "Rect" => vec![
        50.into(),  // ll_x
        680.into(), // ll_y
        100.into(), // ur_x
        700.into(), // ur_y
    ],
};
let _highlight_id = doc.add_object(highlight);

let pages = dictionary! {
            "Type" => "Pages",
            "Kids" => vec![page_id.into()],
            "Count" => 1,
    "Resources" => resources_id,
    "MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
};

doc.objects.insert(pages_id, Object::Dictionary(pages));
let catalog_id = doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});
doc.trailer.set("Root", catalog_id);
  doc.save("example.pdf").unwrap();

NOTE: Although this works for annotations, this does NOT work in "normal text"/content stream (using the Tj operator). I haven't read that part of the spec yet, so I can't really comment on why exactly that is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants