-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode does not show correctly #86
Comments
any update for this problem? |
@fold-squirrel would you happen to have any wisdom here? I was able to get codepoints up to 255 (i.e. the extended latin ASCII set) to display with this hack: let hex_text = text.chars().map(|c| c as u8).collect::<Vec<_>>();
let contents = Object::String(hex_text, StringFormat::Hexadecimal); supplying that into an object that looks like this let highlight = dictionary! {
"Contents" => contents,
"AP" => dictionary!{
"N" => appearance_id
},
"Border" => vec![
0.into(),
0.into(),
1.into(),
],
"C" => color,
"CA" => 1,
"F" => 4,
"P" => page_id,
"QuadPoints" => vec![
(left).into(),
(top).into(),
//
(right).into(),
(top).into(),
//
(left).into(),
(bottom).into(),
//
(right).into(),
(bottom).into(),
],
"Rect" => vec![
(left).into(),
(bottom).into(),
(right).into(),
(top).into(),
],
"Type" => "Annot",
"Subtype" => "Highlight",
} Not sure why that doesn't work plugging it into the "Tj" operation of @adals example. But also, I'm not sure why I can't get codepoints above 255 to work either. I looked at a highlight produced by an official adobe PDF... if you enter ASCII inside the annotation, it will produce an object similar to mine but with a string_literal, but if you add any non ASCII symbols, it becomes HexText, which appears to have the BOM prefix let hex_text = text.chars().flat_map(|c| (c as u32).to_le_bytes()).collect::<Vec<_>>(); but that works worse than the I think i read somewhere that the character encoding needs to be Do you have any insight on this matter? Thanks!! |
The only way I know how to insert unicode into a pdf is to embed a font and reference it using a hex string, that gives you 255 unicode characters to work with, I only need 2 unicode characters in my tool so I didn't look further. I read that you can extend the 255 limit through some kind of font inheritance. I also read that changing the pdf encoding to utf-16 can give you access to unicode characters but I never got it to work. When I wanted to insert unicode I simply opened some word processor and typed in all the characters I needed then using mutool from mupdf I would extract the embedded font subset along with its cmap and hard code them in my tool. I looked at using the allsorts font library to automate that process but at that point I lost interest. |
Wow. thanks for swift and helpful answer. How does one change the pdf encoding to UTF-16? Or maybe just add an extra encoding, because in my case i want to take an existing PDF, and just add extra annotations to it. Perhaps then it might be as simple as |
The thing is I'm not sure how but you made me interested in trying again so I'll give it a shot latter today. I would also like to know if you're embedding a font file or using one of the 14 standard fonts that don't require embedding. |
Well I very much appreciate it! I'm trying to recreate this feature: Notice I took an existing document and the word "Zombies" has been annotated. (interestingly in Chrome the text is garbled) I'm not sure I have control over the font being used. |
so that's what annotate means, I thought more like an arrow pointing to a word, I haven't looked at how annotation works in pdf before. The font used (if you don't set it) will be the last font the document used which will create problems especially if it's a subset font, the previous font colour is also inherited. |
So after some googling it seems that embedding a font is not required for annotations. I found this example from the pdf-association pdf 2.0 utf-8 annotation, in that example they put a BOM |
I was wrong about this part, it's only true for content that is rendered in the pdf page but not for annotations, it seems the pdf reader is that one that chooses the font for annotations. |
Interesting, Chrome and Edge both show gargabe and the native pdf viewer in Ubuntu 21.04 just crashes :) I also tried encoding the text with this https://docs.rs/ucs2/0.3.2/ucs2/fn.encode.html but that did not appear to work either. I forgot to mention that my hack ( |
I tried it on Firefox and it worked fine, acrobat should also render fine. |
The relevant part of the PDF 1.7 on this topic is section 7.9.2.2 text strings.
There are three ways to encode text strings: PDFDocEncoding, UTF-16BE, UTF-8. So to create a valid encoded string as described in the linked section on could use the following code: // recommended for better compatibility
// UTF-16BE encoding
let str = "مرحبا بالعالم!";
let mut bytes16 = vec![];
// push the two BOM bytes
bytes16.push(254u8);
bytes16.push(255u8);
// encode the actual content as UTF-16BE
// (not 100% if this is the best way to do it, please correct me if this is wrong)
bytes16.append(
&mut str
.encode_utf16()
.map(|it| u16::to_be_bytes(it))
.flatten()
.collect::<Vec<u8>>(),
); // NOT RECOMMENDED, ONLY IN PDF2.0
// the only reader I got this working with is Firefox, not even Adobe worked
// UTF-8
let str = "مرحبا بالعالم!";
let mut bytes8 = vec![];
// push the three bytes needed to mark UTF-8 encoding
bytes8.push(239u8);
bytes8.push(187u8);
bytes8.push(191u8);
// encode the actual content as UTF-8 (no conversion needed because Rust strings are already UTF-8)
bytes8.append(&mut str.bytes().collect::<Vec<u8>>()); The following code should produce the following PDF: annotation.pdf Full working code examplelet mut doc = Document::with_version("1.7");
// doc.reference_table = Xref::new(0, super::XrefType::CrossReferenceTable);
let pages_id = doc.new_object_id();
let font_id = doc.add_object(dictionary! {
"Type" => "Font",
"Subtype" => "Type1",
"BaseFont" => "Courier",
});
let resources_id = doc.add_object(dictionary! {
"Font" => dictionary! {
"F1" => font_id,
},
});
let str = "مرحبا بالعالم!";
let mut bytes = vec![];
bytes.push(254u8);
bytes.push(255u8);
bytes.append(
&mut str
.encode_utf16()
.map(|it| u16::to_be_bytes(it))
.flatten()
.collect::<Vec<u8>>(),
);
let content = Content {
operations: vec![
Operation::new("BT", vec![]),
Operation::new("Tf", vec!["F1".into(), 48.into()]),
Operation::new("Td", vec![100.into(), 600.into()]),
Operation::new("Tj", vec![Object::string_literal("hello")]),
Operation::new("ET", vec![]),
],
};
let content_id = doc.add_object(Stream::new(dictionary! {}, content.encode().unwrap()));
let page_id = doc.add_object(dictionary! {
"Type" => "Page",
"Parent" => pages_id,
"Contents" => content_id,
// hardcoded id, please don't copy this yourself :D
"Annots" => vec![(6u32, 0u16).into()],
});
let highlight = dictionary! {
"Type" => "Annot",
"Subtype" => "Text",
"Contents" => Object::String(bytes, StringFormat::Hexadecimal),
"CA" => 1,
"F" => 4,
"P" => page_id,
"Rect" => vec![
50.into(), // ll_x
680.into(), // ll_y
100.into(), // ur_x
700.into(), // ur_y
],
};
let _highlight_id = doc.add_object(highlight);
let pages = dictionary! {
"Type" => "Pages",
"Kids" => vec![page_id.into()],
"Count" => 1,
"Resources" => resources_id,
"MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
};
doc.objects.insert(pages_id, Object::Dictionary(pages));
let catalog_id = doc.add_object(dictionary! {
"Type" => "Catalog",
"Pages" => pages_id,
});
doc.trailer.set("Root", catalog_id);
doc.save("example.pdf").unwrap(); NOTE: Although this works for annotations, this does NOT work in "normal text"/content stream (using the Tj operator). I haven't read that part of the spec yet, so I can't really comment on why exactly that is. |
The text was updated successfully, but these errors were encountered: