unicode does not show correctly #86

adals · 2019-11-16T10:40:00Z

#[macro_use]
extern crate lopdf;
use lopdf::content::{Content, Operation};
use lopdf::{Document, Object, Stream};

fn main() {
	let mut doc = Document::with_version("1.5");
	let pages_id = doc.new_object_id();
	let font_id = doc.add_object(dictionary! {
		"Type" => "Font",
		"Subtype" => "Type1",
		"BaseFont" => "Courier",
	});
	let resources_id = doc.add_object(dictionary! {
		"Font" => dictionary! {
			"F1" => font_id,
		},
	});
	let content = Content {
		operations: vec![
			Operation::new("BT", vec![]),
			Operation::new("Tf", vec!["F1".into(), 48.into()]),
			Operation::new("Td", vec![100.into(), 600.into()]),
			
			//change text to unicode (arabic)
			Operation::new("Tj", vec![Object::string_literal("مرحبا بالعالم!")]),
			Operation::new("ET", vec![]),
		],
	};
	let content_id = doc.add_object(Stream::new(dictionary! {}, content.encode().unwrap()));
	let page_id = doc.add_object(dictionary! {
		"Type" => "Page",
		"Parent" => pages_id,
		"Contents" => content_id,
	});
	let pages = dictionary! {
		"Type" => "Pages",
		"Kids" => vec![page_id.into()],
		"Count" => 1,
		"Resources" => resources_id,
		"MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
	};
	doc.objects.insert(pages_id, Object::Dictionary(pages));
	let catalog_id = doc.add_object(dictionary! {
		"Type" => "Catalog",
		"Pages" => pages_id,
	});
	doc.trailer.set("Root", catalog_id);
	doc.compress();
	doc.save("example.pdf").unwrap();
}
`


and the result is 


<img width="1029" alt="Screen Shot 2019-11-16 at 1 36 12 PM" src="https://user-images.githubusercontent.com/169691/68992001-40876680-0876-11ea-8c05-a1f20fbd824a.png">

LHRchina · 2022-12-27T07:04:02Z

any update for this problem?

arifd · 2023-04-12T14:32:30Z

@fold-squirrel would you happen to have any wisdom here?

I was able to get codepoints up to 255 (i.e. the extended latin ASCII set) to display with this hack:

let hex_text = text.chars().map(|c| c as u8).collect::<Vec<_>>();
let contents = Object::String(hex_text, StringFormat::Hexadecimal);

supplying that into an object that looks like this

let highlight = dictionary! {
    "Contents" => contents,
    "AP" => dictionary!{
        "N" => appearance_id
    },
    "Border" => vec![
        0.into(),
        0.into(),
        1.into(),
    ],
    "C" => color,
    "CA" => 1,
    "F" => 4,
    "P" => page_id,
    "QuadPoints" => vec![
        (left).into(),
        (top).into(),
        //
        (right).into(),
        (top).into(),
        //
        (left).into(),
        (bottom).into(),
        //
        (right).into(),
        (bottom).into(),
        ],
        "Rect" => vec![
            (left).into(),
            (bottom).into(),
            (right).into(),
            (top).into(),
            ],
            "Type" => "Annot",
            "Subtype" => "Highlight",
}

Not sure why that doesn't work plugging it into the "Tj" operation of @adals example.

But also, I'm not sure why I can't get codepoints above 255 to work either.

I looked at a highlight produced by an official adobe PDF... if you enter ASCII inside the annotation, it will produce an object similar to mine but with a string_literal, but if you add any non ASCII symbols, it becomes HexText, which appears to have the BOM prefix b"\xfe\xff\x00" I have tried all combinations of adding or not adding the BOM and also

let hex_text = text.chars().flat_map(|c| (c as u32).to_le_bytes()).collect::<Vec<_>>();

but that works worse than the text.chars().map(|c| c as u8).collect::<Vec<_>>() version.

I think i read somewhere that the character encoding needs to be UCS-2 which is maybe why as u8 works but (c as u32).to_le_bytes() does not? (because that is not technically UCS-2?

Do you have any insight on this matter? Thanks!!

fold-squirrel · 2023-04-12T15:11:42Z

The only way I know how to insert unicode into a pdf is to embed a font and reference it using a hex string, that gives you 255 unicode characters to work with, I only need 2 unicode characters in my tool so I didn't look further. I read that you can extend the 255 limit through some kind of font inheritance.

I also read that changing the pdf encoding to utf-16 can give you access to unicode characters but I never got it to work.

When I wanted to insert unicode I simply opened some word processor and typed in all the characters I needed then using mutool from mupdf I would extract the embedded font subset along with its cmap and hard code them in my tool. I looked at using the allsorts font library to automate that process but at that point I lost interest.

arifd · 2023-04-12T15:30:23Z

Wow. thanks for swift and helpful answer.

How does one change the pdf encoding to UTF-16? Or maybe just add an extra encoding, because in my case i want to take an existing PDF, and just add extra annotations to it.

Perhaps then it might be as simple as "hello".encode_utf16().flat_map(u16::to_le_bytes).collect::<Vec<_>>()

fold-squirrel · 2023-04-12T15:45:06Z

The thing is I'm not sure how but you made me interested in trying again so I'll give it a shot latter today. I would also like to know if you're embedding a font file or using one of the 14 standard fonts that don't require embedding.

arifd · 2023-04-12T16:03:49Z

Well I very much appreciate it! I'm trying to recreate this feature:

simple_with_arabic_annot.pdf

Notice I took an existing document and the word "Zombies" has been annotated. (interestingly in Chrome the text is garbled)

I'm not sure I have control over the font being used.

fold-squirrel · 2023-04-12T16:22:22Z

simple_with_arabic_annot.pdf

so that's what annotate means, I thought more like an arrow pointing to a word, I haven't looked at how annotation works in pdf before.

The font used (if you don't set it) will be the last font the document used which will create problems especially if it's a subset font, the previous font colour is also inherited.

fold-squirrel · 2023-04-13T05:48:48Z

So after some googling it seems that embedding a font is not required for annotations. I found this example from the pdf-association pdf 2.0 utf-8 annotation, in that example they put a BOM U+FEFF before the text in object 2 0 and after that they write utf-8 text, I'm not sure if this works for pdf versions before 2.0 but my guess in not.

fold-squirrel · 2023-04-13T05:51:27Z

The font used (if you don't set it) will be the last font the document used which will create problems especially if it's a subset font, the previous font color is also inherited.

I was wrong about this part, it's only true for content that is rendered in the pdf page but not for annotations, it seems the pdf reader is that one that chooses the font for annotations.

arifd · 2023-04-13T12:08:27Z

pdf 2.0 utf-8 annotation

Interesting, Chrome and Edge both show gargabe and the native pdf viewer in Ubuntu 21.04 just crashes :)

I also tried encoding the text with this https://docs.rs/ucs2/0.3.2/ucs2/fn.encode.html but that did not appear to work either.

I forgot to mention that my hack (text.chars().map(|c| c as u8).collect::<Vec<_>>()) causes Safari, OSXPreview, and Samsung notes, all to refuse to open the popup at all.

fold-squirrel · 2023-04-13T13:28:18Z

I tried it on Firefox and it worked fine, acrobat should also render fine.

Heinenen · 2024-08-04T11:03:35Z

The relevant part of the PDF 1.7 on this topic is section 7.9.2.2 text strings.

(Image taken from PDF2.0 section 7.9.2 String object types)

There are three ways to encode text strings: PDFDocEncoding, UTF-16BE, UTF-8.
Note that UTF-8 is only available in PDF2.0.

So to create a valid encoded string as described in the linked section on could use the following code:

// recommended for better compatibility
// UTF-16BE encoding
let str = "مرحبا بالعالم!";
let mut bytes16 = vec![];
// push the two BOM bytes
bytes16.push(254u8);
bytes16.push(255u8);
// encode the actual content as UTF-16BE
// (not 100% if this is the best way to do it, please correct me if this is wrong)
bytes16.append(
  &mut str
    .encode_utf16()
    .map(|it| u16::to_be_bytes(it))
    .flatten()
    .collect::<Vec<u8>>(),
);

// NOT RECOMMENDED, ONLY IN PDF2.0
// the only reader I got this working with is Firefox, not even Adobe worked
// UTF-8
let str = "مرحبا بالعالم!";
let mut bytes8 = vec![];
// push the three bytes needed to mark UTF-8 encoding
bytes8.push(239u8);
bytes8.push(187u8);
bytes8.push(191u8);
// encode the actual content as UTF-8 (no conversion needed because Rust strings are already UTF-8)
bytes8.append(&mut str.bytes().collect::<Vec<u8>>());

The following code should produce the following PDF: annotation.pdf

Full working code example

let mut doc = Document::with_version("1.7");
// doc.reference_table = Xref::new(0, super::XrefType::CrossReferenceTable);
let pages_id = doc.new_object_id();
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Courier",
});
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! {
        "F1" => font_id,
    },
});
let str = "مرحبا بالعالم!";
let mut bytes = vec![];
bytes.push(254u8);
bytes.push(255u8);
bytes.append(
    &mut str
        .encode_utf16()
        .map(|it| u16::to_be_bytes(it))
        .flatten()
        .collect::<Vec<u8>>(),
);

let content = Content {
    operations: vec![
        Operation::new("BT", vec![]),
        Operation::new("Tf", vec!["F1".into(), 48.into()]),
        Operation::new("Td", vec![100.into(), 600.into()]),
        Operation::new("Tj", vec![Object::string_literal("hello")]),
        Operation::new("ET", vec![]),
    ],
};
let content_id = doc.add_object(Stream::new(dictionary! {}, content.encode().unwrap()));

let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "Parent" => pages_id,
    "Contents" => content_id,
    // hardcoded id, please don't copy this yourself :D
    "Annots" => vec![(6u32, 0u16).into()],
});

let highlight = dictionary! {
    "Type" => "Annot",
    "Subtype" => "Text",
    "Contents" => Object::String(bytes, StringFormat::Hexadecimal),
    "CA" => 1,
    "F" => 4,
    "P" => page_id,
    "Rect" => vec![
        50.into(),  // ll_x
        680.into(), // ll_y
        100.into(), // ur_x
        700.into(), // ur_y
    ],
};
let _highlight_id = doc.add_object(highlight);

let pages = dictionary! {
            "Type" => "Pages",
            "Kids" => vec![page_id.into()],
            "Count" => 1,
    "Resources" => resources_id,
    "MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
};

doc.objects.insert(pages_id, Object::Dictionary(pages));
let catalog_id = doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});
doc.trailer.set("Root", catalog_id);
  doc.save("example.pdf").unwrap();

NOTE: Although this works for annotations, this does NOT work in "normal text"/content stream (using the Tj operator). I haven't read that part of the spec yet, so I can't really comment on why exactly that is.

JohnAZoidberg mentioned this issue Nov 10, 2020

Implement decoding of Unicode characters #125

Closed

This was referenced Aug 9, 2024

Chinese characters cannot be displayed correctly #146

Open

Implement encoding and decoding of text strings (PDF1.7 section 7.9.2.2) #297

Merged

Heinenen closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode does not show correctly #86

unicode does not show correctly #86

adals commented Nov 16, 2019 •

edited

Loading

LHRchina commented Dec 27, 2022

arifd commented Apr 12, 2023 •

edited

Loading

fold-squirrel commented Apr 12, 2023

arifd commented Apr 12, 2023 •

edited

Loading

fold-squirrel commented Apr 12, 2023

arifd commented Apr 12, 2023 •

edited

Loading

fold-squirrel commented Apr 12, 2023

fold-squirrel commented Apr 13, 2023 •

edited

Loading

fold-squirrel commented Apr 13, 2023

arifd commented Apr 13, 2023 •

edited

Loading

fold-squirrel commented Apr 13, 2023

Heinenen commented Aug 4, 2024

unicode does not show correctly #86

unicode does not show correctly #86

Comments

adals commented Nov 16, 2019 • edited Loading

LHRchina commented Dec 27, 2022

arifd commented Apr 12, 2023 • edited Loading

fold-squirrel commented Apr 12, 2023

arifd commented Apr 12, 2023 • edited Loading

fold-squirrel commented Apr 12, 2023

arifd commented Apr 12, 2023 • edited Loading

fold-squirrel commented Apr 12, 2023

fold-squirrel commented Apr 13, 2023 • edited Loading

fold-squirrel commented Apr 13, 2023

arifd commented Apr 13, 2023 • edited Loading

fold-squirrel commented Apr 13, 2023

Heinenen commented Aug 4, 2024

adals commented Nov 16, 2019 •

edited

Loading

arifd commented Apr 12, 2023 •

edited

Loading

arifd commented Apr 12, 2023 •

edited

Loading

arifd commented Apr 12, 2023 •

edited

Loading

fold-squirrel commented Apr 13, 2023 •

edited

Loading

arifd commented Apr 13, 2023 •

edited

Loading