Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitation of the tool on modified Word documents #2

Closed
catalan-adobe opened this issue Sep 21, 2023 · 1 comment · Fixed by #14
Closed

Limitation of the tool on modified Word documents #2

catalan-adobe opened this issue Sep 21, 2023 · 1 comment · Fixed by #14

Comments

@catalan-adobe
Copy link

Context

For adobe.com migration to Franklin I had a request to bulk modify a set of existing Word documents, adding a missing variant label to some blocks.
I first had a look at existing libraries to execute grep/replace actions (such as https://github.com/nguyenthenguyen/docx) but I quickly got blocked because of the "WordprocessingML fragmentation behaviour" (best explained here https://github.com/lukasjarosch/go-docx#overview).
Long story short, when editing a Word document there will be many circumstances where the text will arbitrary be fragmented in the xml structure.

Steps to reproduce

Example:

1. Create a new Word document

Type in I now now ...
Save it
Extract the XML:

[...]
	      <w:r>
	        <w:rPr>
	          <w:lang w:val="de-CH"/>
	        </w:rPr>
	        <w:t>I now now …</w:t>
	      </w:r>
[...]

All good!

2. Edit the text

Modify the text to I now know ... (just add a k)
Save it
Extract the XML:

[...]
	      <w:r>
	        <w:rPr>
	          <w:lang w:val="de-CH"/>
	        </w:rPr>
	        <w:t xml:space="preserve">I now </w:t>
	      </w:r>
	      <w:r w:rsidR="00D600D1">
	        <w:rPr>
	          <w:lang w:val="de-CH"/>
	        </w:rPr>
	        <w:t>k</w:t>
	      </w:r>
	      <w:r>
	        <w:rPr>
	          <w:lang w:val="de-CH"/>
	        </w:rPr>
	        <w:t>now …</w:t>
	      </w:r>
[...]

From that point on, doing a grep on know will not work anymore.

Solution?

I don't see any simple solution, I checked Word and could not find any command to simplify/remove such fragmentation.
I think we should at minimum communicate about that limitation as it can highly impact bulk editing operations (in the sense that you cannot really ensure results are accurate).

@bosschaert
Copy link
Contributor

Hi @catalan-adobe ! Thanks for filing the issue.

Yes, the tool works on the XML level and sometimes words are split over multiple XML tags in which case the search and replace doesn't find it. I think it's only an issue with replace, not with replace-links.
So yes if you're using replace to replace text you need to double check the result for now. I think it's potentially possible to fix this, by mapping the text to a single string for the search and then mapping the replacement back to the original XML tags, but that's not trivial.

bosschaert added a commit that referenced this issue Jan 15, 2024
Replace will now work across multiple tags in the .xml file that
contains the source of the .docx file. The replacement is spread across
those tags on output.
Additional unit tests added.

Fixes #2 and #11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants