Yet ugly simple script but get the job done. Convert docx
to markdown
using pandoc
.
I have a multi-level folder that contains many mixed doc/docx files along with other file types.
And, I don't wanna play with a word proccessor every time I need to see what it is in the doc/docx file.
Then, I know I have to convert these word files to a text-based format. Yes, markdown
, of course, the most widely-used and simplest markup format of humanity :3. And yes, pandoc
, a great tool to convert-a-file-format-to-another-file-format.
- Shell that support text substitution.
- GNU
sed
(-i
for inline-replacement). Most Linux distro. - LibreOffice.
- Perl (>
v5.x.x
) comes with Unicode module. - Pandoc, of course.
- Walks through the folder with multi-level sub-folder
- Watches all the files for
doc
ordocx
:- If the file name contans any white-space, replaces the white-space with hyphen (
-
). I hate the white-spaces in the path! - If the file name contains any latin-based char but not the ascii (I mean the Unicode), converts the char to an us-ascii char.
- What about non-latin chars, suck as CJK? Eh, I don't get it now.
- Lower-cases the file name. Yes, non-spaces and lower-cases paths. I love it.
- If the file is
doc
, convert it todocx
. This stage callslibreoffice
(orsoffice
) directly. You also can use unoconv as an alternative. Thedoc2docx
function. Then calls thedocx2md
function. - Removes the converted
doc
file. - Creates one media folder for every
docx
file to save every extracted media files (jpeg, png, ...) from the being-converteddocx
file. Named this folder${the-docx-file-name}-media
. - Calls the
docx2md
function. Converts all thedocx
files tomd
files. Converts the absolute to relative paths for extracted media files (if any). - Remove the converted
docx
file.
- If the file name contans any white-space, replaces the white-space with hyphen (
- Exits the loop.
Unfortunately, you have to delete the empty media folder manually (if any).
That it.
I don't have much time to play with pandoc
or Pandoc Template, although I think the out put should be more pleasure.