This program allows you to take the messy auto-generated HTML that Google Docs gives you when downloading a file in HTML format and turn it into clean and usable HTML. This is mainly built for my tech blog, but I decided to make it public as others may find it helpful. That being said, the program cannot clean all HTML elements yet (such as tables and videos) - More about what this program can clean below.
In an effort to make this program modular and expandable, I refactored the tiny script into a package (gdtch) that uses a mixin design pattern so that (in theory) it's easy to add more features! If you find this project useful, consider contributing to it!
First, you need to import the main class:
from gdtch import Cleaner
Then you need to create an instance of the object. The only argument it takes is the path to the HTML file - The one you downloaded from Google Docs. To download a Google Document as an HTML file, go to File > Downloads > Web Page. If you use auto-detect markdown mode, you may get errors (I haven't tested with Google Docs auto-detect markdown preference yet).
cleaner = Cleaner('something_like_this/example.html')
When creating an instance, the Cleaner object will read the HTML file and put every child element in the body into a list called self.elements. The HTML elements are put into a Python list because this makes future cleaning operations easier. It also allows other methods to easily iterate through the list, add elements above and below, and join elements together.
Currently, this project supports the following type of Google Document styling. It's limited for now since these are the only things that I use for my website, but hopefully, by open-sourcing this, the feature set will improve. Below are the things that are currently supported:
To use the program, you can pick from the following methods to choose which parts of the HTML to clean. As more features are added, you can clean more!
Remove the top part of the document. I like to make an outline on the top part of my article documents and then divide the main article from the outline with a horizontal line.
Get rid of the Google Docs generated attributes. This method actually calls three more methods:
- self.remove_junk_attrs(exclude={'a', 'img'})
- self.clean_a_tag_attrs()
- self.clean_img_tag_attrs()
a
tags and img
tags are special because we want to keep the src attribute of the image element and parse the a
tag attributes to extract the clean url for the href
attribute.
Add one or more classes to an element.
Add a target attribute to all of the outgoing links. For example, if I want to open all links that aren't to jacobpadilla.com, I could set origin = "jacobpadilla.com" and target to "_blank".
Remove the pesky span
html elements that Google Docs puts everywhere.
Remove extra space around the text. If elements such as an a
tag are inside of a p
tag, the text inside of the a
tag will still be encoded! This method also makes the quotes curly and removes extra white space between the text and end of the p
tag.
I generally have a table of contents on my articles, and to make those, you need to set the table of contents links to a fragment of the article. This is what I use this method for - It will take the text inside the h
tags, replace the spaces with a dash, make all of the text lowercase, and then set that value to the id
of the h
tag. If two or more of the headers are the same, you can set add_incrementing_number
to True
.
Replace two ticks with code
elements. See the image above for more info.
I use Highlight.js to add colors to the multiline code blocks on my website. This method will transform a code block into a single element (takes up one slot in the self.elements Python list). Abiding by the highlight.js documentation, this method wraps the code block in pre
and code
elements and adds the following classes to the code
element: language-[YOUR LANGUAGE] hljs
.
Turns the curly quotes inside of the code blocks into straight ones so that HighlightJS works.
Remove empty tags. Google Docs will make an empty p
for things likes blank lines.
If you have images in your Google document, when you download the HTML, they will be stored in a directory called images
and the src
of the html img
elements will be `images/image[number].jpg. This is not always what you want, hence this method.
Pass in a template and let this method update all of the img
tag sources.
The template is a Python string. The original source is stored in a variable, that you can use in the template, called original
.
Example: articles/{original}
will output `articles/images/image[number].jpg
Allows for bold text. Google Docs uses CSS to make certain fonts bold. This method will look for the specific styles and replace the span tag (which has the bold styling) with a strong
tag. This method must called before cleaner.clean_element_attributes()
as it will remove all of the attributes that are used for the CSS.
Add an element above another. This method will add an Lxml.html.HtmlElement one slot above another element in the self.elements list. I use this to add a br
tag above my img
tags because I never added margin-top to the images on my website :)
Uses the heading tags to create a more descriptive image name. It also adds a random number to the image names to ensure that a client's browser doesn't use a cached image on an page update.
Write all of the lines in self.elements to a new html file if file_path
is set OR if it's not set, return the content as a string. This will format the elements with indentation so that, in my opinion, it's easier to read.
A basic configuration would be something such as:
from gdtch import Cleaner
HTML_FILE = ''
cleaner = Cleaner(HTML_FILE)
cleaner.clean_element_attributes()
cleaner.remove_span_tags()
cleaner.clean_text()
cleaner.remove_empty_tags()
cleaner.pretty_save(file_path='./clean_html.html')
This is the configuration that I use for the articles on my website:
from gdtch import Cleaner
HTML_FILE = ''
cleaner = Cleaner(HTML_FILE)
cleaner.remove_top_of_document(element_break='hr')
cleaner.add_strong_tags()
cleaner.clean_element_attributes()
cleaner.add_class_to_element(element='a', class_attr='blue-link')
cleaner.add_target_to_outgoing_links(origin="jacobpadilla.com", target='_blank')
cleaner.remove_span_tags()
cleaner.clean_text()
cleaner.generate_header_id_attributes()
cleaner.insert_inline_code()
cleaner.insert_highlightjs_code_blocks()
cleaner.remove_empty_tags()
cleaner.give_images_unique_names()
cleaner.alter_image_attributes(path_template='articles/example/{original}')
cleaner.add_element_above_tag_type(type='img', add='<br>')
cleaner.pretty_save(file_path='./clean_html.html')
REQUIRES PYTHON 3.12 to use all of the features. The only part that needs this version is the alter_image_attributes
method due to the itertools.batched
function.
Clone the repo (and star it)
$ git clone https://github.com/jpjacobpadilla/Google-Docs-To-Clean-HTML.git
Make a Python environment and pip install the package in editable mode so that you can easily make changes to the source code.
$ python -m venv venv
$ source venv/bin/activate
$ pip install -e Google-Docs-To-Clean-HTML/
Make a Python file, like the example ones in the example
directory, and then run it!
Contributions are welcome! If you have a suggestion or an issue, please use the issue tracker to let us know.
You can also contact me here.