google_takeout_parser/split_html at master · purarue/google_takeout_parser

History

Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
split_html_activity.go		split_html_activity.go

README.md

This is for splitting the old HTML format, you shouldn't use this for new exports

This dir contains a go script to split an HTML file into smaller chunks, so its possible to parse on machines with limited memory.

In particular, I had issues using this with termux on my phone, as the ~100MB takeout HTML files when parsed by loading the whole file into memory cause my terminal to just crash since it runs out of memory

So, this script splits the HTML files into lots of smaller chunks, like:

MyActivity-001.html
MyActivity-002.html
MyActivity-003.html
MyActivity-004.html
MyActivity-005.html
MyActivity-006.html

To build: go build -o split_html

Usage: ./split_html [options] input
  -count int
    	how many cells to split into each file (default 1000)
  -output string
    	output directory. if not specified, will use the directory of the input file

Then, use it against any large files that you have problems parsing:

./split_html ~/data/takeout/something/MyActivity/YouTube/MyActivity.html
# move other file somewhere else
mv ~/data/takeout/something/MyActivity/Youtube/MyActivity.html /tmp
# test parsing to make sure they still work
google_takeout_parser merge -a summary ~/data/takeout/something

This splits the 100MB+ HTML files into dozens of small files sized about ~700K.

I personally created copies of all of my HTML exports, and did:

find ~/Downloads/takeout/ -name 'MyActivity.html' -exec ./split_html "{}" \;
find ~/Downloads/takeout/ -name 'MyActivity.html' -delete

And then used google_takeout_parser merge -a summary to compare the new and old outputs before removing the old files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split_html

split_html

README.md

This is for splitting the old HTML format, you shouldn't use this for new exports

Files

split_html

Directory actions

More options

Directory actions

More options

Latest commit

History

split_html

Folders and files

parent directory

README.md

This is for splitting the old HTML format, you shouldn't use this for new exports