Coverage Graph |
Get your cloud data on your terms.
Cloud Scraper is an open source tool that allows you to download your data from cloud services on a regular basis to limit exposure to failure or data loss. It's a command line tool that's easy to use, and can be used on any platform.
I saw a very enlightening post on academic Mastodon, and, though I'm unable to find it now, it made a very good point. The gist was that we should be wary of cloud services, and we should back up any data of ours that they hold regularly.
One line that stuck with me was, this lecturer asking her students (with her hand up), "hands up who has backed up their Google Docs in the last 24 hours?".
I certainly wouldn't do this if I had to remember to do it. I also think that if I had a hackable pluggable solution to do this automatically, I might start to find some cute new things to do with the data that I generate.
Crucially, anything I decide to do with my data will be on my hardware and my terms.
The motive to action for me is that I've been using Google Keep more often recently for journaling & whatnot. I think it's becoming quite useful to me. I also think that Google could pull the plug on it at any time, and I'd lose all of my data. I don't want that to happen.
Because this is a personal project and I like Rust.
Ideation/solution design - there is no working implementation at the time of writing.
Figure 1 shows the flow of data in the proposed solution. It uses Google as an example, but the same flow applies to any cloud service.
The plan is to introduce modules for each cloud service, data store or data transformation. These each have configuration controllable by the user.
The modules are to be implemented as nodes that pass events to each other.
I intend to implement Google Docs and Tasks initially because this is the service I think is the biggest risk for me. By risk, I mean the combination of probability and impact for me of tasks and documents becoming unavailable.
Originally, I wanted to do this with Keep, but that API is restricted to enterprise users, and building a scraper that uses headless web pages to pull the information is a large taks for a first implementation.
Probably you already have this dev library installed if you're doing any other development in a systems programming language. If not, you'll need to install it.
openssl
Beyond that, it's a pretty normal Rust project.
cargo build # build
cargo test # run tests
Coverage uses llvm-tools-preview
and grcov
. You can install llvm-tools-preview
with
rustup component add llvm-tools-preview
, and grcov
with cargo install grcov
.
You can measure test coverage with the following commands:
CARGO_INCREMENTAL=0 RUSTFLAGS='-Cinstrument-coverage' \
LLVM_PROFILE_FILE='cargo-test-%p-%m.profraw' cargo test
grcov . --binary-path ./target/debug/deps/ -s . -t html --branch --ignore-not-existing \
--ignore '../*' --ignore "/*" -o target/html
Cloud Scraper can be configured by a yaml file.
You can generate your own yaml file by running the CLI wizard:
cargo run config
You can either use the default config.yaml
, as above, or specify your own file name by command
line argument:
cargo run config -- -c my-config.yaml
You can then run the service with the configuration file using the same argument:
cargo run -- -c my-config.yaml
The service only requires a config.yaml
file if you're going to configure a domain. The minimum
contents are:
email: your@email.address
But this is only required if you add a domain configuration:
domain_config:
tls_config:
builder_contacts: [ "an@other.email", "your@email.address" ]
poll_attempts: 10
poll_interval_seconds: 10
url: https://your.domain.com
email: your@email.address
If a domain is configured, the service will check for a root certificate, and if a valid one is not found, it will request one from Let's Encrypt. The poll attempts and poll interval parameters are used to manage how the service retries attempts to retrieve a certificate.
By default, the service listens on port 80 unless you configure TLS. You can change this in the configuration file by including it in your url.
domain_config:
url: http://localhost:1234
Or by command line argument.
cargo run -- -p 1234
First, you should set a root password. There is only one, and no user because this system is single-tenant by design.
cargo run root-password
You can then run the service.
cargo run
You can get log information by setting the RUST_LOG
environment variable.
RUST_LOG=debug cargo run
Of course you can also run the binary directly.
./target/debug/cloud_scraper
Linux usually doesn't let you open ports like 80 or 443 as a non-root user. You can use the following command to allow this.
# If needed, install setcap, for example in Ubuntu: sudo apt-get install libcap2-bin
sudo setcap cap_net_bind_service=+ep ./target/debug/cloud_scraper
Remember that after running the above that the permission applies to cloud_scraper
, not
cargo
. Using cargo run
will not work with the above permission.