Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using IPFS in the OSM infrastructure #388

Closed
RubenKelevra opened this issue Mar 30, 2020 · 25 comments
Closed

Using IPFS in the OSM infrastructure #388

RubenKelevra opened this issue Mar 30, 2020 · 25 comments

Comments

@RubenKelevra
Copy link

Hey guys, hope you're all doing fine in the current situation & as a long time mapper I like to thank you for all the work you put into this project! :)

Interplanetary Filesystem

IPFS is a network protocol that allows exchanging data efficiently in a worldwide mesh-network. The content is addressed by a Content-ID (CID) - by default SHA256 - and ensures that the content wasn't altered.

All interconnections are dynamically established and terminated, based on you the requests to the daemon and the queries in the global Distributed Hash-Table - which is used to resolve Content-IDs to Peers, and Peers to IP+ports.

Storage concept

There are multiple data types, but the most interesting for you is UnixFS (files and folders). A Content-ID of a folder is immutable and thus ensures that all data inside a folder can be verified after receiving it.

IPFS has a build-in 'name system' that allows to assign a static id (the public key of an RSA or ed25519 key) and point it to changing content. This way you can switch a link from one folder version to a different folder version atomically. The static IDs are accessed through/ipfs/ and the content IDs are accessed through/ipfs/ on a web-gateway.

An example page via a CID-Link on a Gateway

Software for end-users

But you don't need a gateway to access such URLs, there are also Browser Plugins (For Firefox and Chrome), which can resolve and access them directly, Desktop Clients (for Windows / Linux / MacOS) and there's a wget replacement which uses IPFS directly to access the URL.

Backwards compatibility

You can offer a webpage which is accessible via HTTP(s) and IPFS at the same time. The browser plugins automatically detect if a webpage got a DNSLink-Entry and will switch to IPFS. All IPFS-project pages are for example stored on an IPFS cluster and served by a regular web server and can also be fetched by the browser plugins.

On the website itself, you can link an URL to one of the web-gateways, to allow users with regular browsers to access the data, without having to install anything.

If the link points towards a folder it looks like this dataset.

Cluster

IPFS alone does not guarantee data replication, everything is just stored locally for other clients to access. To achieve data replication, you need the cluster daemon. It will create a set of elements and let you add or remove them. Each element can be tagged with an expiry time (after it will be removed automatically) a minimum and a maximum replication amount.

Maximum will set the number of copies created on add, while a drop below minimum will automatically start additional replication of the data.

Altering the cluster-configuration

A cluster can dynamically grow or shrink without any configuration needed, and new data is preferable allocated to the peers which got the freest space. This way every need peer in the cluster will extend the available storage in the cluster.

Write access on the cluster is defined with the cluster configuration file (A JSON file), which lists a number of public-keys that are allowed to alter the set of elements.

Adding cluster members

Following a cluster is very simple, everyone with a locally running IPFS-daemon can start a cluster-follower which reads the cluster configuration file and communicates with the local IPFS-daemon to do the necessary replications.

Those public collaboration clusters are available since the last release of IPFS-cluster and some of the clusters are listed here:

https://collab.ipfscluster.io/

Server Outages

Server outages are no issue. The cluster has no 'master' which is necessary for the operations. Nodes with write access can go completely offline, while the data is still available.

Server outages of third-parties might trigger additionally copies of data, to guarantee the availability inside the cluster - if necessary.

If a server of the cluster comes back online, it will receive the full delta of the cluster metadata, catch up and continue the operation automatically.

Data integrity

All data is checked for integrity block by block (default block size max 256K) via SHA256 sum according to the CID (and it's metadata).

Tamper resistance

The data held on the mirrors cannot be tampered with since IPFS would just don't accept those data, because of the wrong checksum. Nobody without your keys can write to the cluster and nobody without your keys can alter the IPFS-Name system entry.

Community aspect

IPFS allows to easily read-access of the files on the mirrors but also allows everyone in the community to set up a cluster follower without having to list an additional URL on a Wiki page which needs to be cleaned up if some of the servers are no longer available etc.

Disaster recovery

Private key for Cluster-Write-Access lost

If the write key of a cluster is lost, a new cluster has to be created. This requires a daemon restart with a new configuration file and refetch of the cluster-metadata all cluster-followers. The data-integrity is unaffected since the data will stay online and on a reimport stay the same.

This can be mitigated by an alternative write key which is securely stored on a backup location.

Complete data loss all (project) servers

Since there are third-party servers, the data-integrity won't be affected. Regarding write access, look above.

Data-integration issues on a cluster server

On this server, the data store needs to be verified. All data with errors will be removed and refetched by the cluster-follower.

If the databases are affected too, IPFS can be wiped (private key doesn't need to be maintained), and the ipfs-cluster-follower can be wiped as well (private key doesn't need to be maintained).

The follower and IPFS can then be restarted again and will pull the full metadata-history again, then receiving any newly written data.

If the follower-identity is maintained (the private key isn't wiped) the cluster follower will fetch his part of the replication again.

Data loss on the whole cluster

If some data is completely lost on the cluster it can be added again by adding the same data on an IPFS node. So an offline-backup, for example, can restore the data on the cluster.

Data transfer speed

Netflix did a great job improving the IPFS component which organizes the data transfers - Bitswap. There's an blogpost about that. It will be part of the next major release, to be released within the next month.

Archiving via IPFS-cluster

Since IPFS allows everyone to replicate the data easily and offer redundancies this way, it might be an interesting solution for your backups as well - in a second cluster installation.

A third party person outside of the main team could hold the write access to this archiving/backup cluster. The main team adds all backup files to IPFS and the third-party-person puts the CID of the backup-folder to the cluster-pin.

If files from the mirror-cluster should be archived, they can just be added via ContentID to the backup cluster and are automatically transferred.

@tomhughes
Copy link
Member

So the summary is that all our mirrors will need to run special software and all our clients will need to use special software to be able to download?

Can you see why that probably isn't workable?

@bertrandfalguiere
Copy link

I think @RubenKelevra forgot to mention another major perk: the first users fetching maps will fetch them from your server. The following users will also ferch them from the other users already having it, with no additional action from you.
This would drasticly reduce your server load, while improving service availability.

@tomhughes
Copy link
Member

I'm sure it's absolutely brilliant in theory but it relies on a network effect with all the bootstrapping issues that creates for early adopters.

Our mirrors, and mirror sites in general, don't run it - they run on rsync.

Likewise our users don't have IPFS clients, they have HTTP clients.

@mmd-osm
Copy link

mmd-osm commented Mar 30, 2020

The following users will also ferch them from the other users already having it, with no additional action from you.
This would drasticly reduce your server load, while improving service availability.

Maybe you haven’t noticed, that was exactly the reason why people set up a BitTorrent feed to make planet files available via p2p. It’s already available today.

@pnorman
Copy link
Collaborator

pnorman commented Mar 30, 2020

I'm not sure what this would gain us that torrents don't. I know one weakness of torrents is they're harder for users to work with than a link they can feed to curl, but IPFS certainly doesn't help with that problem.

@jimpick
Copy link

jimpick commented Mar 30, 2020

Funny to see this issue now ... I just spent the weekend experimenting with IPFS + tile-based mapping for a web-based side project. (I used to work on the IPFS team at Protocol Labs, I'm still there but on another team)

IPFS has been built from the start with strong HTTP support - every node has a gateway built in. There are also a number of public gateways, anybody can run one. Web apps can use http assets from a public gateway ... if you are using a browser that supports IPFS natively, or are using the IPFS Companion browser extension, those public gateway requests will be handled by a local IPFS node instead of via http. It's really nice.

This is old, but the demos still work: ipfs/notes#142

Being able to fetch subsets of tilesets using content addressable identifiers is a complete game-changer.

The next major version of go-ipfs is going to have some gigantic performance improvements, so keep an eye out for that.

@jimpick
Copy link

jimpick commented Mar 30, 2020

@pnorman Torrents are great. IPFS builds on top of the core ideas pioneered by BitTorrent ... the biggest innovation is content addressability - every directory, file and chunk gets an immutable id generated by hashing the content. There's a DHT which simplifies content discovery vs. having to deal with trackers.

@mmd-osm
Copy link

mmd-osm commented Mar 31, 2020

Can you maybe comment a bit what you mean by "IPFS is not production-ready yet" as it is mentioned in one of your repos? https://github.com/ipfs/go-ipfs#security-issues:

The IPFS protocol and its implementations are still in heavy development. This means that there may be problems in our protocols, or there may be mistakes in our implementations. And -- though IPFS is not production-ready yet -- many people are already running nodes in their machines.

This sounds like this is something to consider in 2-3 years, maybe?

@tomhughes
Copy link
Member

They've been going for five years already and have apparently obtained only negligible traction...

@autonome
Copy link

Hi! I'm another person who works on IPFS. IPFS isn't done yet, but still runs well for many use-cases, and lots of folks are shipping IPFS today. Usage is growing in massive and sometimes surprising ways. But ultimately that doesn't inherently matter to you - IPFS either meets the needs you have now or doesn't. I'm happy to see someone brought it up here, and we're happy to answer questions if you're interested.

Separately, we do get pokes like this about IPFS support for OSM. So if there's a modular way to support it for your users that are interested in having it, we're happy to support that effort. We're talking with PeerMaps about a potential grant for adding an IPFS endpoint, and as you can see in @jimpick's comments above, experiments are happening. So maybe experimentation and support will eventually happen either way, outside of the core infra, to meet the needs of the folks who want it.

@RubenKelevra
Copy link
Author

Hey @tomhughes, thanks for your thought on that:

Our mirrors, and mirror sites in general, don't run it - they run on rsync.

Well, actually they run a webserver and rsync - which would be pretty analog to running an IPFS client and a cluster follower.

The main issue with the current approach is the static location addressing:

This relies on the user himself to make sure he gets a server that serves the file in the latest version and needs to do the integrity checking, after downloading the checksum file from the main mirror.

In practice probably nobody checks the integrity with the checksum file from the main mirror. This opens the vulnerability that users might download a file that got placed on a malicious server using a bug in the pbf reader which can execute code.

Rsync also allows direct access to the local filesystem, needing chroot envoirments which could be prone to configuration errors.

Additionally, there are no atomic updates happening with rsync, a user might download a file which get's just rewritten - leaving him with the old file while he got the new md5 sum file.

The wiki states:

You should first check for the existence of the .md5 file before trying to download actual data (which may sometimes be in a transient state while a mirror is being synchronized with a recent dump)

md5 is also far away from being safe - especially on such large files like the planet files, it's easy to spoof an md5 sum.

Likewise our users don't have IPFS clients, they have HTTP clients.

That's what the gateways are for. You can simply link the gateway-address to a file.

If the user has a browser plugin installed, it's automatically detected and fetched via IPFS directly, if not the browser will use the gateway. Easy as that.

@tomhughes
Copy link
Member

The point is that it doesn't really matter how easy it might be for our mirrors to add IPFS because we don't actually control them and they're not going to be interested in doing something special for one upstream when they are probably mirroring hundreds of sites.

@mmd-osm
Copy link

mmd-osm commented Mar 31, 2020

Just linking to some earlier work on positioning IPFS with OSM, in case anyone cares. Quite likely that I missed some more.

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 1, 2020

The point is that it doesn't really matter how easy it might be for our mirrors to add IPFS because we don't actually control them and they're not going to be interested in doing something special for one upstream when they are probably mirroring hundreds of sites.

@tomhughes yeah, I know that.

But on the other hand there's no reason why we can't use both systems in parallel:

All current mirrors can stay and work the same as they are, we add the files to the cluster and exchange the links on the main mirror for ipfs-links.

This way we reduce the traffic on the main mirror every time someone decides to follow the IPFS cluster or download a file via IPFS and while the current system will handle most of the traffic.

We can write a guide on how to migrate a current mirror to IPFS if someone is running just a mirror for the OSM project.

We also don't have to duplicate any files on this server or change anything on the current setup - IPFS can handle files that are accessed via an HTTP-URL. They can also be added via directly from a filesystem - if you like to run the writing cluster node on the same server.

So the script which currently writes new files will just be extended by some lines of code which will add the newly stored file via it's http-url to the IPFS cluster and update the link afterward.

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 1, 2020

Also, great recent news is, that Opera added build-in support for IPFS :)

http://ipfs.io/ipns/blog.ipfs.io/2020-03-30-ipfs-in-opera-for-android/

@tomhughes
Copy link
Member

Yes I saw that. I'm sure all five of their users will find it useful.

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 2, 2020

@tomhughes are we talking about the same app - which got more than 3 million votes, more than 100 million installations and 4.6 stars avg? 🤔

@mmd-osm
Copy link

mmd-osm commented Apr 2, 2020

How is IPFS support in Opera relevant for the discussion here? What would be the use case for an Android user to download one of our 50GB+ sized planet files?

@mmd-osm
Copy link

mmd-osm commented Apr 15, 2020

It seems that some people are working on a grant to bring a project called PeerMaps to IPFS - ipfs/devgrants#41. At a first glance this seems to be around providing tiled raw data (not to be confused with vector tiles) on a weekly basis, and doesn't require any specific support from OSM infrastructure side.

For this reason I'd suggest to close this issue for the time being. In case there's something more specific to be discussed, you can always reopen. Currently, I don't see any actionable item here.

@pnorman
Copy link
Collaborator

pnorman commented Apr 18, 2020

Our third-party mirrors run with rsync and a http server, and our users download the planet file with curl or wget. We do not have control over these, nor can we change it.

For peer to peer downloads, we're working on torrent files. These suffer from the problem that you can't use them without installing additional software, but at least that additional software is common.

@pnorman pnorman closed this as completed Apr 18, 2020
@RubenKelevra
Copy link
Author

For peer to peer downloads, we're working on torrent files. These suffer from the problem that you can't use them without installing additional software, but at least that additional software is common.

Well, you don't need additional software to download files stored in IPFS, you can just use any of the available HTTP-Gateways, for example, the one provided by Cloudflare or the one run by the project itself (gateway.ipfs.io).

The advantage of using a local gateway is just, that you can upload the data again after you did the download - but the gateway will basically do the same, for a short duration of time.

Our third-party mirrors run with rsync and a http server, and our users download the planet file with curl or wget. We do not have control over these, nor can we change it.

As I said, I wouldn't want to remove anything from the current infrastructure, but just replace the links on the main server with links to the IPFS-gateway.

When the user has IPFS, great: The client will be used automatically - if the user doesn't have IPFS - no issue, the download will be done by HTTPS from the gateway.

Anybody interested in providing bandwidth for the main mirror can just join the cluster and fetch a copy.

This way we get the traffic off the main mirror, while still offering links that holds the latest data stored on the main mirror.

@mmd-osm
Copy link

mmd-osm commented Apr 21, 2020

just replace the links on the main server with links to the IPFS-gateway

So this will not really work unless someone stores those files in ipfs and we have some reliable means to check for their existence.

Linking to a single point of failure (which linking to a third party http gateway effectively means for most users) doesn’t seem like a good idea, either tbh.

@RubenKelevra
Copy link
Author

just replace the links on the main server with links to the IPFS-gateway

So this will not really work unless someone stores those files in ipfs and we have some reliable means to check for their existence.

That's why I recommend IPFS-Cluster. You can add the files to a collab-cluster and everyone interested in helping you with storage and bandwidth can join the cluster.

Linking to a single point of failure (which linking to a third party http gateway effectively means for most users) doesn’t seem like a good idea, either tbh.

I wouldn't consider Cloudflare a single point of failure, but you could just add like 2 different links, one to ipfs.io and one to Cloudflare. The user can just choose.

@thibaultmol
Copy link

@pnorman I don't think this issue should be closed, I think there's still a lot of things to discuss.

I would much prefer if the osmf spent more money on moremicrogrants than on server infrastructure (of course the server infrastructure needs to be upgraded as well, but if some of the work can be put on a volunteer network via ipfs, that might help reduce costs)

@autonome
Copy link

We are funding a grant project for prep/ingest tools to make OSM data more amenable for p2p distribution and use by decentralized applications. Can follow along here: ipfs/devgrants#59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants