-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using IPFS in the OSM infrastructure #388
Comments
So the summary is that all our mirrors will need to run special software and all our clients will need to use special software to be able to download? Can you see why that probably isn't workable? |
I think @RubenKelevra forgot to mention another major perk: the first users fetching maps will fetch them from your server. The following users will also ferch them from the other users already having it, with no additional action from you. |
I'm sure it's absolutely brilliant in theory but it relies on a network effect with all the bootstrapping issues that creates for early adopters. Our mirrors, and mirror sites in general, don't run it - they run on rsync. Likewise our users don't have IPFS clients, they have HTTP clients. |
Maybe you haven’t noticed, that was exactly the reason why people set up a BitTorrent feed to make planet files available via p2p. It’s already available today. |
I'm not sure what this would gain us that torrents don't. I know one weakness of torrents is they're harder for users to work with than a link they can feed to curl, but IPFS certainly doesn't help with that problem. |
Funny to see this issue now ... I just spent the weekend experimenting with IPFS + tile-based mapping for a web-based side project. (I used to work on the IPFS team at Protocol Labs, I'm still there but on another team) IPFS has been built from the start with strong HTTP support - every node has a gateway built in. There are also a number of public gateways, anybody can run one. Web apps can use http assets from a public gateway ... if you are using a browser that supports IPFS natively, or are using the IPFS Companion browser extension, those public gateway requests will be handled by a local IPFS node instead of via http. It's really nice. This is old, but the demos still work: ipfs/notes#142 Being able to fetch subsets of tilesets using content addressable identifiers is a complete game-changer. The next major version of go-ipfs is going to have some gigantic performance improvements, so keep an eye out for that. |
@pnorman Torrents are great. IPFS builds on top of the core ideas pioneered by BitTorrent ... the biggest innovation is content addressability - every directory, file and chunk gets an immutable id generated by hashing the content. There's a DHT which simplifies content discovery vs. having to deal with trackers. |
Can you maybe comment a bit what you mean by "IPFS is not production-ready yet" as it is mentioned in one of your repos? https://github.com/ipfs/go-ipfs#security-issues: The IPFS protocol and its implementations are still in heavy development. This means that there may be problems in our protocols, or there may be mistakes in our implementations. And -- though IPFS is not production-ready yet -- many people are already running nodes in their machines. This sounds like this is something to consider in 2-3 years, maybe? |
They've been going for five years already and have apparently obtained only negligible traction... |
Hi! I'm another person who works on IPFS. IPFS isn't done yet, but still runs well for many use-cases, and lots of folks are shipping IPFS today. Usage is growing in massive and sometimes surprising ways. But ultimately that doesn't inherently matter to you - IPFS either meets the needs you have now or doesn't. I'm happy to see someone brought it up here, and we're happy to answer questions if you're interested. Separately, we do get pokes like this about IPFS support for OSM. So if there's a modular way to support it for your users that are interested in having it, we're happy to support that effort. We're talking with PeerMaps about a potential grant for adding an IPFS endpoint, and as you can see in @jimpick's comments above, experiments are happening. So maybe experimentation and support will eventually happen either way, outside of the core infra, to meet the needs of the folks who want it. |
Hey @tomhughes, thanks for your thought on that:
Well, actually they run a webserver and rsync - which would be pretty analog to running an IPFS client and a cluster follower. The main issue with the current approach is the static location addressing: This relies on the user himself to make sure he gets a server that serves the file in the latest version and needs to do the integrity checking, after downloading the checksum file from the main mirror. In practice probably nobody checks the integrity with the checksum file from the main mirror. This opens the vulnerability that users might download a file that got placed on a malicious server using a bug in the pbf reader which can execute code. Rsync also allows direct access to the local filesystem, needing chroot envoirments which could be prone to configuration errors. Additionally, there are no atomic updates happening with rsync, a user might download a file which get's just rewritten - leaving him with the old file while he got the new md5 sum file. The wiki states:
md5 is also far away from being safe - especially on such large files like the planet files, it's easy to spoof an md5 sum.
That's what the gateways are for. You can simply link the gateway-address to a file. If the user has a browser plugin installed, it's automatically detected and fetched via IPFS directly, if not the browser will use the gateway. Easy as that. |
The point is that it doesn't really matter how easy it might be for our mirrors to add IPFS because we don't actually control them and they're not going to be interested in doing something special for one upstream when they are probably mirroring hundreds of sites. |
Just linking to some earlier work on positioning IPFS with OSM, in case anyone cares. Quite likely that I missed some more. |
@tomhughes yeah, I know that. But on the other hand there's no reason why we can't use both systems in parallel: All current mirrors can stay and work the same as they are, we add the files to the cluster and exchange the links on the main mirror for ipfs-links. This way we reduce the traffic on the main mirror every time someone decides to follow the IPFS cluster or download a file via IPFS and while the current system will handle most of the traffic. We can write a guide on how to migrate a current mirror to IPFS if someone is running just a mirror for the OSM project. We also don't have to duplicate any files on this server or change anything on the current setup - IPFS can handle files that are accessed via an HTTP-URL. They can also be added via directly from a filesystem - if you like to run the writing cluster node on the same server. So the script which currently writes new files will just be extended by some lines of code which will add the newly stored file via it's http-url to the IPFS cluster and update the link afterward. |
Also, great recent news is, that Opera added build-in support for IPFS :) http://ipfs.io/ipns/blog.ipfs.io/2020-03-30-ipfs-in-opera-for-android/ |
Yes I saw that. I'm sure all five of their users will find it useful. |
@tomhughes are we talking about the same app - which got more than 3 million votes, more than 100 million installations and 4.6 stars avg? 🤔 |
How is IPFS support in Opera relevant for the discussion here? What would be the use case for an Android user to download one of our 50GB+ sized planet files? |
It seems that some people are working on a grant to bring a project called PeerMaps to IPFS - ipfs/devgrants#41. At a first glance this seems to be around providing tiled raw data (not to be confused with vector tiles) on a weekly basis, and doesn't require any specific support from OSM infrastructure side. For this reason I'd suggest to close this issue for the time being. In case there's something more specific to be discussed, you can always reopen. Currently, I don't see any actionable item here. |
Our third-party mirrors run with rsync and a http server, and our users download the planet file with curl or wget. We do not have control over these, nor can we change it. For peer to peer downloads, we're working on torrent files. These suffer from the problem that you can't use them without installing additional software, but at least that additional software is common. |
Well, you don't need additional software to download files stored in IPFS, you can just use any of the available HTTP-Gateways, for example, the one provided by Cloudflare or the one run by the project itself ( The advantage of using a local gateway is just, that you can upload the data again after you did the download - but the gateway will basically do the same, for a short duration of time.
As I said, I wouldn't want to remove anything from the current infrastructure, but just replace the links on the main server with links to the IPFS-gateway. When the user has IPFS, great: The client will be used automatically - if the user doesn't have IPFS - no issue, the download will be done by HTTPS from the gateway. Anybody interested in providing bandwidth for the main mirror can just join the cluster and fetch a copy. This way we get the traffic off the main mirror, while still offering links that holds the latest data stored on the main mirror. |
So this will not really work unless someone stores those files in ipfs and we have some reliable means to check for their existence. Linking to a single point of failure (which linking to a third party http gateway effectively means for most users) doesn’t seem like a good idea, either tbh. |
That's why I recommend IPFS-Cluster. You can add the files to a collab-cluster and everyone interested in helping you with storage and bandwidth can join the cluster.
I wouldn't consider Cloudflare a single point of failure, but you could just add like 2 different links, one to ipfs.io and one to Cloudflare. The user can just choose. |
@pnorman I don't think this issue should be closed, I think there's still a lot of things to discuss. I would much prefer if the osmf spent more money on moremicrogrants than on server infrastructure (of course the server infrastructure needs to be upgraded as well, but if some of the work can be put on a volunteer network via ipfs, that might help reduce costs) |
We are funding a grant project for prep/ingest tools to make OSM data more amenable for p2p distribution and use by decentralized applications. Can follow along here: ipfs/devgrants#59 |
Hey guys, hope you're all doing fine in the current situation & as a long time mapper I like to thank you for all the work you put into this project! :)
Interplanetary Filesystem
IPFS is a network protocol that allows exchanging data efficiently in a worldwide mesh-network. The content is addressed by a Content-ID (CID) - by default SHA256 - and ensures that the content wasn't altered.
All interconnections are dynamically established and terminated, based on you the requests to the daemon and the queries in the global Distributed Hash-Table - which is used to resolve Content-IDs to Peers, and Peers to IP+ports.
Storage concept
There are multiple data types, but the most interesting for you is UnixFS (files and folders). A Content-ID of a folder is immutable and thus ensures that all data inside a folder can be verified after receiving it.
IPFS has a build-in 'name system' that allows to assign a static id (the public key of an RSA or ed25519 key) and point it to changing content. This way you can switch a link from one folder version to a different folder version atomically. The static IDs are accessed through/ipfs/ and the content IDs are accessed through/ipfs/ on a web-gateway.
An example page via a CID-Link on a Gateway
Software for end-users
But you don't need a gateway to access such URLs, there are also Browser Plugins (For Firefox and Chrome), which can resolve and access them directly, Desktop Clients (for Windows / Linux / MacOS) and there's a
wget
replacement which uses IPFS directly to access the URL.Backwards compatibility
You can offer a webpage which is accessible via HTTP(s) and IPFS at the same time. The browser plugins automatically detect if a webpage got a DNSLink-Entry and will switch to IPFS. All IPFS-project pages are for example stored on an IPFS cluster and served by a regular web server and can also be fetched by the browser plugins.
On the website itself, you can link an URL to one of the web-gateways, to allow users with regular browsers to access the data, without having to install anything.
If the link points towards a folder it looks like this dataset.
Cluster
IPFS alone does not guarantee data replication, everything is just stored locally for other clients to access. To achieve data replication, you need the cluster daemon. It will create a set of elements and let you add or remove them. Each element can be tagged with an expiry time (after it will be removed automatically) a minimum and a maximum replication amount.
Maximum will set the number of copies created on add, while a drop below minimum will automatically start additional replication of the data.
Altering the cluster-configuration
A cluster can dynamically grow or shrink without any configuration needed, and new data is preferable allocated to the peers which got the freest space. This way every need peer in the cluster will extend the available storage in the cluster.
Write access on the cluster is defined with the cluster configuration file (A JSON file), which lists a number of public-keys that are allowed to alter the set of elements.
Adding cluster members
Following a cluster is very simple, everyone with a locally running IPFS-daemon can start a cluster-follower which reads the cluster configuration file and communicates with the local IPFS-daemon to do the necessary replications.
Those public collaboration clusters are available since the last release of IPFS-cluster and some of the clusters are listed here:
https://collab.ipfscluster.io/
Server Outages
Server outages are no issue. The cluster has no 'master' which is necessary for the operations. Nodes with write access can go completely offline, while the data is still available.
Server outages of third-parties might trigger additionally copies of data, to guarantee the availability inside the cluster - if necessary.
If a server of the cluster comes back online, it will receive the full delta of the cluster metadata, catch up and continue the operation automatically.
Data integrity
All data is checked for integrity block by block (default block size max 256K) via SHA256 sum according to the CID (and it's metadata).
Tamper resistance
The data held on the mirrors cannot be tampered with since IPFS would just don't accept those data, because of the wrong checksum. Nobody without your keys can write to the cluster and nobody without your keys can alter the IPFS-Name system entry.
Community aspect
IPFS allows to easily read-access of the files on the mirrors but also allows everyone in the community to set up a cluster follower without having to list an additional URL on a Wiki page which needs to be cleaned up if some of the servers are no longer available etc.
Disaster recovery
Private key for Cluster-Write-Access lost
If the write key of a cluster is lost, a new cluster has to be created. This requires a daemon restart with a new configuration file and refetch of the cluster-metadata all cluster-followers. The data-integrity is unaffected since the data will stay online and on a reimport stay the same.
This can be mitigated by an alternative write key which is securely stored on a backup location.
Complete data loss all (project) servers
Since there are third-party servers, the data-integrity won't be affected. Regarding write access, look above.
Data-integration issues on a cluster server
On this server, the data store needs to be verified. All data with errors will be removed and refetched by the cluster-follower.
If the databases are affected too, IPFS can be wiped (private key doesn't need to be maintained), and the ipfs-cluster-follower can be wiped as well (private key doesn't need to be maintained).
The follower and IPFS can then be restarted again and will pull the full metadata-history again, then receiving any newly written data.
If the follower-identity is maintained (the private key isn't wiped) the cluster follower will fetch his part of the replication again.
Data loss on the whole cluster
If some data is completely lost on the cluster it can be added again by adding the same data on an IPFS node. So an offline-backup, for example, can restore the data on the cluster.
Data transfer speed
Netflix did a great job improving the IPFS component which organizes the data transfers - Bitswap. There's an blogpost about that. It will be part of the next major release, to be released within the next month.
Archiving via IPFS-cluster
Since IPFS allows everyone to replicate the data easily and offer redundancies this way, it might be an interesting solution for your backups as well - in a second cluster installation.
A third party person outside of the main team could hold the write access to this archiving/backup cluster. The main team adds all backup files to IPFS and the third-party-person puts the CID of the backup-folder to the cluster-pin.
If files from the mirror-cluster should be archived, they can just be added via ContentID to the backup cluster and are automatically transferred.
The text was updated successfully, but these errors were encountered: