Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bisq Network Monitor Revisited #62

Closed
25 of 33 tasks
freimair opened this issue Dec 12, 2018 · 29 comments · Fixed by bisq-network/bisq#2432
Closed
25 of 33 tasks

Bisq Network Monitor Revisited #62

freimair opened this issue Dec 12, 2018 · 29 comments · Fixed by bisq-network/bisq#2432
Assignees

Comments

@freimair
Copy link

freimair commented Dec 12, 2018

This is a Bisq Network proposal. Please familiarize yourself with the submission and review process.

Abstract: Tor and P2P network issues do and will affect the performance and acceptance of Bisq - a monitoring system greatly assists in finding their cause. Practically, the current monitoring system still leaves us with a lot of tedious guesswork. I propose a fresh monitoring solution which is properly designed for the task at hand (unlike fast quick and in a hurry as the current one had to be). The solution features monitoring node(s) in the P2P network gathering metrics while an external service takes care of history and presentation. The presentation can be suitable for developers and users alike, provide detailled insight into Bisq's network layer, lets us grasp the value of Bisq to the world, and prepare Bisq for the future.

Introduction

Bisq is getting bigger and bigger and thus, issues in Bisq's network layer appear more frequently. In late 2017, for example, the network simply did not perform (bisq-network/bisq#1172). In an attempt to understand why, @ManfredKarrer created a monitoring do-that fast quick and in a hurry. However, while clearly showing the situation, the new monitor did not help very much in understanding the cause, let alone foster strategies to prevent such a situation in the future. Finally, the network magically recovered. Since then, Bisq has grown even bigger.

Challenge

An ideal monitoring solution has to serve multiple purposes. First of all, it should support developers in finding fixes to pending issues (bisq-network/bisq#1241, bisq-network/bisq#1299, ). Second, it should support developers in analyzing and understanding the network. A better understanding of the network lets us anticipate upcoming issues and maybe stop them in their tracks before they become an actual issue. Third, numerical performance values let us evaluate the effectiveness of network tweaks more clearly and make informed decisions whether to keep the tweaks or not. Furthermore, numerical performance values can be fed to some sort of attack detection mechanisms which trigger countermeasures on demand. Leaving the realm of development, a historical display of numerical performance values allows users to get an idea why their offer is taking so long to be published and maybe pick a time where the network is less busy (bisq-network/bisq#1575). And last but not least, the collective of metrics lets people get an idea of the value Bisq brings to the world.

The current monitoring solution, unfortunately, is very limited (because it has been created fast quick and in a hurry in order to get hold of actual pending issues). First of all, there is no historical data: a dev cannot correlate historical data with other sources (Tor metrics for example) in order to either conclude whether or not it is Bisq's fault when the network suffers from performance loss. Second, the current monitoring solution does not provde sole Tor performance values nor does it provide network load values. The only value available is a roundtrip time metric which might indicate performance loss, but does not say if it is caused by poor Tor performance, by a high network load or by congestion caused by a way-to-high network load. Whereas the latter should have been visible before congestion actually kicked in. Third and last, the data presentation is not suitable for people other than developers. The statistics site is static and has to be manually refreshed to get up-to-date data, there is no historical data, and the metrics displayed are too cryptic to be understood by the average (Bisq-affine) Joe. All in all, the current monitoring solution leaves us guessing what Bisq's network layer looks like inside, if Tor is blocking our request due to their DoS protection or if our optimiziations do really optimize things. Hence, the current monitoring solution does not come near the ideal solution sketched above.

Proposal

It is time to create a proper monitoring solution (bisq-network/bisq#1361). From a technical point of view, the shiny new monitoring solution of course has to conform to the usual bullsh*t bingo: it has to be clean, modular, extensible, easy to deploy, low maintenance effort, use existing solutions where possible, fast initial time to market, etc.

Having that in mind, I propose

  1. starting fresh and design the monitoring solution from the ground up. A clean start does not trick us into reusing approaches just because they are already there, without thinking about their usefullness and effectiveness. Futhermore, by discarding old code we do not pull deprecated and/or dead code into a shiny new project, especially since the existing monitoring solution has been created in a hurry.
  2. making the new monitoring solution easy to deploy and operate. I had my share of application servers, tomcats and IT departments - thus, I suggest instead of wasting time fighting these we invest a little more development time in order to create a simple executable which everyone, who is able to run the Bisq client, is able to run.
  3. making the monitor highly modular and configurable so an operator can easily pick a (sub)set of metrics he wants to run. Furthermore, a simple Java-properties-based configuration should be used to control how these metrics behave.
  4. to focus on extensibility. We might have a good idea of which metrics we need right now. However, there certainly are metrics we do not think of right now. Such future metrics have to be addable without rewriting the whole monitor thing. Furthermore, a new-to-Bisq developer should find her way around the monitoring tool quite easily. I am not saying that the actual metric themselves have to be super clean, just that a developer can add a set of new metrics without having to spend a lot of time solving riddles.

Following these proposals, we IMHO should be able to create a monitoring solution which properly allow for a deeper look ainsidet Bisq's network layer while not being outdated by tomorrow.

Please find a big picture of the proposed monitoring solution in the illustration below. There are two main components to the monitoring solution. First, Monitoring Nodes are inserted into Bisq's P2P network. These monitor-bigpicture nodes only gather data, they do not keep a historical record of the data. Second, Monitoring Services scrape accept data from the Monitoring Node(s), keep a historical record and visualize the data. Offline discussions yielded Prometheus to be the monitoring service to be used. Please note that the underlying Tor network is part of every connection in the illustration above.

EDIT: Prometheus turned out to be not suitable for our purpose, at least if we do not want to overload the network. Prometheus is meant to monitor system resources multiple times a minute, if we do that with the Bisq network or even only Tor, we might run into network overload and DoS countermeasures, respectively. However, there is another great open source tool out there: Graphite. It does the same thing as Prometheus, except it does NOT scrape the data but waits for data to arrive (and features timestamps). However, as the chances of running into the limits of such a tool are quite high, I rearranged the proposal to allow each Monitor Node to publish its findings to its own set of Monitoring Services. For example, for system information, use Prometheus as a Monitoring Service, for benchmarking the Tor and p2p networks use Graphite, and for event tracking use -InsertYourFavoriteMonitoringService-.

Monitor Node(s)

Please find an architectual overview of a Monitor Node in the illustration below. A Monitor Node uses a Scheduler as its central component. the Scheduler executes Metrics and supplies them with their share of Configuration. The minimal Configuration for each Metric contains whether the Metric is enabled and if
monitor-architecture
yes, at which intervals the Metric is to be run. The collected data is published to a suitable Monitoring Service offered as a Prometheus job exporter (maybe we need to add a Pushgateway) via a Tor Hidden Service. The Monitor Node is to be run as a simple executable from the command line (and thus can be easily turned into a system service). Furthermore, on Linux systems, the Monitor Node can be instructed to reload its configuration and react to changes (enable/disable metrics, change intervals) by a kill -USR1 signal without the need for restarting the executable (since we expect to run multiple Tor binaries and restarting the whole thing would take an awful lot of time while at the same time loosing running average data).

Monitoring Service

The Monitoring Service collects the data provided by the Monitor Node(s), keeps a historical record and by some means also provides a GUI. For starters, this service is to be provided by the open-source monitoring solution Graphite Prometheus. It is well-maintained and active, takes a few minutes to set up and handles recording and displaying data quite nicely.

Implementation Details

Please note the priority list and/or time line below on how I propose to get the proposed monitoring system up and running. Each release is meant to be set productive. The Babysteps release is meant to complement the existing monitoring do-that as it primarily adds Tor metrics. As some of the work is already done, I believe that we can provide Babysteps in January already. The Showing Off release is then ment to supersede Manfreds monitor as by then, the new monitor includes all metrics provided by Manfreds monitor. The Settled release then focuses on making the value of Bisq to the world somehow visible.

  • create basic infrastructure for a monitoring node
  • create Graphite instance
  • benchmark Tor startup
  • benchmark Hidden Service startup
  • benchmark Tor roundtrip time to torproject.org's hidden service
  • release: Babysteps
  • p2p RTT (using ping/pong)
  • p2p network load (messages per timeslot)
  • p2p network load histogram (messagetype per timeslot)
  • estimate the number of open offers (snapshot)
  • release: Showing off
    - [ ] Monitoring Service whitelisting
  • exchange rates (integrate work of @HarryMacfinned)
  • fee estimation (integrate work of @HarryMacfinned)
  • p2p network load (include Refresh/Remove messages)
  • p2p network load histogram (include Refresh/Remove messages)
  • message number diffs between Seed Nodes per message type
  • open offers per market (teaser for what is possible)
  • notifications on alarms per email/slack/...
  • fix >6h empty data
  • create Readme
    • running a Monitor Node
    • running a Monitor Service
  • release: Settled
  • release: yourReleaseNameHere

Future:

Aftermath

The proposed monitoring solution should pave the way of Bisq's future quite a bit. With the solution, we are able to understand Bisqs network layer better and thus, take better care of it.

Please feel free to suggest further Metrics and how they fit in the list above. Please also feel free to raise any questions and concerns! It is a rather big project and more minds usually perform better in creating a more complete picture.

@initCCG
Copy link

initCCG commented Dec 12, 2018

How will this affect privacy and anonymity of Bisq network trading?
Is this going to help chainalysis and surveillance companies monitor Bisq trading?

@freimair
Copy link
Author

@initCCG I kind of expected your question. And it is a valid one! Let me try to explain and clarify.

How will this affect privacy and anonymity of Bisq network trading?

The proposed monitoring solution does not expose any data which is not already exposed or at least, can be extracted by a skilled programmer by hacking the client. Therefore, I do not expect the monitoring to have any effect on the privacy and anonymity of Bisq network trading. Note that the monitoring does not aim at pinpointing and tracking single trades, our interest lies only in getting an idea of the network status.

Is this going to help chainalysis and surveillance companies monitor Bisq trading?

A skilled developer can always place her own modified Bisq client into the network, as there is no access restriction in place and the trading protocol specifications are publicly available (at least in the form of source code). However, having Tor and its hidden services as a backend, such eavesdropping cannot link any trades to real people (except if you are doing FIAT trades of course, then you have to reveal your bank account details to your trading partner). So companies interested in surveilling Bisq can and will do that independendly from whether or not a monitor solution is available.

As a side note:
In general (@ManfredKarrer please correct me if I am wrong), Bisq is designed to be a truly decentralized trading platform for crypto currencies i.e. forclosing a central single point of failure, a single attack surface, and the possibility of the operators to commit serious fraud (as we have seen numerous cases in recent times). If Bisq is to be focused on privacy and security only, a whole lot would have to be changed and I (having spent many years doing research in the field of data privacy and security) doubt that we could make Bisq bulletproof.

@ManfredKarrer
Copy link
Contributor

Great thanks @freimair for the proposal!
One thing we discussed and which might be good to add is that the monitoring might use different sources for the data. The P2P network node is the main source but some data like the number of users online might be difficult to get from that node. An easy solution could be to use datda from the price nodes as those get contacted each minute by any live node and we use a unique connection ID (I think the onion address of the requester is visible as well). So if we collect those data from all the price nodes we get a pretty good picture about the number of live nodes.

Another aspect is how much we want to integrate the monitoring with other non P2P services as the price node (exchange rates, fee estimation - where historical data would be helpful), Bitcoin nodes (not sure if there is uch needed beside to check if they are healthy), webpage, forum, download page,.... @chirhonul started in that direction and probably its better to not merge both approaches too close to keep it more flexible, but maybe good to stay well connected...

@ManfredKarrer
Copy link
Contributor

ManfredKarrer commented Dec 12, 2018

@initCCG Florian answered most already so I just want to add a few things:
The P2P network monitoring has no privileged access to anything. The data is either public so anyone could observe and collect that or it is private (encrypted P2P messages). There are a few areas like the above mentioned price nodes which can be used to collect data which is not publicly accessible but we are 100% open source and the available data is either an onion address (that is public in the offer anyway when you make an offer as otherwise the taker could not connect to you) or an UID which is only used to have a connection uniquely identified.

The concept of Bisq has it's limitations reagarding privacy as @freimair mentioned like the revealing of your name to the trade peer for doing a Fiat transfer. For altcoins the privacy level is much higher but also there one should be aware of the risks of chainanalysis on the blockchain level and other conceptual limitations (using the same onion address for many trades,...).

This attempt here is to improve and protect the P2P network. We have not moved to data collection business ;-).

@ManfredKarrer
Copy link
Contributor

ManfredKarrer commented Dec 12, 2018

Ah and one more technical issue:
There is support for a GRPC API in place. It is not in master yet but I can look up for the branch. It is just the skeletton to run Bisq in a headless mode with a API server and client. The work from @blabno and @mrosseel also use an API (HTTP rest) and the same integration infrastructure.
@cbeams was very interested to add GRPC.

@blabno
Copy link

blabno commented Dec 13, 2018

Is monitoring service going to connect to the monitoring node over regular HTTP or WebSocket?
If anybody could connect to monitoring node then it's easy to run DoS against the node.

I've recently refactored HTTP API to run on embedded Jetty, so I guess that would be something handy and aligned with point 2 of the proposal.

@freimair
Copy link
Author

Pricesnodes

So what you are suggesting is to somehow merge the Monitor Node with the price node and thus, enable Prometheus to fetch data directly from the price node? (i.e. exchange rates, fee estimation, number of live users, ...)
We can of course integrate the metric into the price node and only expose numbers (i.e. estimated number of active peers) to Prometheus.
Having said that, I second @ManfredKarrer s point about flexibility. I prefer to keep monitoring separated from other infrastructure components (such as the price nodes). That way, we can be sure to not expose any data that should not be exposed. Furthermore, by opening an additional public accessible API on such a critical part of the infrastructure does enlarge the attack surface considerably. Considering the floodfill characteristic of the P2P network, we might be able to estimate the network size without changing the price node. Furthermore, if a monitor node does query the price node as any other Bisq client does, we can keep that separated as well.

Is monitoring service going to connect to the monitoring node over regular HTTP or WebSocket?

@blabno Prometheus uses plain HTTP to fetch a simple text document. The Metrics and measurements are encoded (i.e. human readable) into those documents by the monitor node. So yes, DoS is possible, however, as we know the Prometheus server (IP for example) we could do whitelisting. I added that feature to the Settled release as I am not sure if DoS against monitoring is that much of an issue.

There is support for a GRPC API in place.

How would we use RPC for monitoring?

how much we want to integrate the monitoring with other non P2P service

That is exactly the point. The Prometheus server does not care what data is to be collected. Why not reuse the Prometheus solution for the P2P monitoring as well. Low maintenance, easy to find, good overview, etc. Iff there is an issue with separation of concerns, we can always fire up a second Prometheus instance.

@ripcurlx
Copy link

@freimair Sounds great! Would it be possible to use this setup also for user/onion address based monitoring? The reason why I'm asking is, that I wanted (bisq-network/analytics#3) and still want to have a basic monitoring (app started -> offer created -> trade completed) on a kind of user basis to be able to say if we have a problem within the client and if a certain version improved our KPIs. If we just monitor the network it should be possible to gather this kind of information without adding anything to the client, correct?

@ghost
Copy link

ghost commented Dec 14, 2018

@ManfredKarrer wrote:

Another aspect is how much we want to integrate the monitoring with other non P2P services as the price node (exchange rates, fee estimation - where historical data would be helpful)

I'm working on making fee historical data available online in realtime (Let's say I'll whip myself for january).

Another point I was thinking on:
When lauching Bisq from CLI, there are quite some interesting informations delivered, either in the CLI or in the logfile.
Would it be interesting to monitor Bisq CLI output, and/or the logfile ?
(It could be something quite simple, eg a croned awk program watching for some critical keywords.)

@freimair
Copy link
Author

freimair commented Dec 14, 2018

@ripcurlx You are propably more confident in the Bisq trading protocol than I am, so, if the protocol allows anyone to gather this kind of data, then yes.

@HarryMacfinned I thought about integrating monitoring metrics into the Bisq client itself. However, fetching the data without using the P2P network is kinda difficult and with using the P2P network, the network is (unneccessarily) burdened by non-critical monitoring data. The only option is to make the Bisq client send the metrics to a monitoring service, but in that case, we would have to ask the user if she wants to send "anonymouse usage data" - a thing that I personally deny as a reflex. Having the Bisq users in mind, not providing the choice is not an option, and if the choice is there, I fear that the data we get is only a very small portion of the overall status.
Hence, I decided to propose dedicated monitoring nodes and just rely on the things available on the network anyhow.

@ghost
Copy link

ghost commented Dec 14, 2018

@freimair ,
Sorry for being imprecise. My idea was not at all to use all or any user's Bisq client, but only some Bisq client specially reserved for such a task. eg, I have a Bisq running always for several purpose, and I may also use it to monitor its output.
But first I wanted to know if this is useful or not. It's probably more a question for @ManfredKarrer .

@freimair
Copy link
Author

@HarryMacfinned that is exactly what the Monitor Node in the proposal is meant to be.

Btw you said

I'm working on making fee historical data available online in realtime (Let's say I'll whip myself for january).

can we have an offline discussion (slack?) about how you do that?

@ghost
Copy link

ghost commented Dec 14, 2018

can we have an offline discussion (slack?) about how you do that?

of course, I ping you on slack

@freimair
Copy link
Author

I just found that Prometheus is not that suitable for our purpose. I will try Graphite and Metrics as java lib. That results in pretty much the same architecture with the difference, that the Monitor Node pushes findings to the Monitoring Service instead of getting scraped.

@freimair
Copy link
Author

Please be informed that I just updated the proposal.

The update was necessary because Prometheus is not suitable for our purpose and turned out to be a showstopper.

Changes:

  • remove Prometheus as single Monitoring Service
  • insert Graphite as a Monitoring Service for getting the monitoring up and running
  • enhance the architecture to allow for different Monitoring Services (use Prometheus for system resources, Graphite for network monitoring, ? for event tracking for example)

@ManfredKarrer
Copy link
Contributor

@freimair @ripcurlx
Regarding tracking of user behaviour/trades:
It is not possible to track that as it is not public P2P network data. At a trade the peers communicate directly (encrypted) and the rest of the P2P network does not know about that. Only the public floodfill message like offers can be used for metrics.

The trade statistic data do not conain any identifying data about the trade/traders but only price, amount, date,...

To introduce some tracking about user behaviour might be possible in a privacy protecting way but it is a bit tricky and a project on it's own which would require a well designed proposal. I think it cannot and should not be considered as part of that proposal but it might get added later to the presentation layer of the monitoring.

We still can and should have performance metrics for direct messages with 2 dummy nodes sending test messages and meassure the RTT.

@ManfredKarrer
Copy link
Contributor

Just an additional note: With user behaviour tracking we mean to get more information how users use the app so we get more info for improving usability. If we implement that it will protect privacy and if non identifying data are shared it will require user acceptance.

@freimair
Copy link
Author

@ManfredKarrer I updated the proposal accordingly (by adding a "?")...

@freimair
Copy link
Author

Babysteps is almost ready

screenshot from 2018-12-28 17-57-17

@freimair
Copy link
Author

freimair commented Dec 30, 2018

Babysteps live service: https://monitor.bisq.network or http://vgp5y2qrkifh7foh.onion

@freimair
Copy link
Author

Showing off is coming up

screenshot from 2019-01-29 14-41-56

@devinbileck
Copy link
Member

@freimair Is it possible to change the displayed timezone? I would like to display times in my local timezone rather than UTC.

@freimair
Copy link
Author

freimair commented Feb 3, 2019

@devinbileck I believe I just changed it - although it has been a pain in the behind.

@devinbileck
Copy link
Member

Yep. Thanks!

@freimair
Copy link
Author

Settled is coming up! However, the only thing left in the list of features is the

Notification

However, there is some decision making required.

Options

  • Email: Quite an evergreen. However, in order to make that work, we have to do some configuration.
    • First, we need a sender email address (e.g. monitoring@bisq.network)
    • Second, we need an SMTP service which accepts the aforementioned sender email address and does that in a way that common spam filters do not consider a message as junk!
    • Third, of course, we need credentials for that SMTP service so we can configure Grafana appropriately
    • Furthermore, we need the recipient addresses of all which should be notified of an alarm. However, mapping the recipients to the appropriate alarm has to be done manually (at least that is what I have learned by now). And that brings up some facts:
      • A seed node operator has to state an email address (which does compromise her identitiy)
      • Grafana knows these email addresses
      • Mating a faulty seed node alarm with the appropriate email address has to be done manually.
  • Slack: As I am quite unfamiliar with Slack and there has been discussions on whether to switch to an other platform, I am not certain, that this is the best option. However, it is possible, given that can be done with our Slack plan.
  • others: Grafana brings a whole lot more options for alert notifications. Most notably IMHO is the Webhook option. However, I am not quite sure how to use it in order to make things easy.

Quick solution

All in all, I have a feeling that the Slack option is the most suitable and versatile. The monitor reports to Slack in case any alarm triggers (e.g. a seed node failed). Slack takes care of the email notification to those who subscribed.

  • On the plus-side, the seed node operators have an additional layer between them and their seed node address + one cannot easily determine which seed node is operated by whom. More generally, an independent (propably even subscriber-controlled) email notification platform seems like to scale much better as reconfiguring Grafana for each and every alarm.
  • Downside is, that subscribers get notified even if the alarm is not meant for her. (Is there a notification filter on Slack?)
    Make or break: does our Slack plan support incoming webhooks

What do you think?

@ripcurlx
Copy link

I would also go for Slack as a first step. You can set notification options for each channel (all new messages, Just mentions, Nothing).

Yes, we already use incoming webhooks for a couple of cases: https://bisq.slack.com/apps/A0F7XDUAZ-incoming-webhooks?next_id=0 (not sure if you have the permission to see them)

@freimair
Copy link
Author

I just configured Grafana to report to Slack, i.e. to our bisq-monitor channel

lets see how this works out

@ManfredKarrer
Copy link
Contributor

* A seed node operator has to state an email address (which does compromise her identitiy)

They can use an anonymous email if that is a concern.

* [Webhook](http://docs.grafana.org/alerting/notifications/#webhook) option

Ah maybe we could hook that into Github? Could be an alternative to email/slack... not sure though...

Yes I agree Slack might be the easiest option. RocketChat is set up but some final work is missing. I assume it will support similar notifiaction tools like Slack.

@ManfredKarrer
Copy link
Contributor

I think the Slack notification can be configured as PM to a user. I had it in the old monitor but was deactivated. @mrosseel also worked with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants