-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endorse htlc and local reputation #2716
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## master #2716 +/- ##
==========================================
+ Coverage 85.82% 85.86% +0.04%
==========================================
Files 216 218 +2
Lines 18126 18209 +83
Branches 771 749 -22
==========================================
+ Hits 15556 15636 +80
- Misses 2570 2573 +3
|
7d53a40
to
525f547
Compare
525f547
to
14b6dd1
Compare
14b6dd1
to
deab085
Compare
deab085
to
b8b5cb1
Compare
b8b5cb1
to
3b4b506
Compare
3b4b506
to
7ae571d
Compare
ff97580
to
73efca3
Compare
755ba80
to
fa2e536
Compare
870796d
to
b79b298
Compare
eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelData.scala
Outdated
Show resolved
Hide resolved
eclair-core/src/main/scala/fr/acinq/eclair/wire/protocol/HtlcTlv.scala
Outdated
Show resolved
Hide resolved
b79b298
to
83a5627
Compare
We must allow node operators to run plain blip-0004 without our experimental reputation recorder. It's also a stop-gap in case there is a bug in it that can be exploited. We also refactor to introduce more types and documentation, without changing the reputation algorithm itself. We also fix a few issues mentioned on the main PR comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's now more clear to me how we assign reputation to our peers. I've made a few comments on the code itself, some of which (the easy ones) I fixed in #2893.
The reputation algorithm itself looks good to me, let's try it out and see what results we get in practice and during simulations.
However, I don't think the way we interact with the reputation recorder makes the most sense.
You are storing a relay attempt as soon as we start relaying, before we know whether we actually send HTLCs out or not.
This leads to the weird CancelRelay
command and an inconsistency between channel relay and trampoline relay.
In the trampoline case, if we can't find a route or can't send outgoing HTLCs, we will treat this as a failure, which is incorrect.
This can probably even be used to skew our reputation algorithm.
It's also pretty invasive, especially in the NodeRelay
component...
It seems to me that it would make more sense if we implemented the following flow:
- Once we start relaying (
ChannelRelay
/NodeRelay
), we obtain the confidence value withGetConfidence
and will include it inCMD_ADD_HTLC
. - At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.
- In
Channel.scala
, when we actually send an outgoingUpdateAddHtlc
, we emit anOutgoingHtlcAdded
event to the event stream, that contains the outgoing HTLC and itsOrigin.Hot
. - In
Channel.scala
, when an outgoing HTLC is failed or fulfilled, we emit anOutgoingHtlcFailed
/OutgoingHtlcFulfilled
event to the event stream. - The reputation recorder listens to those events, and updates the internal reputation state accordingly.
- We don't use the
relayId
but rather the outgoingchannel_id
andhtlc_id
, combined with the origin to group HTLCs. - For trampoline payments, since the reputation recorder has the
Origin
information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.
I believe this better matches what we're trying to accomplish: the only thing the reputation recorder actually needs to know to update reputation is when outgoing HTLCs are sent and when they're settled.
It also provides more accurate relay data to ensure we're updating the reputation correctly, and has much less impact on the ChannelRelay
/ NodeRelay
actors (which should simplify testing).
Can you try that, or let me know if you think that it wouldn't be as good as the currently implemented flow?
// Pending HTLCs are counted as failed, and because they could potentially stay pending for a very long time, the | ||
// following multiplier is applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is much clearer. I'm wondering how you came up with this default value though? Can you explain it and provide some guidance on what node operators should take into account when they change that default value?
*/ | ||
def record(relayId: UUID, isSuccess: Boolean, feeOverride: Option[MilliSatoshi] = None, now: TimestampMilli = TimestampMilli.now()): Reputation = { | ||
val d = decay(now) | ||
var p = pending.getOrElse(relayId, Pending(MilliSatoshi(0), now)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to record something if we can't find the HTLC in the pending
map, does it? Are there cases where we won't find the HTLC in the map? I've changed that in #2893, let me know if that's correct.
@@ -124,8 +131,6 @@ class ChannelRelay private(nodeParams: NodeParams, | |||
private case class PreviouslyTried(channelId: ByteVector32, failure: RES_ADD_FAILED[ChannelException]) | |||
|
|||
def relay(previousFailures: Seq[PreviouslyTried]): Behavior[Command] = { | |||
Behaviors.receiveMessagePartial { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being in a Behaviors.receiveMessagePartial
is important, because that's how we correctly get MDC fields into the logging context. If you remove this, the log lines won't contain the relayId
and other important MDC fields. Restored in #2893.
@@ -360,7 +361,8 @@ class Setup(val datadir: File, | |||
offerManager = system.spawn(Behaviors.supervise(OfferManager(nodeParams, router, paymentTimeout = 1 minute)).onFailure(typed.SupervisorStrategy.resume), name = "offer-manager") | |||
paymentHandler = system.actorOf(SimpleSupervisor.props(PaymentHandler.props(nodeParams, register, offerManager), "payment-handler", SupervisorStrategy.Resume)) | |||
triggerer = system.spawn(Behaviors.supervise(AsyncPaymentTriggerer()).onFailure(typed.SupervisorStrategy.resume), name = "async-payment-triggerer") | |||
relayer = system.actorOf(SimpleSupervisor.props(Relayer.props(nodeParams, router, register, paymentHandler, triggerer, Some(postRestartCleanUpInitialized)), "relayer", SupervisorStrategy.Resume)) | |||
reputationRecorder = system.spawn(Behaviors.supervise(ReputationRecorder(nodeParams.relayParams.peerReputationConfig, Map.empty)).onFailure(typed.SupervisorStrategy.resume), name = "reputation-recorder") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must make this optional and let node operator disable it entirely if they'd like to run the standard blip-0004 relay. I've done this in #2893.
|
||
case class Confidence(value: Double) | ||
|
||
def apply(reputationConfig: ReputationConfig, reputations: Map[(PublicKey, Int), Reputation]): Behavior[Command] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you're keep a per-incoming-endorsement reputation, but aren't you afraid that this will split the data points and we'll end up with some endorsement values for which we don't have enough entries to make a proper decision?
Why can't you instead maintain a single Reputation
instance per incoming peer, but take their endorsement value into account in the scoring? That would be more robust, wouldn't it?
// Jamming protection | ||
// Must be the last checks so that they can be ignored for shadow deployment. | ||
for ((amountMsat, i) <- incomingHtlcs.toSeq.map(_.amountMsat).sorted.zipWithIndex) { | ||
if ((amountMsat.toLong < 1) || (math.log(amountMsat.toLong.toDouble) * params.localParams.maxAcceptedHtlcs / math.log(params.localParams.maxHtlcValueInFlightMsat.toLong.toDouble / params.localParams.maxAcceptedHtlcs) < i)) { | ||
return Left(TooManySmallHtlcs(params.channelId, number = i + 1, below = amountMsat)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have jamming protection in receiveAdd
: an error here would trigger a force-close. This is only used when our peer violated some channel constraints they should be aware of. This isn't the case for jamming, where we apply a local (potentially dynamic) heuristic: we should accept those HTLCs and fail them instead of relaying them, but not raise an error in receiveAdd
.
val canSendAdds = active.map(_.canSendAdd(add.amountMsat, params, changes1, feerates, feeConf, cmd.confidence)) | ||
// Log only for jamming protection. | ||
canSendAdds.collectFirst { | ||
case Left(f: TooManySmallHtlcs) => | ||
log.info("TooManySmallHtlcs: {} outgoing HTLCs are below {}}", f.number, f.below) | ||
Metrics.dropHtlc(f, Tags.Directions.Outgoing) | ||
case Left(f: ConfidenceTooLow) => | ||
log.info("ConfidenceTooLow: confidence is {}% while channel is {}% full", (100 * f.confidence).toInt, (100 * f.occupancy).toInt) | ||
Metrics.dropHtlc(f, Tags.Directions.Outgoing) | ||
} | ||
canSendAdds.flatMap { // TODO: We ignore jamming protection, delete this flatMap to activate jamming protection. | ||
case Left(_: TooManySmallHtlcs) | Left(_: ConfidenceTooLow) => None | ||
case x => Some(x) | ||
} | ||
.collectFirst { case Left(f) => | ||
Metrics.dropHtlc(f, Tags.Directions.Outgoing) | ||
Left(f) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly complex: can't you keep sendAdd
unchanged, and in canSendAdd
instead of returning TooManySmallHtlcs
or ConfidenceTooLow
, simply write the log line and increment the metric?
This will duplicate the log/metric when we have pending splices, but pending splices is something that won't happen frequently enough to skew the data?
Or even better, since HTLCs are the same in all active commitments, you can remove the changes to canSendAdd
and here in sendAdd
, when canSendAdd
returns a success, you can evaluate the jamming conditions and log and increment the metric if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this in #2898 however I don't like because it just duplicates a lot of canSendAdd
's code. I believe that jamming checks, like the others should be done in canSendAdd
. This piece of code may look overly complex but the idea is to get rid of it when we deploy jamming protection and TooManySmallHtlcs
and ConfidenceTooLow
will just be treated like any other error.
The reason for the weird |
But you're not doing this for trampoline relay, so that race can already be exploited anyway? I don't think this matters much in practice though, because:
|
I've tried doing that in #2897.
It seems to me that solving this would require adding more complexity than this refactoring was removing. |
- change default parameters and explain them better - remove checks on incomming HTLCs
It seems to me that we're trying to make trampoline fit into a box where it actually doesn't fit. One important aspect to trampoline is that the sender does not choose the outgoing channels and does not choose the fees, they allocate a total fee budget for the trampoline node which tries to relay within that fee budget. The trampoline node will ensure that it earns at least its usual channel routing fees, otherwise it won't relay the payment. If the trampoline node is well connected, or the sender over-allocated fees, the trampoline node earns more fees than its usual routing fees: but I'm not sure that this extra fee should count in the reputation? So I think we could handle trampoline relays in a simplified way that gets rid of those issues, by using the channel routing fees instead of trying to take the extra trampoline fees into account: when sending an outgoing HTLC with a trampoline origin, the fees we allocate to it in the reputation algorithm should just be this outgoing channel's routing fees (which can be included in the relay event since we have access to our channel update in the channel data). If the payment succeeds, if we want to give a bonus reputation if we earned more fees than our channel routing fees, this should be easy to do as well, by splitting the extra fee between all the outgoing channels? But I'm not sure we should do this, because we can't really match an outgoing HTLC to a specific incoming channel 1-to-1, so it's probably better to just count our channel routing fees? Do you think that model would make sense, or am I missing something? |
It seems like a good solution indeed, I'll try it. |
We do not yet drop HTLCs, the purpose is to collect data first.
We add
UpdateAddHtlc
. This follows blip-0004: experimental endorsement signaling in update_add_htlc lightning/blips#27.