[SIP] Authentication based multi-user-single-port #130

madeye · 2018-09-26T10:06:44Z

Background

Previous discussions suggest that we do authentication (SIP004) on the first chunk with different keys, to identify the user based on the success key.

Implementation Consideration

Performing GCM/Poly1305 on the first chunk should be very fast. It's expected that even a naive implementation would support thousands of users without any notable overhead.

Still, we can cache the success keys for its source IP, which would save most of computation. To prevent potential DDOS attack, the IP that tries too many times with authentication failure should be blocked.

Given this SIP doesn't involve any protocol change, only server code needs to be modified. The only limitation here is that AEAD ciphers are required.

Example

Jigsaw implemented a go-ss2 based server here: https://github.com/Jigsaw-Code/outline-ss-server. Early report shows that it works quite well with 100 users: #128 (comment)

Mygod · 2018-09-26T10:27:37Z

Have you considered the possibility that NAT might mess with your cache? Namely, if two clients behind the same NAT router try to connect to the same server with different credentials, god bless you because they have the same source IP address to the server.

kimw · 2018-09-26T11:28:11Z

Have you considered the possibility that NAT might mess with your cache? Namely, if two clients behind the same NAT router try to connect to the same server with different credentials, god bless you because they have the same source IP address to the server.

May be that's what we called, THE COST :)

Things cannot be perfect. It's depend on a BALANCE.

Do not support multi users run on a single port (I mean really multi, e.g. 100 users) => multi ports should be opened <= It's a abnormal server side behavior.
Multiple users mass in a single port, oh yes, it's cool!

And, it looks some kind "clean" from server side. The operator of SSP, shadowsocks service provider, must buy your beer.

And * 2, ss-manager should, maybe, retired.

That's just some personal comment. This SIP need more balance in any case.

kimw · 2018-09-26T11:34:51Z

More words:

If a shadowsocks server supports only countable user/users, it's abnormal behavior too.

--

Following on this idea, maybe a later SIP should about exchange shadowsocks in a kind of circle (known friends? trusted servers?)

Mygod · 2018-09-26T12:35:15Z

Hmm if you're okay with the COST that users one day complaining to you it's not working because of NAT v.s. cache issue.

I suggest either not taking the cache approach, or use other protocols that already supports multi-user like v2mess (I haven't looked at the protocol yet but it seems that that protocol supports this use-case).

Different people prefer different balance between things. I don't think Shadowsocks is intended to cover all kind of balances you wish.

riobard · 2018-09-26T17:03:34Z

Hmmm… I think if we're gonna officially support multiuser per port, we might as well address the problem cleanly? #54 is still open ^_^

riobard · 2018-09-26T17:06:05Z

But I agree this hack is neat in that it does not require any changes in the clients. 👍

Mygod · 2018-09-26T17:14:32Z

Also I should point out that the problem I pointed out might occur more frequently than you imagine thanks to exhausted IPv4 pool and widely-deployed CGN. It's likely that one will run into such frustration despite having taken precautions.

riobard · 2018-09-26T18:00:56Z

CGN is a major concern. We might need to run some tests to determine the rough size of NAT pools used by ISPs doing massive CGN.

madeye · 2018-09-26T23:45:02Z

NAT should not be a problem, as long as not all of the users are behind the same NAT address.

Say five users behind a same NAT ip address, at most five keys cached for that IP.

madeye · 2018-09-26T23:58:02Z

This SIP just suggests a kind of multi-user-single-port solution for shadowsocks without modifying the protocol.

But as mentioned by @Mygod, shadowsocks is not designed for this purpose.

I listed this SIP here since it's already implemented in a third-party software. If anyone is interested in it as well, please follow this SIP and apply the suggested optimizations.

riobard · 2018-09-27T03:27:53Z

My worry is that people will eventually abuse this hack to run commercial services. It's not gonna scale well when users are mostly behind CGN with small pool of public IPs, e.g. mobile networks in China.

Mygod · 2018-09-27T03:30:39Z

CGN also applies to ADSLs. Also one shouldn't forget NAT routers in enterprises, schools, etc. A good way to combat this is to enlarge cache size and always do a fallback lookup.

madeye · 2018-09-27T04:22:54Z

Fallback lookup is always needed. Even a key is cached, the authentication is still required. If failed for authentication, a fallback lookup is performed.

I don't expect millions of users on one single port. A reasonable assumption is thousands of users per server, hundreds per port.

And of course, it cannot scale for commercial usage.

celeron533 · 2018-09-27T04:26:37Z

In some places, the ISP may do the NAT for entire neighborhood which may include 10,000 end users by assigning the ip address with prefix 100.64. It is also a kind of NAT.

https://tools.ietf.org/html/rfc6598

IANA Considerations

IANA has recorded the allocation of an IPv4 /10 for use as Shared
Address Space.

The Shared Address Space address range is 100.64.0.0/10.

riobard · 2018-09-27T05:04:37Z

@celeron533 This is CGN mentioned above.

shinku721 · 2018-10-13T08:07:56Z

Hmm, why not use a ElGamal-like method to identify users?

Mygod · 2018-10-13T10:34:01Z

Compatibility.

fortuna · 2018-11-29T15:24:56Z

FYI, Outline Servers have all been migrated to outline-ss-server this week. They don't yet use the single port feature, but we intend to enable it in a few weeks, after I implement the IP->cipher cache.

We can roll that out gradually and see how it performs in the wild. In my own tests, the added latency for 100 users without any optimization in a crappy $5 VPS can be significant, 10s of milliseconds, but it can vary wildly, and I believe the optimizations will help significantly. Also, outline-ss-server has Prometheus metrics, so we will be able to expose latency metrics and admins will be able to monitor that.

BTW, outline-ss-server still allows for multiple ports, and you can have multiple keys per port, and multiple ports per key. You can always start a new port if one becomes overloaded. One nice feature is that you can do that without creating a new process for each port, or stop the running one.

fortuna · 2018-11-29T15:32:22Z

It's worth mentioning that the single-port feature has some very good motivation:

It makes it a lot easier and safer to configure your server firewall. No need to open all the ports.
It allows all servers to run on ports 443, 80 or any other usually unblocked port. We found multiple cases of users not being able to use Outline in strict networks that doesn't allow traffic to high port numbers, or outside a small subset of ports.
It allows Outline Servers to run on a Docker container without needing --net=host (you can expose the single port instead)
In the future, we'll be able to run the Outline Server management API and the Shadowsocks service on the same port, by making it fallback to HTTPS to the management API if all keys fail. This will make the servers even harder to detect (you'll get a standard 404).

fortuna · 2018-12-12T18:53:58Z

I now have a benchmark for my single-port implementation:
Jigsaw-Code/outline-ss-server#7

These are the results on a $5 Frankfurt DigitalOcean machine that is idle:

BenchmarkTCPFindCipher 	    1000	   1304879 ns/op	 2015027 B/op	    3107 allocs/op
BenchmarkUDPUnpack     	    3000	    615077 ns/op	  115427 B/op	    1801 allocs/op

That's 1.3ms to go over 100 ciphers for a TCP connection. 0.6 ms for a UDP datagram. That will probably be worse under load, but it gives an idea of the kind of added latency we'd be adding.

There's 2MB of allocations for one TCP connection. I believe that can be significantly reduced by sharing buffers, but it gets a little tricky with the code structure and different ciphers needing different sizes of buffers (I guess I need to find the max buffer size).

riobard · 2018-12-12T23:23:45Z

@fortuna That's a lot of allocs/op. Is that normal?

fortuna · 2018-12-13T18:03:21Z

PR Jigsaw-Code/outline-ss-server#8 makes the TCP performance on par with UDP. We no longer allocate so much memory:

BenchmarkTCPFindCipher-12    	    1000	   1349922 ns/op	  125278 B/op	    1705 allocs/op
BenchmarkUDPUnpack-12        	    2000	    881121 ns/op	  125030 B/op	    1701 allocs/op

The ~2MB allocations were because I was allocating a buffer for an entire encrypted chunk (~16KB) for each of the 100 ciphers I tried. Now I allocate only one buffer for all ciphers

As for the number of allocations, it's just that' I'm doing the operation 100 times. For 1 cipher only I get these numbers:

BenchmarkTCPFindCipher-12    	   30000	     52329 ns/op	    1408 B/op	      22 allocs/op
BenchmarkUDPUnpack-12        	  200000	      8989 ns/op	    1266 B/op	      18 allocs/op

fortuna · 2018-12-13T18:40:12Z

With the new findAccessKey optimization, the allocations and CPU are dominated by the low level crypto, so I'm not sure there's much room to improve there:

This is without the IP -> cipher cache. I'm trying to make the cipher finding as efficient as possible, to reduce the need for the cache.

fortuna · 2018-12-19T15:19:28Z

FYI, I've added an optimization to outline-ss-server that will keep the latest used cipher in the front of the list. This way the time to find the cipher is proportional to the number of ciphers being used, rather than the total ciphers.

Furthermore, I've added the shadowsocks_time_to_cipher_ms metric that will tell you the 50th, 90th and 99th percentile times to find the cipher for each access key.

This should be enough to inform us whether the performance is good enough. It would be great if people here gave it a try and reported back. The lastest binary with the changes is v1.0.3 and can be found in the releases:
https://github.com/Jigsaw-Code/outline-ss-server/releases

fortuna · 2019-04-04T11:25:27Z

Update: Outline has been running servers with multi-user support on a single port for a few months now. Some organizations have 300 keys on a server, with over 100 active on any given day. Median latency due to cipher finding is around 10ms and CPU usage is minimal (bandwidth is the bottleneck).

At 90th percentile you can see cases here and there close to 1 second, but that's not common and may be due to other factors such as a burst in CPU usage (maybe expensive prometheus queries).

Has anyone here tried the single port feature? How was your experience?

madeye · 2019-04-05T00:49:53Z

Average 10ms latency looks too slow to me.

Assuming 300 users and the worst case that 300 authentications performed for each connection, one single authentication takes 33us. It means more than 33k cycles on a 1 GHz CPU, which is too long for a small packet authentication.

Can you elaborate more about the measurement of latency?

Mygod · 2019-04-05T00:56:49Z

2998 light-kilometer might or might not be acceptable depending on use case, e.g. it's probably not acceptable for game streaming but probably ok for downloading/video streaming. 😄

fortuna · 2019-04-05T01:35:24Z

This site says that 20ms is excellent RTT. So 10ms shouldn't be perceptible.

Also, this is latency added per connection, not per packet.

Mygod · 2019-04-05T01:42:04Z

How about UDP connections/packets (which are mostly used in latency-sensitive applications)?

fortuna · 2019-04-12T03:55:15Z

I have a benchmark above: #130 (comment)

UDP takes about 9 microseconds per cipher.

Mygod · 2019-04-13T03:51:04Z

@fortuna Sorry, I mean to ask whether the added latency for UDP connections is per-connection or per-packet.

fortuna · 2019-04-13T10:42:51Z

Yes, the added latency is per packet.

…

On Fri, Apr 12, 2019 at 11:51 PM Mygod ***@***.***> wrote: @fortuna <https://github.com/fortuna> Sorry, I mean to ask whether the added latency for UDP connections is per-connection or per-packet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAG7nQCeTU6QZdIMnf9u1Sf527B4pe8xks5vgVQsgaJpZM4W6UI5> .

Mygod · 2019-04-13T15:30:00Z

I think it would be more appropriate to optimize for UDP connections (I think there are UDP lookup caches in libev implementation)

fortuna · 2019-04-13T16:15:55Z

Oh, the cipher finding overhead is per UDP packet from the client. We don't need to find the cipher for the UDP packets from the remote target, because the chosen cipher is saved in the UDP association. That means the overhead will be minimal if you're watching a video. I guess it could be a concern if you're live streaming, but then your cipher will be kept near the front of the cipher list, which minimizes the overhead.

Mygod · 2019-04-19T12:01:59Z

@fortuna Is it technically possible to do a cache for UDP packets as well?

fortuna · 2019-08-02T16:14:11Z

Update: @bemasc has merged Jigsaw-Code/outline-ss-server#25 that adds a new optimization to the cipher finding. We now associate a "last client ip" to each cipher. When a new request arrives, we lookup the ciphers that had the client ip as the last ip, and try them first, before trying the the prioritized list.

If a cipher is accessed by a single IP, it will always be tried first.
If a cipher is accessed by multiple IPs simultaneously, it's likely to stay in the front of the priority list.

With the optimization, any extra latency will be almost gone for almost everyone, even if there are hundreds of active access keys.

@Mygod, the heuristic of pushing used ciphers to the front of the list, as well as the new one, are applied to both TCP and UDP.

riobard · 2019-08-02T18:44:07Z

@fortuna Neat! Almost two orders of magnitude latency reduction in the common case! I'm really surprised by how far you guys have pushed forward without changing the protocol 👍

Ehco1996 · 2020-11-09T01:12:01Z

i also impl the many user in one port use python asyncio

core idea is to use db order field to find the right user

code is here

https://github.com/Ehco1996/aioshadowsocks/blob/052c472422955c4ade7d0e375c8d093231aff1a9/shadowsocks/mdb/models.py#L157

ghost · 2020-11-09T02:17:20Z

We can use same technology to eliminate the need of the encryption method selection. Server try both AES-256-GCM and Chacha20-Poly1305 with same password (they have same tag size and salt size, thus have exact same packet layout). Client choose the fastest one depends on it's platform.

Remove encryption selection might be too radical (and lack of foresight. With this selector, we've introduced new protocol) for us, but still an option for other shadowsocks-like protocol.

lzm0 · 2021-11-19T15:01:22Z

This may be a stupid question, but what prevents us from using a HashSet for cipher lookup?

fortuna · 2021-11-19T20:56:07Z

@lzm0 there's no id in the Shadowsocks protocol that can be mapped to the credentials to use, so there's no key to lookup. That's why we need to use trial decryption.

database64128 · 2022-06-20T10:21:18Z

Shadowsocks 2022 (#196) has a protocol extension that brings native multi-user-single-port support without trial decryption: https://github.com/Shadowsocks-NET/shadowsocks-specs/blob/main/2022-2-shadowsocks-2022-extensible-identity-headers.md

VictoriaRaymond mentioned this issue Jan 17, 2019

希望V2Ray的Shadowsocks协议加入像VMess协议那样，可以添加多个用户。 v2ray/v2ray-core#1515

Closed

bemasc mentioned this issue Aug 2, 2019

findAccessKey cache Jigsaw-Code/outline-ss-server#15

Closed

fortuna mentioned this issue Jan 4, 2021

设计一个随机密钥滚动下发机制，实现前向安全、防长期重放、防潜在攻击等 #177

Open

fortuna mentioned this issue Nov 18, 2021

[Feature Request] Accept multiple passwords on a single port shadowsocks/shadowsocks-rust#489

Closed

qeatzy mentioned this issue Apr 29, 2023

- XTLS/Xray-core#1998

Closed

[SIP] Authentication based multi-user-single-port #130

[SIP] Authentication based multi-user-single-port #130

Comments

madeye commented Sep 26, 2018 • edited Loading

Background

Implementation Consideration

Example

Mygod commented Sep 26, 2018

kimw commented Sep 26, 2018

kimw commented Sep 26, 2018

Mygod commented Sep 26, 2018

riobard commented Sep 26, 2018

riobard commented Sep 26, 2018

Mygod commented Sep 26, 2018

riobard commented Sep 26, 2018

madeye commented Sep 26, 2018

madeye commented Sep 26, 2018

riobard commented Sep 27, 2018

Mygod commented Sep 27, 2018

madeye commented Sep 27, 2018 • edited Loading

celeron533 commented Sep 27, 2018

riobard commented Sep 27, 2018

shinku721 commented Oct 13, 2018 • edited Loading

Mygod commented Oct 13, 2018 via email

fortuna commented Nov 29, 2018

fortuna commented Nov 29, 2018

fortuna commented Dec 12, 2018 • edited Loading

riobard commented Dec 12, 2018

fortuna commented Dec 13, 2018 • edited Loading

fortuna commented Dec 13, 2018

fortuna commented Dec 19, 2018 • edited Loading

fortuna commented Apr 4, 2019

madeye commented Apr 5, 2019

Mygod commented Apr 5, 2019

fortuna commented Apr 5, 2019

Mygod commented Apr 5, 2019

fortuna commented Apr 12, 2019

Mygod commented Apr 13, 2019

fortuna commented Apr 13, 2019 via email

Mygod commented Apr 13, 2019

fortuna commented Apr 13, 2019 via email • edited Loading

Mygod commented Apr 19, 2019

fortuna commented Aug 2, 2019

riobard commented Aug 2, 2019

Ehco1996 commented Nov 9, 2020

ghost commented Nov 9, 2020

lzm0 commented Nov 19, 2021

fortuna commented Nov 19, 2021

database64128 commented Jun 20, 2022

madeye commented Sep 26, 2018 •

edited

Loading

madeye commented Sep 27, 2018 •

edited

Loading

shinku721 commented Oct 13, 2018 •

edited

Loading

fortuna commented Dec 12, 2018 •

edited

Loading

fortuna commented Dec 13, 2018 •

edited

Loading

fortuna commented Dec 19, 2018 •

edited

Loading

fortuna commented Apr 13, 2019 via email •

edited

Loading