Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use node buffers and decode full lines #198

Merged
merged 2 commits into from
Apr 3, 2019
Merged

Use node buffers and decode full lines #198

merged 2 commits into from
Apr 3, 2019

Conversation

xPaw
Copy link
Collaborator

@xPaw xPaw commented Mar 6, 2019

This is a change I noticed in #142 but I rewrote it myself (after looking at that PR, it doesn't appear to actually implement the buffering correctly)

This change will use buffers and will split it into lines from said buffer. iconv.decode is now called on a single IRC line instead of the whole data buffer passed from the socket (which I only assume could cause troubles on partial data).

I also no longer keep an array of lines and directly emit each split and decoded line. There's no need to keep an array because split is no longer used (it doesn't exist on Buffer either). This should lower memory consumption.

I also reset incoming_buffer when calling connect() instead of constructor, which should clear out any remaining data if the socket dies.

I tested it with ZNC playback where it feels a lot of data, and it works fine.

@xPaw xPaw requested a review from prawnsalad March 6, 2019 19:20
@prawnsalad
Copy link
Member

The partial data has been a concern since this was put in place although I've never seen it reported that it's ever an issue. Happy to hear a better to handle it though.

This part is something that's been mentioned by a few people over the years though: this.incoming_buffer.indexOf('\n', startIndex). Some UTF8 (and most likely other encoding) characters do include the \n byte within them which would cause the splitting to occur half way through a character.

@xPaw
Copy link
Collaborator Author

xPaw commented Mar 6, 2019

do include the \n byte within them

That could be a problem on master as well though, right? EDIT: It probably won't be, as it decodes first.

I could search for \r\n, but that will break on networks that only send \n.

@prawnsalad
Copy link
Member

Right, decoding happens first so searching \n will always specifically match a \n character. Also correct on point 2.

@xPaw
Copy link
Collaborator Author

xPaw commented Mar 6, 2019

I'm not sure how to solve this then. Ideas?

@prawnsalad
Copy link
Member

The potential partial data issue? It's been a concern for several years and I'm not sure of a way around it despite a few more knowledgable encoding people looking over it over the years. Unless someone can mention a work around or reports of the potential issue actually occurring, then I can only leave it as it is.

@xPaw
Copy link
Collaborator Author

xPaw commented Mar 6, 2019

I asked in #ircv3, people over there just seem to split on \n over the raw buffer as well. I looked at hexchat code and didn't see anything interesting over there either.

Merge then?

@prawnsalad
Copy link
Member

I really can't see any benefit with this PR, it's just switching a potential partial data issue with a slightly more likely potential issue with character splitting. There hasn't been any reports of the current implementation breaking either.

Am i missing something from the #ircv3 discussion yesterday?

@xPaw
Copy link
Collaborator Author

xPaw commented Mar 7, 2019

From what I gathered, I don't see the issue you are describing could ever arise.

If a client does happen to send such encoded data that happens to have 0x0A (\n) in there somewhere, the server would already split on that, and such a message would never reach other clients (without the server re-encoding the message at least).

Besides, what multibyte encoding can send 0x0A where it doesn't mean a new line? This can't happen in UTF-8.

@xPaw
Copy link
Collaborator Author

xPaw commented Mar 7, 2019

For reference why it can't happen in utf8: https://softwareengineering.stackexchange.com/questions/262227/why-does-utf-8-waste-several-bits-in-its-encoding

To add to that, it seems that Kiwi/irc-framework is alone to decode data from socket first. Other clients that I looked at all search for the new line character first.

EDIT: I looked at ZNC's code, and their ReadLine function also searches for new line character before doing any conversions. https://github.com/jimloco/Csocket/blob/e8d9e0bb248c521c2c7fa01e1c6a116d929c41b4/Csocket.cc#L2489-L2536

@prawnsalad
Copy link
Member

After discussing and getting this explained clearer with some others that know way more about encoding than I, this seems fine to me now. Merging

@prawnsalad prawnsalad merged commit 1b9263e into master Apr 3, 2019
@xPaw xPaw deleted the xpaw/buffers branch July 13, 2019 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants