-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106
Conversation
I also considered adding an overload of |
Hello! You can reverts and refactor like this UTF-unknown/src/UTF-unknown.csproj Line 19 in d52af8d
and this UTF-unknown/src/DetectionDetail.cs Lines 97 to 101 in d52af8d
i think so you can also using netstandard and adding netcoreapp3_0 |
I have tried to add Span APIs conditionally, but I faced some serious issues. SummaryWriting a maintainable code for Span APIs requires dropping old targets (.Net Framework 4.0, .Net Standard 1.0) and adding a new target (.Net Framework 4.5). DetailsSupporting Span conditionally results duplicated core logicsSpan can be spawned from an array, but not in reverse.
void Write(byte[] buf, int offset, int count)
{
for (int i = offset; i < offset + count; i++)
{
byte b = buf[i];
/* Do Some Work */
}
}
void Write(byte[] buf, int offset, int count)
{
Write(buf.AsSpan(offset, count));
}
void Write(ReadOnlySpan<byte> span)
{
for (int i = 0; i < span.Length; i++)
{
byte b = span[i];
/* Do Some Work */
}
} However, since .Net Framework 4.0 and .Net Standard 1.0 target is not supported by System.Memory, I cannot use that tactic. Instead, I had to duplicate the logics which greatly increases the maintenance burden. That was why I did not add the Span API.
#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif
void Write(byte[] buf, int offset, int count)
{
for (int i = offset; i < count; i++)
{
byte b = span[i];
/* Do Some Work */
}
}
#if SPAN_SUPPORT
void Write(ReadOnlySpan<byte> span)
{
for (int i = 0; i < span.Length; i++)
{
byte b = span[i];
/* Do Some Work */
}
}
#endif I can translate the Span API into byte[] API, but it contradicts the reason why Span was built. It brings additional copying and lowers the performance, so the code becomes meaningless.
#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif
void Write(byte[] buf, int offset, int count)
{
for (int i = offset; i < offset + count; i++)
{
byte b = span[i];
/* Do Some Work */
}
}
#if SPAN_SUPPORT
void Write(ReadOnlySpan<byte> span)
{
Write(span.ToArray());
}
#endif So there is not a perfect solution as long as I know. We have to choose between three options:
Need to add one more target: .Net Framework 4.5As long as I know, .Net Framework applications always prefer .Net Framework targeted library. It means an .Net Framework 4.8 (w Span) application will prefer .Net Framework 4.0 target (wo Span), not .Net Standard 2.0 (w Span) target. (In here "with Span" means the API is accessible with the help of NuGet package). To solve this, We have to add .Net Framework 4.5 target in order to make sure that the latest .Net Framework applications have access to Span. Are you okay with adding more target? EDIT: Sorry for editing, I found some error at sample code... |
Span API issue is quite complicated, thus I suggest to discuss the Span APIs in the separate issue. I think it is better to concentrate on reviewing the I made several new commits, please review. |
} | ||
} | ||
} | ||
|
||
private static string FindCharSetByBom(byte[] buf, int len) | ||
private static string FindCharSetByBom(byte[] buf, int offset, int len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example when this is necessary? As far as I know, magic number is inserted first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new method overload has an assumption: 'the actual data starts at buf[offset]'. It is useful when the actual data was loaded into the middle of the byte array.
src/CharsetDetector.cs
Outdated
{ | ||
// other than 0xa0, if every other character is ascii, the page is ascii | ||
if ((buf[i] & 0x80) != 0 && buf[i] != 0xA0) | ||
if ((buf[offset + i] & 0x80) != 0 && buf[offset + i] != 0xA0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify why this need to do offset again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was written under the same assumption with (new) line 371.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I-index already starts with offset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I made a mistake. I will patch it ASAP.
src/CharsetDetector.cs
Outdated
@@ -326,46 +357,46 @@ private void FindInputState(byte[] buf, int len) | |||
else | |||
{ | |||
if (InputState == InputState.PureASCII && | |||
(buf[i] == 0x1B || (buf[i] == 0x7B && _lastChar == 0x7E))) | |||
(buf[offset + i] == 0x1B || (buf[offset + i] == 0x7B && _lastChar == 0x7E))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this offset
src/CharsetDetector.cs
Outdated
{ | ||
// found escape character or HZ "~{" | ||
InputState = InputState.EscASCII; | ||
_escCharsetProber = _escCharsetProber ?? GetNewProbers(); | ||
} | ||
_lastChar = buf[i]; | ||
_lastChar = buf[offset + i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this offset
Could you add a test for this case (so that further tests indicate this)? |
src/CharsetDetector.cs
Outdated
} | ||
if (bytes.Length - offset < len) | ||
{ | ||
throw new ArgumentOutOfRangeException(nameof(len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is similar to the previous, but they are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to find the best exception without introducing additional strings.
.Net Framework's FileStream.Write implementation use ArgumentException
.
if (array.Length - offset < count)
throw new ArgumentException(Environment.GetResourceString("Argument_InvalidOffLen"));
Should we benchmark the code and use throw new ArgumentException("Invalid offset and length")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that ArgumentOutOfRangeException
looks here, but @304NotModified will tell you what message to write, since I only help to conduct a review.
For example, here is a message in mscorelib.
And then, offset + len > bytes.Length
looks better. Because offset
may be larger than bytes.Length
It seems to me that it is not worthwhile to detect the BOM by the offset, respectively, and the changes should be other. Because the Feed(byte[] buf, int offset, int len) function searches for BOM |
Yes. We can also use this service https://nugettoolsdev.azurewebsites.net :)
I think it will look good since it already started in another PR |
/cc @304NotModified |
As long as I know, designating if ((count <= 8) && (buffer != _buffer))
{
int byteCount = count;
while (--byteCount >= 0)
_buffer[_position + byteCount] = buffer[offset + byteCount];
}
else
Buffer.InternalBlockCopy(buffer, offset, _buffer, _position, count); FileStream.Write(byte[] buffer, int offset, int count): if (_writePos > 0) {
int numBytes = _bufferSize - _writePos; // space left in buffer
if (numBytes > 0) {
if (numBytes > count)
numBytes = count;
Buffer.InternalBlockCopy(array, offset, _buffer, _writePos, numBytes);
_writePos += numBytes;
if (count==numBytes) return;
offset += numBytes;
count -= numBytes;
} We can find out the actual copy is done by Buffer.InternalBlockCopy, with the given offset. However, if you still think my explanation is not enough, I will revert the code to always search BOM from buf[0]. |
The error I mentioned in the sentence meant the sample code in the post, not in the committed code. Original Sample Code with Error #if NET40 || NETSTANDARD1.0
#define SPAN_SUPPORT
#endif Fixed Sample Code #if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif |
I fixed the issue of adding the |
It seems to me that you need to look at it easier and consider it as an opportunity to go through the loop and unify the code. At least I looked at it like that before :)
Thank you for the clarification! I hadn’t thought about this before. |
Thanks! I misunderstood you |
I had fixed the issue in commit 58c009. Please review. |
I think in these PR we need to add a couple of changes before merging:
|
Added BOM offset info to the docs of DetectFromBytes(byte[], int, int) Improvme exception message of DetectFromBytes(byte[], int, int)
I reflected your advice into the code. |
It's look good. Thank you! :) |
Thanks for the PR! I'm fine with merging this one, but I have one question. The detection of the BOM in the "middle" of a byte array could be an issue for some use cases. Would it be better to make searching for the BOM it configurable ? (so adding a bool parameter?) |
I think adding another overload having explicit BOM search location parameter would also be a good solution. |
👍. That's for another pull request? (After merging this one) |
merged! Thanks! |
I added an overload of
CharsetDetector.DetectFromBytes()
, by providing the offset and length parameters.CharsetDetector.DetectFromBytes(byte[] bytes)
does not support checking a subset of the byte array, like howStream
does inRead()
andWrite()
. The new overload enables such operation.