Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overload of DetectFromBytes(byte[] bytes, int offset, int len) #106

Merged
merged 7 commits into from
Jan 26, 2020

Conversation

ied206
Copy link
Contributor

@ied206 ied206 commented Jan 5, 2020

I added an overload of CharsetDetector.DetectFromBytes(), by providing the offset and length parameters.

CharsetDetector.DetectFromBytes(byte[] bytes) does not support checking a subset of the byte array, like how Stream does in Read() and Write(). The new overload enables such operation.

@ied206
Copy link
Contributor Author

ied206 commented Jan 5, 2020

I also considered adding an overload of ReadOnlySpan<byte> by utilizing System.Memory nuget package, but did not add the code. UtfUnknown targets .Net Framework 4.0, which is not supported by the System.Memory.

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 5, 2020

Hello!

You can reverts and refactor like this

<ItemGroup Condition=" '$(TargetFramework)' == 'netstandard1.3' Or '$(TargetFramework)' == 'netstandard2.0' ">

and this

#if NETSTANDARD && !NETSTANDARD1_0 || NETCOREAPP3_0
return CodePagesEncodingProvider.Instance.GetEncoding(encodingName);
#else
return null;
#endif

i think so you can also using netstandard and adding netcoreapp3_0

@ied206
Copy link
Contributor Author

ied206 commented Jan 5, 2020

I have tried to add Span APIs conditionally, but I faced some serious issues.

Summary

Writing a maintainable code for Span APIs requires dropping old targets (.Net Framework 4.0, .Net Standard 1.0) and adding a new target (.Net Framework 4.5).
If we do not compromise with targets, the code becomes unmaintainable due to the duplication.

Details

Supporting Span conditionally results duplicated core logics

Span can be spawned from an array, but not in reverse.
Thus to support Span alongside traditional APIs, calling Method(Span<byte>) internally from Method(byte[], int, int) is the easiest way to implement Span APIs.

  • Original Code:
void Write(byte[] buf, int offset, int count)
{
	for (int i = offset; i < offset + count; i++)
	{ 
		byte b = buf[i];
		/* Do Some Work */
	}
}
  • Patched Code to support Span (Ideal):
void Write(byte[] buf, int offset, int count)
{
	Write(buf.AsSpan(offset, count));
}

void Write(ReadOnlySpan<byte> span)
{
	for (int i = 0; i < span.Length; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}

However, since .Net Framework 4.0 and .Net Standard 1.0 target is not supported by System.Memory, I cannot use that tactic. Instead, I had to duplicate the logics which greatly increases the maintenance burden. That was why I did not add the Span API.

  • Patched Code to support Span (Duplication):
#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

void Write(byte[] buf, int offset, int count)
{
	for (int i = offset; i < count; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}

#if SPAN_SUPPORT
void Write(ReadOnlySpan<byte> span)
{
	for (int i = 0; i < span.Length; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}
#endif

I can translate the Span API into byte[] API, but it contradicts the reason why Span was built. It brings additional copying and lowers the performance, so the code becomes meaningless.

  • Patched Code to support Span (Translation):
#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

void Write(byte[] buf, int offset, int count)
{
	for (int i = offset; i < offset + count; i++)
	{
		byte b = span[i];
		/* Do Some Work */
	}
}

#if SPAN_SUPPORT
void Write(ReadOnlySpan<byte> span)
{
	Write(span.ToArray());
}
#endif

So there is not a perfect solution as long as I know. We have to choose between three options:

  1. Duplicated logics all across CharsetDetector and children of Prober
  2. Drop old targets (.Net Framework 4.0, .Net Standard 1.0)
  3. Give up Span APIs

Need to add one more target: .Net Framework 4.5

As long as I know, .Net Framework applications always prefer .Net Framework targeted library. It means an .Net Framework 4.8 (w Span) application will prefer .Net Framework 4.0 target (wo Span), not .Net Standard 2.0 (w Span) target. (In here "with Span" means the API is accessible with the help of NuGet package).

To solve this, We have to add .Net Framework 4.5 target in order to make sure that the latest .Net Framework applications have access to Span. Are you okay with adding more target?

EDIT: Sorry for editing, I found some error at sample code...

@ied206
Copy link
Contributor Author

ied206 commented Jan 5, 2020

Span API issue is quite complicated, thus I suggest to discuss the Span APIs in the separate issue. I think it is better to concentrate on reviewing the (byte[], int, int) overload commits in the PR.

I made several new commits, please review.

}
}
}

private static string FindCharSetByBom(byte[] buf, int len)
private static string FindCharSetByBom(byte[] buf, int offset, int len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example when this is necessary? As far as I know, magic number is inserted first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new method overload has an assumption: 'the actual data starts at buf[offset]'. It is useful when the actual data was loaded into the middle of the byte array.

{
// other than 0xa0, if every other character is ascii, the page is ascii
if ((buf[i] & 0x80) != 0 && buf[i] != 0xA0)
if ((buf[offset + i] & 0x80) != 0 && buf[offset + i] != 0xA0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why this need to do offset again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was written under the same assumption with (new) line 371.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I-index already starts with offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I made a mistake. I will patch it ASAP.

@@ -326,46 +357,46 @@ private void FindInputState(byte[] buf, int len)
else
{
if (InputState == InputState.PureASCII &&
(buf[i] == 0x1B || (buf[i] == 0x7B && _lastChar == 0x7E)))
(buf[offset + i] == 0x1B || (buf[offset + i] == 0x7B && _lastChar == 0x7E)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this offset

{
// found escape character or HZ "~{"
InputState = InputState.EscASCII;
_escCharsetProber = _escCharsetProber ?? GetNewProbers();
}
_lastChar = buf[i];
_lastChar = buf[offset + i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this offset

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 5, 2020

Sorry for editing, I found some error at sample code...

Could you add a test for this case (so that further tests indicate this)?

}
if (bytes.Length - offset < len)
{
throw new ArgumentOutOfRangeException(nameof(len));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is similar to the previous, but they are different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to find the best exception without introducing additional strings.

.Net Framework's FileStream.Write implementation use ArgumentException.

if (array.Length - offset < count)
	throw new ArgumentException(Environment.GetResourceString("Argument_InvalidOffLen"));

Should we benchmark the code and use throw new ArgumentException("Invalid offset and length")?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that ArgumentOutOfRangeException looks here, but @304NotModified will tell you what message to write, since I only help to conduct a review.

For example, here is a message in mscorelib.

And then, offset + len > bytes.Length looks better. Because offset may be larger than bytes.Length

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 5, 2020

It seems to me that it is not worthwhile to detect the BOM by the offset, respectively, and the changes should be other. Because the Feed(byte[] buf, int offset, int len) function searches for BOM

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 5, 2020

As long as I know, .Net Framework applications always prefer .Net Framework targeted library.

Yes. We can also use this service https://nugettoolsdev.azurewebsites.net :)

To solve this, We have to add .Net Framework 4.5 target in order to make sure that the latest .Net Framework applications have access to Span. Are you okay with adding more target?

I think it will look good since it already started in another PR

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 5, 2020

Span API issue is quite complicated, thus I suggest to discuss the Span APIs in the separate issue. I think it is better to concentrate on reviewing the (byte[], int, int) overload commits in the PR.

I made several new commits, please review.

/cc @304NotModified

@ied206
Copy link
Contributor Author

ied206 commented Jan 6, 2020

It seems to me that it is not worthwhile to detect the BOM by the offset, respectively, and the changes should be other. Because the Feed(byte[] buf, int offset, int len) function searches for BOM

As long as I know, designating offset in a method with byte[] buf, int offset, int count signature means the actual data starts at buffer[offset]. It does not take account of the bytes where it is not covered by the [offset, offset + count). Actually, the pattern was copied from the .Net Framework's MemoryStream and FileStream implementation.
MemoryStream.Write(byte[] buffer, int offset, int count):

if ((count <= 8) && (buffer != _buffer))
{
  int byteCount = count;
  while (--byteCount >= 0)
    _buffer[_position + byteCount] = buffer[offset + byteCount];
}
else
  Buffer.InternalBlockCopy(buffer, offset, _buffer, _position, count);

FileStream.Write(byte[] buffer, int offset, int count):

if (_writePos > 0) {
	int numBytes = _bufferSize - _writePos;   // space left in buffer
	if (numBytes > 0) {
		if (numBytes > count)
			numBytes = count;
		Buffer.InternalBlockCopy(array, offset, _buffer, _writePos, numBytes);
		_writePos += numBytes;
		if (count==numBytes) return;
		offset += numBytes;
		count -= numBytes;
	}

We can find out the actual copy is done by Buffer.InternalBlockCopy, with the given offset.

However, if you still think my explanation is not enough, I will revert the code to always search BOM from buf[0].

@ied206
Copy link
Contributor Author

ied206 commented Jan 6, 2020

Sorry for editing, I found some error at sample code...

Could you add a test for this case (so that further tests indicate this)?

The error I mentioned in the sentence meant the sample code in the post, not in the committed code.
You are able to see that in the edited history.

Original Sample Code with Error

#if NET40 || NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

Fixed Sample Code

#if !NET40 || !NETSTANDARD1.0
#define SPAN_SUPPORT
#endif

@ied206
Copy link
Contributor Author

ied206 commented Jan 6, 2020

I fixed the issue of adding the offset twice to the index base at CharsetDetector.FindInputState(). Thanks for pointing it out.

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 7, 2020

As long as I know, designating offset in a method with byte[] buf, int offset, int count signature means the actual data starts at buffer[offset].

It seems to me that you need to look at it easier and consider it as an opportunity to go through the loop and unify the code. At least I looked at it like that before :)

However, if you still think my explanation is not enough, I will revert the code to always search BOM from buf[0].

Thank you for the clarification! I hadn’t thought about this before.

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 7, 2020

The error I mentioned in the sentence meant the sample code in the post, not in the committed code.
You are able to see that in the edited history.

Thanks! I misunderstood you

@ied206
Copy link
Contributor Author

ied206 commented Jan 10, 2020

It seems to me that you need to look at it easier and consider it as an opportunity to go through the loop and unify the code. At least I looked at it like that before :)

I had fixed the issue in commit 58c009. Please review.

@304NotModified 304NotModified self-assigned this Jan 10, 2020
@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 10, 2020

I think in these PR we need to add a couple of changes before merging:

  • indicate in the documentation that for the DetectFromBytes(byte[], int, int) method, a search by BOM will start with offset
  • сorrect message for exception and call condition

Added BOM offset info to the docs of DetectFromBytes(byte[], int, int)
Improvme exception message of DetectFromBytes(byte[], int, int)
@ied206
Copy link
Contributor Author

ied206 commented Jan 14, 2020

  • indicate in the documentation that for the DetectFromBytes(byte[], int, int) method, a search by BOM will start with offset
  • сorrect message for exception and call condition

I reflected your advice into the code.

@rstm-sf
Copy link
Collaborator

rstm-sf commented Jan 14, 2020

It's look good. Thank you! :)

@304NotModified
Copy link
Member

Thanks for the PR!

I'm fine with merging this one, but I have one question.

The detection of the BOM in the "middle" of a byte array could be an issue for some use cases. Would it be better to make searching for the BOM it configurable ? (so adding a bool parameter?)

@ied206
Copy link
Contributor Author

ied206 commented Jan 22, 2020

The detection of the BOM in the "middle" of a byte array could be an issue for some use cases. Would it be better to make searching for the BOM it configurable ? (so adding a bool parameter?)

I think adding another overload having explicit BOM search location parameter would also be a good solution.

@304NotModified
Copy link
Member

👍. That's for another pull request? (After merging this one)

@304NotModified 304NotModified added this to the 2.3 milestone Jan 26, 2020
@304NotModified 304NotModified merged commit f1aa5fd into CharsetDetector:master Jan 26, 2020
@304NotModified
Copy link
Member

merged! Thanks!

@304NotModified 304NotModified modified the milestones: 2.3, 2.4 Feb 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants