Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert TextNormalizer to estimator #1276

Merged

Conversation

Ivanidzo4ka
Copy link
Contributor

Convert TextNormalizer to estimator

@Ivanidzo4ka Ivanidzo4ka added the API Issues pertaining the friendly API label Oct 17, 2018

[Argument(ArgumentType.AtMostOnce, HelpText = "Whether to keep numbers or remove them.", ShortName = "num", SortOrder = 2)]
public bool KeepNumbers = true;
public bool KeepNumbers = TextNormalizerEstimator.Defaults.KeepNumbers;
}

internal const string Summary = "A text normalization transform that allows normalizing text case, removing diacritical marks, punctuation marks and/or numbers." +
" The transform operates on text input as well as vector of tokens/text (vector of ReadOnlyMemory).";

public const string LoaderSignature = "TextNormalizerTransform";
Copy link
Contributor

@artidoro artidoro Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextNormalizerTransform [](start = 47, length = 23)

We could get rid of this and use nameof(TextNormalizerTransform) instead? otherwise we shoudl probably make it private or internal #Resolved

// Greek letters combined with diacritics:
"ΆΑ", "ΈΕ", "ΉΗ", "ΊΙ", "ΌΟ", "ΎΥ", "ΏΩ", "ΐι", "ΪΙ", "ΫΥ", "άα", "έε", "ήη", "ίι", "ΰυ", "ϊι",
"ϋυ", "όο", "ύυ", "ώω", "ϓϒ", "ϔϒ",
public TextNormalizerTransform(IHostEnvironment env, ColumnInfo[] columns) :
Copy link
Contributor

@artidoro artidoro Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

Can we make this internal, since users will construct the transform only through the estimators? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no, non-trainable transformers should have public ctors.


In reply to: 226024916 [](ancestors = 226024916)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good to know!


In reply to: 226071035 [](ancestors = 226071035,226024916)

None = 2
}

public static class Defaults
Copy link
Contributor

@artidoro artidoro Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

Could this be internal, or private instead of public? The user should not need it. #Resolved

// Hebrew combining diacritics
ch >= 0x0591 && ch <= 0x05BD || ch == 0x05C1 || ch == 0x05C2 || ch == 0x05C4 ||
ch == 0x05C5 || ch == 0x05C7 ||
private TextNormalizerEstimator(IHostEnvironment env, TextNormalizerTransform transformer)
Copy link
Contributor

@artidoro artidoro Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextNormalizerEstimator [](start = 16, length = 23)

When do we use this constructor? The above TextNormalizerEstimator(IHostEnvironment env, params TextNormalizerTransform.ColumnInfo[] columns) can call directly into base. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in public constructor above


In reply to: 226028712 [](ancestors = 226028712)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

public TextNormalizerEstimator(IHostEnvironment env, params TextNormalizerTransform.ColumnInfo[] columns)
:base(Contracts.CheckRef(env, nameof(env)).Register(nameof(TextNormalizerEstimator)), new TextNormalizerTransform(env, columns))
{
}

And we get rid of the private one?


In reply to: 226032246 [](ancestors = 226032246,226028712)

}

[Fact]
public void TestOldSavingAndLoading()
Copy link
Contributor

@artidoro artidoro Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestOldSavingAndLoading [](start = 20, length = 23)

Both saving and loading seem to be done with the new transform code. Since you changed the versioninfo, it might be useful to have an other test where you load a model that was saved using the old transform code, so that we check that those models can still be loaded? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore


In reply to: 226031938 [](ancestors = 226031938)

change defaults visibility as well
private static VersionInfo GetVersionInfo()
{
return new VersionInfo(
modelSignature: "TEXTNORM",
verWrittenCur: 0x00010001, // Initial
//verWrittenCur: 0x00010001, // Initial
verWrittenCur: 0x00010002, // Params for each column.
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verWrittenCur: 0x00010002, // Params for each column. [](start = 16, length = 53)

@Zruty0, I'm actually not sure how benefitial and usefull to provide options for each column rather than provide set of columns and options for all of them. Would be nice if you can comment. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would unblock certain pigsty optimizations, where you can bundle multiple calls to text normalizer (with different params) into one estimator. But since this is not incurring a data pass, I'm ok with not having a per-column option.


In reply to: 226042341 [](ancestors = 226042341)

return h.Apply("Loading Model", ch => new TextNormalizerTransform(h, ctx, input));
var type = inputSchema.GetColumnType(srcCol);
if (!TextNormalizerEstimator.IsColumnTypeValid(type))
throw Host.ExceptParam(nameof(inputSchema), TextNormalizerEstimator.ExpectedColumnType);
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptParam [](start = 27, length = 11)

ExceptSchemaMismatch #Resolved

_columns = new ColumnInfo[columnsLength];
for (int i = 0; i < columnsLength; i++)
{
// *** Binary format ***
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ *** Binary format *** [](start = 21, length = 23)

yea I don't think it's necessary to introduce that extra code path just to keep per-column options. #Resolved

for (int i = 0; i < src.Count; i++)
inputSchema.TryGetColumnIndex(_parent._columns[i].Input, out int srcCol);
var srcType = inputSchema.GetColumnType(srcCol);
_types[i] = srcType.IsVector ? new VectorType(TextType.Instance) : srcType;
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorType(TextType.Instance) : srcType [](start = 55, length = 39)

I believe you should validate srcType here.
Also, are you really creating a variable vector of text? Why do you even change type at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, it's dropping the empty ones. Wow


In reply to: 226072116 [](ancestors = 226072116)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation comment still stands


In reply to: 226072884 [](ancestors = 226072884,226072116)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of validation?


In reply to: 226072949 [](ancestors = 226072949,226072884,226072116)


if (src.IsEmpty)
input.Schema.TryGetColumnIndex(_parent._columns[iinfo].Input, out int srcCol);
var srcType = input.Schema.GetColumnType(srcCol);
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srcType [](start = 20, length = 7)

var srcType = input.Schema[_parent._columns[iinfo].Input].Type #Resolved


public static bool IsColumnTypeValid(ColumnType type) => (type.ItemType.IsText);

internal const string ExpectedColumnType = "Expected Text item type";
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected Text item type [](start = 52, length = 23)

how about 'Text or vector of text' ? #Resolved

if (!inputSchema.TryFindColumn(colInfo.Input, out var col))
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input);
if (!IsColumnTypeValid(col.ItemType))
throw Host.ExceptParam(nameof(inputSchema), ExpectedColumnType);
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptParam [](start = 31, length = 11)

ExceptSchemaMismatch #Resolved

}

[Fact]
public void TextNormalizeStatic()
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextNormalizeStatic [](start = 20, length = 19)

please don't bundle pigsty and non-pigsty tests in one class, put it next to other pigsty tests #Resolved

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

{
var col = new StopWordsCol();
col.Source = wordTokCols[i];
col.Source = wordTokCols[i];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source [](start = 4, length = 6)

fix layout

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka Ivanidzo4ka merged commit a285f8d into dotnet:master Oct 18, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants