-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert TextNormalizer to estimator #1276
Convert TextNormalizer to estimator #1276
Conversation
|
||
[Argument(ArgumentType.AtMostOnce, HelpText = "Whether to keep numbers or remove them.", ShortName = "num", SortOrder = 2)] | ||
public bool KeepNumbers = true; | ||
public bool KeepNumbers = TextNormalizerEstimator.Defaults.KeepNumbers; | ||
} | ||
|
||
internal const string Summary = "A text normalization transform that allows normalizing text case, removing diacritical marks, punctuation marks and/or numbers." + | ||
" The transform operates on text input as well as vector of tokens/text (vector of ReadOnlyMemory)."; | ||
|
||
public const string LoaderSignature = "TextNormalizerTransform"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextNormalizerTransform [](start = 47, length = 23)
We could get rid of this and use nameof(TextNormalizerTransform) instead? otherwise we shoudl probably make it private or internal #Resolved
// Greek letters combined with diacritics: | ||
"ΆΑ", "ΈΕ", "ΉΗ", "ΊΙ", "ΌΟ", "ΎΥ", "ΏΩ", "ΐι", "ΪΙ", "ΫΥ", "άα", "έε", "ήη", "ίι", "ΰυ", "ϊι", | ||
"ϋυ", "όο", "ύυ", "ώω", "ϓϒ", "ϔϒ", | ||
public TextNormalizerTransform(IHostEnvironment env, ColumnInfo[] columns) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
Can we make this internal, since users will construct the transform only through the estimators? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no, non-trainable transformers should have public ctors.
In reply to: 226024916 [](ancestors = 226024916)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None = 2 | ||
} | ||
|
||
public static class Defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
Could this be internal, or private instead of public? The user should not need it. #Resolved
// Hebrew combining diacritics | ||
ch >= 0x0591 && ch <= 0x05BD || ch == 0x05C1 || ch == 0x05C2 || ch == 0x05C4 || | ||
ch == 0x05C5 || ch == 0x05C7 || | ||
private TextNormalizerEstimator(IHostEnvironment env, TextNormalizerTransform transformer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextNormalizerEstimator [](start = 16, length = 23)
When do we use this constructor? The above TextNormalizerEstimator(IHostEnvironment env, params TextNormalizerTransform.ColumnInfo[] columns) can call directly into base. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
public TextNormalizerEstimator(IHostEnvironment env, params TextNormalizerTransform.ColumnInfo[] columns)
:base(Contracts.CheckRef(env, nameof(env)).Register(nameof(TextNormalizerEstimator)), new TextNormalizerTransform(env, columns))
{
}
And we get rid of the private one?
In reply to: 226032246 [](ancestors = 226032246,226028712)
} | ||
|
||
[Fact] | ||
public void TestOldSavingAndLoading() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestOldSavingAndLoading [](start = 20, length = 23)
Both saving and loading seem to be done with the new transform code. Since you changed the versioninfo, it might be useful to have an other test where you load a model that was saved using the old transform code, so that we check that those models can still be loaded? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change defaults visibility as well
private static VersionInfo GetVersionInfo() | ||
{ | ||
return new VersionInfo( | ||
modelSignature: "TEXTNORM", | ||
verWrittenCur: 0x00010001, // Initial | ||
//verWrittenCur: 0x00010001, // Initial | ||
verWrittenCur: 0x00010002, // Params for each column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verWrittenCur: 0x00010002, // Params for each column. [](start = 16, length = 53)
@Zruty0, I'm actually not sure how benefitial and usefull to provide options for each column rather than provide set of columns and options for all of them. Would be nice if you can comment. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would unblock certain pigsty optimizations, where you can bundle multiple calls to text normalizer (with different params) into one estimator. But since this is not incurring a data pass, I'm ok with not having a per-column option.
In reply to: 226042341 [](ancestors = 226042341)
return h.Apply("Loading Model", ch => new TextNormalizerTransform(h, ctx, input)); | ||
var type = inputSchema.GetColumnType(srcCol); | ||
if (!TextNormalizerEstimator.IsColumnTypeValid(type)) | ||
throw Host.ExceptParam(nameof(inputSchema), TextNormalizerEstimator.ExpectedColumnType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExceptParam [](start = 27, length = 11)
ExceptSchemaMismatch
#Resolved
_columns = new ColumnInfo[columnsLength]; | ||
for (int i = 0; i < columnsLength; i++) | ||
{ | ||
// *** Binary format *** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ *** Binary format *** [](start = 21, length = 23)
yea I don't think it's necessary to introduce that extra code path just to keep per-column options. #Resolved
for (int i = 0; i < src.Count; i++) | ||
inputSchema.TryGetColumnIndex(_parent._columns[i].Input, out int srcCol); | ||
var srcType = inputSchema.GetColumnType(srcCol); | ||
_types[i] = srcType.IsVector ? new VectorType(TextType.Instance) : srcType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorType(TextType.Instance) : srcType [](start = 55, length = 39)
I believe you should validate srcType
here.
Also, are you really creating a variable vector of text? Why do you even change type at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
if (src.IsEmpty) | ||
input.Schema.TryGetColumnIndex(_parent._columns[iinfo].Input, out int srcCol); | ||
var srcType = input.Schema.GetColumnType(srcCol); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srcType [](start = 20, length = 7)
var srcType = input.Schema[_parent._columns[iinfo].Input].Type
#Resolved
|
||
public static bool IsColumnTypeValid(ColumnType type) => (type.ItemType.IsText); | ||
|
||
internal const string ExpectedColumnType = "Expected Text item type"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected Text item type [](start = 52, length = 23)
how about 'Text or vector of text' ? #Resolved
if (!inputSchema.TryFindColumn(colInfo.Input, out var col)) | ||
throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input); | ||
if (!IsColumnTypeValid(col.ItemType)) | ||
throw Host.ExceptParam(nameof(inputSchema), ExpectedColumnType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExceptParam [](start = 31, length = 11)
ExceptSchemaMismatch
#Resolved
} | ||
|
||
[Fact] | ||
public void TextNormalizeStatic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextNormalizeStatic [](start = 20, length = 19)
please don't bundle pigsty and non-pigsty tests in one class, put it next to other pigsty tests #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
var col = new StopWordsCol(); | ||
col.Source = wordTokCols[i]; | ||
col.Source = wordTokCols[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source [](start = 4, length = 6)
fix layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert TextNormalizer to estimator