-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert LdaTransform to IEstimator/ITransformer API #1410
Conversation
@@ -17,12 +17,19 @@ | |||
using Microsoft.ML.Runtime.Internal.Utilities; | |||
using Microsoft.ML.Runtime.Model; | |||
using Microsoft.ML.Runtime.TextAnalytics; | |||
using Microsoft.ML.Core.Data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctrl+R+G (aka sort using) #Closed
|
||
// The type T asociated with the delegate will be the actual value type once #863 goes in. | ||
// However, until such time as #863 goes in, it would be too awkward to attempt to extract the metadata. | ||
// For now construct the useless object then pass it into the delegate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment doesn't seems right. #Resolved
"Latent Dirichlet Allocation Transform", LdaTransform.LoaderSignature)] | ||
|
||
[assembly: LoadableClass(typeof(IRowMapper), typeof(LdaTransform), null, typeof(SignatureLoadRowMapper), | ||
"Latent Dirichlet Allocation Transform", LdaTransform.LoaderSignature)] | ||
|
||
namespace Microsoft.ML.Runtime.TextAnalytics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtime [](start = 23, length = 7)
Can you merge with master, I'm sure Senja changed this namespace to something better. #Resolved
@@ -646,7 +646,7 @@ public void TestLdaTransformEmptyDocumentException() | |||
|
|||
try | |||
{ | |||
var lda = new LdaTransform(Env, args, srcView); | |||
var lda = LdaTransform.Create(Env, args, srcView); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LdaTransform [](start = 26, length = 12)
Can you switch to estimator call?
Basically var lda = LdaEstimator(env, cols).Fit(srcView).Transform(scrView); #Resolved
/// <param name="numSummaryTermPerTopic">The number of words to summarize the topic</param> | ||
/// <param name="numBurninIterations">The number of burn-in iterations</param> | ||
/// <param name="resetRandomGenerator">Reset the random number generator for each document</param> | ||
/// <param name="onFit">Called upon fitting with the learnt enumeration on the dataset.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sprinkle it with dots at the end of sentence. #Resolved
Contracts.Assert(0 <= iinfo && iinfo < _parent.ColumnPairs.Length); | ||
disposer = null; | ||
|
||
var test = TestType(_srcTypes[iinfo]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestType [](start = 27, length = 8)
@ivan. Should we put this check here v/s putting it in the Mapper constructor ? Does it make a difference ? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make difference at what point it will fail. I would prefer to test types in constructor.
In reply to: 230899143 [](ancestors = 230899143)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to have TestType
idiom please
In reply to: 232105072 [](ancestors = 232105072,230899143)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if column type in incorrect we now throw ExceptSchemaMismatch
In reply to: 232437787 [](ancestors = 232437787,232105072,230899143)
private struct Config | ||
{ | ||
public readonly int NumTopic; | ||
public readonly Single AlphaSum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public readonly Single AlphaSum; | |
public readonly float AlphaSum; | |
``` #Resolved |
{ | ||
public readonly int NumTopic; | ||
public readonly Single AlphaSum; | ||
public readonly Single Beta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public readonly Single Beta; | |
public readonly float Beta; | |
``` #Resolved |
@@ -22,6 +22,7 @@ | |||
using System.Text; | |||
using Xunit; | |||
using Xunit.Abstractions; | |||
using static Microsoft.ML.Transforms.Text.LdaTransform; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort the usings. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Control+R Control+G (remove and sort Using) gives this order itself
In reply to: 231324047 [](ancestors = 231324047)
} | ||
|
||
/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' /> | ||
/// <param name="input">The column to apply to.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what this transform is doing briefly? #Resolved
/// <param name="onFit">Called upon fitting with the learnt enumeration on the dataset.</param> | ||
public static Vector<float> ToLdaTopicVector(this Vector<float> input, | ||
int numTopic = LdaEstimator.Defaults.NumTopic, | ||
Single alphaSum = LdaEstimator.Defaults.AlphaSum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single alphaSum = LdaEstimator.Defaults.AlphaSum, | |
float alphaSum = LdaEstimator.Defaults.AlphaSum, | |
``` #Resolved |
public static Vector<float> ToLdaTopicVector(this Vector<float> input, | ||
int numTopic = LdaEstimator.Defaults.NumTopic, | ||
Single alphaSum = LdaEstimator.Defaults.AlphaSum, | ||
Single beta = LdaEstimator.Defaults.Beta, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single beta = LdaEstimator.Defaults.Beta, | |
float beta = LdaEstimator.Defaults.Beta, | |
``` #Resolved |
Could you add mlcontext extensions? |
@@ -155,8 +160,10 @@ public bool TryUnparse(StringBuilder sb) | |||
} | |||
} | |||
|
|||
private sealed class ColInfoEx | |||
public sealed class ColumnInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not combine this with transformInfo? You can just add this public constructor and change the name to columnInfo, while keeping the other methods internal/private. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason why you can't delete TransformInfo class and use ColumnInfo instead.
In reply to: 231632966 [](ancestors = 231632966)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getSrc(ref src); | ||
lda.Output(in src, ref dst, numBurninIter, reset); | ||
}; | ||
return new LdaTransform(_host, input, _columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
using (var ch = Host.Start("Train")) | ||
{ | ||
Train(ch, input, _ldas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Train [](start = 16, length = 5)
Pete suggested that we make training a private static method in the transform, and that we have a constructor for the transform that will accept the trained parameters as an input.
This should make it easier to allow initialization of the transform using parameters coming from an external training routine. It is not something that we are exposing now, however.
This allows the transform to only keep the fields storing trained parameters. For example _types should no longer be needed as a field in the transform. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented something like this for whitening.
In reply to: 231639657 [](ancestors = 231639657)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. thanks for giving the context about this approach.
In reply to: 231640188 [](ancestors = 231640188,231639657)
|
||
NumThread = item.NumThreads ?? args.NumThreads ?? 0; | ||
ectx.CheckUserArg(NumThread >= 0, nameof(item.NumThreads), "Must be positive or zero."); | ||
public sealed class LdaState : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
Can this be made internal/private? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like some 'LDA guts' and should under no circumstances be exposed to the user.
In reply to: 231641222 [](ancestors = 231641222)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this internal. Instead added a class LdaTopicSummary to provide topic summary information back to users (when needed)
In reply to: 232437742 [](ancestors = 232437742,231641222)
… fields in LdaTopicSummary
@Zruty0 I have addressed your PR comments. Could you kindly take a look ? #Resolved |
ping @Zruty0 #Resolved |
using System; | ||
using System.Collections.Generic; | ||
|
||
namespace Microsoft.ML.Transforms.Text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microsoft.ML.Transforms.Text [](start = 10, length = 28)
Pete will ask you to put static extensions into Microsoft.ML.StaticPipe; #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ectx.CheckUserArg(NumBurninIter >= 0, nameof(item.NumBurninIterations), "Must be non-negative."); | ||
Contracts.CheckValue(input, nameof(input)); | ||
Contracts.CheckValueOrNull(output); | ||
Contracts.CheckUserArg(numTopic > 0, nameof(numTopic), "Must be positive."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckUserArg [](start = 26, length = 12)
Unless parsing command line, use CheckParam
instead of CheckUserArg
#Resolved
ctx.CheckAtModel(GetVersionInfo()); | ||
h.CheckValue(input, nameof(input)); | ||
// For each topic, provide information about the (item, score) pairs. | ||
public readonly Dictionary<int, List<Tuple<int, float>>> ItemScoresPerTopic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<Tuple<int, float> [](start = 44, length = 22)
Tuple<int, float>
-> (int Item, float Score)
(use named tuple instead of tuples)
List
-> `ImmutableArray
#Resolved
ctx.CheckAtModel(GetVersionInfo()); | ||
h.CheckValue(input, nameof(input)); | ||
// For each topic, provide information about the (item, score) pairs. | ||
public readonly Dictionary<int, List<Tuple<int, float>>> ItemScoresPerTopic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the int
key in this dictionary? Is it a sequential integer? If yes, make it into ImmutableArray
. If it's not a sequential integer, describe what it is, and make it into an IReadOnlyDictionary
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made it ImmutableArray
since its a sequential integer
In reply to: 234857025 [](ancestors = 234857025)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approval conditional on fixing all the remaining open comments :) |
|
||
/// <summary> | ||
/// Initializes a new instance of <see cref="LatentDirichletAllocationEstimator"/>. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use summary form method above.
It should be user friendly. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
ResetRandomGenerator = item.ResetRandomGenerator ?? args.ResetRandomGenerator; | ||
internal ColumnInfo(Column item, Arguments args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal ColumnInfo(Column item, Arguments args) [](start = 12, length = 48)
you can just call constructor above, and that way you don't need to duplicate code.
internal ColumnInfo(Column item, Arguments args): this( item.Source, item.Name, args.NumTopic, ....)
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
public const string LoaderSignature = "LdaTransform"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
internal #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// The following call fails because of the following issue | ||
// https://github.com/dotnet/machinelearning/issues/969 | ||
// In this test it manifests because of the WordBagEstimator in the estimator chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if you uncomment this?
I've recently convert WordBag to estimator so it should work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR converts LdaTransform to the IEstimator/ITransformer API paradigm (a work item related to #754)
Created separate issue #1411 for providing a convenient way of providing LDA topic summary info to the user