-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding documentation about the rest of the classes involved on generating the CSharpAPI #529
Conversation
…les. For the summary text that is common between several learners, the examples will be added on a separate node. An example of how that will look like is in the LogisticRegressionBinaryClassifier and LogisticRegressionClassifier.
…espective projects.
changing the root of the XML documents from docs -> doc, since its only one. Switching all <see href /> to the valid <see cref />
I would suggest to mostly review the text in the doc.xml files. I have marked this as WIP, because i need to do a final pass on extracting the examples separately from the nodes for the xml that is shared between two classes with different names, and only reference the example from the class with the corresponding name. I will update the WIP tag, and close this comment once done. |
// * Each column builds/uses exactly one "vocabulary" (dictionary). | ||
// * Output columns are KeyType-valued. | ||
// * The Key value is the one-based index of the item in the dictionary. | ||
// * Not found is assigned the value zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to be careful here. Conceptually, the key-values are logically starting at 0, but physically valid values start at 1. I feel like this might not be the best place to talk about key-values unless you're really going to go into them, since otherwise it may be confusing. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving it as is, since it is just a code comment for us.
In reply to: 202454960 [](ancestors = 202454960)
@@ -10,6 +10,7 @@ | |||
<ProjectReference Include="..\Microsoft.ML.Core\Microsoft.ML.Core.csproj" /> | |||
<ProjectReference Include="..\Microsoft.ML.CpuMath\Microsoft.ML.CpuMath.csproj" /> | |||
<ProjectReference Include="..\Microsoft.ML.Data\Microsoft.ML.Data.csproj" /> | |||
<ProjectReference Include="..\Microsoft.ML.FastTree\Microsoft.ML.FastTree.csproj" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 4, length = 84)
Is this necessary? Is this due to some sort of indirect dependency we have via depedency injection or somesuch? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's actually PKPD referencing FastTree in the documentation through a that adds this project reference.
Removing the reference, and the see tag. Let me know if the opposite is desired.
In reply to: 202456059 [](ancestors = 202456059)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'd prefer to not have things like this result in dependencies. :)
In reply to: 202741801 [](ancestors = 202741801,202456059)
Well, partially. The L1 regularization is actually a technique related but distinct from L-BFGS called OWL-QN. #Pending Refers to: src/Microsoft.ML.StandardLearners/Standard/LogisticRegression/doc.xml:13 in db2592d. [](commit_id = db2592d, deletion_comment = False) |
LASSO is an acronym so I wonder if this should be capitalized? #Closed Refers to: src/Microsoft.ML.StandardLearners/Standard/LogisticRegression/doc.xml:20 in db2592d. [](commit_id = db2592d, deletion_comment = False) |
src/Microsoft.ML.FastTree/doc.xml
Outdated
</summary> | ||
<remarks> | ||
In machine learning it is a pretty common and powerful approach to utilize the already trained model in the process of defining features. | ||
<para>A most obvious example could be to use the model's scores as features to downstream models. For example, we might run clustering on the original features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A most obvious example [](start = 14, length = 22)
I feel like the words "most obvious" aren't adding to the description here. #Closed
src/Microsoft.ML.FastTree/doc.xml
Outdated
<para>A most obvious example could be to use the model's scores as features to downstream models. For example, we might run clustering on the original features, | ||
and use the cluster distances as the new feature set. | ||
Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> | ||
There's a number of famous or popular examples of this technique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a number of famous or popular examples [](start = 8, length = 46)
Should be "There are ... examples" not "There is ... examples". #Closed
src/Microsoft.ML.FastTree/doc.xml
Outdated
</remarks> | ||
<example> | ||
<code> | ||
pipeline.Add(new TreeLeafFeaturizer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline.Add(new TreeLeafFeaturizer()) [](start = 10, length = 38)
This is very specific to the CSharpApi.cs
code, and will be going away. We're going to need to remember to nuke it. With this going away, ought we to even bother especially since this doesn't seem terribly informative? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment would apply to all the examples. Keeping it, since leaving the documentation without any example would not look good.
In reply to: 202459319 [](ancestors = 202459319)
can be different from LightGbmClassifier, which develops a multi-class classifier directly. | ||
</para> | ||
</remarks> | ||
<seealso cref='Microsoft.ML.Trainers.LogisticRegressionClassifier'>LogisticRegressionClassifier</seealso> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogisticRegressionClassifier [](start = 6, length = 105)
My expectation with <seealso
tags is that they take the form <seealso cref="Foo"/>
not <seealso cref="Foo">Foo</seealso>
. That latter form, even if it were supported, I would discourage us from using because it will not track renames of the elements. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to this when i needed the absolute name, because all the namespace repetition looked lousy. Agree that the renaming is more important, thought. switching back.
In reply to: 202460733 [](ancestors = 202460733)
src/Microsoft.ML.Transforms/doc.xml
Outdated
<para> | ||
This transform uses a set of aggregators to count the number of non-default values for each slot and | ||
instantiates a <see cref="Microsoft.ML.Runtime.Data.DropSlotsTransform"/> to actually drop the slots. | ||
This transform is useful when applied together with a CategoricalHashOneHotVectorizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CategoricalHashOneHotVectorizer [](start = 64, length = 31)
Should this have a <see
tag somehow? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, but the compiler was complaining of not being able to find a reference to that class, although the cref of the see tag had absolute reference to the CategoricalHashOneHotVectorizer, starting from the namespace.
I'll check the proj references. Makes no sense to add a reference to the dll containing the C# API, though (if that will fix it ).
In reply to: 202461085 [](ancestors = 202461085)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfilipi There are a few ways to address this if you want the link to resolve but don't want to mess up code references. My favorite tends to be declaring an assembly reference but specifying <Aliases>
to ensure the reference can't be readily used in code. This form allows for a compiler-validated cref
that doesn't result in an assembly reference in the compiled assembly. I'll try to find an example.
A secondary way to resolve this is using a fully-expanded cref that the compiler simply passes through without validation, but that documentation tools know how to parse:
<!-- This is inferior because the compiler will not validate it, but works as a last resort -->
<see cref="T:Fully.Qualified.Path.To.TypeName" />
``` #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what was not working, actually.
I didn't specify the T: in the cref. I'll try that.
Thanks for the suggestion, though.
In reply to: 202780648 [](ancestors = 202780648)
// Imputation modes are supported for vectors both by slot and across all slots. | ||
// </summary> | ||
// REVIEW: May make sense to implement the transform template interface. | ||
/// <include file='doc.xml' path='doc/members/member[@name="NAReplace"]/*' /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the XML comments have to be tripple ///
indented.
I think this // REVIEW
comment might be messing up the XML parsing. Maybe it should go somewhere else (or even away, it's not adding much info). My suspicion is that this comment isn't actually being compiled into the docs. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... wait were you trying to make it no longer an XML doc comment?
In that case you might want to take our those <summary>
tags.
In reply to: 202461616 [](ancestors = 202461616)
|
||
<member name="CharacterTokenizer"> | ||
<summary> | ||
Character-oriented tokenizer where text is considered a sequence of characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Character-oriented [](start = 8, length = 18)
There's something that seems a bit off about this description. "Character-oriented" is a strange description, in any case text is as far as I know always considered a sequence of characters, and at the end of this I don't really have a clear idea of what it's doing.
Maybe, "breaks text into individual tokens consisting of each individual character" might be a good description? #Closed
</item> | ||
<item> | ||
<description>L-p vector normalization.</description> | ||
</item> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later in this document and I think in other documents you tended to do
<item><description>Foo</description></item>
vs.
<item>
<description>Foo</description>
</item>
The first form might make this list a little less voluminous and easier to read. #Closed
These learner will request normalization from the data pipeline if the | ||
classifier indicates it would benefit from it. Note that even if the | ||
classifier indicates that it does not need caching, OVA will always | ||
request caching, as it will be performing multiple passes over the data set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be misplaced #Closed
<example name="MultiClassNaiveBayesTrainer"> | ||
<example> | ||
<code language="csharp"> | ||
pipeline.Add(new NaiveBayesClassifier(){ NormalizeFeatures = NormalizeOption.Auto, Caching = CachingOptions.Memory }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note about consistency between examples #Closed
</example> | ||
|
||
<member name="OVA"> | ||
<summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 7, length = 7)
I think the text below would make a great first paragraph of the remarks section, and summary should be one sentence along the lines of:
Trains a one-versus-all multi-class classifier on top of specified binary classifier. #Closed
Prediction is then performed by running these binary classifiers, and choosing the prediction with the highest confidence score. | ||
</summary> | ||
<remarks> | ||
<para>This algorithm can be treated as a wrapper for all the binary classifiers in ML.NET. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be treated as a wrapper for all [](start = 29, length = 35)
'can be used with any of' sounds more appropriate #Closed
@@ -31,6 +31,29 @@ namespace Microsoft.ML.Runtime.Learners | |||
using TDistPredictor = IDistPredictorProducing<Float, Float>; | |||
using CR = RoleMappedSchema.ColumnRole; | |||
|
|||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 9, length = 7)
Why here? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no entry point for it. it is only needed here.
In reply to: 203200111 [](ancestors = 203200111)
src/Microsoft.ML.Transforms/doc.xml
Outdated
Here p(x,y) is the joint probability density function of X and Y, p(x) and p(y) are the marginal probability density functions of X and Y respectively. | ||
In general, a higher mutual information between the dependent variable (or label) and an independent variable (or feature) means | ||
that the label has higher mutual dependence over that feature. | ||
The mutual information feature selection mode selects the features based on the mutual information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a tautology to me. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not adding to the understanding either. Removing it.
In reply to: 203200715 [](ancestors = 203200715)
</example> | ||
</example> | ||
|
||
<member name="MutualInformationFeatureSelection"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MutualInformationFeatureSelection [](start = 18, length = 33)
Duplicate? #Closed
src/Microsoft.ML.Transforms/doc.xml
Outdated
</remarks> | ||
<example> | ||
<code language="csharp"> | ||
pipeline.Add(new FeatureSelectorByMutualInformation() { Column = new[]{ "Feature1"}, SlotsInOutput = 6 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new[] [](start = 75, length = 5)
does it have to be an array? That's unfortunate #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.Transforms/doc.xml
Outdated
</member> | ||
|
||
<member name="OptionalColumnTransform"> | ||
<summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 7, length = 7)
this should probably go into remarks?
The summary could then be
Creates a new column with the specified type and default values.
``` #Closed
</para> | ||
<para> | ||
When computing the mean/max/min value, there is also an option to compute it over the whole column instead of per slot. | ||
This option has a default value of true for variable length vectors, and false for known length vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooooh, I remember that... Not a source of confusion at all... #Closed
src/Microsoft.ML.Transforms/doc.xml
Outdated
Un-groups vector columns into sequences of rows, inverse of Group transform. | ||
</summary> | ||
<remarks> | ||
<para>This can be thought of as an inverse of the CombinerByContiguousGroupId. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CombinerByContiguousGroupId [](start = 58, length = 27)
link? #Closed
</example> | ||
|
||
<member name="KeyToText"> | ||
<summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 7, length = 7)
I suggest
Transforms keys in a key column into the corresponding values
The fact that it is using the metadata looks like implementation detail to me #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML.Transforms/doc.xml
Outdated
The resulting data will have all the group key columns preserved, | ||
and the aggregated columns will become variable-length vectors of the original types. | ||
<para>This transform essentially performs the following SQL-like operation:</para> | ||
<para>GroupKey1, GroupKey2, ... GroupKeyK, LIST(Value1), LIST(Value2), ... LIST(ValueN)</para> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
para [](start = 8, length = 4)
I think you missed the SELECT keyword #Closed
// | ||
// All metadata is preserved for the retained columns. For 'unrolled' columns, all known metadata | ||
// except slot names is preserved. | ||
/// <include file='doc.xml' path='doc/members/member[@name="Ungroup"]/*' /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include [](start = 9, length = 7)
Why do we have both text and include here? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the comment where they were in the original places, just made them comments instead of documentation - style comments, to avoid having someone working on the code having to open the doc and read through it.
The text in the doc was better worded that the actual comment in the code.
In reply to: 203202675 [](ancestors = 203202675)
</example> | ||
|
||
<member name="WordTokenizer"> | ||
<summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 7, length = 7)
again, this would make a good first paragraph in remarks, and the summary could be
Splits the text into words using the separator character(s).
``` #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the changes done, only one sentence is gone.
In reply to: 203203181 [](ancestors = 203203181)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
Sorry, I wasn't clear. In reply to: 405745390 [](ancestors = 405745390) Refers to: src/Microsoft.ML.StandardLearners/Standard/doc.xml:10 in 6f45e69. [](commit_id = 6f45e69, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ting the CSharpAPI (dotnet#529) * Moving from xml strings to having the documentation details in xml files. For the summary text that is common between several learners, the examples will be added on a separate node. An example of how that will look like is in the LogisticRegressionBinaryClassifier and LogisticRegressionClassifier. * fixing the aftermath of renaming the XML file. * removing the Desc from the EntryPoint attribute is a bad idea. * removing the XML docs from the doc folder, and added them under the respective projects. * Some OS get picky about casing. * file name should be vanilla * Adding documentation for the first group of transforms * adding more documentation. changing the root of the XML documents from docs -> doc, since its only one. Switching all <see href /> to the valid <see cref /> * formatting tweaks, and adressing most of the code comments. * Extracted the examples outside of the member nodes in the xml, so that they only appear in the CSharpApi classes, and not on the runtime classes. * small fixes * addressing code comments * addressing Pete's comments. * Fixing language around the CharTokenizer description. Closes dotnet#389
Resolves #389
Added more documentation and examples about mostly transforms components. (A few learners as well.)
The documentation for the classes involved in generating the entry points lives in the doc.xml documents, since it needs to be referenced from two locations, for the most part, and since the CSharpApi is auto-generated.