-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migration of first IDataView
docs
#173
Conversation
Incidentally I am putting this in a |
docs/code/KeyValues.md
Outdated
The reality cannot be seen by any conventional means I am aware of, save for | ||
viewing ML.NET's workings in the debugger or using the API and inspecting | ||
these raw values yourself: that `4000` you would see is really stored as the | ||
`byte` `1`, `4001` as `2`, `4002` as `3`, and the missing `�` stored as `0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a unicode replacement char '�' in this line. I'm assuming this is intentional, though perhaps there's another way to convey the missing key value in a way which doesn't remind me of Mojibake.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/IDataViewImplementation.md
Outdated
nonetheless quite valuable to know. That is, not the `IDataView` spec itself, | ||
but many of the logical implications of that spec. | ||
|
||
We will here starts with the idioms and practices for `IDataView` generally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We will here starts" should be "We will here start" #Resolved
docs/code/IDataViewImplementation.md
Outdated
Presumably you are motivated to read this document because you have some | ||
problem of how to get some data into ML.NET, or process data using ML.NET, or | ||
something along these lines. There is a decision to be made about how to even | ||
engineer a solution. Sometimes its quite obvious: text featurization obviously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Sometimes its quite" should be "Sometimes it's quite" #Resolved
docs/code/IdvFileFormat.md
Outdated
use of LEB128 in places where we are writing values that, on balance, we | ||
expect to be relatively small, and only in cases where there is no potential | ||
for benefit for random access to the associated stream, since LEB128 is | ||
incompatible with random access. However this is not formulated into anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"However this is not" should be "However, this is not" #Resolved
|
||
Note that identically *constructed* data views are not necessarily | ||
*functionally* identical. Consider this usage of the train and score transform | ||
with `xf=trainScore{tr=ap}`, where we first train averaged perceptron, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend UpperCamel TrainScore
as this is the case I see most often. Also TrainScore is not currently available in ML.NET (or xf=). And, spelling out tr=AveragedPerceptron may be easier for readers to understand. Granted case and long vs. short is up to personal preference. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh... I thought I'd excised all maml
command line things. But now I remember, I saw this, threw up my hands, left it for later, then promptly forgot about it till now. This might require a bit more thought.
Unfortunately I don't know that there's a way to express this in the new public API of ML.NET though. Hmmm. We also haven't migrated some things like the markdown saver. Hmmm.
In reply to: 188811867 [](ancestors = 188811867)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just don't know what to do about this. A verbal description of what would happen if it were possible to do this is fairly feeble.
In reply to: 188843730 [](ancestors = 188843730,188811867)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment out the section until which point we have a TrainScore? Or perhaps leave behind in another file until we can move it back here? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. Let me think about this. Let me know if you have other ideas.
In reply to: 188877540 [](ancestors = 188877540)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm afraid I have to punt on this one... there's kind of no replacement for this yet. It's kind of minor, we can adjust later I hope.
In reply to: 189015019 [](ancestors = 189015019,188877540)
docs/code/IDataViewImplementation.md
Outdated
Moreover: it is a solution to a problem that does not exist. `IDataView`s are | ||
fundamentally composable structures already, and one of the most fundamental | ||
operations you can do is transform columns into different types. So, there is | ||
no need for you to do the conversion yourself. Indeed it is harmful for you to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Indeed it is harmful" should be "Indeed, it is harmful" #Resolved
docs/code/IDataViewTypeSystem.md
Outdated
|
||
* The `DimCount` property indicates the number of dimensions and the `GetDim` | ||
method returns a particular dimension value. All dimension values are non- | ||
negative integers. A zero dimension value indicates unknown (or variable) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Severely minor: recommend, "A zero-dimension value " #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think perhaps this is a misparse. It's "zero (dimension value)". I can rephrase though, since "2-dimension" and such things are common constructs, and it does resemble that...
In reply to: 188812830 [](ancestors = 188812830)
docs/code/IDataViewTypeSystem.md
Outdated
* `U1`, `U2`, `U4`, `U8`: unsigned integer types with the indicated number of | ||
bytes | ||
|
||
* `UG`: unsigned type with 16-bytes, typically used as an unique ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as an unique" should be "as a unique"
... yes I had to look this one up to make sure I'm not crazy: https://www.quora.com/Which-is-the-correct-grammar-usage-a-unique-or-an-unique/answer/Stephanie-Fysh?share=1 #Resolved
docs/code/IDataViewTypeSystem.md
Outdated
than the number of bytes in the source's underlying type, or the `Count` | ||
value is positive. In the latter case, the `Count` is necessarily less than | ||
`2^^k`, where `k` is the number of bits in the destination type's underlying | ||
type. For example, `U1[1-*] `can be converted to `U2[1-*]`, but `U2[1-*]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`U1[1-*] `can be converted
has the space on the wrong side of the back tick. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first read this, my brain somehow filled in "traintrack" for "back tick." I imagined a poor little space, on the wrong side of the backtick, suddenly finding itself set upon by the unsavory characters in the KeyType gang...
In reply to: 188813760 [](ancestors = 188813760)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you think... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Justin, just clarifying, was there a comment attached here? I've reviewed the language in the section about up-conversion, it seems correct?
In reply to: 188877915 [](ancestors = 188877915)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh oh oh... this was a comment on the earlier issue. Geez CodeFlow github integration sure isn't great at tracking replies. :P
In reply to: 188963444 [](ancestors = 188963444,188877915)
docs/code/IdvFileFormat.md
Outdated
column across multiple rows. Block format is dictated by a codec. There is a | ||
table-of-contents and lookup table to facilitate quasi-random access to | ||
particular blocks. (Quasi in the sense that looking up the value for a column | ||
and particular row may require .) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfinished thought #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/IdvFileFormat.md
Outdated
use of LEB128 in places where we are writing values that, on balance, we | ||
expect to be relatively small, and only in cases where there is no potential | ||
for benefit for random access to the associated stream, since LEB128 is | ||
incompatible with random access. However this is not formulated into anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"However this" should be "However, this" #Resolved
docs/code/IdvFileFormat.md
Outdated
8 | ulong | **Version**: Indicates the version of the data file. | ||
16 | ulong | **CompatibleVersion**: Indicates the minimum reader version that can interpret this file, possibly with some data loss. | ||
24 | long | **TableOfContentsOffset**: The offset to the column table of contents structure. | ||
32 | long | **TailOffset**: The eight-byte tail signature starts at this offset. So, the entire dataset stream should be considered to have eight plus this value bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this value bytes" should be (something like) "this value in bytes" #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually grammatically correct but, I agree, slightly confusing. I think you were parsing it as "(eight) plus (this value bytes)", but the correct interpretation is "(eight plus this value) bytes". I have changed it though, to say it has "byte length of eight plus this value," to remove any misunderstandings.
In reply to: 188814633 [](ancestors = 188814633)
docs/code/IdvFileFormat.md
Outdated
|
||
The enum for compression kind is one byte, and follows this scheme: | ||
|
||
Compresion Kind | Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Compresion" should be "Compression " #Resolved
docs/code/IdvFileFormat.md
Outdated
Zlib (i.e., [RFC1950](http://www.ietf.org/rfc/rfc1950.txt)) | 2 | ||
|
||
None means no compression. DEFLATE is the default scheme. There is a tendency | ||
to conflate Zlib and DEFLATE, so to be clear: Zlib can be (somewhat inexactly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"zlib" is generally fully lowercase. #Resolved
docs/code/IdvFileFormat.md
Outdated
None means no compression. DEFLATE is the default scheme. There is a tendency | ||
to conflate Zlib and DEFLATE, so to be clear: Zlib can be (somewhat inexactly) | ||
considered a wrapped version of DEFLATE, but it is still a distinct (but | ||
closely related) format. However both are implemented by the Zlib library, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"However both are" should be "However, both are" #Resolved
docs/code/KeyValues.md
Outdated
feature vector to be a vector of floating point values and *not* keys, in | ||
typical usage the majority of usages of keys is as some sort of intermediate | ||
value on the way to that final feature vector. (Unless, say, doing something | ||
like preparing labels for a multiclass learner or somesuch.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"somesuch" should be "some such" #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? Hmmm.
Anyway I think I can delete it, it's a bit too folksy of a colloquialism and the "say" already suggests that this is merely one example and there's no need to belabor the point.
In reply to: 188815291 [](ancestors = 188815291)
docs/code/KeyValues.md
Outdated
implementation side, which is both less for us to maintain, and also for users | ||
gives consistency in behavior. | ||
|
||
So for example, the `charTokenize` above might appear to be a strange choice: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend UpperCamel for CharTokenize
#Resolved
docs/code/KeyValues.md
Outdated
gives consistency in behavior. | ||
|
||
So for example, the `charTokenize` above might appear to be a strange choice: | ||
*why* represent characters as keys? The reason is that the N-gram transform is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a few spelling of "ngram" in the docs. Personally, I like "ngram" and "chargram". #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/KeyValues.md
Outdated
and so forth. | ||
|
||
There is another implication: a hypothetical type `U1<4000-4002>` is actually | ||
a sensible type in this scheme. The `U1` indicates that is is stored in one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that is is stored" should be "that is stored" #Resolved
docs/code/VBufferCareFeeding.md
Outdated
representations, since they are equivalent it follows that any code consuming | ||
`VBuffer`s must work equally well with *both*. That is, there must never be a | ||
condition where data is read and assumed to be either sparse, or dense, since | ||
implementors of `IDataView` and related interfaces are perfectly free to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"implementors" is less common of a spelling than "implementers" #Resolved
docs/code/VBufferCareFeeding.md
Outdated
One possible alternate (wrong) implementation of this would be to just say | ||
`dst=src` then scale all contents of `dst.Values` by `c`. But, then `dst` and | ||
`src` would share references to their internal arrays, completely compromising | ||
the callers ability to do anything useful with them: if the caller were to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"callers" should be "caller's" #Resolved
docs/code/VBufferCareFeeding.md
Outdated
representations, since they are equivalent it follows that any code consuming | ||
`VBuffer`s must work equally well with *both*. That is, there must never be a | ||
condition where data is read and assumed to be either sparse, or dense, since | ||
implementors of `IDataView` and related interfaces are perfectly free to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"implementors" is a less common spelling of "implementers" #Resolved
operate on acceptable sets. | ||
|
||
1. The simple enumeration of `UInt128` numeric values from any number is an | ||
acceptable set. (This covers how most loaders generate IDs. Typically we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Typically we" should be "Typically, we" #Resolved
optional metadata. | ||
|
||
Column names are case sensitive. Multiple columns can share the same name, in | ||
which case, one of the columns hides the others, in the sense that the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to say which column hides the others. Perhaps, "the highest index numbered". #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. While this is part of the existing implementations, it isn't really part of the spec or ISchema
interface. If you were to construct a dataview with columns ["A", "B", "B"]
, but the schema implementation behaves so that TryGetColumnIndex("B", out int colIdx)" succeeds with
colIndex==1`, it would behave perfectly in, say, a transform pipeline and whatnot. I will though add a note about this in the implementation -- actually maybe a bit more, since I see that I had neglected to mention anything about the schema in the implementation notes aside from one or two notes about metadata...
In reply to: 188891796 [](ancestors = 188891796)
* Unsigned 16 byte values for ids and probabilistically unique hashes | ||
* Date time, date time zone, and timespan | ||
* Key types | ||
* Vector types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend "Vector types—for the above base types" #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. If it were that simple I might do that, but it's actually more flexible than that. Vector types over any so-called primitive type, including user defined types, are perfectly all right. But if I were to do that here I would have to define the concept of a primitive type. We already have a vector section though that goes into this in somewhat more detail. I'd prefer to leave that detail there.
In reply to: 188892678 [](ancestors = 188892678)
* Key types | ||
* Vector types | ||
|
||
The set of standard types will likely be expanded over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call out some notably (currently) missing types? eg: half-precision floats, mixed types, decimal, tensors. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I'm inclined to not do anything we don't have a definite plan for, if that's all right.
FYI the "tensor" type is a multidim vector, which is already covered as we see for.
In reply to: 188896792 [](ancestors = 188896792)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I released the tensor was wrong as I was re-passing the longer datetype discussion in IDataViewTypeSystem.md and forgot to cleanup my comment.
Not listing the missing types is good too; just ideas. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes like I say I'm wary of listing them, since it could easily be interpreted as a definite future plan. Mentioning that we're adding more IDataView components, sure, but more types -- eh.
In reply to: 189032170 [](ancestors = 189032170)
* Unsigned integer values using 1, 2, 4, or 8 bytes | ||
* Unsigned 16 byte values for ids and probabilistically unique hashes | ||
* Date time, date time zone, and timespan | ||
* Key types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the list of supported column types, should we include, "User defined types"? For example you could pass a custom datatype between two transforms. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is not a list of "supported" column types, it is a list of "standard" types. (You'll note that we already talked about user defined types in the area preceding this list.)
In reply to: 188897498 [](ancestors = 188897498)
individual slots of a vector-valued column, values associated with a key type | ||
column, whether a column's data is normalized, etc. | ||
|
||
While IDataView schema supports an arbitrary number of columns, it, like most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the limits of the "arbitrary" number of columns? int.MaxValue? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's arbitrary in the sense that the schema is not fixed, not arbitrary in the sense of completely unlimited and unbounded.
In reply to: 188898484 [](ancestors = 188898484)
|
||
Machine learning and advanced analytics applications often involve high- | ||
dimensional data. For example, a common technique for learning from text, | ||
known as bag-of-words, represents each word in the text as a numeric feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to our users to link Wikipedia-terms like bag-of-words, or more of a distraction? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah actually, especially since the point of this paragraph seems to be to explain the concept. Wouldn't hurt to have a backup description at hand in case this brief description is insufficient.
In reply to: 188899313 [](ancestors = 188899313)
that maps a text value to the sequence of individual terms in that text, | ||
naturally produces variable-length vectors of text. Then, a hashing ngram | ||
transform may map the variable-length vectors of text to a bag-of-ngrams | ||
representation, which naturally produces numeric vectors of length $2^k$, where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think math renders in github. Could hot-link to a service such as:
src: <img src="https://latex.codecogs.com/gif.latex?$2^k$" title="$2^k$" />
Though we'd have to investigate which services are suitable for use; though the more appropriate method maybe creating an img
folder and storing the prerendered. That option is hard to edit though. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple things like this, I'd rather just format them differently.
What I've seen other people do is format this like 2^^k
or something. (Not sure why ^^
... Python does **
, everyone knows Python, I'm not sure what language ^^
is from.)
In reply to: 188901632 [](ancestors = 188901632)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW don't search for "latex image service" at work or with kids in the room... trust me on this. :P
In reply to: 188975188 [](ancestors = 188975188,188901632)
docs/code/IDataViewImplementation.md
Outdated
examples of this throughout the codebase. | ||
|
||
Nevertheless: in very specific circumstances we have relaxed this. For | ||
example, the TLC API serves up corrupt `IDataView` implementations that have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "TLC" remain in the docs? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/IDataViewTypeSystem.md
Outdated
savers, trainers, predictors, etc.). The team is actively working on many | ||
more. | ||
|
||
The name IDataView was inspired from the database world, where the term table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is also in IDataViewDesignPrinciples.md #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's deliberate. This is kind of spec-ish whitepaper-ish.
In reply to: 188909199 [](ancestors = 188909199)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/code/IDataViewTypeSystem.md
Outdated
For example: | ||
|
||
* A column may have a `BL` valued piece of metadata associated with the string | ||
`IsNormalized` indicating whether the column can be interpreted as a label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this example confusing. Are we saying a user named the column IsNormalized
, or is the name of the metadata IsNormalized
. Why would either of these tell if the column can be a label? (perhaps regression for linear models?)
Perhaps: "* A column may have a BL
valued piece of metadata (along with its associated string IsNormalized
to name it) indicating the column has already been normalized, and won't require additional normalization before certain learners or transforms." #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire sentence is absolute nonsense so I'm just going to rewrite it. Normalization has nothing to do with labels. Also I'm not sure why we're saying "Metadata is associated with the string..." Metadata just has names, period.
Hello, I am associated with the string Tom Finley. :D :D
In reply to: 189046936 [](ancestors = 189046936)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you great person associated with the string, "Tom Finley"!
docs/code/IDataViewTypeSystem.md
Outdated
|
||
* A column produced by a scorer may have several pieces of associated | ||
metadata, indicating the "scoring column group id" that it belongs to, what | ||
kind of scorer produced the column (e.g., binary classification), and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 9 instances of double spaces in this file (and perhaps more at the EOL boundary). eg: "scorer · · produced" #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I've successfully regular expressioned my way around weeding out these problems.
In reply to: 189163935 [](ancestors = 189163935)
docs/code/IDataViewTypeSystem.md
Outdated
a column of a custom (non-standard) type, `Picture<*,*,4>`, where the | ||
asterisks indicate that the picture dimensions are unknown. The last dimension | ||
of `4` indicates that there are four channels in each pixel: the three color | ||
components, plus the alpha channel. Applying a `BitmapScaler` transform scales |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we deprecated BitmapScaler
for ImageResizer
(and similar for the rest of the image transforms. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right... and image loader transform and so on, and so on.
In reply to: 189171796 [](ancestors = 189171796)
docs/code/IdvFileFormat.md
Outdated
* All numbers are stored as little-endian, using their natural fix-length | ||
binary encoding. | ||
|
||
* Strings are stored using an unsigned LEB128 number describing the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend moving the LEB128 wiki link to the first mention, as I just looked up LEB128 before getting to the wiki link in the next paragraph. #Resolved
docs/code/VBufferCareFeeding.md
Outdated
|
||
A `VBuffer<T>` is a generic type that supports both dense and sparse vectors | ||
over items of type `T`. This is the representation type for all | ||
[`VectorType`](../public/IDataViewTypeSystem.md#vector-representations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken link. Should be: [`VectorType`](IDataViewTypeSystem.md#vector-representations)
#Resolved
docs/code/VBufferCareFeeding.md
Outdated
Regarding the generic type parameter `T`, the only real assumption made about | ||
this type is that assignment (that is, using `=`) is sufficient to create an | ||
*independent* copy of that item. All representation types of the | ||
[primitive types](../public/IDataViewTypeSystem.md#standard-column-types) have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken link. Should be: [primitive types](IDataViewTypeSystem.md#standard-column-types)
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finished my passes over the docs. Besides the above newly added (6) comments, it looks good to me.
The only other unresolved topic was n-gram vs. ngram, which I think we should punt on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8f1da5c
to
c8ead0e
Compare
Migration of some existing internal documentation, rephrased in some places to be more appropriate in context (hopefully successfully). Related to dotnet#160, though this PR would be just part of addressing the issue of moving over internal docs.
Migration of some existing internal documentation, rephrased in some places to be more appropriate in context (hopefully successfully). Related to #160, though this PR would be just part of addressing the issue of moving over internal docs.