|  | 
| 95 | 95 |         <para>Generally, ensemble models provide better coverage and accuracy than single decision trees. | 
| 96 | 96 |          Each tree in a decision forest outputs a Gaussian distribution.</para> | 
| 97 | 97 |          <para>For more see: </para> | 
| 98 |  | -        <list> | 
|  | 98 | +        <list  type='bullet'> | 
| 99 | 99 |           <item><description><a href='http://en.wikipedia.org/wiki/Random_forest'>Wikipedia: Random forest</a></description></item> | 
| 100 | 100 |           <item><description><a href='http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf'>Quantile regression forest</a></description></item> | 
| 101 | 101 |           <item><description><a href='https://blogs.technet.microsoft.com/machinelearning/2014/09/10/from-stumps-to-trees-to-forests/'>From Stumps to Trees to Forests</a></description></item> | 
|  | 
| 146 | 146 |       <summary> | 
| 147 | 147 |         Trains a tree ensemble, or loads it from a file, then maps a numeric feature vector | 
| 148 | 148 |         to three outputs: | 
| 149 |  | -        <list> | 
|  | 149 | +        <list type='number'> | 
| 150 | 150 |           <item><description>A vector containing the individual tree outputs of the tree ensemble.</description></item> | 
| 151 | 151 |           <item><description>A vector indicating the leaves that the feature vector falls on in the tree ensemble.</description></item> | 
| 152 | 152 |           <item><description>A vector indicating the paths that the feature vector falls on in the tree ensemble.</description></item> | 
|  | 
| 157 | 157 |       </summary> | 
| 158 | 158 |       <remarks> | 
| 159 | 159 |         In machine learning it is a pretty common and powerful approach to utilize the already trained model in the process of defining features. | 
| 160 |  | -        <para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features,  | 
|  | 160 | +        <para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features,  | 
| 161 | 161 |         and use the cluster distances as the new feature set. | 
| 162 |  | -        Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> | 
|  | 162 | +        Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> | 
| 163 | 163 |         There are a number of famous or popular examples of this technique: | 
| 164 |  | -        <list> | 
| 165 |  | -          <item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'. | 
| 166 |  | -            It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together, | 
|  | 164 | +        <list type='bullet'> | 
|  | 165 | +          <item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'. | 
|  | 166 | +            It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together, | 
| 167 | 167 |             and far away from pictures of kittens. </description></item> | 
| 168 |  | -          <item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item> | 
| 169 |  | -          <item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model, | 
| 170 |  | -            and there's no reason to compute them. </description></item> | 
|  | 168 | +          <item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item> | 
|  | 169 | +          <item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model, | 
|  | 170 | +            and there's no reason to compute them. </description></item> | 
| 171 | 171 |         </list> | 
| 172 | 172 |         <para>Tree featurizer uses the decision tree ensembles for feature engineering in the same fashion as above.</para> | 
| 173 |  | -        <para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training).  | 
|  | 173 | +        <para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training).  | 
| 174 | 174 |         If we associate each leaf of each tree with a sequential integer, we can, for every incoming example x,  | 
| 175 |  | -        produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para> | 
|  | 175 | +        produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para> | 
| 176 | 176 |         <para>Thus, for every example x, we produce a 10000-valued vector L, with exactly 100 1s and the rest zeroes.  | 
| 177 |  | -        This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para> | 
| 178 |  | -        <para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para> | 
|  | 177 | +        This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para> | 
|  | 178 | +        <para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para> | 
| 179 | 179 |         <para>We could repeat the same thought process for the non-leaf, or internal, nodes of the trees (we know that each tree has exactly 99 of them in our 100-leaf example),  | 
| 180 |  | -        and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para> | 
| 181 |  | -        <para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para> | 
|  | 180 | +        and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para> | 
|  | 181 | +        <para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para> | 
| 182 | 182 |         <para>The TreeLeafFeaturizer is also producing the third vector, T, which is defined as Ti(x) = output of tree #i on example x.</para> | 
| 183 | 183 |       </remarks> | 
| 184 | 184 |       <example> | 
|  | 
0 commit comments