-
Notifications
You must be signed in to change notification settings - Fork 89
Training step - Meaning of Arrays #12
Comments
Great question. They would be considered different distinct documents with the same training labels. I originally went with an approach that extracted a set of sentences from a source text and ran the training algorithm sentence by sentence. I did this because repetition was important to training good models. This is no longer the case as I've made training the model focus more on quality of training examples. For your idea to allow multiple documents to be sent during training, this works now, but the document being stored as a "Data" node, it has no unique identity, for instance a URL as an identifier of that document's text. What I am going to do is to improve the data model to include an optional document identifier. This would be something you pass along during training:
Let me know what you think. |
Thx a lot for the explanation and yes, this improved data model would be exactly what I was looking for! 👍 How many training samples would you say (roughly like 10k, 100k, 1m) are a good amount for your algorithm and would there be a big difference between few / big documents vs. many / small documents (like tweets)? |
In the movie review dataset, as many as 200 documents is enough to train a model that classifies correctly 60% of the time. This number increases with the number of documents. This comes at the cost of performance eventually. I'm working on putting a set of guidelines together, which are coming from the examples. As far as document size, batching tweets together with the same hashtags into one document is equivalent to submitting them individually one by one. All content is treated equally during training. Good generalizations come from content that has some uniformity in the grammar as to allow for generalizations to be made for a large set of examples. Since the training model performs grammar induction, if you had many movie reviews by the same author then this would be less effective then having all reviews in the training data be authored by different people. |
I attempted to train a model with the example you give and there seem to be a few issues. Is there an issue with my installation? C:\Users>curl -H "Content-Type: application/json" -d '{"label": ["Documen |
It looks like the JSON request was malformed. I think that on Windows there On Mon, Oct 13, 2014 at 4:24 PM, Mark Cicero notifications@github.com
Kenny Bastani Join us at GraphConnect 2014 SF! graphconnect.com |
Yeah it seems to be a command prompt issue. Works great using REST Console chrome plugin. Very impressed with this plugin, keep up the good work. Hope it yields good results with what I am trying to do. |
I'm glad you were able to get it working. Thanks for your support. Please let me know how it goes. |
Hi,
this more of a question than an "issue": I noticed that during the training step I need to pass an array like:
{
"text": [
"Interoperability is the ability of making systems and organizations work together."
],
"label": [
"Interoperability"
]
}
to the endpoint, but in all of your examples the array contains only one element. I am wondering what it would mean for the classifier when I pass several elements in the "text" array for example. Would they be considered different elements of the same document, or would it see them as two separate documents which have the same label?
Related to this and as some input: It would be great if it would be actually possible to pass several documents with the same label in "one go" during training. That would reduce the amount of http requests drastically in my case and probably speed up training with 100.000s of small documents.
Just an idea :)
The text was updated successfully, but these errors were encountered: