Training step - Meaning of Arrays #12

moooji · 2014-09-12T07:59:13Z

Hi,
this more of a question than an "issue": I noticed that during the training step I need to pass an array like:

{
"text": [
"Interoperability is the ability of making systems and organizations work together."
],
"label": [
"Interoperability"
]
}

to the endpoint, but in all of your examples the array contains only one element. I am wondering what it would mean for the classifier when I pass several elements in the "text" array for example. Would they be considered different elements of the same document, or would it see them as two separate documents which have the same label?

Related to this and as some input: It would be great if it would be actually possible to pass several documents with the same label in "one go" during training. That would reduce the amount of http requests drastically in my case and probably speed up training with 100.000s of small documents.

Just an idea :)

kbastani · 2014-09-15T05:01:32Z

Great question. They would be considered different distinct documents with the same training labels.

I originally went with an approach that extracted a set of sentences from a source text and ran the training algorithm sentence by sentence. I did this because repetition was important to training good models. This is no longer the case as I've made training the model focus more on quality of training examples.

For your idea to allow multiple documents to be sent during training, this works now, but the document being stored as a "Data" node, it has no unique identity, for instance a URL as an identifier of that document's text.

What I am going to do is to improve the data model to include an optional document identifier. This would be something you pass along during training:

{
    "documents": [
        {
            "uri": "http://en.wikipedia.org/wiki/Interoperability",
            "text": "Interoperability is the ability of making systems and organizations work together.",
            "label": [
                "Computing terminology",
                "Telecommunications engineering",
                "Interoperability",
                "Product testing"
            ]
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Information_technology",
            "text": "Information technology (IT) is the application of computers and telecommunications equipment to store, retrieve, transmit and manipulate data, often in the context of a business or other enterprise.",
            "label": [
                "Information technology",
                "Media technology"
            ]
        }
    ]
}

Let me know what you think.

moooji · 2014-09-16T21:37:53Z

Thx a lot for the explanation and yes, this improved data model would be exactly what I was looking for! 👍 How many training samples would you say (roughly like 10k, 100k, 1m) are a good amount for your algorithm and would there be a big difference between few / big documents vs. many / small documents (like tweets)?

kbastani · 2014-09-16T22:20:28Z

In the movie review dataset, as many as 200 documents is enough to train a model that classifies correctly 60% of the time. This number increases with the number of documents. This comes at the cost of performance eventually. I'm working on putting a set of guidelines together, which are coming from the examples. As far as document size, batching tweets together with the same hashtags into one document is equivalent to submitting them individually one by one. All content is treated equally during training. Good generalizations come from content that has some uniformity in the grammar as to allow for generalizations to be made for a large set of examples. Since the training model performs grammar induction, if you had many movie reviews by the same author then this would be less effective then having all reviews in the training data be authored by different people.

cicero19 · 2014-10-13T23:24:12Z

I attempted to train a model with the example you give and there seem to be a few issues. Is there an issue with my installation?

C:\Users>curl -H "Content-Type: application/json" -d '{"label": ["Documen
t classification"], "text": ["Documents may be classified according to their sub
jects or according to other attributes (such as document type, author, printing
year etc.). In the rest of this article only subject classification is considere
d. There are two main philosophies of subject classification of documents: The c
ontent based approach and the request based approach."]}' http://localhost:7474/
service/graphify/training
curl: (3) [globbing] bad range in column 6
curl: (6) Could not resolve host: text
curl: (3) [globbing] bad range in column 6
{"error":"[org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
, org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimal
Base.java:521), org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpecte
dChar(JsonParserMinimalBase.java:442), org.codehaus.jackson.impl.ReaderBasedPars
er._handleUnexpectedValue(ReaderBasedParser.java:1198), org.codehaus.jackson.imp
l.ReaderBasedParser.nextToken(ReaderBasedParser.java:485), org.codehaus.jackson.
map.ObjectMapper._initForReading(ObjectMapper.java:2770), org.codehaus.jackson.m
ap.ObjectMapper._readMapAndClose(ObjectMapper.java:2718), org.codehaus.jackson.m
ap.ObjectMapper.readValue(ObjectMapper.java:1863), org.neo4j.nlp.ext.PatternReco
gnitionResource.training(PatternRecognitionResource.java:52), sun.reflect.Native
MethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl.
invoke(Unknown Source), sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source), java.lang.reflect.Method.invoke(Unknown Source), com.sun.jersey.spi.con
tainer.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60), com.
sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvi
der$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205
), com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher
.dispatch(ResourceJavaMethodDispatcher.java:75), org.neo4j.server.rest.transacti
onal.TransactionalRequestDispatcher.dispatch(TransactionalRequestDispatcher.java
:139), com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule
.java:288), com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightH
andPathRule.java:147), com.sun.jersey.server.impl.uri.rules.ResourceClassRule.ac
cept(ResourceClassRule.java:108), com.sun.jersey.server.impl.uri.rules.RightHand
PathRule.accept(RightHandPathRule.java:147), com.sun.jersey.server.impl.uri.rule
s.RootResourceClassesRule.accept(RootResourceClassesRule.java:84), com.sun.jerse
y.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.j
ava:1469), com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequ
est(WebApplicationImpl.java:1400), com.sun.jersey.server.impl.application.WebApp
licationImpl.handleRequest(WebApplicationImpl.java:1349), com.sun.jersey.server.
impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339),
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416
), com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContaine
r.java:537), com.sun.jersey.spi.container.servlet.ServletContainer.service(Servl
etContainer.java:699), javax.servlet.http.HttpServlet.service(HttpServlet.java:8
48), org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:698), org
.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:505), org.ecl
ipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:211), org.
eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096),
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432), org.e
clipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175), org
.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030),
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136), o
rg.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52), org.ecl
ipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97), org.ecl
ipse.jetty.server.Server.handle(Server.java:445), org.eclipse.jetty.server.HttpC
hannel.handle(HttpChannel.java:268), org.eclipse.jetty.server.HttpConnection.onF
illable(HttpConnection.java:229), org.eclipse.jetty.io.AbstractConnection$ReadCa
llback.run(AbstractConnection.java:358), org.eclipse.jetty.util.thread.QueuedThr
eadPool.runJob(QueuedThreadPool.java:601), org.eclipse.jetty.util.thread.QueuedT
hreadPool$3.run(QueuedThreadPool.java:532), java.lang.Thread.run(Unknown Source)
]"}
C:\Users>

kbastani · 2014-10-13T23:49:21Z

It looks like the JSON request was malformed. I think that on Windows there
may be a differentiation between single and double quotes on the command
line. You may want to try replacing your single quotes for double quotes
and double quotes for single quotes.

On Mon, Oct 13, 2014 at 4:24 PM, Mark Cicero notifications@github.com
wrote:

I attempted to train a model with the example you give and there seem to
be a few issues. Is there an issue with my installation?

C:\Users>curl -H "Content-Type: application/json" -d '{"label": ["Documen
t classification"], "text": ["Documents may be classified according to
their sub
jects or according to other attributes (such as document type, author,
printing
year etc.). In the rest of this article only subject classification is
considere
d. There are two main philosophies of subject classification of documents:
The c
ontent based approach and the request based approach."]}'
http://localhost:7474/
service/graphify/training
curl: (3) [globbing] bad range in column 6
curl: (6) Could not resolve host: text
curl: (3) [globbing] bad range in column 6

{"error":"[org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
,
org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimal
Base.java:521),
org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpecte
dChar(JsonParserMinimalBase.java:442),
org.codehaus.jackson.impl.ReaderBasedPars
er._handleUnexpectedValue(ReaderBasedParser.java:1198),
org.codehaus.jackson.imp
l.ReaderBasedParser.nextToken(ReaderBasedParser.java:485),
org.codehaus.jackson.
map.ObjectMapper._initForReading(ObjectMapper.java:2770),
org.codehaus.jackson.m
ap.ObjectMapper._readMapAndClose(ObjectMapper.java:2718),
org.codehaus.jackson.m
ap.ObjectMapper.readValue(ObjectMapper.java:1863),
org.neo4j.nlp.ext.PatternReco
gnitionResource.training(PatternRecognitionResource.java:52),
sun.reflect.Native
MethodAccessorImpl.invoke0(Native Method),
sun.reflect.NativeMethodAccessorImpl.
invoke(Unknown Source),
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source), java.lang.reflect.Method.invoke(Unknown Source),
com.sun.jersey.spi.con
tainer.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60),
com.

sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvi

der$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205
),
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher
.dispatch(ResourceJavaMethodDispatcher.java:75),
org.neo4j.server.rest.transacti

onal.TransactionalRequestDispatcher.dispatch(TransactionalRequestDispatcher.java
:139),
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule
.java:288),
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightH
andPathRule.java:147),
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.ac
cept(ResourceClassRule.java:108),
com.sun.jersey.server.impl.uri.rules.RightHand
PathRule.accept(RightHandPathRule.java:147),
com.sun.jersey.server.impl.uri.rule
s.RootResourceClassesRule.accept(RootResourceClassesRule.java:84),
com.sun.jerse

y.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.j
ava:1469),
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequ
est(WebApplicationImpl.java:1400),
com.sun.jersey.server.impl.application.WebApp
licationImpl.handleRequest(WebApplicationImpl.java:1349),
com.sun.jersey.server.

impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339),

com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416
),
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContaine
r.java:537),
com.sun.jersey.spi.container.servlet.ServletContainer.service(Servl
etContainer.java:699),
javax.servlet.http.HttpServlet.service(HttpServlet.java:8
48),
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:698), org
.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:505),
org.ecl
ipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:211),
org.

eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096),
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432),
org.e
clipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175),
org

.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030),
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136),
o
rg.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52),
org.ecl
ipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97),
org.ecl
ipse.jetty.server.Server.handle(Server.java:445),
org.eclipse.jetty.server.HttpC
hannel.handle(HttpChannel.java:268),
org.eclipse.jetty.server.HttpConnection.onF
illable(HttpConnection.java:229),
org.eclipse.jetty.io.AbstractConnection$ReadCa
llback.run(AbstractConnection.java:358),
org.eclipse.jetty.util.thread.QueuedThr
eadPool.runJob(QueuedThreadPool.java:601),
org.eclipse.jetty.util.thread.QueuedT
hreadPool$3.run(QueuedThreadPool.java:532), java.lang.Thread.run(Unknown
Source)
]"}
C:\Users>

—
Reply to this email directly or view it on GitHub
#12 (comment).

Kenny Bastani
Developer Evangelist, Neo4j
Phone: 239-738-8000
Twitter: http://www.twitter.com/kennybastani
Website: http://www.neo4j.com
(graphs)-[:are]->(everywhere)

Join us at GraphConnect 2014 SF! graphconnect.com
https://wmphighrise.appspot.com/r/c45f4f906f15443b256e9809dd6efeb9?d=http%3A%2F%2Fgraphconnect.com%2F
As a friend of Neo4j, use discount code *KOMPIS
https://wmphighrise.appspot.com/r/c45f4f906f15443b256e9809dd6efeb9?d=https%3A%2F%2Fgraphconnect2014sf.eventbrite.com%2F%3Fdiscount%3DKOMPIS
for $100 off registration*

cicero19 · 2014-10-14T00:41:25Z

Yeah it seems to be a command prompt issue. Works great using REST Console chrome plugin. Very impressed with this plugin, keep up the good work. Hope it yields good results with what I am trying to do.

kbastani · 2014-10-14T02:45:45Z

I'm glad you were able to get it working. Thanks for your support. Please let me know how it goes.

kbastani added the enhancement label Sep 15, 2014

kbastani self-assigned this Sep 15, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training step - Meaning of Arrays #12

Training step - Meaning of Arrays #12

moooji commented Sep 12, 2014

kbastani commented Sep 15, 2014

moooji commented Sep 16, 2014

kbastani commented Sep 16, 2014

cicero19 commented Oct 13, 2014

kbastani commented Oct 13, 2014

cicero19 commented Oct 14, 2014

kbastani commented Oct 14, 2014

Training step - Meaning of Arrays #12

Training step - Meaning of Arrays #12

Comments

moooji commented Sep 12, 2014

kbastani commented Sep 15, 2014

moooji commented Sep 16, 2014

kbastani commented Sep 16, 2014

cicero19 commented Oct 13, 2014

kbastani commented Oct 13, 2014

cicero19 commented Oct 14, 2014

kbastani commented Oct 14, 2014