You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in NoSke there are 2 modes to create subcorpora:
after issuing a concordance search (KWIC):
FULL: create a subcorpus just on the basis of a search-result. After a search store all hits in a subcorpus.
TIC: use the tic-boxes in the KWIC to select/de-select specific lines from the search-result to be saved as subcorpus
Note:
In the NoSke for both FULL and TIC it can (and must!) be specified what environment of the search-result should be stored in the subcorpus - select one of the structures. e.g. s, p, doc ...
In the NoSke when using tic-marks: it can be specified whether you want to select the tic-ed or the un-ticed items for futher display and/or storage.
NoSke offers functionalities for managing subcorpora - most important: for deleting
Asil's corpsum
only supplied Version TIC (which is the more complicated form)
only stored the full doc for the selections
Version TIC can in theory be used to emulate FULL : when you supply a "select all" button then TIC is "equivalent" to FULL - though we might run into troubles: with FULL only the search-request is passed to the NoSke - for TIC a complete list of all found-ids has to be passed (which can range into the thousands? millions?)
Only supplying TIC and always storing the doc could be justified as a simplistic solution for now - but:
when using (TIC (tic-boxes): "select all" "select only marked" and "select all unmarked" should be supplied
Possible work-around: if users want to work with subcorpora in a more sophisticated way, they for now can create + manage them in the NoSke directly
Technical background & caveats
Subcopora are handled with subcorp.
Currently the API documentation for this method is faulty! The functionality of subcorp is described as method only for
getting the list of existing sub-corpora and
deleting subcorpora
The documentation fails to describe how to create subcorpora (and even the description on how to use it for deletion seems faulty)
Currently the only way to find out how to create a subcorpus is to sniff the calls issued by the NoSke-GUI.
What can be learned from there is, that subcorp gets passed the concordance query in the form of q=... and not in the form of json={ concordance_query [...] }.
I am awaiting an answer from the NoSke-developers whether this is a bug or a feature, and whether we could hope for a proper API-documentation for subcorp.
Overall idea:
When hitting the "+" (create subcorp) in crystal, a whole cascade of API-calls is issued. But the most important one is subcorp.
Subcorp gets handed the original query (but as q= not as json=)
The difference between FULL and TIC: for TIC the list of selected token-ids is passed as an additional query.
This is a functionality of the "normal" NoSke search: when you pass several queries - they will be executed one after another - each query filtering down the output of the previous query.
i.e. the subcorp call for creating a subcorpus for the a search for lemma "Haus" and then use manually selected ticmarks would be: https://noske-amc.acdh.oeaw.ac.at/bonito/run.cgi/subcorp?q=q[lemma="Haus"]&q=p0 0 1 [#467|#5139|#5213|#7963|#8617|#114000|#119985|#132096|#142360|#149482]&corpname=amc4_demo&create=1&format=json&subcname=Haus_10&struct=s
([#467|#5139|#5213|...] is standard syntax for a CQL query which selects specific token-ids: it just says: match token with id=467 OR 5139 OR ...)
The text was updated successfully, but these errors were encountered:
hpreki
changed the title
Known issues
Known issues - Subcorpora
Dec 11, 2023
Subcorpora
Usage of existing
Creation of new subcorpus
in NoSke there are 2 modes to create subcorpora:
after issuing a concordance search (KWIC):
Note:
s
,p
,doc
...Asil's corpsum
doc
for the selectionsVersion TIC can in theory be used to emulate FULL : when you supply a "select all" button then TIC is "equivalent" to FULL - though we might run into troubles: with FULL only the search-request is passed to the NoSke - for TIC a complete list of all found-ids has to be passed (which can range into the thousands? millions?)
Only supplying TIC and always storing the
doc
could be justified as a simplistic solution for now - but:Possible work-around: if users want to work with subcorpora in a more sophisticated way, they for now can create + manage them in the NoSke directly
Technical background & caveats
Subcopora are handled with
subcorp
.Currently the API documentation for this method is faulty! The functionality of
subcorp
is described as method only forThe documentation fails to describe how to create subcorpora (and even the description on how to use it for deletion seems faulty)
Currently the only way to find out how to create a subcorpus is to sniff the calls issued by the NoSke-GUI.
What can be learned from there is, that
subcorp
gets passed the concordance query in the form ofq=...
and not in the form ofjson={ concordance_query [...] }
.I am awaiting an answer from the NoSke-developers whether this is a bug or a feature, and whether we could hope for a proper API-documentation for
subcorp
.Overall idea:
When hitting the "+" (create subcorp) in crystal, a whole cascade of API-calls is issued. But the most important one is
subcorp
.Subcorp gets handed the original query (but as
q=
not asjson=
)The difference between FULL and TIC: for TIC the list of selected token-ids is passed as an additional query.
This is a functionality of the "normal" NoSke search: when you pass several queries - they will be executed one after another - each query filtering down the output of the previous query.
i.e. the subcorp call for creating a subcorpus for the a search for lemma "Haus" and then use manually selected ticmarks would be:
https://noske-amc.acdh.oeaw.ac.at/bonito/run.cgi/subcorp?q=q[lemma="Haus"]&q=p0 0 1 [#467|#5139|#5213|#7963|#8617|#114000|#119985|#132096|#142360|#149482]&corpname=amc4_demo&create=1&format=json&subcname=Haus_10&struct=s
(
[#467|#5139|#5213|...]
is standard syntax for a CQL query which selects specific token-ids: it just says: match token with id=467 OR 5139 OR ...)The text was updated successfully, but these errors were encountered: