Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known issues - Subcorpora #3

Open
2 of 4 tasks
hpreki opened this issue Dec 11, 2023 · 4 comments
Open
2 of 4 tasks

Known issues - Subcorpora #3

hpreki opened this issue Dec 11, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@hpreki
Copy link
Contributor

hpreki commented Dec 11, 2023

Subcorpora

Usage of existing

  • make pre-existing sub corpora selectable

Creation of new subcorpus

in NoSke there are 2 modes to create subcorpora:
after issuing a concordance search (KWIC):

  • FULL: create a subcorpus just on the basis of a search-result. After a search store all hits in a subcorpus.
  • TIC: use the tic-boxes in the KWIC to select/de-select specific lines from the search-result to be saved as subcorpus

Note:

  • In the NoSke for both FULL and TIC it can (and must!) be specified what environment of the search-result should be stored in the subcorpus - select one of the structures. e.g. s, p, doc ...
  • In the NoSke when using tic-marks: it can be specified whether you want to select the tic-ed or the un-ticed items for futher display and/or storage.
  • NoSke offers functionalities for managing subcorpora - most important: for deleting

Asil's corpsum

  • only supplied Version TIC (which is the more complicated form)
  • only stored the full doc for the selections

Version TIC can in theory be used to emulate FULL : when you supply a "select all" button then TIC is "equivalent" to FULL - though we might run into troubles: with FULL only the search-request is passed to the NoSke - for TIC a complete list of all found-ids has to be passed (which can range into the thousands? millions?)

Only supplying TIC and always storing the doc could be justified as a simplistic solution for now - but:

  • when using (TIC (tic-boxes): "select all" "select only marked" and "select all unmarked" should be supplied

Possible work-around: if users want to work with subcorpora in a more sophisticated way, they for now can create + manage them in the NoSke directly

Technical background & caveats

Subcopora are handled with subcorp.
Currently the API documentation for this method is faulty! The functionality of subcorp is described as method only for

  • getting the list of existing sub-corpora and
  • deleting subcorpora

The documentation fails to describe how to create subcorpora (and even the description on how to use it for deletion seems faulty)

Currently the only way to find out how to create a subcorpus is to sniff the calls issued by the NoSke-GUI.
What can be learned from there is, that subcorp gets passed the concordance query in the form of q=... and not in the form of json={ concordance_query [...] }.
I am awaiting an answer from the NoSke-developers whether this is a bug or a feature, and whether we could hope for a proper API-documentation for subcorp.

Overall idea:
When hitting the "+" (create subcorp) in crystal, a whole cascade of API-calls is issued. But the most important one is subcorp.
Subcorp gets handed the original query (but as q= not as json=)

The difference between FULL and TIC: for TIC the list of selected token-ids is passed as an additional query.
This is a functionality of the "normal" NoSke search: when you pass several queries - they will be executed one after another - each query filtering down the output of the previous query.

i.e. the subcorp call for creating a subcorpus for the a search for lemma "Haus" and then use manually selected ticmarks would be:
https://noske-amc.acdh.oeaw.ac.at/bonito/run.cgi/subcorp?q=q[lemma="Haus"]&q=p0 0 1 [#467|#5139|#5213|#7963|#8617|#114000|#119985|#132096|#142360|#149482]&corpname=amc4_demo&create=1&format=json&subcname=Haus_10&struct=s

([#467|#5139|#5213|...] is standard syntax for a CQL query which selects specific token-ids: it just says: match token with id=467 OR 5139 OR ...)

@hpreki hpreki changed the title Known issues Known issues - Subcorpora Dec 11, 2023
@lukas-moertl
Copy link
Collaborator

@hpreki what do you mean with "pre-existing sub corpora: not selectable yet - is a must"?

@hpreki
Copy link
Contributor Author

hpreki commented Dec 20, 2023

@hpreki what do you mean with "pre-existing sub corpora: not selectable yet - is a must"?
is already closed.

@lukas-moertl lukas-moertl added the enhancement New feature or request label Jan 15, 2024
@lukas-moertl
Copy link
Collaborator

disallow ticking of documents of different corpora

@ctot-nondef
Copy link
Member

as mentioned in #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants