Detecting Free-form Citation Text #1

bnewbold · 2019-05-20T23:28:19Z

A great way to use biblio-glutton's deeper citation matching capabilities would be if this extension could detect strings on a page that look like citations (eg, a list of reference strings), and if they don't already have identifiers or URLs in the string, do a match query against the biblio-glutton instance.

The current code uses regex to match identifier strings. Detecting reference/citation strings would probably be harder... look for a sequence of <li> elements, or short paragraphs? Maybe there exists Javascript code that already does this sort of fuzzy detection already, or code in another language that could be adapted.

Other issues (off the top of my head) are that a full match query is more expensive than an identifier lookup query, and that the privacy concerns when submitting full "raw strings" are more concerning than only matched identifiers (eg, passwords, email, or other strings could accidentally get sent in the API request).

This would be a bunch of work, and it isn't something I could contribute to in the near future, but i'm opening this issue in case others are interested or know of a project that has already done the hard bits.

The text was updated successfully, but these errors were encountered:

kermitt2 · 2019-05-20T23:44:32Z

Thanks Bryan !

This would be indeed a long term goal given the work to be done. I've thought about intermediary steps, I've planed to add the first ones in the near future:

let the user highlight freely a reference text in the web page, and call the glutton bib. ref. on this text for adding the matched DOI + OA URL. So in this case, the user is doing the hard job...
same on a PDF displayed on a web browser (this would go to grobid first which is more robust wrt noisy text),
the user select a full bibliographical section, and use grobid to segment the references and then glutton.

Then for moving forward in the direction you point, we could imagine having a light deep learning model for web page very similar to GROBID's reference identifier, run by the browser extension via https://github.com/transcranial/keras-js (most of the current grobid models can already run with Keras). Some tags like <li> could be used as additional features, but the problem is then to find training data.

The execution of the model in a particular page could be triggered by the user, rather than something automated in the background as the current regex, to avoid the legitimate privacy concerns you are raising and because this would slow down the browser (contrary to the current regex).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting Free-form Citation Text #1

Detecting Free-form Citation Text #1

bnewbold commented May 20, 2019

kermitt2 commented May 20, 2019

Detecting Free-form Citation Text #1

Detecting Free-form Citation Text #1

Comments

bnewbold commented May 20, 2019

kermitt2 commented May 20, 2019