Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting Free-form Citation Text #1

Open
bnewbold opened this issue May 20, 2019 · 1 comment
Open

Detecting Free-form Citation Text #1

bnewbold opened this issue May 20, 2019 · 1 comment

Comments

@bnewbold
Copy link

A great way to use biblio-glutton's deeper citation matching capabilities would be if this extension could detect strings on a page that look like citations (eg, a list of reference strings), and if they don't already have identifiers or URLs in the string, do a match query against the biblio-glutton instance.

The current code uses regex to match identifier strings. Detecting reference/citation strings would probably be harder... look for a sequence of <li> elements, or short paragraphs? Maybe there exists Javascript code that already does this sort of fuzzy detection already, or code in another language that could be adapted.

Other issues (off the top of my head) are that a full match query is more expensive than an identifier lookup query, and that the privacy concerns when submitting full "raw strings" are more concerning than only matched identifiers (eg, passwords, email, or other strings could accidentally get sent in the API request).

This would be a bunch of work, and it isn't something I could contribute to in the near future, but i'm opening this issue in case others are interested or know of a project that has already done the hard bits.

@kermitt2
Copy link
Owner

Thanks Bryan !

This would be indeed a long term goal given the work to be done. I've thought about intermediary steps, I've planed to add the first ones in the near future:

  • let the user highlight freely a reference text in the web page, and call the glutton bib. ref. on this text for adding the matched DOI + OA URL. So in this case, the user is doing the hard job...
  • same on a PDF displayed on a web browser (this would go to grobid first which is more robust wrt noisy text),
  • the user select a full bibliographical section, and use grobid to segment the references and then glutton.

Then for moving forward in the direction you point, we could imagine having a light deep learning model for web page very similar to GROBID's reference identifier, run by the browser extension via https://github.com/transcranial/keras-js (most of the current grobid models can already run with Keras). Some tags like <li> could be used as additional features, but the problem is then to find training data.

The execution of the model in a particular page could be triggered by the user, rather than something automated in the background as the current regex, to avoid the legitimate privacy concerns you are raising and because this would slow down the browser (contrary to the current regex).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants