Replies: 1 comment
-
One of the most challenging issues coming out of those sample queries is misspellings. In order to accommodate frequent misspellings, you need to address domain-specific terms that aren't English dictionary words (e.g. tnf). For instance, you could spell check only English words -- then you need to classify general English misspellings from non-English words like 'tnf'. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Target Users
General
Our target users are biological and medical researchers, that is, “domain experts”. Their primary need is to stay informed about advancements in their research field. They may also be curious about notable efforts on the margins or outside their expertise, because it could inspire their own work or due the amount of attention, but these are secondary.
These researchers are very familiar with PubMed and use it regularly to search for articles. The target users are interested in collecting information related to a topic, and so would typically use PubMed by submitting “informational queries”. This is unlike “navigational queries” used to locate a specific record or set. It is important to note that like most PubMed visitors, our target users have no background or specialized training with databases, knowledge representation or search.
Given users’ needs, when they have a set of search criteria, our researchers would expect to receive a set of articles ordered by date, akin to a feed. It would not necessarily be problematic for a system to return zero new results (10% of PubMed queries). Domain experts are also quite tolerant of noise in sets of informational search hits, because they are considerate of the qualitative difference between content provided by bibliographic indexes and, say, Google. Indeed, their expectations are unlike those of Google or Amazon where one expects a ranked list of possibly millions of results.
Our users are accustomed to submitting queries consisting of a few terms. Like PubMed and search engines, the overwhelming majority (>90%) of users submit single queries in a brief session; with only a minority interested in subsequently refining an initial search. Target users are error-prone, and a significant proportion of input search terms are misspelled (10-20%) in line with observations from PubMed. On average, target users enter three (3) terms in a typical query, with the majority (70%) entering up to four (4) terms to define the content they see. Being non-experts in search and having informational needs, they will not entertain the notion of annotating terms with specialized tags or syntax (e.g. operators) in queries and typically not interested in advanced search functions. All of this notwithstanding, it should be noted that a large proportion of PubMed queries in general include an author name (35%).
Individual Users
Rather than define prototypical users a priori and derive hypothetical queries, we do the inverse: randomly sample queries from 1 day of PubMed logs (Wilbur WJ, Kim W, Xie N. 2006. Spelling correction in the PubMed search engine. Information Retrieval 9 accessible at https://ftp.ncbi.nlm.nih.gov/pub/wilbur/DAYSLOG/ and attempt to infer their intentions. Below in Table I, I randomly sample 1000 lines of the PubMed log using the command:
$ shuf -n 10 pubmed-queries.txt > sample-pubmed-queries.txt
and have (arbitrarily) selected 30 queries of interest.Table 1. Informational query samples
Michael Elowitz
delta opioid receptor
propofol, sizure
seizure
)top 10 deseases for males
diseases
)Christopher Barrett
fetal hemoglobin congeintal heart disease
congenital
)hypoxia sex hormones
rhabdomyolysis and labetalol
swi promoter escape
Kawasaki
jessell t
Thymic selection
gene silencing plant
ampk
reproducibility
muscle myogenesis
Bra1 brain protein
psychometric properties hawaii early learning profile
rheumatoid arthritis X-rays scoring methods reliability
cornea and oxidative stress
Lipossomes cancers Karposi Sarcoma
Liposomes; Kaposi
)brteast cancer positive nodes
breast
)tamoxifen MBC
cyclosporin, expoliative dermatitis
exfoliative
)pancreatitis c reactive peptide
depression and fluoxetine
Hac1 yeast mammanlian
mammalian
)vegf and prostate
cutaneous effects of topamax
Background
Opposing retrieval models
exact-match (Boolean)
best-match (Information retrieval)
Notes on library science tradition vs information & computer paradigm
**Criticisms of Boolean **
Only (1) seems valid after refuting these criticisms.
On IR and the popularity of Google
There is no question that systems such as Google, based on a kind of scoring system, are easy to use and highly popular. But much of the popularity of contemporary search engines may also be attributed to the easy pickings afforded by the first generation of Internet full-text based systems (owing to the cheap cost of digital storage capacity after 1990): no doubt it is good to have all text on the web indexed and made searchable—and often with free access. However, when the easy pickings have been utilized, more complex strategies (and more humanistic approaches) may be needed to make further progress.
PubMed Log analysis
References
Beta Was this translation helpful? Give feedback.
All reactions