Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove variant names from index #529

Merged
merged 1 commit into from
Sep 29, 2021
Merged

Conversation

orangejulius
Copy link
Member

Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing.

While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we really need to support returning NYC for queries for "the big apple"?).

Here are some variant names for some key places, just to record the kind of data that's in there:

NYC:

Bigapple
NY City
NY Cty
New York City
New York Cty
Newyork
Newyorkcity
Novaiorque
Nycity
Thebigapple
Big Apple

San Francisco:

S Francisco
S. Francisco
SFO
Sanfran
Sanfrancisco
Frisco

China:

China - Peoples Republic
China Peoples Rep
China, People's Republic
Chinese
PR China
PR of China
People's Republic of China
Peoples Republic of China

Copy link
Member

@Joxit Joxit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Paris

Pantruche
Ville-Lumière
Paname
Lutèce

I'm quite amazed, Google does return Paris when you search Ville-Lumière or Paname, maybe an easter egg 😅

@missinglink
Copy link
Member

I'm 👍 for this, just interested to see if it causes any acceptance test failures before merging.

@missinglink
Copy link
Member

If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.

I don't remember the history of this but it's likely the convention was copied across to other parts of the codebase.

@missinglink
Copy link
Member

I thought things like Ville-Lumière, Frisco, Big Apple were supposed to be filed under 'colloquial' 🤷‍♂️

It's seems 'variant' is where toponyms go to die 😆

@orangejulius
Copy link
Member Author

If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.

Agreed, I was just thinking about that. The impact of all those extra names is probably even higher for Placeholder since it considers matches across the entire parent hierarchy.

Who's on First variant names are a useful collection of unofficial names
for places, but they tend to be pretty messy. This PR explores the
effect of removing them from indexing.

While there might be occasionally useful names in there, it seems like
the majority are exact or near duplicates of more official names, or
names that are so colloquial that they are not particularly useful (do
we _really_ need to support returning NYC for queries for "the big
apple"?).

For reference, here are some variant names for some key places, just to
record the kind of data that's in there:

NYC:
```
Bigapple
NY City
NY Cty
New York City
New York Cty
Newyork
Newyorkcity
Novaiorque
Nycity
Thebigapple
Big Apple
```

San Francisco:
```
S Francisco
S. Francisco
SFO
Sanfran
Sanfrancisco
Frisco
```

China:
```
China - Peoples Republic
China Peoples Rep
China, People's Republic
Chinese
PR China
PR of China
People's Republic of China
Peoples Republic of China
```
@orangejulius
Copy link
Member Author

orangejulius commented Sep 28, 2021

Okay, the results from this are in and they look pretty good. There is a decent increase to the overall score of our autocomplete acceptance tests, and a big increase in some other test cases like top_us_cities and us_states. There's almost no difference to test suites that look at addresses, which is expected for a change to the WOF importer.

As far as I can see there are almost no significant regressions from this change. A few individual autocomplete characters here and there, but nothing that looks like a trend.

If I had to summarize, overall it looks like removing variant names has three main positive effects

  • Removing useless variant names allows desired results to score higher (the extra variant names mean Elasticsearch considers the name field to be longer, and thus gives a lower score)
  • Desired records that were previously erronously deduplicated due to variant names that happened to mach will now be displayed (the variant name could be either on the undesired record that remained or on the desired record that was removed)
  • Records that aren't desired are often removed from results, because the variant name that was causing it to be shown in results is now no longer included

Here's some cases that show off one or more of these.

New York, New York

This is a query that has often been tough to get right since there are several results we want near the top, and lots of chances for duplicates or undesirable records to sneak in.

image
The autocomplete results don't really tell the whole story, here's the results from the query before/after:
image

The test says that both New York city and county should appear in the results. I'd argue We should add New York State to that list. But in any case the removal of variant names mean that the desired results for the WOF city and county records score higher than before. This boosts them above the East New York locality and the New York City result from Geonames (hopefully we can remove that one completely via deduplication someday).

I think this also shows that when we really fix our scoring in pelias/pelias#862, we'll see even more and better results like this.

Missouri

A common trend in city and state results is fixing issues where a record simply wasn't ever displayed because it would be deduplicated. For example, the state of Missouri would essentially never come up in results because it would be deduped with Missouri Township, MO, which has Missouri in its list of variant names.

Our deduplication code currently prefers more granular results (for example, locality over county or region) in these cases. We might want to make that a little bit more strict with something like pelias/api#1557. A region and a locality with wildly different populations should probably not be considered duplicates if we can avoid that causing issues with places like Berlin.

image

image

There's still some deduplication related issues here that we should look at, many of them can be fixed with data updates.
image

Various cities now showing up earlier in autocomplete

Individually these are all not necessarily amazing changes, but I noticed a decent trend of cities showing up one or two characters earlier in results. When we're talking queries that are only 2-4 characters, that's actually a big deal!
image

image

image

Final notes

I was expecting a big of a decrease in index size for this change, since there are a reasonable number of variant names out there. But it turned out to only be about 5MB. I suppose there might be a slight performance increase because fewer documents will match any given query, but I'm expecting it to not be something we can notice. We should just be able to go by the various improvements and feel confident merging this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants