-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fuse.js for dbt docs search #162
Conversation
From the look of things, there will need to be some level of transformation from the raw data in manifest.json, because Fuse expects a list of items but each model in the manifest is its own object. The data that gets passed into Fuse will need to look more like this: [
{
"node": "model.educationperfect.classes",
"description": "All classes, including staff classes and Unlicensed Users",
"alias": "classes",
"columns": [
{
"name": "class_id",
"description": "",
"tags": []
},
{
"name": "is_current_academic_year",
"description": "Current means that today's date is inside the bounds of the academic year set on the class (if one is set)",
"tags": ["some-tag"]
}
],
"tags": ["another-tag"],
"raw_sql": "with source as (\r\n select *\r\n from {{ source('classes', 'classes') }}\r\n),\r\n\r\nrenamed as (\r\n select \r\n id as class_id,\r\n name as class_name,\r\n school as org_id,\r\n datecreated as class_created_at,\r\n case\r\n when classtypeid = 1 then 'Standard'\r\n when classtypeid = 2 then 'Unlicenced Users'\r\n when classtypeid = 3 then 'Staff'\r\n when classtypeid = 4 then 'CRM Staff'\r\n end as class_type,\r\n academicyearid as academic_year_id,\r\n classtypeid = 1 as is_student_class \r\n from source\r\n),\r\n\r\nacademic_years as (\r\n select *\r\n from {{ref('classes__academic_years')}}\r\n),\r\n\r\ncombined as (\r\n select renamed.*,\r\n coalesce(academic_years.is_currently_active, false) as is_current_academic_year\r\n from renamed\r\n left outer join academic_years using (academic_year_id)\r\n)\r\n\r\nselect * from combined",
},
{
"node": "model.educationperfect.classes__academic_years",
"description": "",
"alias": "classes__academic_years",
"columns": [
{
"name": "academic_year_id",
"description": "",
"tags": []
},
{
"name": "start_date",
"description": "",
"tags": []
},
{
"name": "end_date",
"description": "",
"tags": []
}
],
"tags": [],
"raw_sql": "with source as (\r\n select * from {{source('classes', 'academicyear')}}\r\n),\r\n\r\nrenamed as (\r\n select id as academic_year_id,\r\n name as academic_year_name,\r\n hemisphereid as hemisphere_id,\r\n startdate as start_date,\r\n enddate as end_date,\r\n startdate <= date_trunc('day', getdate()) and enddate >= date_trunc('day', getdate()) as is_currently_active\r\n from source\r\n)\r\n\r\nselect *\r\nfrom renamed",
}
] than the default: "columns": {
"class_id": {
"name": "class_id",
"description": "",
"tags": []
}
...
} I'll work on that next for a while, because that doesn't depend on working out why Fuse gives |
This reverts commit 4f57431.
…/dbt-docs into feat-143-fusejs-search
@drewbanin Happy New Year! When you're back on deck, I'd appreciate a conceptual review before I go too much further down this rabbit hole:
The other blocker I have at the moment is successfully installing Fuse. I got it working in a new Glitch project perfectly happily, which seems to be identical to the docs project. Obviously I've done something wrong though, or there's a weird interplay between Fuse and some other lib that docs uses. Once Fuse actually loads and I can get some results, I can dive into highlighting results etc instead of just dumping stuff in the console. |
// make a copy | ||
var newModel = _.assign({}, model); | ||
newModel.columns = _.values(model.columns); | ||
newModel.searchableName = getModelName(model); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting side-effect of this approach is that macros from other packages become much surfaced much more often than before (esp on short search terms), e.g. dbt-utils and codegen, because the package name is also passed in as a search option, as opposed to just for disambiguation before being rendered on the page.
It does the same thing for sources (in that the schema name is searchable as well now). This feels more clearly cut as a benefit.
I've done it this way because fuse gives back a range of matching characters for highlighting, and tacking on an extra string afterwards makes that a bit less elegant than I'd like (and means that schema names etc wouldn't highlight at all). I think this is somewhere between a net positive and neutral, but might be lying to myself so that I don't have to fix it. Open to alternative views.
In general, I'm a bit bemused at the number of different ways different resources can be named, and that there isn't a canonical field for this. Or is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha - ok - this one sounds like a can of worms. dbt does have "unique identifiers" but they are not semantically super useful (eg. they look like model.fishtown.fct_users
). It would be really good to give every node type a canonical name for search representation, but that's something we'd need to layer in as you've done here AFAIK
Going to come back to this to take a deeper dive soon
if( | ||
(show_names && match.key === "searchableName") | ||
|| (show_descriptions && (match.key === "description" || match.key == "columns.description")) | ||
|| (show_columns && match.key === "columns.name") | ||
|| (show_code && match.key === "raw_sql") | ||
|| (show_tags && (match.key === "tags" || match.key == "columns.tags")) | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, these tickboxes no longer do exactly what they used to. They used to be "search only these fields for the query term", but are now "only return results where this field was one of the ones that included a query term".
In real life, this means that a model tagged with my-cool-tag
will lose out to a different model called my_cool_tag
and tagged with something-else
when searching for cool tag
with Tags Only ticked.
As far as I can see, to get around this would require a re-instantiation of Fuse with a new index (by passing just the relevant keys) each time these tick boxes were toggled.
const fuseOptions = { | ||
includeScore: true, | ||
includeMatches: true, | ||
ignoreLocation: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of Fuse's options is that by setting ignoreLocation, there's no point in setting threshold. Does that sound right to you?
I'd like to tone down the fuzziness a bit, but it's not in terms of how close to the start of the string a term is, it's how aggressively it thinks character replacements/transpositions should be to get a match. And I don't think it offers any other levers to pull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm hm hm! I am reading this slightly differently - it sounds so me like ignoreLocation
will score matches in any part of the string, but I believe that threshold still has a role to play! I played around with this by setting threshold to 0.1
, then searching for "zz" and did not see any matches. Next, I changed the threshold to 0.9
and searched for "zz" and I did see matches!
I bet we can make the search less fuzzy by reducing the threshold from the default (0.6
) to a smaller number... unclear to me how to pick that number beyond guess-and-check'ing though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! Currently on 0.4 - I tried 0.3 but that excluded some things I thought should be allowed.
overflow: hidden; | ||
text-overflow: ellipsis; | ||
display: -webkit-box; | ||
-webkit-line-clamp: 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there should be some way to extend a base line-clamp, but my CSS isn't good enough so just copy-pasting for now.
@drewbanin I'm pretty sure everything works end-to-end now! I've left comments and questions in different places. Next step is to work out why it performs so slowly on larger projects - if you've got any hints on that front that'd be great. I've spent some time enabling and disabling indexing by various keys, but haven't found a specific smoking gun. I have noticed that it seems to be calling some functions twice, but no idea where or why. |
includeScore: true, | ||
includeMatches: true, | ||
ignoreLocation: true, | ||
//useExtendedSearch: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really want to enable extended search, because it means you can prefix things with a single quote to force exact matches over fuzzy ones. However when I enabled it, I found that the matches array has way more sharp edges (to name a few: unsorted, duplicate records, overlapping ranges).
The duplicates and sorting are pretty trivial to resolve, but I haven't come up with an elegant solution to making sense of [[1, 3], [1, 1], [1, 2], [2, 3]]. I think this might be something to come back and have a second pass at once the bulk of it's merged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right on - totally agree - let's get fuse in here and then make it more sophisticated in the future!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joellabes I took another pass at this today - thanks for your patience!! Looking really good so far. For my first pass, mostly focused on the UX side of things. Need to come back here to dig a little bit deeper into the code and how everything works. Let me know if you have any questions, thoughts, or comments!!
src/app/main/index.js
Outdated
@@ -145,67 +145,9 @@ angular | |||
}); | |||
|
|||
$scope.$watch('search.query', function(q) { | |||
$scope.search.results = assignSearchRelevance(projectService.search(q)); | |||
$scope.search.results = projectService.search(q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that both this line and this line in components/search/search.js
are both called when the search.query
variable changes. I'm seeing some input lag between typing a key and seeing results (~150 models in my test project) and I wonder if running the search twice for each keystroke contributes to the latency?
scope.results = filterResults(projectService.search(scope.query), scope.checkboxStatus);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a quick loom with some of my observations & open questions - check it out here! https://www.loom.com/share/8422b5b0865d40e0a477cf25cb0815e5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken the redundant search call out of index.js - on my computer it's still sitting around 350ms per keystroke which is odd, I would have expected that to be much lower now that it's only doing the work once. I also did some basic Googling on $$debounceViewValueCommit but nothing jumped right out at me :(
const fuseOptions = { | ||
includeScore: true, | ||
includeMatches: true, | ||
ignoreLocation: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm hm hm! I am reading this slightly differently - it sounds so me like ignoreLocation
will score matches in any part of the string, but I believe that threshold still has a role to play! I played around with this by setting threshold to 0.1
, then searching for "zz" and did not see any matches. Next, I changed the threshold to 0.9
and searched for "zz" and I did see matches!
I bet we can make the search less fuzzy by reducing the threshold from the default (0.6
) to a smaller number... unclear to me how to pick that number beyond guess-and-check'ing though :)
includeScore: true, | ||
includeMatches: true, | ||
ignoreLocation: true, | ||
//useExtendedSearch: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right on - totally agree - let's get fuse in here and then make it more sophisticated in the future!
try { | ||
//As search terms become longer, be less tolerant of tiny fuzzy matches | ||
var shortestWord = q.split(' ').sort(function(a, b){ return a.length - b.length})[0] | ||
service.fuse.options.minMatchCharLength = Math.max(1, shortestWord.length - 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh... did you find that search results were too fuzzy before adding this code? I wonder it ratcheting down the threshold
will help with this without needing to hack into the search options for each keystroke
// make a copy | ||
var newModel = _.assign({}, model); | ||
newModel.columns = _.values(model.columns); | ||
newModel.searchableName = getModelName(model); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha - ok - this one sounds like a can of worms. dbt does have "unique identifiers" but they are not semantically super useful (eg. they look like model.fishtown.fct_users
). It would be really good to give every node type a canonical name for search representation, but that's something we'd need to layer in as you've done here AFAIK
Going to come back to this to take a deeper dive soon
@drewbanin I’ve gotten rid of the duplicate searches, but it still performs really badly. I don’t have the JS chops to debug this unfortunately - do you mind if I throw the heavy lifting on this one at you? Happy to jump on a call or something and pair on it if you think that’d help, but I’d mostly be along for the ride 😬 |
@drewbanin another bump here sorry! Would love to get this closed out, but I'm sure there's a lot going on over there 🏎️ |
I'm abandoning this in favour of unabandoning PR 145, which is simple and works! Fuse.js's weighting didn't feel good to me. Reading the docs doesn't make me confident that those weighting/fuzziness issues are readily resolvable, so I'm not eager to put a lot of effort into resolving crummy performance (especially since there's not really any SMEs available right now!) just to discover that the whole premise is flawed. |
resolves #143, resolves #166
Description
Moving on from the work done in PR #145, use Fuse's fuzzy search instead of hand-rolled version.
Extremely WIP-y, running into issues when running
yarn install
locally so kicking the work off to the CI version on NetlifyChecklist
CHANGELOG.md
and added information about my change to the "dbt next" section.