Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic full-text-search #1359

Merged
merged 6 commits into from
Oct 15, 2022
Merged

Implement basic full-text-search #1359

merged 6 commits into from
Oct 15, 2022

Conversation

justinfagnani
Copy link
Collaborator

@justinfagnani justinfagnani commented Oct 8, 2022

This implements full-text search by:

  1. Stemming a number of fields and storing the results in an array field in CustomElement documents
  2. Stemming the user's text query
  3. Retrieving elements with a collectionGroup query across the customElement collections and filtering with an array-contains-any operator

I'm using /~https://github.com/NaturalNode/natural for stemming. It looks well maintained and thorough compared to other libraries.

The fields currently indexed are:

  • package description
  • element summary
  • element description
  • A split of the element name by -, using segments more than 3 characters long (to get button from sl-button, etc).

Builds on #1364. We need to fix #1351 to ensure that queries are confined to a single namespace.

@justinfagnani justinfagnani changed the base branch from main to latest-field October 13, 2022 15:35
@justinfagnani justinfagnani changed the title WIP: prototype full-text-search Implement basic full-text-search Oct 13, 2022
@justinfagnani justinfagnani marked this pull request as ready for review October 13, 2022 15:41
Base automatically changed from latest-field to main October 13, 2022 17:04
Copy link
Contributor

@aomarks aomarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I can see us moving to something e.g. Lucene based, or a separate hosted service in the future -- but I like how simple this implementation is and am happy to see how far it can take us (especially after adding some ranking).

Main comment is for more test coverage :D

elementName: String!
tag: String = "latest"
): CustomElement
elements(query: String, limit: Int): [CustomElement!]!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perhaps also want a strict lookup like the element API we had before, or will we instead support search prefixes like package:foo element:bar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the package() query gives us what we had with element() - and we can make some of the fields like customElements into queries if we want that.

I think this query should grow to add both search operators and possibly subqueries/fields that get you from element to package here.

One consideration is how to implement structured + text search. If we move to use a search service do we do both Firestore + search service queries and mix them? If a search service offer structured fields do we delegate all searches to it?

packages/catalog-server/src/lib/catalog.ts Show resolved Hide resolved
// Update the PackageVersion doc
await t.update(versionRef, {
// We remove the converter to fix the types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand "remove the converter" here, since it's actually adding "withConverter" -- seems like the opposite? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.withConverter(null) removes the packageVersionConverter that's added in getPackageVersionCollectionRef.

const tagName = c.export.name;
// Grab longer tag name parts for searching. We want "button" from
// md-button, etc.
const tagNameParts = tagName.split('-').filter((s) => s.length > 3);
Copy link
Contributor

@aomarks aomarks Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even >=2? Thinking of md- which I heard might be the new prefix for mwc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did > 3 to try to exclude the prefixes - I wanted to try to capture words here, not the prefixes. This is very debatable, I just threw this in, but that was my thinking.

if (queryTerms.length > 10) {
queryTerms.length = 10;
}
dbQuery = dbQuery.where('searchTerms', 'array-contains-any', queryTerms);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline, but just for the record here -- it sounds like the searchTerms field is automatically indexed, and firestore supports indexed array-contains-any lookups.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and there are no errors in the emulator.

packages/catalog-server/src/lib/graphql.ts Outdated Show resolved Hide resolved
limit,
}: {
query?: string;
limit?: number;
Copy link
Contributor

@aomarks aomarks Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can limit be non-optional here, since the same 25 default is assigned earlier in the call stack? Or maybe factored out to a constant? (Just thinking if we want to change the default, it would be easy to accidentally only do it in one place).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually seems fine for it to differ. The graphql one is what matters most. This is just a fallback incase someone called the repo directly without a limit. Required is fine too, though in a JS world I'd still usually check before launching a non-limited query :)

@@ -25,6 +25,13 @@ const testPackage1Path = fileURLToPath(
new URL('../test-packages/test-1/', import.meta.url)
);

// A set of import, fetch, search tests that use the same data
const TEST_SEQUENCE_ONE = 'test-data-1';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused by the term "sequence" here, it's actually a partition name? Or is sequence some technical firestore term?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my term... I'm meaning a sequence of tests that rely on each other. I really want some kind of test nesting.

@@ -84,6 +91,37 @@ test('A second import does nothing', async () => {
assert.equal(importResult.problems, undefined);
});

test('Full text search', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I thought I sent this comment with the review, but must have lost it somehow).

Feels like we could do with some more test coverage here. Maybe a test rig/utility function that would make it easier to concisely write the scenarios? Some cases it feels like we should have:

  • Search by tag name term
  • Search by package term
  • Search by entire package name
  • Search by entire tag name
  • Search by description term
  • Tests for the various stemmings/other normalization we expect to be happening -- case sensitivity, removing punctuation, plural normalization, etc.
  • A search with both a matching and non-matching term (to confirm it's OR not AND matching)
  • Checking that the limit is respected
  • No matching results
  • No elements in the index at all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should definitely have more coverage. I added a few more tests for now.

@justinfagnani justinfagnani requested a review from aomarks October 14, 2022 21:21
@justinfagnani justinfagnani merged commit 3abc31d into main Oct 15, 2022
@justinfagnani justinfagnani deleted the elements-query branch October 15, 2022 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[catalog-server] Ensure DB partitioning for tests and staging purposes
2 participants