Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve search index generation for PHP.net #154

Merged
merged 5 commits into from
Oct 6, 2024

Conversation

lhsazevedo
Copy link
Contributor

Note

This is a companion PR to php/web-php#1084, but it is not dependent on it and can be merged independently.

Intro

PHD index entries can have both short (sdesc) and long (ldesc) descriptions. Often, the short description is empty, leading to the current fallback mechanism in getShortDescription and getLongDescription:

// phpdotnet/phd/Format.php
final public function getLongDescription($id, &$isLDesc = null) {
    if ($this->indexes[$id]["ldesc"]) {
        $isLDesc = true;
        return $this->indexes[$id]["ldesc"];
    } else {
        $isLDesc = false;
        return $this->indexes[$id]["sdesc"];
    }
}
final public function getShortDescription($id, &$isSDesc = null) {
    if ($this->indexes[$id]["sdesc"]) {
        $isSDesc = true;
        return $this->indexes[$id]["sdesc"];
    } else {
        $isSDesc = false;
        return $this->indexes[$id]["ldesc"];
    }
}

Problem

The current search index JSON generation doesn't utilize this fallback
mechanism, resulting in missing entries in PHP.net search results.

Solution

This PR addresses the issue by:

  • Using the long description as the title when the short description is empty.
  • Using the parent book title as the long description in such cases.
  • Ignoring non-page entries (non-chunk entries) when generating the search index JSON.

Impact

  • Improves search result completeness and accuracy.
  • Reduces search index size by excluding irrelevant entries.

Examples

Currently, "PHP Manual > Language Reference > Types > String" is missing from search results due to an empty short description. This change will use "Strings" (the long description) as the title and "Language Reference" (the parent book title) as the long description.

search-index.json

 [
-  "",
+  "Strings",
   "language.types.string",
   "sect1"
 ],
-[
-  "",
-  "language.types.string",
-  "sect2"
-],

...
  [
-   "",
+   "Enumerations",
    "language.types.enumerations",
    "sect1"
  ],

search-description.json

-  "language.types.string": "Strings",
+  "language.types.string": "Language Reference",
...
-  "language.types.enumerations": "Enumerations",
+  "language.types.enumerations": "Language Reference",

Statistics

stat before after change
Entries count 29,765 11,256 -62%
Entries lacking title (sdesc) 19,955 0 -100%
Search index size 1,383KB 693KB -50%

Generated Search Index Diff Preview

Only the first 100 lines are shown for brevity.

 [
   [
-    "",
+    "Copyright",
     "copyright",
     "legalnotice"
   ],
   [
-    "",
-    "index",
-    "info"
-  ],
-  [
-    "",
-    "index",
-    "book"
-  ],
-  [
-    "",
-    "preface",
-    "section"
-  ],
-  [
-    "",
+    "Preface",
     "preface",
     "preface"
   ],
   [
-    "",
-    "intro-whatis",
-    "example"
-  ],
-  [
-    "",
+    "What is PHP?",
     "intro-whatis",
     "section"
   ],
   [
-    "",
+    "What can PHP do?",
     "intro-whatcando",
     "section"
   ],
   [
-    "",
+    "Introduction",
     "introduction",
     "chapter"
   ],
   [
-    "",
+    "What do I need?",
     "tutorial.requirements",
     "section"
   ],
   [
-    "",
-    "tutorial.firstpage",
-    "example"
-  ],
-  [
-    "",
-    "tutorial.firstpage",
-    "example"
-  ],
-  [
-    "",
+    "Your first PHP-enabled page",
     "tutorial.firstpage",
     "section"
   ],
   [
-    "",
-    "tutorial.useful",
-    "example"
-  ],
-  [
-    "",
-    "tutorial.useful",
-    "example"
-  ],
-  [
-    "",
-    "tutorial.useful",
-    "example"
-  ],
-  [
-    "",
+    "Something Useful",
     "tutorial.useful",
     "section"
   ],
   [
-    "",
-    "tutorial.forms",
-    "example"
-  ],

Improves the search indexes generated by the PHP-Web format by:
- Adding short descriptions to entries that lack them
- Skipping non-chunk entries (page elements)
@lhsazevedo lhsazevedo force-pushed the improve-search-index branch from 3c44489 to e3fb93a Compare October 5, 2024 05:51
Co-authored-by: Kamil Tekiela <tekiela246@gmail.com>
@lhsazevedo lhsazevedo force-pushed the improve-search-index branch from 3acf1b6 to ba87ca6 Compare October 5, 2024 22:47
@kamil-tekiela
Copy link
Member

Will this break the search if we merge it now?

@lhsazevedo
Copy link
Contributor Author

No. It is safe to merge as it doesn't alter the JSON structure.

@kamil-tekiela kamil-tekiela merged commit 673b2da into php:master Oct 6, 2024
9 checks passed
@lhsazevedo lhsazevedo deleted the improve-search-index branch October 6, 2024 13:55
@lhsazevedo
Copy link
Contributor Author

Thank you!

@kamil-tekiela
Copy link
Member

Just curious. You said it would not affect the current functionality, right? Does that mean you changed something in the other PR that will avail of this change?

@lhsazevedo
Copy link
Contributor Author

I said that it wouldn't break the current search, but you can already see the improved results on php.net.

Unfortunately, the current UI hides them under the last result group ("Other Matches"). You'll probably need to scroll down the menu to see it. You may also need to clear you local storage because the search index is cached for two weeks.

Try searching for "syntax", "types" or "operators". You should see some language reference results under "Other Matches". Those pages were missing from the index before because they don't have an sdesc.

@haszi
Copy link
Contributor

haszi commented Oct 20, 2024

Problem

The current search index JSON generation doesn't utilize this fallback mechanism, resulting in missing entries in PHP.net search results.

Solution

This PR addresses the issue by:

...

  • Ignoring non-page entries (non-chunk entries) when generating the search index JSON.

One question: how does removing non-chunk entries address the issue of empty long/short descriptions not using their short/long counterpart? Beside that this change not related to the problem, it also removes the possibility of searching for any useful sections that don't necessarily need an entire page (e.g. the ternary or the null coalescing operators).

@lhsazevedo
Copy link
Contributor Author

Thanks for bringing this up.

The main reason for skipping non-chunk entries was to avoid adding an sdesc to them, which would make them appear in the search results. The front-end does not properly handle non-chunk entries, and displaying them would lead to broken links. I realize I should have clarified this in the PR.

To properly integrate non-chunk entries into the search, we'd need to:

  1. Include the container chunk ID in the index (unless parent_id is guaranteed to be a chunk).
  2. Update the front-end to handle non-chunk links correctly, using the container chunk ID as the path and the entry ID as a hash property. For example, https://www.php.net/manual/en/language.operators.comparison.php#language.operators.comparison.ternary .
  3. Determine the best value for the ldesc for non-chunk entries. Using the container chunk's sdesc might be preferable over the parent book/set sdesc.

Your question led me to some interesting discoveries:

  1. There are 18,205 non-chunk entries currently missing sdesc, but 370 non-chunk entries actually have both sdesc and ldesc. These are <refentry> aliases (like DateTime::getTimezone, an alias for date_timezone_get()) and were removed from the index after the PR. I think they should be using the same chunk value as the entry they alias to, but it is hardcoded as false (0):

    phd/phpdotnet/phd/Index.php

    Lines 254 to 269 in ec49154

    if ($array === true) {
    foreach($lastChunk["sdesc"] as $sdesc) {
    $this->commit[++$idx] = [
    "docbook_id" => $lastChunkId,
    "filename" => $lastChunk["filename"],
    "parent_id" => $this->currentchunk,
    "sdesc" => $sdesc,
    "ldesc" => $lastChunk["ldesc"],
    "element" => $lastChunk["element"],
    "previous" => $lastChunk["previous"],
    "next" => ($lastChunk["chunk"] ? "POST-REPLACEMENT" : ""),
    "chunk" => 0,
    ];
    $this->POST_REPLACEMENT_INDEXES[] = array("docbook_id" => $lastChunkId, "idx" => $idx);
    }
    }

    I'll create an issue so we can discuss how to handle this.

  2. Another issue: while the entries' IDs are not unique, I've seen some search related code assuming they are, which definitely needs to be corrected. Will create an issue for this as well.

@lhsazevedo
Copy link
Contributor Author

@kamil-tekiela would you mind adding the hacktoberfest-accepted label to this one?

@Girgias Girgias added the hacktoberfest-accepted Hacktober PRs label Oct 25, 2024
@Girgias
Copy link
Member

Girgias commented Oct 25, 2024

@kamil-tekiela would you mind adding the hacktoberfest-accepted label to this one?

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants