Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent change in Wikipedias Layout breaks the table of contents #143

Closed
Builditluc opened this issue Jan 25, 2023 · 3 comments · Fixed by #150
Closed

Recent change in Wikipedias Layout breaks the table of contents #143

Builditluc opened this issue Jan 25, 2023 · 3 comments · Fixed by #150
Assignees
Labels
breaking-change This introduces a breaking change to the codebase. Increment the major version bug-type: parser This bug is related to the parser (for example, incorrect lists or missing elements in the article) state: approved This issue or pull request was approved and can be worked on type: bug This fixes a bug. Increment the minor version

Comments

@Builditluc
Copy link
Owner

Builditluc commented Jan 25, 2023

TLDR: The Table of contents won't be visible until the issue with the parser has been resolved. All other features should be working correctly. Check out this issue frequently, as all news about the progress will be posted here

Issue Status

A checklist of what needs to be done can be found here. Currently, the progress in development can be tracked by looking at the following branches related to the issue:

Recently, the Wikimedia Foundation announced its first change to the design of its site (including the articles) in over a decade, in which they "Put Usability at the Forefront"

One of the changes was a renewal of the table of contents, now on the left of the article. This changed its appearance in the HTML code considerably, revealing some significant flaws in the parsing design (We've always assumed that the table of contents would be at a specific location in a specific format. This is now no longer the case).

I've already come up with a solution to this problem. We can get the table of contents easily from the Wikipedia JSON API, already formatted with all of the required data. A quick example of the returned table of contents (internally called sections) for the article Meaning

{
    "parse": {
        "title": "Meaning",
        "pageid": 18916,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Arts and entertainment",
                "number": "1",
                "index": "1",
                "fromtitle": "Meaning",
                "byteoffset": 785,
                "anchor": "Arts_and_entertainment",
                "linkAnchor": "Arts_and_entertainment"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "See also",
                "number": "2",
                "index": "2",
                "fromtitle": "Meaning",
                "byteoffset": 1238,
                "anchor": "See_also",
                "linkAnchor": "See_also"
            }
        ],
        "showtoc": ""
    }
}

Response of the API call: https://en.wikipedia.org/w/api.php?format=json&action=parse&page=Meaning&prop=sections

As we can see, the API gives us all of the necessary data (level, number, linkAnchor). To implement this change we would, however, have to rewrite the whole table of contents section in the parser. Because we only get the id of the span that contains the header, we would also have to rewrite the header handling system. This requires tapping into the code of the UI.

These are some pretty big changes, but it'll be a high priority for me to fix this as soon as possible. I'll start with designing the interfaces for the new parser and once the design is fleshed out, I'll start implementing the changes.

@Builditluc Builditluc added state: approved This issue or pull request was approved and can be worked on type: bug This fixes a bug. Increment the minor version breaking-change This introduces a breaking change to the codebase. Increment the major version bug-type: parser This bug is related to the parser (for example, incorrect lists or missing elements in the article) labels Jan 25, 2023
@Builditluc Builditluc self-assigned this Jan 25, 2023
@Builditluc Builditluc pinned this issue Jan 25, 2023
@Builditluc Builditluc moved this from Todo to In Development in wiki-tui: Roadmap Jan 25, 2023
@Builditluc
Copy link
Owner Author

Builditluc commented Jan 26, 2023

Table of Contents redesign

Data Structure and Representation

First, we need to know exactly what information we want in the end. We need to ask ourselves, what information do we need for our current functionality to work (Don't want to overcomplicate this)? Well, we need the following things:
  • Some sort of identifier (maybe the id of the page the toc refers to)
  • The title, drawn at the top of the list (usually the page title)
  • Well, a list of all of the actual items

For every item, we would need the following things:

  • Some identifier (used internally for things like jumping to the header with a specific id)
  • The level (can be used to identify the type of header, maybe even represented as an enum)
  • The Text or title of the item
  • Some way of linking to a specific position in the article

When we look at what the API gives us, we can see that everything we need is given to us
Sample API result:

{
    "parse": {
        "title": "GitHub",
        "pageid": 18545292,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "History",
                "number": "1",
                "index": "1",
                "fromtitle": "GitHub",
                "byteoffset": 5788,
                "anchor": "History",
                "linkAnchor": "History"
            },
            ...
        ],
        ...
    }
}

When now try to put these requirements into a rust data type, we come out with the following (this is just a first draft. variable names and types are not fixed)

struct TableOfContents {
    id: usize,
    title: String,
    items: Vec<TableOfContentsItem>
}

struct TableOfContentsItem {
    id: usize,
    level: usize, // or an Enum, don't know yet
    text: String,
    anchor: String,
}

The anchor would be the id of the header that this item is linked to. When parsing the article, we can then just look at what TableOfContentsItem has that id and then save the id in the ArticleElement. We've just found out the first API requirement we need. Some sort of lookup that returns the correct item id when given an anchor.

Integration with the Article

The Problem

The current implementation separates the table of contents from the article. The UI has to fetch the table of contents from the article and then do its actions there manually (matching the items against headers, fetching the title, formatting the item title, etc.). This adds a lot of overhead in the UI that's not necessary (we can add functions for everything the UI needs from the article to Article itself, so we don't need to interact with the underlying data directly (this also allows us to change the structure and data representation, like adding more fields, without having to rewrite the UI.

When we look at the things we currently do with the table of contents, we can see that it's all over the place. For example, look at what happens when you click on an item in the toc:

  1. We iterate through every item in the SelectView (not the actual toc data) and we then compare the text of each item with the text of the item we clicked.
  2. We then fetch the index of said item in the SelectView
  3. We then call ArticleView::select_header with this index

There are quite a few things wrong and just dangerous about this. First, the whole thing depends on the headers being in the same order everywhere (in the Article and the SelectView). We don't use identifiers. No. We use the index.... That's just a disaster waiting to happen. Also, we don't work with the actual data here, just the items stored in the SelectView and some coordinates in an array on the Articles side.

Redesigned Solution

The solution for this would be quite simple actually. We use identifiers to refer to an item instead of its index. Currently, we need the following functionality:

  • Get the level and text from an item given its id
  • Get the anchor for an item given its id (needed for the linking of an ArticleElement to a TableOfContentsItem)
  • Get the coordinates for an item given its id (this functionality would be in the ArticleContent and not the Article itself, because these depend on the size of the view (can change)

As we can see, that's not too complicated.

One thing to note about the text of these items. Because the displayed text can be changed in the config, we format the text right in the parsing stage. This configuration change only configures the text displayed in the SelectView. So why would we need to complicate the parser when we can just let the UI handle formatting?

Naming

I believe this is the right time to change the naming of some of the structs and variables of the table of contents. A better name for it would be Section or Sections. This is also the name the Wikipedia API uses for the toc. This would mean that the Article struct would now have an attribute called sections that stores the individual sections of the article, called Section. This would remove the need for another struct called TableOfContents entirely because all of the logic is going to be moved into the main Article struct.

@Builditluc
Copy link
Owner Author

Making API Calls

One Struct handles it all

Currently, interacting with the Wikipedia API happens directly inside of the ArticleBuilder and SearchBuilder respectively. This means that we need two implementations of validating parameters, checking for errors in the responses, and making the requests. One way to solve this would be to just create helper functions for these things but I believe it would be best to move the direct API interactions into a separate struct that handles it all (parameter validation, making requests, maybe in the future even caching). Because this struct interacts with the Wikimedia API (basically Wikimedia is the database Wikipedia is built on. That means we can handle all kinds of sites that use Wikimedia and have its API endpoint), a good name for it would be WikimediaAPI. I don't want to overcomplicate things so the API Handler would only handle the things we need for now, easily expanded once needed.

Structure and current Requirements

Currently, we only need four different main things for the handler to do:

  • Search for an Article
  • Continue Searching for an Article
  • Fetch an Article given its title (used by the links)
  • Fetch an Article given its id (used by the search results)

When we only implement these four things, we not only keep the struct clean and simple, but we can expand it in the future. The handler would then have private helper functions that validate arguments and check for wikimedia errors. The return types for each of these functions would be json, keeping it simple (we don't have to create a struct for each API response and don't have to deal with many many Option<T> attributes).

For the error handling, we would create an WikimediaError enum with all of the different errors in it. That way we can simplify the error handling in the UI. That means each of the endpoints returns a Result<Json, WikimediaError> (not sure yet how the raw json type is named).

Testing

Creating one struct for handling all of the wikimedia API interactions enables us to test it better. We can mock the Wikipedia API for example and then test all the different things that can happen for each of the requirements.

@Builditluc
Copy link
Owner Author

Builditluc commented Jan 30, 2023

Things that need to be done:

  • write tests for the API handler
  • create an API handler
  • rework the parser system
  • implement the reworked parser into the UI
  • remove the old code

Builditluc added a commit that referenced this issue Feb 6, 2023
| Linked Issues | #143 |
|---------------|------|

Changes made:
* feat: add Parser, ElementParser and Element traits
* feat: add Article and Section and implement Article::from_mediawiki
* feat: add Text element
* feat: add Header element
* feat: add MediawikiParagraphParser
* feat: add MediawikiUnsupportedElementParser
* feat: add MediawikiParser::new
* feat: finish basic parsing
* feat: add Header element
* feat: implement link element
* feat: add italic and bold element parser
* feat: add MediawikiListParser
@Builditluc Builditluc linked a pull request Feb 6, 2023 that will close this issue
@github-project-automation github-project-automation bot moved this from In Development to Done in wiki-tui: Roadmap Feb 12, 2023
@Builditluc Builditluc unpinned this issue Feb 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change This introduces a breaking change to the codebase. Increment the major version bug-type: parser This bug is related to the parser (for example, incorrect lists or missing elements in the article) state: approved This issue or pull request was approved and can be worked on type: bug This fixes a bug. Increment the minor version
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant