Recent change in Wikipedias Layout breaks the table of contents #143

Builditluc · 2023-01-25T19:28:51Z

TLDR: The Table of contents won't be visible until the issue with the parser has been resolved. All other features should be working correctly. Check out this issue frequently, as all news about the progress will be posted here

Issue Status

A checklist of what needs to be done can be found here. Currently, the progress in development can be tracked by looking at the following branches related to the issue:

143-rework-backend: This is the main branch of the issue and all finished items on the checklist get merged into this one
(merged Create Api Handler #147) 143-create-api-handler: The new API handler is created in this branch
(merged Create new parser system #149) 143-create-parser-system: The new parser system is created in this branch

Recently, the Wikimedia Foundation announced its first change to the design of its site (including the articles) in over a decade, in which they "Put Usability at the Forefront"

One of the changes was a renewal of the table of contents, now on the left of the article. This changed its appearance in the HTML code considerably, revealing some significant flaws in the parsing design (We've always assumed that the table of contents would be at a specific location in a specific format. This is now no longer the case).

I've already come up with a solution to this problem. We can get the table of contents easily from the Wikipedia JSON API, already formatted with all of the required data. A quick example of the returned table of contents (internally called sections) for the article Meaning

{
    "parse": {
        "title": "Meaning",
        "pageid": 18916,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Arts and entertainment",
                "number": "1",
                "index": "1",
                "fromtitle": "Meaning",
                "byteoffset": 785,
                "anchor": "Arts_and_entertainment",
                "linkAnchor": "Arts_and_entertainment"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "See also",
                "number": "2",
                "index": "2",
                "fromtitle": "Meaning",
                "byteoffset": 1238,
                "anchor": "See_also",
                "linkAnchor": "See_also"
            }
        ],
        "showtoc": ""
    }
}

Response of the API call: https://en.wikipedia.org/w/api.php?format=json&action=parse&page=Meaning&prop=sections

As we can see, the API gives us all of the necessary data (level, number, linkAnchor). To implement this change we would, however, have to rewrite the whole table of contents section in the parser. Because we only get the id of the span that contains the header, we would also have to rewrite the header handling system. This requires tapping into the code of the UI.

These are some pretty big changes, but it'll be a high priority for me to fix this as soon as possible. I'll start with designing the interfaces for the new parser and once the design is fleshed out, I'll start implementing the changes.

The text was updated successfully, but these errors were encountered:

Builditluc · 2023-01-26T07:14:05Z

Table of Contents redesign

Data Structure and Representation

First, we need to know exactly what information we want in the end. We need to ask ourselves, what information do we need for our current functionality to work (Don't want to overcomplicate this)? Well, we need the following things:

Some sort of identifier (maybe the id of the page the toc refers to)
The title, drawn at the top of the list (usually the page title)
Well, a list of all of the actual items

For every item, we would need the following things:

Some identifier (used internally for things like jumping to the header with a specific id)
The level (can be used to identify the type of header, maybe even represented as an enum)
The Text or title of the item
Some way of linking to a specific position in the article

When we look at what the API gives us, we can see that everything we need is given to us
Sample API result:

{
    "parse": {
        "title": "GitHub",
        "pageid": 18545292,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "History",
                "number": "1",
                "index": "1",
                "fromtitle": "GitHub",
                "byteoffset": 5788,
                "anchor": "History",
                "linkAnchor": "History"
            },
            ...
        ],
        ...
    }
}

When now try to put these requirements into a rust data type, we come out with the following (this is just a first draft. variable names and types are not fixed)

struct TableOfContents {
    id: usize,
    title: String,
    items: Vec<TableOfContentsItem>
}

struct TableOfContentsItem {
    id: usize,
    level: usize, // or an Enum, don't know yet
    text: String,
    anchor: String,
}

The anchor would be the id of the header that this item is linked to. When parsing the article, we can then just look at what TableOfContentsItem has that id and then save the id in the ArticleElement. We've just found out the first API requirement we need. Some sort of lookup that returns the correct item id when given an anchor.

Integration with the Article

The Problem

The current implementation separates the table of contents from the article. The UI has to fetch the table of contents from the article and then do its actions there manually (matching the items against headers, fetching the title, formatting the item title, etc.). This adds a lot of overhead in the UI that's not necessary (we can add functions for everything the UI needs from the article to Article itself, so we don't need to interact with the underlying data directly (this also allows us to change the structure and data representation, like adding more fields, without having to rewrite the UI.

When we look at the things we currently do with the table of contents, we can see that it's all over the place. For example, look at what happens when you click on an item in the toc:

We iterate through every item in the SelectView (not the actual toc data) and we then compare the text of each item with the text of the item we clicked.
We then fetch the index of said item in the SelectView
We then call ArticleView::select_header with this index

There are quite a few things wrong and just dangerous about this. First, the whole thing depends on the headers being in the same order everywhere (in the Article and the SelectView). We don't use identifiers. No. We use the index.... That's just a disaster waiting to happen. Also, we don't work with the actual data here, just the items stored in the SelectView and some coordinates in an array on the Articles side.

Redesigned Solution

The solution for this would be quite simple actually. We use identifiers to refer to an item instead of its index. Currently, we need the following functionality:

Get the level and text from an item given its id
Get the anchor for an item given its id (needed for the linking of an ArticleElement to a TableOfContentsItem)
Get the coordinates for an item given its id (this functionality would be in the ArticleContent and not the Article itself, because these depend on the size of the view (can change)

As we can see, that's not too complicated.

One thing to note about the text of these items. Because the displayed text can be changed in the config, we format the text right in the parsing stage. This configuration change only configures the text displayed in the SelectView. So why would we need to complicate the parser when we can just let the UI handle formatting?

Naming

I believe this is the right time to change the naming of some of the structs and variables of the table of contents. A better name for it would be Section or Sections. This is also the name the Wikipedia API uses for the toc. This would mean that the Article struct would now have an attribute called sections that stores the individual sections of the article, called Section. This would remove the need for another struct called TableOfContents entirely because all of the logic is going to be moved into the main Article struct.

Builditluc · 2023-01-27T06:54:14Z

Making API Calls

One Struct handles it all

Currently, interacting with the Wikipedia API happens directly inside of the ArticleBuilder and SearchBuilder respectively. This means that we need two implementations of validating parameters, checking for errors in the responses, and making the requests. One way to solve this would be to just create helper functions for these things but I believe it would be best to move the direct API interactions into a separate struct that handles it all (parameter validation, making requests, maybe in the future even caching). Because this struct interacts with the Wikimedia API (basically Wikimedia is the database Wikipedia is built on. That means we can handle all kinds of sites that use Wikimedia and have its API endpoint), a good name for it would be WikimediaAPI. I don't want to overcomplicate things so the API Handler would only handle the things we need for now, easily expanded once needed.

Structure and current Requirements

Currently, we only need four different main things for the handler to do:

Search for an Article
Continue Searching for an Article
Fetch an Article given its title (used by the links)
Fetch an Article given its id (used by the search results)

When we only implement these four things, we not only keep the struct clean and simple, but we can expand it in the future. The handler would then have private helper functions that validate arguments and check for wikimedia errors. The return types for each of these functions would be json, keeping it simple (we don't have to create a struct for each API response and don't have to deal with many many Option<T> attributes).

For the error handling, we would create an WikimediaError enum with all of the different errors in it. That way we can simplify the error handling in the UI. That means each of the endpoints returns a Result<Json, WikimediaError> (not sure yet how the raw json type is named).

Testing

Creating one struct for handling all of the wikimedia API interactions enables us to test it better. We can mock the Wikipedia API for example and then test all the different things that can happen for each of the requirements.

Builditluc · 2023-01-30T16:49:41Z

Things that need to be done:

| Linked Issues | #143 | |---------------|------| Changes made: * feat: add Parser, ElementParser and Element traits * feat: add Article and Section and implement Article::from_mediawiki * feat: add Text element * feat: add Header element * feat: add MediawikiParagraphParser * feat: add MediawikiUnsupportedElementParser * feat: add MediawikiParser::new * feat: finish basic parsing * feat: add Header element * feat: implement link element * feat: add italic and bold element parser * feat: add MediawikiListParser

Builditluc self-assigned this Jan 25, 2023

Builditluc added this to wiki-tui: Roadmap Jan 25, 2023

github-project-automation bot moved this to Todo in wiki-tui: Roadmap Jan 25, 2023

Builditluc pinned this issue Jan 25, 2023

Builditluc moved this from Todo to In Development in wiki-tui: Roadmap Jan 25, 2023

This was referenced Feb 1, 2023

Create Api Handler #147

Merged

Create new parser system #148

Merged

Create new parser system #149

Merged

Builditluc linked a pull request Feb 6, 2023 that will close this issue

Rework backend #150

Merged

Builditluc mentioned this issue Feb 12, 2023

Rework backend #150

Merged

Builditluc closed this as completed in #150 Feb 12, 2023

github-project-automation bot moved this from In Development to Done in wiki-tui: Roadmap Feb 12, 2023

Builditluc unpinned this issue Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recent change in Wikipedias Layout breaks the table of contents #143

Recent change in Wikipedias Layout breaks the table of contents #143

Builditluc commented Jan 25, 2023 •

edited

Loading

Builditluc commented Jan 26, 2023 •

edited

Loading

Builditluc commented Jan 27, 2023

Builditluc commented Jan 30, 2023 •

edited

Loading

Recent change in Wikipedias Layout breaks the table of contents #143

Recent change in Wikipedias Layout breaks the table of contents #143

Comments

Builditluc commented Jan 25, 2023 • edited Loading

Issue Status

Builditluc commented Jan 26, 2023 • edited Loading

Table of Contents redesign

Data Structure and Representation

Integration with the Article

The Problem

Redesigned Solution

Naming

Builditluc commented Jan 27, 2023

Making API Calls

One Struct handles it all

Structure and current Requirements

Testing

Builditluc commented Jan 30, 2023 • edited Loading

Builditluc commented Jan 25, 2023 •

edited

Loading

Builditluc commented Jan 26, 2023 •

edited

Loading

Builditluc commented Jan 30, 2023 •

edited

Loading