Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zimcheck: better and faster redirect loop check #312

Merged
merged 1 commit into from
Aug 12, 2022

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Jul 27, 2022

Fixes #161

Depends on openzim/libzim#716

With this implementation deep chains of redirections are not mis-reported as loops. Besides it is faster compared to the old
implementation for the following reasons:

  • Redirection info is read from every entry/dirent exactly once; all subsequent processing is with minimal in-memory data required for the task.

  • When a standalone loop redirection is performed (for example, zimcheck -L with no other option) the auxiliary effient-order-to-by-path-order conversion table is not computed. In this case, in addition to shorter runtime, the memory usage is lower, too.

Benchmark data for zimcheck -L stackoverflow.com_en_all_2022-05.zim:

Run time Memory usage
Old implementation (cold start): 631 seconds 1097 MiB
Old implementation (warm start): 500 seconds 1097 MiB
This PR (cold start): 287 seconds 927 MiB
This PR (warm start): 275 seconds 927 MiB

@veloman-yunkan
Copy link
Collaborator Author

CI fails because of the dependence on openzim/libzim#716

@veloman-yunkan
Copy link
Collaborator Author

openzim/libzim#716 was merged and participated in the nightly CI build. I reran the CI here today. Macos picked up the change and passed, but Linux failed since it refers to the xenial version of deps, which are no longer being built by the kiwix-build CI.

@kelson42
Copy link
Contributor

kelson42 commented Aug 3, 2022

@legoktm Something for you?

@kelson42 kelson42 force-pushed the zimcheck_faster_standalone_redirect_loop_check branch from d87dd35 to ddf3cab Compare August 4, 2022 14:48
@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #312 (c0755b7) into master (e7853d1) will increase coverage by 0.71%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #312      +/-   ##
==========================================
+ Coverage   43.54%   44.25%   +0.71%     
==========================================
  Files          23       23              
  Lines        2338     2368      +30     
  Branches     1304     1318      +14     
==========================================
+ Hits         1018     1048      +30     
  Misses       1304     1304              
  Partials       16       16              
Impacted Files Coverage Δ
src/zimcheck/checks.cpp 96.57% <100.00%> (+0.32%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@kelson42
Copy link
Contributor

kelson42 commented Aug 6, 2022

@veloman-yunkan Two small "problems" are reported by Codefactor.

@veloman-yunkan veloman-yunkan force-pushed the zimcheck_faster_standalone_redirect_loop_check branch from ddf3cab to ffd1714 Compare August 6, 2022 17:22
@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan Two small "problems" are reported by Codefactor.

Not anymore

Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could do even better on the algorithm.
Your algorithm loop twice on the redirection table to resolve the redirect. Once in detectLoopStatus to know if a redirection chain is looping or not. And a second one in resolveLoopStatus to mark the redirection chain as looping or not.

If we not mark the redirection loop with LOOP/NONLOOP, we could mark the loopStatus only with a integer indicating the number of the resolveLoopStatus run. And then only store if the nth run of detectLoopStatus resolve to a loop or not. When reporting, we loop on the loopStatus for what run has covered it and look if the run is loop or not.

Something like (not tested):

class RedirectionTable
{
private: // types
    enum LoopStatus
    {
        UNKNOWN,
        NONLOOP,
        BASE_RUN_NUMBER
    };


public: // functions
    RedirectionTable() {
       // NONLOOP (1) is not looping
       runStatus.push_back(false);
    }
    void addRedirectionEntry(zim::entry_index_type targetEntryIndex)
    {
        redirTable.push_back(targetEntryIndex);
        loopStatus.push_back(LoopStatus::UNKNOWN);
    }

    void addItem()
    {
        redirTable.push_back(redirTable.size());
        loopStatus.push_back(LoopStatus::NONLOOP);
    }

    size_t size() const { return redirTable.size(); }

    bool isInRedirectionLoop(zim::entry_index_type i)
    {
        if ( loopStatus[i] == UNKNOWN )
        {
            resolveLoopStatus(i);
        }

        return runStatus[loopStatus[i]-1];
    }

private: // functions
    void resolveLoopStatus(zim::entry_index_type i)
    {
        auto runNumber = self.nextRunNumber++;
        while (true)
        {
            if ( loopStatus[i] != LoopStatus::UNKNOWN ) {
                if (loopStatus[i] == runNumber) {
                  // The current run is a loop
                  runStatus.push_back(true);
                } else {
                  // The current run is the same status than the first know status found.
                  runStatus.push_back(runStatus[loopStatus[i]-1]);
                };
                return;
            }
            loopStatus[i] = runNumber;
            i = redirTable[i];
        }
    }

private: // data
    std::vector<zim::entry_index_type> redirTable;
    std::vector<uint32_t> loopStatus;
    std::vector<runStatus> runStatus;
    uint32_t nextRunNumber = BASE_RUN_NUMBER;
};

} // unnamed namespace

void test_redirect_loop(const zim::Archive& archive, ErrorLogger& reporter) {
    reporter.infoMsg("[INFO] Checking for redirect loops...");

    RedirectionTable redirTable;
    for(const auto& entry: archive.iterByPath())
    {
        if ( entry.isRedirect() )
            redirTable.addRedirectionEntry(entry.getRedirectEntryIndex());
        else
            redirTable.addItem();
    }

    for(zim::entry_index_type i = 0; i < redirTable.size(); ++i )
    {
        if(redirTable.isInRedirectionLoop(i)){
            const auto entry = archive.getEntryByPath(i);
            reporter.addMsg(MsgId::REDIRECT_LOOP, {{"entry_path", entry.getPath()}});
        }
    }

Comment on lines 777 to 778
std::deque<zim::entry_index_type> redirTable;
std::deque<LoopStatus> loopStatus;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using std::deque ? std::vector already have push_back and we never front_pop

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of std::vector's higher memory usage requirements during growth. When you incrementally populate an std::vector with N>>1 elements its high-watermark memory usage is at about 3/2N (~N/2 of previous storage + ~N new storage during the last reallocation). std::deque typically being implemented as a vector of dynamically allocated fixed size blocks is devoid of that disadvantage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reserve the space once at initialization as we know the number of entries to add

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then we will need to separately track the count of items already added. What are the cons of std::deque?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then we will need to separately track the count of items already added.

With std::vector::reserve we can pre-allocate the memory for the vector without changing the semantic.
Or I misunderstand your argument...

What are the cons of std::deque?

The semantic. std::deque is a queue. Which means we use to to pass elements in order (add with push_back, read with pop_front). Here we have the semantic of a growing vector, so we should use a vector.
We may have the hack of using deque for memory allocation improvement. But as we can reserve the memory with
redirTable.reserve(archive.getAllEntryCount()), there is no real reason to use it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With std::vector::reserve we can pre-allocate the memory for the vector without changing the semantic.
Or I misunderstand your argument...

Indeed. Totally forgot about reserve()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@veloman-yunkan
Copy link
Collaborator Author

I wonder if we could do even better on the algorithm.

You try to save runtime by using more memory (uint32_t instead of uint_8). Sometimes, higher memory usage can result in slower execution. So, though your approach will definitely work, it's not a fact that it will be faster. And I think that memory usage is more important than the runtime - if a program is simply slow you can wait till it completes, whereas if it requires too much memory it simply doesn't run.

@kelson42
Copy link
Contributor

I have no opinion on this specific algorithm question because I don't have the numbers. But, regarding zimcheck, the speed is definitly the priority point to take in consideration. That said, the problem reported by @veloman-yunkan is real: the memory usage should be "reasonable" and should not scale proportionaly to the number of errors/warnings or even size of the ZIM. To conclude here, if we can do significantly faster without using significantly more memory, then this is welcome IMO. Otherwise, probably not.

@mgautierfr
Copy link
Collaborator

That said, the problem reported by @veloman-yunkan is real: the memory usage should be "reasonable" and should not scale proportionally to the number of errors/warnings or even size of the ZIM

My idea will have a memory usage growing proportionally to the number of entry in the ZIM file. (8*nbEntries+1*nbRedirectionChains). It is more than @veloman-yunkan solution (5*nbEntries) but it is in the proportion of what we already do in libzim:

  • We mmap the urlPtrList (8*nbEntries)
  • We mmap the titlePtrList (4*nbEntries)
  • For fast iteration, we also create a vector (8*nbEntries) to store the entries order

On top of that, we have to remember that the performance improvement is theoretical for now, we must do some testing to be sure that one solution is more performant than the other.

@veloman-yunkan
Copy link
Collaborator Author

My idea will have a memory usage growing proportionally to the number of entry in the ZIM file. (8nbEntries+1nbRedirectionChains). It is more than @veloman-yunkan solution (5*nbEntries) but it is in the proportion of what we already do in libzim:

* We mmap the urlPtrList (8*nbEntries)

* We mmap the titlePtrList (4*nbEntries)

mmap-ed memory shouldn't be accounted in this analysis as 100% memory consumption since it can be easily discarded by the OS if there is not enough RAM for the real working set size.

* For fast iteration, we also create a vector (8*nbEntries) to store the entries order

This one can be halved.

On top of that, we have to remember that the performance improvement is theoretical for now, we must do some testing to be sure that one solution is more performant than the other.

During redirect loop detection the working set is limited to the data maintained in RedirectionTable (i.e. (8*nbEntries+1*nbRedirectionChains) vs (5*nbEntries)1. The size of that data, the size of the CPU data cache, the latency of the memory bus and the access patterns (coming from the redirections) will affect the actual performance. I don't think that spending time on evaluating the performance on multiple ZIM files and diverse hardware is justified in the pursuit of (some potential) extra speed-up in one of the cheapest checks of zimcheck (at the cost of sacrificing the memory usage improvement brought by this PR).


1 the latter, BTW, could be reduced to (4*nbEntries) by not maintaining a separate loopStatus table and reusing redirTable for that purpose, but I didn't like that hack because of the reduced clarity.

@mgautierfr
Copy link
Collaborator

I agree that you current algorithm is already a big improvement compared to the previous one. Let's move on with it. We may come back on this when/if we want to improve it further more.

Let's fix the discussion about std::deque/std::vector and we will be good.

Comment on lines 719 to 723
explicit RedirectionTable(size_t entryCount)
{
loopStatus.reserve(entryCount);
redirTable.reserve(entryCount);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation

@mgautierfr
Copy link
Collaborator

Please fix the indentation and rebase-fixup.
You can directly merge the PR, no need for another review round.

@mgautierfr mgautierfr self-requested a review August 11, 2022 14:51
With this implementation deep chains of redirections are not
mis-reported as loops. Besides it is faster compared to the old
implementation for the following reasons:

- Redirection info is read from every entry/dirent exactly once;
  all subsequent processing is with minimal in-memory data required
  for the task.

- When a standalone loop redirection is performed (for example,
  `zimcheck -L` with no other option) the auxiliary
  effient-order-to-by-path-order conversion table is not computed.
  In this case, in addition to shorter runtime, the memory usage is
  lower, too.
@veloman-yunkan veloman-yunkan force-pushed the zimcheck_faster_standalone_redirect_loop_check branch from c0755b7 to ddb1cf3 Compare August 12, 2022 08:49
@veloman-yunkan
Copy link
Collaborator Author

The Packages CI flows fail since they don't see the change brought by openzim/libzim#716.

@veloman-yunkan
Copy link
Collaborator Author

@kelson42 Should I merge? Or we better resolve the issue with the Packages builds first?

@kelson42 kelson42 merged commit 43a80f4 into master Aug 12, 2022
@kelson42 kelson42 deleted the zimcheck_faster_standalone_redirect_loop_check branch August 12, 2022 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zimcheck (was: Creator) should check that we have no redirect loop
3 participants