Skip to content

Project Ideas High performance string matching

Philippe Ombredanne edited this page Mar 14, 2022 · 3 revisions

High performance string search automaton improvements and benchmark

Finding text patterns is used in ScanCode license detection and copyright. In particular we use and contribute to pyahocorasick https://github.com/WojciechMula/pyahocorasick which an Aho-Corasick automaton written in C as a Python extension.

The purpose of projects in this domain can comprise one or more of these goals:

  • Add support for memory mapping to the pyahocorasick automaton such that it can be loaded once in memory and shared across multiple processes.

  • Help reorganize the code tree in a more conventional layout including using a src directory for C/C++ and Python code, use common conventions for test code

  • Investigate and benchmark alternative (memory-mapped) automaton implementations in Rust, Go, or C/C++. These libraries may not have a Python binding or may not allow memory-mapping. The volume of data we handle requires to reuse, extend or create specialized matching automatons that can handle eventually billions of nodes and are larger than the available memory hence memory mapping is likely a mus-have feature. The goal of this task would be to:

    • review and select interesting alternative FOSS automaton implementations
    • create minimal Python bindings if missing
    • benchmark the speed and memory usage of these implementations against pyahocorasick using ScanCode license detection as a benchmark. A possible direction is to use finite state transducers (possibly weighted).
  • Investigate ways to have some regex-like acceleration using automatons. We use automatons for exact multi-pattern matching for license detection. We also use regex-based lexers for copyright detection. The goal of this idea is to design and implement a library to accelerate some limited regex and approximate matching using automatons.

  • Feature hashing research: we deal with many "features" and hashing to limit the number and size of the each features seems to be a valuable thing to help handle lower dimension data. The goal is to research the validity of feature hashing with short hashes (15, 16 and 32 bits) and evaluate if this leads to acceptable false-positive and loss of accuracy in the context of the data structures mentioned above. In particular this should be benchmarked using ScanCode license detection

  • Level

    • Advanced
  • Tech

    • Rust, Go, C/C++, Python
  • Mentors

Clone this wiki locally