Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make compiled files mmap-compatible #119

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
3ffdbef
manager for string constants
mr-martian Jul 26, 2021
1be7e67
transducer read/write functions
mr-martian Jul 26, 2021
0789bb7
string_view for old compilers
mr-martian Jul 27, 2021
c785992
work on python bindings
mr-martian Jul 27, 2021
5e758ee
get it working
mr-martian Jul 27, 2021
fe8b858
remove unneeded args
mr-martian Jul 27, 2021
836c97a
link to ICU so that it actually runs
mr-martian Jul 27, 2021
1612064
transducer mmap sort by input symbol
mr-martian Jul 29, 2021
644f9b1
class for executing transducers
mr-martian Jul 30, 2021
aa0fac9
move endian helpers for mmap to their own header
mr-martian Jul 30, 2021
aeb0fe9
run TransducerExe for matching
mr-martian Jul 30, 2021
cbc3272
use new TransducerExe in lt-proc
mr-martian Jul 30, 2021
a8b0ace
AlphabetExe
mr-martian Aug 2, 2021
a195210
lt-comp working with new format
mr-martian Aug 2, 2021
77a4a1c
lt-print and lt-trim accepting new format
mr-martian Aug 2, 2021
531e9ae
helper functions continue to be fun
mr-martian Aug 2, 2021
dad61a5
actually mmap
mr-martian Aug 2, 2021
a4b2962
missed some debug statement
mr-martian Aug 2, 2021
ccb9ab7
split read functions for TransducerExe
mr-martian Aug 2, 2021
f9d54e1
lsx-proc needs to add symbols at runtime, so support that
mr-martian Aug 3, 2021
66400c6
pass along -DHAVE_STRING_VIEW to other repos
mr-martian Aug 3, 2021
8517e21
yet more helper functions
mr-martian Aug 4, 2021
7155329
reading in offsets
mr-martian Aug 5, 2021
17a257d
const lookup function for StringWriter
mr-martian Aug 7, 2021
55d00ac
start dropping compression.h
mr-martian Aug 21, 2021
4aa03cb
continuing to maintain odd (buggy?) behavior
mr-martian Aug 21, 2021
6c0a712
Compression → OldBinary, serialiser readers for perceptron
mr-martian Aug 31, 2021
55f69af
Merge branch 'master' into mmap
mr-martian Sep 2, 2021
ed69eaf
byteswaps in old headers and tags don't get adjusted by serialiser.h
mr-martian Sep 4, 2021
3bd79b4
we need a way to make AlphabetExe reindex if we read in more strings
mr-martian Sep 9, 2021
c2ec4c5
add -H option to lt-comp so it doesn't eat ε (#92)
mr-martian Sep 10, 2021
7a4410d
Merge branch 'master' into mmap
mr-martian Nov 14, 2021
65e341f
Merge branch 'master' into mmap
mr-martian Mar 15, 2022
b80d920
helper functions are nice
mr-martian Mar 15, 2022
f6f26ed
Merge branch 'master' into mmap
mr-martian May 26, 2022
9d71a8d
Merge branch 'master' into mmap
mr-martian Jul 1, 2022
ac01b01
get tests passing
mr-martian Jul 1, 2022
d719977
python bindings need to know about string_view
mr-martian Jul 1, 2022
620f512
move reading back to file_utils
mr-martian Jul 1, 2022
01fd739
header-reading util
mr-martian Jul 1, 2022
c5d3325
Merge branch 'master' into mmap
mr-martian Jul 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,12 @@ AC_CHECK_LIB(xml2, xmlReaderForFile)

# Checks for header files.
AC_HEADER_STDC
AC_CHECK_HEADERS([stdlib.h string.h unistd.h stddef.h])
AC_CHECK_HEADERS([stdlib.h string.h unistd.h stddef.h string_view])

have_sv=""
AC_CHECK_HEADERS([string_view], [have_sv="-DHAVE_STRING_VIEW"], [have_sv=""])
AC_SUBST([have_sv])

AC_CHECK_HEADER([utf8cpp/utf8.h], [CPPFLAGS="-I/usr/include/utf8cpp/ $CPPFLAGS"], [
AC_CHECK_HEADER([utf8.h], [], [AC_MSG_ERROR([You don't have utfcpp installed.])])
])
Expand Down
2 changes: 1 addition & 1 deletion lttoolbox.pc.in
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ Name: lttoolbox
Description: Augmented letter transducer tools for natural language processing
Version: @VERSION@
Libs: -L${libdir} -llttoolbox@VERSION_MAJOR@
Cflags: -I${includedir}/lttoolbox-@VERSION_API@
Cflags: -I${includedir}/lttoolbox-@VERSION_API@ @have_sv@
16 changes: 8 additions & 8 deletions lttoolbox/Makefile.am
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@

h_sources = alphabet.h att_compiler.h buffer.h compiler.h compression.h \
deserialiser.h entry_token.h expander.h file_utils.h fst_processor.h input_file.h lt_locale.h \
match_exe.h match_node.h match_state.h my_stdio.h node.h \
pattern_list.h regexp_compiler.h serialiser.h sorted_vector.h state.h string_utils.h \
transducer.h trans_exe.h xml_parse_util.h xml_walk_util.h exception.h tmx_compiler.h \
h_sources = alphabet.h alphabet_exe.h att_compiler.h binary_headers.h buffer.h compiler.h compression.h \
deserialiser.h endian_util.h entry_token.h expander.h file_utils.h fst_processor.h input_file.h lt_locale.h \
match_exe.h match_node.h match_state.h match_state2.h mmap.h my_stdio.h node.h old_binary.h \
pattern_list.h regexp_compiler.h serialiser.h sorted_vector.h state.h string_utils.h string_view.h string_writer.h symbol_iter.h \
transducer.h transducer_exe.h trans_exe.h xml_parse_util.h xml_walk_util.h exception.h tmx_compiler.h \
ustring.h sorted_vector.hpp
cc_sources = alphabet.cc att_compiler.cc compiler.cc compression.cc entry_token.cc \
cc_sources = alphabet.cc alphabet_exe.cc att_compiler.cc compiler.cc binary_headers.cc compression.cc entry_token.cc \
expander.cc file_utils.cc fst_processor.cc input_file.cc lt_locale.cc match_exe.cc \
match_node.cc match_state.cc node.cc pattern_list.cc \
regexp_compiler.cc sorted_vector.cc state.cc string_utils.cc transducer.cc \
match_node.cc match_state.cc match_state2.cc node.cc old_binary.cc pattern_list.cc \
regexp_compiler.cc sorted_vector.cc state.cc string_utils.cc string_writer.cc symbol_iter.cc transducer.cc transducer_exe.cc \
trans_exe.cc xml_parse_util.cc xml_walk_util.cc tmx_compiler.cc ustring.cc

library_includedir = $(includedir)/$(PACKAGE_NAME)-$(VERSION_API)/$(PACKAGE_NAME)
Expand Down
99 changes: 70 additions & 29 deletions lttoolbox/alphabet.cc
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#include <lttoolbox/my_stdio.h>
#include <lttoolbox/serialiser.h>
#include <lttoolbox/deserialiser.h>
#include <lttoolbox/endian_util.h>
#include <lttoolbox/old_binary.h>

#include <cctype>
#include <cstdlib>
Expand Down Expand Up @@ -110,12 +112,6 @@ Alphabet::operator()(UString const &s) const
return it->second;
}

bool
Alphabet::isSymbolDefined(UString const &s)
{
return slexic.find(s) != slexic.end();
}

bool
Alphabet::isSymbolDefined(const UString& s) const
{
Expand All @@ -129,23 +125,20 @@ Alphabet::size() const
}

void
Alphabet::write(FILE *output)
Alphabet::write(FILE *output) const
{
// First, we write the taglist
Compression::multibyte_write(slexicinv.size(), output); // taglist size
for(size_t i = 0, limit = slexicinv.size(); i < limit; i++)
{
Compression::string_write(slexicinv[i].substr(1, slexicinv[i].size()-2), output);
for (auto& it : slexicinv) {
Compression::string_write(it.substr(1, it.size()-2), output);
}

// Then we write the list of pairs
// All numbers are biased + slexicinv.size() to be positive or zero
size_t bias = slexicinv.size();
Compression::multibyte_write(spairinv.size(), output);
for(size_t i = 0, limit = spairinv.size(); i != limit; i++)
{
Compression::multibyte_write(spairinv[i].first + bias, output);
Compression::multibyte_write(spairinv[i].second + bias, output);
for (auto& it : spairinv) {
Compression::multibyte_write(it.first + bias, output);
Compression::multibyte_write(it.second + bias, output);
}
}

Expand All @@ -157,26 +150,20 @@ Alphabet::read(FILE *input)
a_new.spair.clear();

// Reading of taglist
int32_t tam = Compression::multibyte_read(input);
std::map<int32_t, std::string> tmp;
while(tam > 0)
{
tam--;
UString mytag = "<"_u;
mytag += Compression::string_read(input);
mytag += ">"_u;
for (uint64_t tam = OldBinary::read_int(input, true); tam > 0; tam--) {
UString mytag;
mytag += '<';
OldBinary::read_ustr(input, mytag, true);
mytag += '>';
a_new.slexicinv.push_back(mytag);
a_new.slexic[mytag]= -a_new.slexicinv.size(); // ToDo: This does not turn the result negative due to unsigned semantics
}

// Reading of pairlist
size_t bias = a_new.slexicinv.size();
tam = Compression::multibyte_read(input);
while(tam > 0)
{
tam--;
int32_t first = Compression::multibyte_read(input);
int32_t second = Compression::multibyte_read(input);
for (uint64_t tam = OldBinary::read_int(input, true); tam > 0; tam--) {
int32_t first = OldBinary::read_int(input, true);
int32_t second = OldBinary::read_int(input, true);
std::pair<int32_t, int32_t> tmp(first - bias, second - bias);
int32_t spair_size = a_new.spair.size();
a_new.spair[tmp] = spair_size;
Expand All @@ -186,6 +173,30 @@ Alphabet::read(FILE *input)
*this = a_new;
}

void
Alphabet::write_mmap(FILE* output, StringWriter& sw) const
{
write_le_64(output, slexicinv.size());
for (auto& it : slexicinv) {
StringRef r = sw.add(it);
write_le_32(output, r.start);
write_le_32(output, r.count);
}
}

void
Alphabet::read_mmap(FILE* input, StringWriter& sw)
{
int64_t count = read_le_64(input);
for (int64_t i = 0; i < count; i++) {
uint32_t s = read_le_32(input);
uint32_t c = read_le_32(input);
UString t = UString{sw.get(s, c)};
slexicinv.push_back(t);
slexic[t] = -i-1;
}
}

void
Alphabet::serialise(std::ostream &serialised) const
{
Expand All @@ -210,6 +221,30 @@ Alphabet::deserialise(std::istream &serialised)
}
}

void
Alphabet::read_serialised(FILE* in)
{
slexicinv.clear();
slexic.clear();
spairinv.clear();
spair.clear();
uint64_t len = OldBinary::read_int(in, false);
for (uint64_t i = 0; i < len; i++) {
UString t;
OldBinary::read_ustr(in, t, false);
slexicinv.push_back(t);
slexic[t] = -(int)i - 1;
}
len = OldBinary::read_int(in, false);
for (uint64_t i = 0; i < len; i++) {
int32_t a = OldBinary::read_int(in, false);
int32_t b = OldBinary::read_int(in, false);
auto p = make_pair(a, b);
spairinv.push_back(p);
spair[p] = i;
}
}

void
Alphabet::writeSymbol(int32_t const symbol, UFILE *output) const
{
Expand Down Expand Up @@ -307,6 +342,12 @@ Alphabet::createLoopbackSymbols(std::set<int32_t> &symbols, Alphabet &basis, Sid
}
}

std::vector<UString>&
Alphabet::getTags()
{
return slexicinv;
}

std::vector<int32_t>
Alphabet::tokenize(const UString& str) const
{
Expand Down
16 changes: 12 additions & 4 deletions lttoolbox/alphabet.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
#include <set>
#include <vector>
#include <cstdint>
#include <lttoolbox/string_writer.h>
#include <lttoolbox/ustring.h>

using namespace icu;
Expand Down Expand Up @@ -114,9 +115,6 @@ class Alphabet
* @param s symbol
* @return true if defined
*/
bool isSymbolDefined(UString const &s);
// TODO: This should always be const.
// But binary compatibility, so have 2 copies for now.
bool isSymbolDefined(UString const &s) const;

/**
Expand All @@ -129,17 +127,22 @@ class Alphabet
* Write method.
* @param output output stream.
*/
void write(FILE *output);
void write(FILE *output) const;

/**
* Read method.
* @param input input stream.
*/
void read(FILE *input);

void write_mmap(FILE* output, StringWriter& sw) const;
void read_mmap(FILE* input, StringWriter& sw);

void serialise(std::ostream &serialised) const;
void deserialise(std::istream &serialised);

void read_serialised(FILE* in);

/**
* Write a symbol enclosed by angle brackets in the output stream.
* @param symbol symbol code.
Expand Down Expand Up @@ -200,6 +203,11 @@ class Alphabet
*/
void createLoopbackSymbols(std::set<int32_t> &symbols, Alphabet &basis, Side s = right, bool nonTagsToo = false);

/**
* Return a reference to the array of tags
*/
std::vector<UString>& getTags();

std::vector<int32_t> tokenize(const UString& str) const;

bool sameSymbol(const int32_t tsym, const Alphabet& other, const int32_t osym,
Expand Down
Loading