Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while mutating Tokenizer not available for language #2601

Closed
danielmai opened this issue Sep 18, 2018 · 2 comments
Closed

Error while mutating Tokenizer not available for language #2601

danielmai opened this issue Sep 18, 2018 · 2 comments
Assignees
Labels
kind/bug Something is broken.

Comments

@danielmai
Copy link
Contributor

Title: Error while mutating Tokenizer not available for language

If you suspect this could be a bug, follow the template.

  • What version of Dgraph are you using?
    v1.0.8

  • Have you tried reproducing the issue with latest release?
    Yes.

  • What is the hardware spec (RAM, OS)?
    64 GB. Ubuntu 18.04.

  • Steps to reproduce the issue (command/config used to run Dgraph).

  1. Run 1 Dgraph Zero and 1 Dgraph Alpha.
  2. Run dgraph live with the 21-million movie data set. (rdf and schema from the benchmarks repo):
dgraph live -r 21million.rdf.gz -s 21million.schema
  1. After running for ~50 minutes, I see a lot of aborts and repeated "Tokenizer not available" errors. Something like this:
2018/09/18 10:41:52 batch.go:125: Error while mutating Tokenizer not available for language: pl
2018/09/18 10:41:52 batch.go:125: Error while mutating Tokenizer not available for language: sl
2018/09/18 10:41:52 batch.go:125: Error while mutating Tokenizer not available for language: hi
2018/09/18 10:41:52 batch.go:125: Error while mutating Tokenizer not available for language: sr
Total Txns done:    17094 RDFs per second:    3992 Time Elapsed: 1h11m22s, Aborts: 70232
Total Txns done:    17094 RDFs per second:    3990 Time Elapsed: 1h11m24s, Aborts: 70232
Total Txns done:    17094 RDFs per second:    3988 Time Elapsed: 1h11m26s, Aborts: 70232
Total Txns done:    17094 RDFs per second:    3986 Time Elapsed: 1h11m28s, Aborts: 70232
2018/09/18 10:42:00 batch.go:125: Error while mutating Tokenizer not available for language: es-419

This happens indefinitely at this point on. Dgraph keeps retrying the failed mutations.

  • Expected behaviour and actual result.

Dgraph live loader finishes loading the data set.

@danielmai danielmai added the kind/bug Something is broken. label Sep 18, 2018
@danielmai
Copy link
Contributor Author

I tried re-running dgraph live from a fresh cluster with the following modified schema:

diff --git a/data/21million.schema b/data/21million.schema
index 9dc91a2..7900554 100644
--- a/data/21million.schema
+++ b/data/21million.schema
@@ -4,7 +4,7 @@ genre                : uid @reverse @count .
 initial_release_date : datetime @index(year) .
 rating               : uid @reverse .
 country              : uid @reverse .
-loc                  : geo @index(geo) .
-name                 : string @index(hash, fulltext, trigram) @lang .
+loc                  : geo .
+name                 : string @index(hash, trigram) @lang .
 starring             : uid @count .
 _share_hash_         : string @index(exact) .

It's been running for an hour and I see aborts in the live loader output:

Total Txns done:    19519 RDFs per second:    3823 Time Elapsed: 1h25m5s, Aborts: 1454

In the Alpha's /debug/requests I see this trace for Server.Mutate:


When | Elapsed (s)
-- | --
2018/09/18 12:11:18.324662 | 110.322906 | Server.Mutate
12:11:18.325889 | .  1227 | ... Added Internal edges
12:11:18.328082 | .  2192 | ... Proposing data with key: 01-17683798971607446653. Timeout: 4s
12:11:18.328369 | .   287 | ... Waiting for the proposal.
12:11:22.328219 | 3.999850 | ... Internal context timed out with error: context deadline exceeded. Retrying...
12:11:22.328249 | .    30 | ... Proposing data with key: 01-17134495942084430214. Timeout: 8s
12:11:22.328630 | .   380 | ... Waiting for the proposal.
12:11:30.328496 | 7.999866 | ... Internal context timed out with error: context deadline exceeded. Retrying...
12:11:30.328534 | .    38 | ... Proposing data with key: 01-16310624025916951683. Timeout: 16s
12:11:30.329358 | .   824 | ... Waiting for the proposal.
12:11:46.328672 | 15.999314 | ... Internal context timed out with error: context deadline exceeded. Retrying...
12:11:46.328704 | .    32 | ... Proposing data with key: 01-1565500641011909092. Timeout: 32s
12:11:46.329015 | .   311 | ... Waiting for the proposal.
12:12:18.328814 | 31.999800 | ... Internal context timed out with error: context deadline exceeded. Retrying...
12:12:18.328833 | .    18 | ... Proposing data with key: 01-9267239880346609593. Timeout: 1m4s
12:12:18.329192 | .   359 | ... Waiting for the proposal.
12:13:08.639079 | 50.309887 | ... Done with error: <nil>
12:13:08.642800 | .  3721 | ... Prewrites err: <nil>. Attempting to commit/abort immediately.
12:13:08.647566 | .  4766 | ... Status of commit at ts: 39894: <nil>

srfrog pushed a commit that referenced this issue Sep 19, 2018
Reuse know language stopwords with similar languages. If/when the
support for the languages is added the aliases are ignored.

Ref: #2601
@srfrog srfrog self-assigned this Sep 28, 2018
@srfrog
Copy link
Contributor

srfrog commented Sep 29, 2018

we can fix some languages with #2602 but we are still missing some, such as @pl. here's a list of the languages we could support: https://github.com/blevesearch/bleve/tree/master/analysis/lang

srfrog pushed a commit that referenced this issue Oct 2, 2018
* add language aliases for broader support.

Reuse know language stopwords with similar languages. If/when the
support for the languages is added the aliases are ignored.

* added a test for all supported and potential fulltext index language tokenizers

Ref: #2601
dna2github pushed a commit to dna2fork/dgraph that referenced this issue Jul 19, 2019
* add language aliases for broader support.

Reuse know language stopwords with similar languages. If/when the
support for the languages is added the aliases are ignored.

* added a test for all supported and potential fulltext index language tokenizers

Ref: dgraph-io#2601
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

2 participants