Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IllegalStateException during indexing #175

Closed
attila-vegh opened this issue Mar 7, 2022 · 32 comments · Fixed by #176
Closed

IllegalStateException during indexing #175

attila-vegh opened this issue Mar 7, 2022 · 32 comments · Fixed by #176
Assignees
Labels
serialization-format-change Changes that affect index serialization format
Milestone

Comments

@attila-vegh
Copy link

I got the next exception from Jandex library:

Caused by: java.lang.IllegalStateException: Class not found in class table:scala.Function18$$anonfun$curried$1
at org.jboss.jandex.IndexWriterV2.positionOf (IndexWriterV2.java:507)
at org.jboss.jandex.IndexWriterV2.writeTypeEntry (IndexWriterV2.java:789)
at org.jboss.jandex.IndexWriterV2.writeTypeTable (IndexWriterV2.java:271)
at org.jboss.jandex.IndexWriterV2.write (IndexWriterV2.java:211)
at org.jboss.jandex.IndexWriter.write (IndexWriter.java:110)
at org.jboss.jandex.IndexWriter.write (IndexWriter.java:74)
at org.jboss.jandex.JarIndexer.createJarIndex (JarIndexer.java:193)

Please investigate this issue. thanks

@Ladicek
Copy link
Contributor

Ladicek commented Mar 7, 2022

I'd gladly investigate the issue, if you provided a way to reproduce. Since the stack trace starts at JarIndexer.createJarIndex, I assume you have a JAR file that you're trying to index. Is that JAR file available somewhere, or is it possible to build it from source that is available somewhere?

@attila-vegh
Copy link
Author

The jar file is our company build.
I`m sending a screenshot where the "same" object is created twice
Dotname@957 vs DotName2717

image

I think the type is constructed from scala jar, what is a dependency jar for any other library file.
Our jar does not use scala directly.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 7, 2022

The Scala library, at least the one I'm looking at (org.scala:scala-library:2.13.8) doesn't contain scala.Function18$$anonfun$curried$1. I'll need a way to reproduce the issue, I don't see a way around it.

Your screenshot is unfortunately useless. The nameTable in IndexWriterV2 is a HashMap and DotName provides equality and hash code, so the fact that there are 2 different instances of DotName with the same content doesn't mean anything.

@attila-vegh
Copy link
Author

We are decided to filter out indexing scala jars. Now the indexing is ok

@attila-vegh
Copy link
Author

attila-vegh commented Mar 7, 2022

The Function is inside the: scala-library-2.9.1.jar

@Ladicek
Copy link
Contributor

Ladicek commented Mar 7, 2022

Thank you! Indeed running java -jar core/target/jandex-3.0.0-SNAPSHOT.jar ~/Downloads/scala-library-2.9.1-1.jar fails with the same exception:

$ java -jar core/target/jandex-3.0.0-SNAPSHOT.jar ~/Downloads/scala-library-2.9.1-1.jar 
java.lang.IllegalStateException: Class not found in class table:scala.collection.TraversableLike$$anonfun$isEmpty$1
        at org.jboss.jandex.IndexWriterV2.positionOf(IndexWriterV2.java:491)
        at org.jboss.jandex.IndexWriterV2.writeTypeEntry(IndexWriterV2.java:801)
        at org.jboss.jandex.IndexWriterV2.writeTypeTable(IndexWriterV2.java:264)
        at org.jboss.jandex.IndexWriterV2.write(IndexWriterV2.java:206)
        at org.jboss.jandex.IndexWriter.write(IndexWriter.java:102)
        at org.jboss.jandex.IndexWriter.write(IndexWriter.java:66)
        at org.jboss.jandex.JarIndexer.createJarIndex(JarIndexer.java:197)
        at org.jboss.jandex.JarIndexer.createJarIndex(JarIndexer.java:88)
        at org.jboss.jandex.Main.getIndex(Main.java:87)
        at org.jboss.jandex.Main.execute(Main.java:67)
        at org.jboss.jandex.Main.main(Main.java:52)

I'll look into it.

In the meantime, I believe what you did seems right, indexing the Scala classes would most likely be useless for you.

@Ladicek Ladicek reopened this Mar 8, 2022
@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

Reopening, I can reproduce the issue by indexing a single file: scala/Function2$$anonfun$curried$1$$anonfun$apply$1.class. That should itself work fine, because Jandex is explicitly designed for incomplete classpaths. Yet I get the exception when writing the index through IndexWriterV2.

It also turns out that the hint at 2 different DotName objects is relevant. When the type table is being serialized, it seems the name table contains 2 entries with a key of scala.Function2$$anonfun$curried$1. Those 2 DotNames are differently componentized, and there may be a bug either in the name destructuring logic, or in equality/hash code.

Looking more into it.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

One more thing: I recall we adjusted DotName parsing to account for $ at the beginning or at the end of a name component. Since the class names in question here contain $$, I'm thinking maybe those adjustments are at fault. Need to check.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

The changes we did to account for $ at first/last position don't seem to be related.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

OK so DotName.hashCode seems fine, hash codes of those 2 different DotNames are equal. DotName.equals is more suspicious, but I guess that one of the DotName's core assumptions is that componentization is canonical (there's exactly one componentization for each valid name). In other words, the problem is probably in the code that creates the componentized DotNames in the first place.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

(Side note: to be completely fair, implementing DotNames equality by delegating to DotName.toString's equality would probably be easiest and I'm not convinced it would be much slower :-) )

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

So the 1st DotName is created by parsing a field, whose descriptor is scala/Function2$$anonfun$curried$1. This is componentized by NameTable.convertToName as:

  • {local = "scala", innerClass = false}
  • {local = "Function2$", innerClass = false}
  • {local = "anonfun", innerClass = true}
  • {local = "curried", innerClass = true}
  • {local = "1", innerClass = true}

The same DotName would later be created by parsing a method parameter, whose descriptor is identical, but that is correctly found in the intern pool.

The 2nd DotName is created by parsing a generic signature of that same method, which reads (Lscala/Function2<TT1;TT2;TR;>.$anonfun$curried$1;)V. That is a little strange, but valid.

The generic signature parser first finds a type name of scala.Function2. That name is componentized by NameTable.convertToName as:

  • {local = "scala", innerClass = false}
  • {local = "Function2", innerClass = false}

Then, the generic signature parser finds and parses the type arguments, and then it looks for nested types. Those are delimited by ., so there's just one: $anonfun$curried$1. NameTable.wrap is used to create a DotName whose prefix is the 2-element name shown above and local name is $anonfun$curried$1. In total, it looks like this:

  • {local = "scala", innerClass = false}
  • {local = "Function2", innerClass = false}
  • {local = "$anonfun$curried$1", innerClass = true}

Each case makes perfect sense when taken in isolation. However, looking at both of them at the same time, I'm starting to believe that DotName componentization most likely should NOT be assumed to be canonical. In other words, DotName.equals is wrong when both names are componentized.

@attila-vegh
Copy link
Author

It is possible to somehow resolve this?

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

It is possible. I'm still working on how to do it best.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 8, 2022

Hi @n1hility, could you please comment on how DotName componentization is supposed to work? At the moment, it seems to me that DotName.equals assumes that there's always exactly 1 componentization of any given name, but I've discovered (see above if you have plenty of time) that that's not really the case.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 9, 2022

I realized yesterday evening that the generic signature parser is probably at fault, because it does not componentize the inner class name. I'll check today.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 9, 2022

I came up with a relatively simple change that makes the generic signature parser componentize the inner class name and end up with the following DotName:

  • {local = "scala", innerClass = false}
  • {local = "Function2", innerClass = false}
  • {local = "$anonfun", innerClass = true}
  • {local = "curried", innerClass = true}
  • {local = "1", innerClass = true}

This is almost identical to the DotName created by parsing the descriptor, shown in one of the previous comments. The only difference is the placement of that one extra $.

Fixing that will likely affect the treatment of $ in the first/last position, as I mentioned above. Will continue digging.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 9, 2022

I think I just got enlightened :-) The problem with a descriptor like scala/Function2$$anonfun$curried$1 is that it is literally impossible to know if any given $ is a nesting separator or a true part of the name. Therefore, the only possible canonical componentization of a name like this is to split on every occurence of $, which leads to introduction of empty local names.

We eliminated empty local names previously (as I mentioned several times, by treating $ in the first/last position specially), which leads to ambiguity: should we split greedily (leading to Function2 + $anonfun), or eagerly (leading to Function2$ + anonfun). Any given answer to this question may be wrong, which will manifest in a situation like this, when there's another source of the name which leads to a different componentization (in this case, it's the generic signature).

The only question is, well, now what :-)

@Ladicek
Copy link
Contributor

Ladicek commented Mar 9, 2022

I have a patch locally that allows (not sure if correctly :-) ) indexing scala-library-2.9.1-1.jar. That patch attempts to restore canonical DotName componentization.

However, looking more at DotName code, it only guards against . separators when constructing componentized instances. It seems $ is allowed to be mixed relatively freely. Sometimes, it's part of the local name, and sometimes, it's a nested type separator. Under that condition, componentization cannot be canonical, and hence DotName.equals() is wrong when both instances are componentized.

@n1hility
Copy link
Contributor

@Ladicek sorry for the delay I will take a look it this tomorrow (today for you). Skimming it sounds like you are on the right track. The intention is that it's not precise (since not enough info), but it should be deterministic.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 11, 2022

@n1hility thanks, I think I've got it now. I need to fix DotName.equals(), which will be interesting (I found in history that it was specifically refactored to avoid allocations etc.), but doable.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 11, 2022

I have a fixed version of DotName.equals that allows successfully indexing the entire scala-library-2.9.1-1.jar. Unfortunately, reading the index back fails, so the fix isn't exactly a fix :-) Will continue digging.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 11, 2022

Reproducing the additional failure is possible by indexing one file (scala/collection/parallel/ParSeqLike$Elements$$anonfun$psplit$3$$anon$3.class) and then serializing and deserializing that index.

Of course, it's once again caused by a piece of code that assumes that componentization is canonical. This time, it's the code that reads/writes the name table.

Specifically, these 2 names are serialized next to each other:

  • scala.collection.parallel.ParSeqLike$Elements$$anonfun$psplit
  • scala.collection.parallel.ParSeqLike$Elements$$anonfun$psplit$3

The first is componentized like this:

  • {local = "scala", innerClass = false}
  • {local = "collection", innerClass = false}
  • {local = "parallel", innerClass = false}
  • {local = "ParSeqLike", innerClass = false}
  • {local = "Elements$", innerClass = true}
  • {local = "anonfun", innerClass = true}
  • {local = "psplit", innerClass = true}

While the second is componentized like this:

  • {local = "scala", innerClass = false}
  • {local = "collection", innerClass = false}
  • {local = "parallel", innerClass = false}
  • {local = "ParSeqLike", innerClass = false}
  • {local = "Elements$$anonfun$psplit$3", innerClass = true}

The code that constructs componentized DotNames during deserialization completely miscalculates the name depths and from there, everything goes wrong.

Will think about a fix next week.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 11, 2022

(Adding here just for the fun: after the index reader miscalculates the name depths, it even creates non-existing names such as scala.collection.parallel.ParSeqLike$Elements$$anonfun$psplit$3$3$ :-) )

@n1hility
Copy link
Contributor

@Ladicek ok thanks. it's all yours. One thing to check out is the sorting is correct, since these names are written in a radix tree order where each branch is preceded by its peers and finally the immediate parent (prefix())

@Ladicek
Copy link
Contributor

Ladicek commented Mar 14, 2022

Yea the sorting is correct. It's the assumption that it's possible, when deserializing a name, to get the correct prefix by taking the previously deserialized name and call prefix() on it N times, where N is read from the serialized format of the name. That assumes canonical componentization.

I realized over the weekend that there's a simple fix. Instead of writing a depth of the name, which will later be used to find the prefix, we can just store an index of the prefix in the name table. Since names are written in order, the prefix must have already been written, so we can simply look it up from the table, instead of inspecting the previously deserialized name. In case of huge JARs with 1000s classes, indices can be large numbers and the total index size could increase. That can be mitigated relatively easily: instead of storing an [absolute] index, we can store a [relative] offset. Since offsets would always be negative, we would actually store the offset modified like this: -1 would be stored as 0, -2 would be stored as 1, -3 would be stored as 2, etc. To sum up, if int currentIndex is an index of the currently serialized name and int prefixIndex is an index of the prefix, we would write currentIndex - prefixIndex - 1.

That obviously requires a new persistent format version, so can't be fixed in 2.4.

EDIT: the offset format described above actually can't be used, because a prefix of null requires extra treatment. So we'll use 0 to mean a prefix of null, otherwise we'll store an absolute value of the prefix offset (so currentIndex - prefixIndex).

@Ladicek
Copy link
Contributor

Ladicek commented Mar 14, 2022

Aaaand with the 2 changes described above (fixed DotName.equals when both names are componentized, and fixed serialization/deserialization of name prefix), the entire scala-library-2.9.1-1.jar file can be successfully indexed and the index can be successfully serialized and deserialized again. Yay! :-)

@Ladicek
Copy link
Contributor

Ladicek commented Mar 15, 2022

I could write a test for the DotName.equals bug relatively easily, but I'm struggling to write a test for the name prefix serialization/deserialization issue. This small snippet of Scala code, compiled with Scala 2.11.12, reproduces the issue:

package test

trait Func[-T1, -T2, +R] {
  def apply(v1: T1, v2: T2): R

  def curried: T1 => T2 => R = {
    (x1: T1) => (x2: T2) => apply(x1, x2)
  }
}

But I'm struggling to write an equivalent Java code that would also exhibit the issue. Will continue digging for some time, not sure what to do if I'm not successful.

@Ladicek
Copy link
Contributor

Ladicek commented Mar 15, 2022

I just realized that I could just put the Scala source code into the test-data submodule and be done with it. There's a Maven plugin for Scala that can be used to compile it. The path of least resistance! :-)

@Ladicek
Copy link
Contributor

Ladicek commented Mar 15, 2022

OK this is embarrassing. I spent more than a day trying to find a reproducer for the name prefix serialization/deserialization issue, and it looks like my test for DotName.equals reproduces the problem too, if I roundtrip that index. The reproducer is more subtle, because deserializing the index doesn't fail, it just builds a nonsensical index in memory. But that's good enough for me.

@Ladicek Ladicek linked a pull request Mar 15, 2022 that will close this issue
@Ladicek
Copy link
Contributor

Ladicek commented Mar 15, 2022

Fixes are in #176. Due to the need to change persistent format, this can't be backported to 2.x, unfortunately.

@Ladicek Ladicek self-assigned this Mar 16, 2022
@Ladicek Ladicek added this to the 3.0.0 milestone Mar 16, 2022
@Ladicek
Copy link
Contributor

Ladicek commented Mar 23, 2022

Fixed in #176.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
serialization-format-change Changes that affect index serialization format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants