Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce hash table switch strategy #895

Merged
merged 1 commit into from
Feb 7, 2019
Merged

Conversation

fwbrasil
Copy link
Contributor

@fwbrasil fwbrasil commented Jan 5, 2019

This is the first change related to #893.

Problem

The current switch strategies don't work well when the keys are sparse values since the density of the table switch is too low and Graal has to choose less efficient strategies.

Solution

Introduce a new switch strategy based on cheap hash functions. The Hasher evaluates the available functions and tries to find one that guarantees no conflicts for the provided keys. The approach uses imperfect hash functions based on the paper "Improving Switch Statement Performance with Hashing Optimized at Compile Time".

HashTableSwitchOp emits the assembly code for the hash table switch. If there's a default target, it also emits the original keys as part of the table and verifies the value to handle possible conflicts with unknown keys.

I've done some benchmarking and, in some cases, the hash table strategy has similar performance to a regular table switch but with better compaction:

BinaryStrategy[avgEffort=6.416389351081533] (the default)

Benchmark                    Mode  Cnt         Score        Error  Units
HashSwitchBench.httpStatus  thrpt   20  26146297.275 ± 345930.644  ops/s

Table switch (density 0.15048543689320387)

Benchmark                    Mode  Cnt         Score        Error  Units
HashSwitchBench.httpStatus  thrpt   20  32873184.212 ± 384754.780  ops/s

Hasher[function=x, effort=1, cardinality=224] (density 0.2767857142857143)

Benchmark                    Mode  Cnt         Score        Error  Units
HashSwitchBench.httpStatus  thrpt   20  32400323.086 ± 218257.904  ops/s

Even though the hash table strategy can achieve better compaction than the regular table switch for some cases, the current heuristic used to decide between table switches and other strategies isn't very robust and rarely chooses the hash switch. This issue will be addressed in another PR as described in #893

Notes

  1. Is it ok to implement the new strategy only for one architecture?
  2. I don't have much experience writing assembly. This first version doesn't seem very optimized since it uses more registers and uses two moves when defaultTarget is not null. I'm also not sure if the padding is ok. Suggestions are appreciated :)

double minDensity = 1 / Math.sqrt(strategy.getAverageEffort() * tableSwitchBonus);

Optional<Hasher> hasher = Optional.empty();
if (supportsHashTableSwitch()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a supportsHashTableSwitch() seems odd. Could we instead add a method that returns the hasher and the default implementation just returns Optional.empty()?

@christhalinger
Copy link
Contributor

christhalinger commented Jan 7, 2019

Is it ok to implement the new strategy only for one architecture?

Yes, that's ok. Especially since this is not a feature but an optimization. Some things are only implemented for AMD64. We just need to find a good way to abstract it (see my comment).

@@ -88,6 +90,8 @@

public static class Options {
// @formatter:off
@Option(help = "", type = OptionType.Expert)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New options must not have an empty help value. That said, it would be best to avoid adding this option altogether. Can you not tune the heuristic to choose the hashing strategy?

masm.jmp(scratchReg);

// Inserting padding so that jump table address is 4-byte aligned
if ((masm.position() & 0x3) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be masm.align(4). I guess this is a copy paste from the existing TableSwitchOp, so maybe fix it too.

masm.leaq(scratchReg, new AMD64Address(AMD64.rip, 0));
final int afterLea = masm.position();

// Jump to default target if index doesn't contain the expected value
Copy link
Member

@tkrodriguez tkrodriguez Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments explaining the table structure would be helpful and maybe merge the defaultTarget != null ifs.


// Jump to default target if index doesn't contain the expected value
if (defaultTarget != null) {
masm.movslq(valueScratchReg, new AMD64Address(scratchReg, idxScratchReg, Scale.Times8, 0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might consider loading 8 bytes here instead of issuing two loads. If you put key is in the low 32 bits you can still use cmpl and then just perform a shift instead of the second load. You'd want to align the table at 8 bytes instead which might be a good idea in any case when you have both key and offset in the table. Having a single load would also allow you to eliminate the move of indexReg to idxScratchReg since you'd only have a single use of it.

}
}
if (candidates.isEmpty()) {
return Optional.empty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how often this happens in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens often since minDensity is normally too low and the hasher stops the evaluation based on it. It's necessary to review the heuristic that makes the decision between the density of the table switch x the cost of other strategies. I'll remove the bonus parameter for now and follow up with a separate PR.

@tkrodriguez
Copy link
Member

In your benchmark numbers what kind of switch implementation is being used in the "without hash switch" case? Presumably it's not a tableswitch. how does that perform?

@plokhotnyuk
Copy link

I hope it will fix the following performance regression: #412

@fwbrasil
Copy link
Contributor Author

fwbrasil commented Jan 8, 2019

@plokhotnyuk It doesn't seem it will. The values are too sparse and the hasher can only find a function at arity 846, which would make the table too large.

public static void main(String[] args) {
        int[] keys = {3355, -1193609971, 3373707, -43264386, 1901043637, -1724546052, 116079, -2102114367, -608539730, -460163995, -1666926107, -1974943731, 1369680106, 226316666, 1699658638,
                        36848094, -1085743021, -1994383672, -1824078800, 3314158, 647263482, 583435343, 1690715422, 1592986856, -1189803120, 594447228, 1021718665, -174080651, -782008927, -690137550,
                        1512257332, -1412739479, 486184128, -1774134745, -101767272, 284519336, -312724053, 545330631, 765915793, 585125142, 1272354024, -475640257};
        JavaConstant[] c = new JavaConstant[keys.length];
        for (int i = 0; i < keys.length; i++)
            c[i] = JavaConstant.forInt(keys[i]);
        Optional<Hasher> forKeys = forKeys(c, 0.00001);
        System.out.println(forKeys); // Optional[Hasher[function=rotateRight(x, p) ^ x, effort=5, cardinality=864]]
    }

@Use protected Value value;
@Temp({REG, HINT}) protected Value hash;
@Temp protected Value scratch;
@Temp({REG, HINT}) protected Value entryScratch;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if these annotations are correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the HINT on this one and add REG to scratch. It can be tricky to get optimal register usage for these complex operations since we don't have a completely flexible way of specifying register interference to the allocator. The use of Alive combined with Temp is the safe way to go.

@fwbrasil
Copy link
Contributor Author

fwbrasil commented Jan 8, 2019

@christhalinger @tkrodriguez @dougxc thank you for the reviews! I've just addressed your suggestions.

@tkrodriguez I changed the code to force specific strategies to collect the performance numbers:

BinaryStrategy[avgEffort=6.416389351081533] (the default)

Benchmark                    Mode  Cnt         Score        Error  Units
HashSwitchBench.httpStatus  thrpt   20  26146297.275 ± 345930.644  ops/s


Table switch (density 0.15048543689320387)

Benchmark                    Mode  Cnt         Score        Error  Units
HashSwitchBench.httpStatus  thrpt   20  32873184.212 ± 384754.780  ops/s


Hasher[function=x, effort=1, cardinality=224] (density 0.2767857142857143)

Benchmark                    Mode  Cnt         Score         Error  Units
HashSwitchBench.httpStatus  thrpt   20  65606834.026 ± 1701259.405  ops/s

I'm puzzled that the hash table switch performs much better than the table switch. I did a bunch of manual testing and it seems to function properly, but I wonder if there's something wrong.

@fwbrasil
Copy link
Contributor Author

fwbrasil commented Jan 9, 2019

okay, I just found the bug. The register annotations were wrong. The hash table switch has performance similar to the regular table switch after the fix:

Hasher[function=x, effort=1, cardinality=224] (density 0.2767857142857143)

Benchmark                    Mode  Cnt         Score        Error  Units
HashSwitchBench.httpStatus  thrpt   20  32400323.086 ± 218257.904  ops/s


import jdk.vm.ci.meta.JavaConstant;

class Switch03TestGenerator {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this class to automatically generate source code with switches that use all available hash functions. It'll be useful if someone adds a new hash function, but let me know if you'd prefer not to have it in the repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine to leave it in. I would suggest making it a static nested class named Generator at the bottom of Switch03.java to make the connection as clear as possible.

@fwbrasil fwbrasil changed the title [WIP] Introduce hash table switch strategy Introduce hash table switch strategy Jan 9, 2019
@fwbrasil
Copy link
Contributor Author

fwbrasil commented Jan 9, 2019

I've introduced tests and addressed the initial review comments. The PR is ready for review.

@fwbrasil fwbrasil force-pushed the hash branch 3 times, most recently from efd3ed2 to 7e74561 Compare January 9, 2019 11:10
* Applies the hash function.
*
* @param value the value to be hashed
* @param min the lowest key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@param min {@code value} is guaranteed to be greater or equal to this minimum
You should add a @returns tag that explains any guarantees (if any) on the return value based on value and min.

public abstract int apply(long value, long min);

/**
* Applies the hash function to a lir value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "apply" here I think you mean "generates LIR that implements the hash function in terms of value and min". In that case, please update the documentation and rename the method to generate.

public abstract Value apply(Value value, Value min, ArithmeticLIRGeneratorTool gen);

/**
* Returns an estimate of the hashing effort.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume "effort" means execution cost (in time) of the hashing function. Please update doc to clarify this.

static {
//@formatter:off

// `x` is the value and `s` is the min (lowest key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use the term "key" here since you describe x as the value. For consistency with the methods above, please use val and min instead of x and s.

import jdk.vm.ci.meta.Value;

/**
* This class holds a hash function at a specific cardinality and min value (lowest key).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain cardinality a bit more in this doc.

@fwbrasil
Copy link
Contributor Author

fwbrasil commented Jan 9, 2019

@dougxc I've just pushed changes to address your comments and add a new test that will fail if the set of hash functions changes and the test is not updated.

@tkrodriguez
Copy link
Member

Why did the performance drop so much after your changes? Was it executing incorrectly so it seemed faster than it really was?

@fwbrasil fwbrasil force-pushed the hash branch 3 times, most recently from 65c35d5 to f176c28 Compare January 10, 2019 16:50
@fwbrasil
Copy link
Contributor Author

@dougxc @tkrodriguez @christhalinger thank you for the reviews! I believe all review comments were addressed, please let me know if there's anything pending to merge this change.

@fwbrasil
Copy link
Contributor Author

fwbrasil commented Jan 15, 2019

any updates on this? I'd like to create the second PR with a better heuristic to choose the switch strategy

@fwbrasil
Copy link
Contributor Author

@dougxc @tkrodriguez since I've planned opening multiple PR for #893 and github doesn't support dependent pull requests, I'm currently blocked. Could you clarify how the merge process work so I can plan accordingly? I imagine it's not as simple as just hitting the merge button here on github.

@dougxc
Copy link
Member

dougxc commented Jan 22, 2019

@tkrodriguez can you please integrate this PR.

@tkrodriguez
Copy link
Member

Sorry I hadn't responded earlier. It has to be run through out internal gate which does a much more extensive set of tests. I think there's also a question of whether this is a win overall even though it looks good on a microbenchmark. I'll run our benchmarks and see how it looks as well. So it requires a bit more attention from me and I've been juggling a few other things. Sorry for not communicating that.

You can always create a new PR that includes these changes and then merge and rebase it away once it lands. Git makes that very easy.

@fwbrasil
Copy link
Contributor Author

@tkrodriguez thanks for the update! Please keep in mind that this change is part of a larger plan detailed on #893, so unless the benchmarks have switches with sparse values, it probably won't improve performance. The hash strategy will be important to compile type switches using addresses, which are sparse values.

@fwbrasil
Copy link
Contributor Author

Also, given the current heuristic to decide which compilation strategy to use, the hash switch is very rarely chosen.

@@ -44,6 +44,8 @@
*/
public interface ArithmeticLIRGeneratorTool {

LIRGenerator getLIRGen();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this override? It sort of defeats the purpose of the interface to expose this here. I think replacing ArithmeticLIRGeneratorTool with ArithmeticLIRGenerator in Hasher and HashFunction resolve the problem without requiring this override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! it does work, I've removed the override

@tkrodriguez
Copy link
Member

The main concern is regressions caused by it. Improvements would be a welcome bonus at this early point.

Can you rebase onto latest master?

@fwbrasil fwbrasil force-pushed the hash branch 2 times, most recently from cfc55ee to 31b045d Compare January 22, 2019 22:29
@fwbrasil
Copy link
Contributor Author

@tkrodriguez branch rebased

* Tests optimization of hash table switches.
* Code generated by `Switch03.TestGenerator.main`
*/
public class Switch03 extends JTTTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a more descriptive name since this is specifically about the hash switch, like HashSwitchTest. Also is there any reason this is a JTTest? These are special variants of regular compiler tests that are exercised with their parameters bounds as constants in addition to a normal compile to test high level transforms. I don't think you care about that here. I think this should probably be over in org.graalvm.compiler.core.test. Or if you wanted it to be a more specific test of the LIR you could make it a LIRTest and check for the existence of a HashTableSwitchOp in the final LIR. Anyway, renaming is enough for now. You can make any improvements in a follow on PR.

keys[idx] = keyConstants[i];
targets[idx] = keyTargets[i];
}
emitHashTableSwitch(hasher.get(), keys, defaultTarget, targets, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasher.get is just h isn't it?

Set<Integer> seen = new HashSet<>(keys.length);
for (JavaConstant key : keys) {
int hash = function.apply(key.asLong(), min) & (cardinality - 1);
if (seen.contains(hash)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be simplified with the else removed?

Suggested change
if (seen.contains(hash)) {
if (!seen.add(hash)) {

@tkrodriguez
Copy link
Member

The benchmark results look ok. It doesn't trigger very often from what I can see. Should we add an option to force it's use? That could be in a follow on PR if you like.


private static Value rotateRight(ArithmeticLIRGenerator gen, Value i, Value distance) {
// (i >>> distance) | (i << -distance)
return gen.emitOr(gen.emitUShr(i, distance), gen.emitShl(i, gen.emitNegate(distance)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD64ArithmeticLIRGenerator provides emitRor which would be a lot more efficient. Maybe provide a platform independent definition like above and use emitRor for AMD64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkrodriguez maybe I could add emitRor to ArithmeticLIRGenerator with the slow implementation as the default and then AMD64ArithmeticLIRGenerator overrides it with a better impl?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I was thinking.

@fwbrasil
Copy link
Contributor Author

fwbrasil commented Feb 4, 2019

@tkrodriguez I've just addressed your last comments.

The benchmark results look ok. It doesn't trigger very often from what I can see. Should we add an option to force it's use? That could be in a follow on PR if you like.

Yes, I have a branch with a new heuristic that will make the hash table switch more competitive. I'll create the PR once this change is merged. The binary strategy has a large code footprint, so the hash table strategy will be chosen in some cases where the binary one is chosen at the moment.

@tkrodriguez
Copy link
Member

Ok. I'm going to do some copyright data adjustments in a new commit and push it.

@dougxc dougxc merged commit b84b8c9 into oracle:master Feb 7, 2019
}

private Value toDWORD(Value v, ArithmeticLIRGenerator t) {
if (v.getPlatformKind() != AMD64Kind.DWORD) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of AMD64Kind in a shared file should be fixed in the next PR. I would in general expect the values to come into this function with the right sizing of either int or long with the caller ensuring that it was at least those types already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wait for the next PR? Best to fix this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix it. Should I create a new PR with the fix or are you going to reopen this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants