In this post, I will try to explain Rabin-Karp Algorithm & its worst time complexity. Rabin-Karp pattern matching algorithm is based on hash matching.

Let's take an example.

Text: aadabc

Pattern: abc

We will use a simple hash function.

Lets create a table as below:

a b c d e

1 2 3 4 5

So how do we calculate the hash value? We will add up the values as per the table.

Pattern hash is a + b + c = 1 + 2 + 3 = 6

For comparing the text, first we will take 3 characters as our pattern length is 3. Then we we will shift one letter in each step & recalculate hash.

t[1] + t[2] + t[3] = 1 + 1 + 4 = 6

So hash matched. We need to compare each character from 1 to 3 position with the characters in the pattern.

t[1] = p[1], match length 1, increment both t & p.

t[2] ≠ p[2], matching failed, we will match characters from t[2] to t[4].

t[2] + t[3] + t[4] = 1 + 4 + 1 = 6

So hash matched. We need to compare each character from 2 to 4 position with the characters in the pattern.

t[2] = p[1], match length 1, increment both t & p.

t[3] ≠ p[2], matching failed, we need to match from t[3] to t[5].

t[3] + t[4] + t[5] = 4 + 1 + 2 = 7

Hash matching failed. We need to match from t[4] to t[6].

t[4] + t[5] + t[6] = 1 + 2 + 3 = 6

So hash matched. We need to compare each character from 4 to 6 position with the characters in the pattern.

t[4] = p[1], match length 1, increment both t & p.

t[5] = p[2], match length 2, increment both t & p.

t[6] = p[3], match length 3, so we found our match.

Now let's discuss few drawbacks of above example. When we are shifting one position in the text, we are recalculating the entire hash.

If we have pattern length 20, we will add up 20 values. If we have pattern length 10, we will add 10 values. So it's not constant & time complexity of hash function depends on pattern length m.

So if we use normal hash function, time complexity will be O(mn). That's why we need to use concept of rolling hash function. It uses sliding window technique.

When we shift one position, we will remove first position value from previous hash & add current position value to that. So whatever the value of m is, we can recalculate hash with only two operations.

That is constant & that will bring down time complexity to O(n).

Here is how we can do that.

t[1] + t[2] + t[3] = 1 + 1 + 4 = 6

To get hash for t[2] to t[4], we can do following.

6 - t[1] = 5

5 + t[4] = 6

So, t[2] + t[3] + t[4] = 6

There is one more thing. Whenever there is a hash match, we need to match pattern with that portion of text & verify if characters are same. In the example above, we saw that hash value is same but characters are not.

These false positives will increase time complexity. And if you are using a bad hash function, then you might get false positives in every step. And you might need to compare half of the pattern characters on an average before you find a mismatch.

So time complexity on that case becomes (m / 2) * n or O(mn). That's why Rabin-Karp algorithm has a worst time complexity of O(mn).

We can do some improvement on the hash function that can reduce time complexity to O(m+n) in most of the cases. I will try to write it in some future post.

But I hope you have got a clear understanding of this string searching algorithm & why its worst time complexity is O(mn).

## No comments:

## Post a comment