Hash Tables

Gregory M. Kapfhammer

March 31, 2025

Efficient way to find data in a collection?

  • Sort the items in a collection
  • Use binary search to find an item
  • Amortize the cost of sorting
  • Operate on collections are static
  • Wait, what are the limitations of this approach?

What are mappings and hash table structures?

  • A mapping is an association between two sets of items
  • A mapping associates a value to a key
  • This association is called a key-value pair
  • Key are ideally unique, so there is only one value per key
  • Why do dictionary operations only take constant time?

Mapping abstract data type

  • get(k) - return the value associate to the key k; Usually an error (KeyError) is raised if the given key is not present

  • put(k,v) - Add the key-value pair (k,v) to the mapping

  • Implemented as __getitem__ and __setitem__ in Python
  • Complete implementation is sophisticated and nuanced
  • Small mistakes can lead to performance problems
  • Start with a simple implementation and iteratively improve it

Can you sketch an efficient dictionary implementation?

Preliminary hash table

class Entry:
    def __init__(self, key, value):
        self.key = key
        self.value = value

    def __str__(self):
        return str(self.key) + " : " + str(self.value)

def mapput(L, key, value):
    for e in L:
        if e.key == key:
            e.value = value
            return
    L.append(Entry(key, value))

def mapget(L, key):
    for e in L:
        if e.key == key:
            return e.value
    raise KeyError

m = []
mapput(m, 4, 'five')
mapput(m, 1, 'one')
mapput(m, 13, 'thirteen')
mapput(m, 4, 'four')
assert(mapget(m, 1) == 'one')
assert(mapget(m, 4) == 'four')
print(mapget(m, 1) == 'one')
print(mapget(m, 4) == 'four')
True
True

Try the basic hash table

What can we learn from the implementation of Entry and the mapput and mapget functions?

  • Offers a new API for accessing data stored in a list
  • Uses the Entry class to store key-value pairs
  • Provides the mapput function to add key-value pairs
  • Uses the mapget function to retrieve values
  • Yet, no efficiency gains and not a suitable interface!

Can you suggest ways to improve this implementation?

What is our ultimate goal for the interface to a hash table?

d = {'key1': 'value1', 'key2': 'value2'}

for k in d:
    print(k)
for v in d.values():
    print(v)
for k, v in d.items():
    print(k, v)

print(d)
key1
key2
value1
value2
key1 value1
key2 value2
{'key1': 'value1', 'key2': 'value2'}

Complete API for a hash table

  • __getitem__(k): return the value associate to the key k; usually an error (KeyError) is raised if the given key is not present

  • __setitem__(k, v): Add the key-value pair (k,v) to the mapping

  • remove(k): Remove the entry with key k if it exists

  • __len__: return the number of keys in the dictionary

  • __contains__(k): return true if the mapping contains a pair with key k

  • __iter__: return an iterator over the keys in the dictionary

  • values: return an iterator over the values in the dictionary

  • items: return an iterator over the key-value pairs (as tuples)

  • __str__: return a string representation of the mapping

Be careful! Don’t assume details about hash table behavior!

  • 🎉 Remember, dict is a non-sequential collection
  • 🎲 The order of items in a dict is unpredictable
  • 🔄 Order can change between two iterations of same dict
  • 📝 Now dict items are in a fixed order since using a list
  • 🚀 Moving items around a list improves running times!

Implementing a performant dictionary is surprisingly challenging! What strategies can we use for assessment?

List-based hash table implementation

class ListMappingSimple:
    def __init__(self):
        self._entries = []

    def put(self, key, value):
        e = self._entry(key)
        if e is not None:
            e.value = value
        else:
            self._entries.append(Entry(key, value))

    def get(self, key):
        e = self._entry(key)
        if e is not None:
            return e.value
        else:
            raise KeyError

    def remove(self, key):
        e = self._entry(key)
        if e is not None:
            self._entries.remove(e)

    def _entry(self, key):
        for e in self._entries:
            if e.key == key:
                return e
        return None

    def __str__(self):
        return "{" + ", ".join(str(e) for e in self._entries) + "}"

    def __len__(self):
        return len(self._entries)

    def __contains__(self, key):    
        if self._entry(key) is None:
            return False
        else:
            return True

    def __iter__(self):
      return (e.key for e in self._entries)

    def values(self):
        return (e.value for e in self._entries)

    def items(self):
        return ((e.key, e.value) for e in self._entries)

    __getitem__ = get
    __setitem__ = put

What is the key drawback of this implementation?

First, try out the ListMappingSimple!

my_map = ListMappingSimple()
my_map["name"] = "John Doe"
my_map["age"] = 30
my_map["city"] = "New York"
my_map.put("occupation", "Developer")

print(f"Entire map: {my_map}")
print(f"Map size: {len(my_map)}")

print("Retrieving values:")
print(f"Name: {my_map.get('name')}")
print(f"Age: {my_map['age']}")
print(f"City: {my_map.get('city')}")
Entire map: {name : John Doe, age : 30, city : New York, occupation : Developer}
Map size: 4
Retrieving values:
Name: John Doe
Age: 30
City: New York

More operations of ListMappingSimple

print("Checking if keys exist:")
print(f"'name' exists: {'name' in my_map}")
print(f"'email' exists: {'email' in my_map}")

my_map["age"] = 31
print(f"Updated age: {my_map['age']}")

print("Iterating through keys:")
for key in my_map:
    print(f"Key: {key}")
Checking if keys exist:
'name' exists: True
'email' exists: False
Updated age: 31
Iterating through keys:
Key: name
Key: age
Key: city
Key: occupation

Updates to a ListMappingSimple

print("Removing 'city' entry")
my_map.remove("city")
print(f"Updated map: {my_map}")
print(f"'city' exists after removal: {'city' in my_map}")

print("Iterating through key-value pairs:")
for key, value in my_map.items():
    print(f"{key}: {value}")
Removing 'city' entry
Updated map: {name : John Doe, age : 31, occupation : Developer}
'city' exists after removal: False
Iterating through key-value pairs:
name: John Doe
age: 31
occupation: Developer
  • This is a flexible and useful way to store key-value pairs!
  • What is the key drawback of this implementation?

The ListMapping class is not efficient! Ideas?

  • Goal is to get near constant-time access to data like dict

  • Right now more than one method has \(O(n)\) running time!

  • Idea: Use a hash function to map keys to indices for values

A hash function takes a key and returns an integer. Most classes in Python implement a method called __hash__ for this purpose. Let’s try it!

A HashMap containing many ListMaps

class HashMappingSimple:
    def __init__(self):
        self._size = 100
        self._buckets = [ListMappingSimple() for _ in range(self._size)]

    def put(self, key, value):
        m = self._bucket(key)
        m[key] = value

    def get(self, key):
        m = self._bucket(key)
        return m[key]

    def _bucket(self, key):
        return self._buckets[hash(key) % self._size]
  • The 100 instances of ListMapping are called buckets
  • The hash function maps keys to a bucket in the HashMap
  • Warning! If two keys hash to the same index, it is a collision!

Using the HashMappingSimple

hash_map = HashMappingSimple()
hash_map.put("name", "Alice Smith")
hash_map.put("age", 28)
hash_map.put("email", "alice@example.com")
hash_map.put("city", "Boston")

print(f"Name: {hash_map.get('name')}")
print(f"Age: {hash_map.get('age')}")
print(f"Email: {hash_map.get('email')}")
print(f"City: {hash_map.get('city')}")
Name: Alice Smith
Age: 28
Email: alice@example.com
City: Boston
  • The put method allows for adding key-value pairs
  • The get method allows for retrieving the value for a key

Understanding the _bucket method

def _bucket(self, key):
    return self._buckets[hash(key) % self._size]
  • Locates the correct bucket for any key in the hash map
  • hash(key) converts any key into a numeric value
  • % self._size ensures the index is within valid range of 0 to size-1
  • self._buckets[index] retrieves the specific bucket of ListMappingSimple
  • Each bucket is a separate list that handles its own keys and values
  • Goal is to distribute data for faster lookups for \(O(1)\) average case
  • However, multiple keys may share the same bucket! Collisions!
  • The buckets store the data mapped to them by the hash function
  • What are the key trade-offs for hashing and bucket count?

Bucket count for HashMappingSimple?

  • Picking 100 buckets is arbitrary used for illustration!
  • There is a trade-off between the number of buckets and:
    • Time overhead
    • Space overhead
    • Number of collisions
  • Strategies for picking the number of buckets?

Key: Use more buckets as the number of entries increases! This approach only increases the hash table’s space overhead in a demand-driven fashion.

HashMap with dynamic bucket count

class HashMappingPrime:
    def __init__(self, size = 2):
        self._size = size
        self._buckets = [ListMappingSimple() for _ in range(self._size)]
        self._length = 0

    def put(self, key, value):
        m = self._bucket(key)
        if key not in m:
            self._length += 1
        m[key] = value
        if self._length > self._size:
            self._double()

    def get(self, key):
        m = self._bucket(key)
        return m[key]

    def remove(self, key):
        m = self._bucket(key)
        m.remove(key)

    def __contains__(self, key):
        m = self._bucket(key)
        return key in m

    def _bucket(self, key):
        return self._buckets[hash(key) % self._size]

    def _double(self):
        oldbuckets = self._buckets
        self._size *= 2
        self._buckets = [ListMappingSimple() for _ in range(self._size)]
        for bucket in oldbuckets:
            for key, value in bucket.items():
                m = self._bucket(key)
                m[key] = value

    def __len__(self):
        return self._length

    def __iter__(self):
        for b in self._buckets:
            for k in b:
                yield k

    def values(self):
        for b in self._buckets:
            for v in b.values():
                yield v

    def items(self):
        for b in self._buckets:
            for k, v in b.items():
                yield k, v

    def __str__(self):
        itemlist = [str(e) for b in self._buckets for e in b._entries]
        return "{" + ", ".join(itemlist) + "}"

    __getitem__ = get
    __setitem__ = put

What is a drawback of this HashMap implementation?

Rehashing with the _double method

def _double(self):
1    oldbuckets = self._buckets
2    self._size *= 2
3    self._buckets = [ListMapping() for i in range(self._size)]
4    for bucket in oldbuckets:
        for key, value in bucket.items():
5            m = self._bucket(key)
            m[key] = value
1
Save references to the old buckets
2
Double the size of the HashMap
3
Create the new list of buckets
4
Add in all of the old entries
5
Identify new bucket for each key-value pair

What is the worst-case time complexity of _double?

Improving the design of ListMapping and HashMapping

  • The data structures share methods in common
  • Factoring out a superclass can reduce code duplication
  • Goal is to improve the design, not overall performance

Creating a common superclass

class Mapping:

    def get(self, key):
        raise NotImplementedError

    def put(self, key, value):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

    def _entryiter(self):
        raise NotImplementedError   

    def __iter__(self):
      return (e.key for e in self._entryiter())

    def values(self):
        return (e.value for e in self._entryiter())

    def items(self):
        return ((e.key, e.value) for e in self._entryiter())

    def __contains__(self, key):
        try:
            self.get(key)
        except KeyError:
            return False
        return True

    def __getitem__(self, key):
        return self.get(key)

    def __setitem__(self, key, value):
        self.put(key, value)

    def __str__(self):
        return "{" + ", ".join(str(e) for e in self._entryiter()) + "}"

Raise NotImplementedError if a subclass does not implement a method

Refactor the ListMapping

class ListMapping(Mapping):
    def __init__(self):
        self._entries = []

    def put(self, key, value):
        e = self._entry(key)
        if e is not None:
            e.value = value
        else:
            self._entries.append(Entry(key, value))

    def get(self, key):
        e = self._entry(key)
        if e is not None:
            return e.value
        else:
            raise KeyError

    def _entry(self, key):
        for e in self._entries:
            if e.key == key:
                return e
        return None

    def _entryiter(self):
        return iter(self._entries)

    def __len__(self):
        return len(self._entries)

Notation ListMapping(Mapping) creates an inheritance hierarchy

Refactor the HashMapping

class HashMapping(Mapping):
    def __init__(self, size = 100):
        self._size = size
        self._buckets = [ListMapping() for _ in range(self._size)]
        self._length = 0

    def _entryiter(self):
        return (e for bucket in self._buckets for e in bucket._entryiter())

    def get(self, key):
        bucket = self._bucket(key)
        return bucket[key]

    def put(self, key, value):
        bucket = self._bucket(key)
        if key not in bucket:
            self._length += 1
        bucket[key] = value
        if self._length > self._size:
            self._double()

    def __len__(self):
        return self._length

    def _bucket(self, key):
        return self._buckets[hash(key) % self._size]

    def _double(self):
        oldbuckets = self._buckets
        self.__init__(self._size * 2)
        for bucket in oldbuckets:
            for key, value in bucket.items():
                self[key] = value

Wow, this implementation is more concise and easier to understand!

Use the HashMapping class

hash_map = HashMapping(size=10)

print(f"Initial hash map with {len(hash_map)} entries")
print("Adding entries to the hash map!")
hash_map["name"] = "Alice Johnson"
hash_map["age"] = 32
hash_map["email"] = "alice@example.com"

print(f"Hash map now has {len(hash_map)} entries")
print("Retrieving values:")
print(f"  Name: {hash_map.get('name')}")
print(f"  Age: {hash_map['age']}")
print(f"  Email: {hash_map['email']}")
Initial hash map with 0 entries
Adding entries to the hash map!
Hash map now has 3 entries
Retrieving values:
  Name: Alice Johnson
  Age: 32
  Email: alice@example.com

Wow, similar to Python’s dictionary!

hash_map = dict()

hash_map["name"] = "Alice Johnson"
hash_map["age"] = 32
hash_map["email"] = "alice@example.com"

print(f"  Name: {hash_map.get('name')}")
print(f"  Age: {hash_map['age']}")
print(f"  Email: {hash_map['email']}")
  Name: Alice Johnson
  Age: 32
  Email: alice@example.com
  • Similar core operations like [], get(), and put()
  • Automatic resizing with a simple “load factor”
  • Uses hash function for key distribution

Hash table data structure

ListMapping

Implementation

  • Good design
  • Easy to understand
  • List of key-value pairs

Performance

  • Can be inefficient
  • Limits memory usage
  • Does not match dict

HashMapping

Implementation

  • Good design
  • Challenging to build
  • Use the ListMapping

Performance

  • Improved efficiency
  • Amortizes costs
  • Increases memory usage