Hash Table: Supported Operations
Purpose:maintain a (possibly evolving) set of stuff.
(transactions, people+associated data, IP address, etc)
Insert: add new record.
Delete: delete existing record.
Lookup: check for a particular record (a “dictionary”)
应用:
1. Application: De-Duplication
Given: a “stream” of objects.
(Linear scan through a huge file. Or objects arriving in real time)
Goal: remove duplicates (keep track of unique objects)
report unique visitors to web site
avoid duplicates in search results.
Solution: when new object x arrives
lookup x in hash table H
if not found, Insert x into H.
2. The 2-SUM Problem
Input: unsorted array A of n integers. Target sum t.
Goal: determine whether or not there are two numbers x, y in A with
Naive Solution:
time via exhaustive search
Better:
1.) sort A (
time)
2.) for each x in A, look for t-x in A via binary search.
Amazing:
1.) insert elements of A into hash table H.
2.) for each x in A, Lookup t-x ,
time.
3. Futher Immediate Applications
Historical application : symbol tables in compilers.
Blocking network traffic.
Search algorithms (game tree exploration)
Use hash table to avoid exploring any configuration
(arrangement of chess pieces ) more than once.
4. High-Level Idea.
Setup: universe U[all IP addersses, all names, all chessboard configurations,etc] [generally really big]
Goal: wnat to maintain evolving set
[generally, of reasonable size].
Solution:
1.) pick n = numbers of buckets.
2.) choose a hash function: take a key as input return the position between
and
.
.
3.) use array A of length n, store x in A[h(x)].
关于: Naive Solutions:
1. Array-based solution [indexed by u]
operations by
space.
2. List-based solution.
space but
Lookup.
5. Resolving Collisions.
Collision: distinct
such that
,hash function: 不同的键值返回同样的position。
1.) Solution #1: (separate) chaning,
keep linked list in each bucket.
given a key/object x, perform Insert/Delete/Loopup in the list in A[h(x)]. (A:linked list for x, h(x): Bucket for x).
2.) Solution #2: open addressing. (only one object per bucket)
Hash function now specifies probe sequence
Examples: linear probing(look consecutively),(17 then 18,19..)
Double hashing.(the first one specifies initial bucket that you probe, the second one specify the offset for each subsequent probe).
Definition: the load factor of a hash table is:
Note:
1.)
= O(1) is necessary condition for operations to run in constant time.
2.) with open addressing, need
<< 1. (only one object per bucket)
6. Pathological Data Sets(病态数据集)
Upshot#2: for god HT performance, need a good hash function.
Ideal(理想): user super-clever hash function guaranteed to spread every data set out evenly.
Problem: DOES NOT EXIST!(for every hash function, there is a pathological data set)
Reason: fix a hash function h:
Pigeonhole Principle(鸽巢原理), there exist bucket i such that at least
elements of U hash to l under h.
if data set drawn only from these, everything collides!
7. Pathological Data in the Real World.
Main Point: can paralyze several real-world systems by exploiting badly designed hash functions.
open source.
overly simplistic hash function.
(easy to reverse engineer a pathological data set)
Solutions
1. Use a cryptographic hash function(e.g., SHA-2)
infeasible to reverse engineer a pathological data set.
2. Use randomization.
design a family H of hash functions such that for all datasets S, “almost all”functions
spread S out “pretty evenly”.
Universal Hash Functions
Definition: Let H be a set of hash functions from U to
.
H is universal if and only if :
for all x,y in U(with
)
(collide:
),
When h is chosen uniformly at random from H.
of perfectly random hashing.
Example: Hashing IP Addresses.
Let U = IP addresses (of the form(
)),with each
Let n = a prime(small multiple of # of objects in HT)
Construction:Define one hash function has per 4-tuple a = (
) with each
.
Define:
: IP addrs
buckets by
A Universal Hash Function
Define:
Theorem: This family is universal.
Proof:(Part 1)
Consider distinct IP addresses(
,
,
,
), (
,
,
,
).
Assume:
Note: collision
Proof (Part II)
The story So Far: with
fixed arbitrarily, how many choices of
satisfy
.
Key Claim: left-hand side equally likely to be any of {0,1,2,…,n-1}
Reason:
.
Bloom Filter(布隆滤波器): Supported Operations.
Fast Inserts and Lookups.
Comparison to Hash Tables.
Pros: more space efficient
Cons:
1) can’t store an associated object.
2) No deletions.
3) Small false positive probability.
(might say x has been inserted even though it has’t been )
Applications:
Original: early spellcheckers.
Canonical(规范): list of forbidden passwords.
Modern: network routers,
Limited memory, need to be super-fast.
Bloom Filter: Under the Hood:
Ingredients:
1) array of n bits.
(So
= # of bits per object in the data set S)
2) k hash functions
(k = small constant)
Insert(x) :
for i = 1, 2, …, k
set A[
] = 1
Lookup(x): return TRUE
A[
] = 1 for every i = 1,2,…,k.
Note: no false negatives:
(if x was inserted, Loopup(x) guaranteed to succeed).
But : false positive if all k
already set to 1 by other insertions.
Heuristic(启发式) Analysis
Intuition: should be a trade-off between space and error (false positive)
probability.
Assume: all
uniformaly random and independent.
Setup: n bits, insert data set S into bloom filter.
Note: for each bit of A, the probability it’s been set to 1 is (under above assumption):
b=# of bits per object (n/|S|)
Story so far: probability a given bit 1 is
So: under assumption, for x not in S, false positve probality is
Error rank
where b = # of bits per object.
How to set k ?: for fixed b ,
is minimized by setting
Plugging back in :
or