Algorithms2-week4-hashtable

Hash Table: Supported Operations
Purpose:maintain a (possibly evolving) set of stuff.
(transactions, people+associated data, IP address, etc)
Insert: add new record.
Delete: delete existing record.
Lookup: check for a particular record (a “dictionary”)
应用：
1. Application: De-Duplication
Given: a “stream” of objects.
(Linear scan through a huge file. Or objects arriving in real time)
Goal: remove duplicates (keep track of unique objects)
$\cdot$ report unique visitors to web site
$\cdot$ avoid duplicates in search results.
Solution: when new object x arrives
$\cdot$ lookup x in hash table H
$\cdot$ if not found, Insert x into H.
2. The 2-SUM Problem
Input: unsorted array A of n integers. Target sum t.
Goal: determine whether or not there are two numbers x, y in A with
$x+y=t$
Naive Solution: $\theta(n^2)$ time via exhaustive search
Better:
1.) sort A ( $\theta(nlog(n))$ time)
2.) for each x in A, look for t-x in A via binary search.
Amazing:
1.) insert elements of A into hash table H.
2.) for each x in A, Lookup t-x , $\theta(n)$ time.
3. Futher Immediate Applications
$\cdot$ Historical application : symbol tables in compilers.
$\cdot$ Blocking network traffic.
$\cdot$ Search algorithms (game tree exploration)
$\cdot \cdot$ Use hash table to avoid exploring any configuration
(arrangement of chess pieces ) more than once.
4. High-Level Idea.
Setup: universe U[all IP addersses, all names, all chessboard configurations,etc] [generally really big]
Goal: wnat to maintain evolving set $S \subseteq U$
[generally, of reasonable size].
Solution:
1.) pick n = numbers of buckets.
2.) choose a hash function: take a key as input return the position between $0$ and $n-1$ . $h:U \rightarrow \{0,1,2,...,n-1\}$ .
3.) use array A of length n, store x in A[h(x)].
关于: Naive Solutions:
1. Array-based solution [indexed by u]
$\cdot$ $O(1)$ operations by $\theta(| U|)$ space.
2. List-based solution. $\theta(|S|)$ space but $\theta(|S|)$ Lookup.
5. Resolving Collisions.
Collision: distinct $x,y \in U$ such that $h(x) = h(y)$ ,hash function: 不同的键值返回同样的position。
1.) Solution #1: (separate) chaning,
$\cdot$ keep linked list in each bucket.
$\cdot$ given a key/object x, perform Insert/Delete/Loopup in the list in A[h(x)]. (A:linked list for x, h(x): Bucket for x).
2.) Solution #2: open addressing. (only one object per bucket)
$\cdot$ Hash function now specifies probe sequence $h_1(x), h_2(x) ...$
$\cdot$ Examples: linear probing(look consecutively),(17 then 18,19..)
Double hashing.(the first one specifies initial bucket that you probe, the second one specify the offset for each subsequent probe).
Definition: the load factor of a hash table is:
$\alpha = \frac{\#of-objetcs-in-hash-table}{\# of-buckets-of-hash-table}$
Note:
1.) $\alpha$ = O(1) is necessary condition for operations to run in constant time.
2.) with open addressing, need $\alpha$ << 1. (only one object per bucket)
6. Pathological Data Sets(病态数据集)
Upshot#2: for god HT performance, need a good hash function.
Ideal(理想): user super-clever hash function guaranteed to spread every data set out evenly.
Problem: DOES NOT EXIST!(for every hash function, there is a pathological data set)
Reason: fix a hash function h: $U \rightarrow \{0,1,...,n-1 \}$
$\Rightarrow$ Pigeonhole Principle(鸽巢原理), there exist bucket i such that at least $\frac{|U|}{n}$ elements of U hash to l under h.
$\Rightarrow$ if data set drawn only from these, everything collides!
7. Pathological Data in the Real World.
Main Point: can paralyze several real-world systems by exploiting badly designed hash functions.
$--$ open source.
$--$ overly simplistic hash function.
(easy to reverse engineer a pathological data set)
Solutions
1. Use a cryptographic hash function(e.g., SHA-2)
$--$ infeasible to reverse engineer a pathological data set.
2. Use randomization.
$--$ design a family H of hash functions such that for all datasets S, “almost all”functions $h\in H$ spread S out “pretty evenly”.
Universal Hash Functions
Definition: Let H be a set of hash functions from U to
$\{0,1,2,...,n-1\}$ .
H is universal if and only if :
for all x,y in U(with $x\neq y$ )
$Pr_{h\in H} [x,y,collide] \leq \frac{1}{n}$ (collide: $h(x)=h(y)$ ),
When h is chosen uniformly at random from H.
$i..e,collision probability as small as with "gold stanard "$ of perfectly random hashing.
Example: Hashing IP Addresses.
Let U = IP addresses (of the form( $x_1,x_2,x_3,x_4$ )),with each $x_i \in \{0,1,2,...,255 \}$
Let n = a prime(small multiple of # of objects in HT)
Construction:Define one hash function has per 4-tuple a = ( $a_1,a_2,a_3,a_4$ ) with each $a_i \in \{0,1,2,3,...,n-1 \}$ .
Define: $h_a$ : IP addrs $\rightarrow$ buckets by
$h_a(x_1,x_2,x_3,x_4 ) = (a_1x_1+a_2x_2+a_3x_3+a_4x_4) mod,n$
A Universal Hash Function
Define: $H=\{ h_a| a_1,a_2,a_3,a_4 \in \{0,1,2,...,n-1\}\}$
$h_a(x_1,x_2,x_3,x_4) = (a_1x_1+a_2x_2+a_3x_3+a_4x_4) mod (n)$
Theorem: This family is universal.
Proof:(Part 1)
Consider distinct IP addresses( $x_1$ , $x_2$ , $x_3$ , $x_4$ ), ( $y_1$ , $y_2$ , $y_3$ , $y_4$ ).
Assume: $x_4 \neq y_4$
Note: collision $\Leftrightarrow$
$a_1x_1+a_2x_2+a_3x_3+a_4x_4=a_1y_1+a_2y_2+a_3y_3+a_4y_4$
$\Leftrightarrow$ $a_4(x_4-y_4)=\sum^3_{i=1} a_i(y_i-x_i) mod(n)$
Proof (Part II)
The story So Far: with $a_1,a_2,a_3$ fixed arbitrarily, how many choices of $a_4$ satisfy
$a_4(x_4-y_4) = \sum^3_{i=1}a_i(y_i-x_i)mod (n)$ .
Key Claim: left-hand side equally likely to be any of {0,1,2,…,n-1}
Reason: $x_4 \neq y_4$ .
Bloom Filter(布隆滤波器): Supported Operations.
Fast Inserts and Lookups.
Comparison to Hash Tables.
Pros: more space efficient
Cons:
1) can’t store an associated object.
2) No deletions.
3) Small false positive probability.
(might say x has been inserted even though it has’t been )
Applications:
Original: early spellcheckers.
Canonical(规范): list of forbidden passwords.
Modern: network routers,
$--$ Limited memory, need to be super-fast.
Bloom Filter: Under the Hood:
Ingredients:
1) array of n bits.
(So $\frac{n}{|S|}$ = # of bits per object in the data set S)
2) k hash functions $h_1, ..., h_k$ (k = small constant)
Insert(x) :
$\cdot$ for i = 1, 2, …, k
$--$ set A[ $h_i(x)$ ] = 1
Lookup(x): return TRUE $\Leftrightarrow$
A[ $h_i(x)$ ] = 1 for every i = 1,2,…,k.
Note: no false negatives:
(if x was inserted, Loopup(x) guaranteed to succeed).
But : false positive if all k $h_i(x) 's$ already set to 1 by other insertions.
Heuristic(启发式) Analysis
Intuition: should be a trade-off between space and error (false positive)
probability.
Assume: all $h_i(x)'s$ uniformaly random and independent.
Setup: n bits, insert data set S into bloom filter.
Note: for each bit of A, the probability it’s been set to 1 is (under above assumption):
$1-(1-\frac{1}{n})^{k|S|} \leq 1 - e^{-\frac{k|S|}{n}}=1-e^{-\frac{k}{b}}$
b=# of bits per object (n/|S|)

Story so far: probability a given bit 1 is $\leq 1- e^{-\frac{k}{b}}$
So: under assumption, for x not in S, false positve probality is
$\leq [1-e^{-\frac{k}{b}}]^k$ Error rank $\epsilon$
where b = # of bits per object.
How to set k ?: for fixed b , $\epsilon$ is minimized by setting
Plugging back in :
$\epsilon \approx (\frac{1}{2}) ^ {(ln2)^b }$ or $b \approx 1.44log_2\frac{1}{\epsilon}$
$k\approx (ln2)\cdot b$

Algorithms2-week4-hashtable

猜你喜欢