1. association rules
The technology focuses on common events, not rare events(long tail)
Support for itemset I = the number of baskets containing all items in i.
Support threshold s, sets of items that appear in at least s baskets are called frequent itemsets.
1. Subset을 찾는다.
2. Support를 찾는다.
Applications
1. Items that appear together too often could represent plagiarism.
2. Unusual words appearing together in a large number of documents.
Stop words : very common words. Ex) he, is, a
TF/IDF : term frequency inverse document frequency. Relates the number of times a word appears to the number of documents in which it appears.
Association rules를 찾기 힘드므로 frequent item을 먼저 찾는다.
Use k nested loops to generate all sets if size k.
True cost of mining disk-resident data is usually the number of disk IO.
Pass : all baskets read in turn.
Main memory bottle neck
1. As we read baskets, we need to count something.
2. The number of different things we can count is limited by main memory.
3. Swapping counts in/out is a disaster.
Naïve algorithm
From each basket of n items, generate its n(n-1)/2 pairs by two nested loops.
Fails if (#items)^2 exceeds main memory.
Suppose 10^5 items. Suppose counts are 4-byte integers.
Number of pairs of items : 10^5*(10^5-1)/2 = 5*10^9.
Therefore, 2*10^10 (20gb) of main memory needed.
1. Count all pairs, using a triangular matrix. Requires only 4 bytes per pair. 1차원으로 표현.
Find pair {i,j} at the position (i-1)(n-i/2)+j-1.
Total number of pairs n(n-1)/2. Total bytes 2n^2.
2. Keep a table of triples [i,j,c]. requires 12 bytes, but only for those pairs with count >0.
Total bytes used is about 12p, where p is the number of pairs that actually occur.
Beats triangular matrix if at most 1/3 of possible pairs actually occur.
A-priori algorithm : two passes approach. Key ided : monotonicity : if a set of items appears at least s times, so does every subsets.
Pass1 : read baskets and count in main memory the occurrences of each item. Items that appear at least s times are the frequent item.
Pass2 : read baskets again and count in main memory only those pairs both of which were found in pass1 to be frequent. By triangular matrix.
Frequent triples.
C1 = all items, in general, Lk = members of Ck with support >= s.
Ck+1 = (K+1)- sets, each k of which is in Lk.
2. Improvement to a-priori
Hash based filtering
S = one billion string of length 10
1GB main memory
Create a bit array of 8 billion bits.
Choose a hash function h with range [0, 8*10^9] and hash each member of S to one of the bits.
Filter the file F by hashing each string and outputting only those that hash to a 1.
At most 1/8 of the bit array is 1, 1/8만 운이 좋으면 output 된다.
반복하면 false positives를 1/8로 계속 줄일 수 있다.
Each filter step costs one pass through the remaining file F and reduces the fraction of false positives by a factor of 8.
Repeat passes until few false positives.
Throwing darts
If we throw k darts into n equally likely targets.
Superimposed codes(bloom filters)
We could use two hash functions, and hash each member of S to two bits of the bit array.
Around 1/4 of the array is 1’s
The more hash functions, the smaller the probability of a false positives.
PCY algorithm
During pass1 of a-priori, most memory is idle. Use that memory to keep counts of buckets into which pairs of items are hashed.
Pairs of items need to be generated from the input file. We need to see whether it is present at least s times.
On pass2, we only count pairs that hash to frequent buckets.
for(each basket){
for(each item in the basket) add 1 to item’s count;
for(each pair of items){
hash the pair to a bucket;
add 1 to the count for that bucket.
}
}
A bucket that a frequent pair hashes to is surely frequent.
Event without any frequent pair, a bucket can be frequent.
But in the best case, the count for a bucket is less than the support s.
Bitmap : replace the buckets by a bit vector. 4 byte integers are replaced by bits, so the bit-vector requires 1/32 of memory. Also, decide which items are frequent and list them for the second pass.
Pass2 : count all pairs {i,j} that meet the conditions for being a candidate pair.
- Both i and j are frequent items.
- The pair {i,j} hashes to a bucket number whose bit in the bit vector is 1.
Memory details : buckets require a few bytes each. On second pass, a table of (item, item, count) triples is essential. 실시간으로 체크하며 동적으로 pair들이 생성되기 때문에 고정된 형태의 triangular를 사용할 수 없다.
Multistage algorithm
After pass1 of pcy, rehash only those pairs that qualify for pass2 of pcy. On middle pass, fewer pairs contribute to buckets, so fewer false positives frequent buckets with no frequent pair.
Pass3 : count only those pairs {i,j} that satisfy these candidate pair conditions.
- Both i and j are frequent items.
- Using the first hash function, the pair hashes to a bucket whose bit in the first bit-vector is 1.
- Using the second hash function, the pair hashes to a bucket whose bit in the second bit-vector is 1.
Two hash function have to be independent. We need to check both hashes on the third pass.
Multihash
Use several independent hash tables on the first pass.
Either multistage or multihash can use more than two hash functions.
Multihash의 bitmap size가 multistage보다 작다.
Too many hash functions makes all counts >= s.
a-priori, pcy, etc take k passes to find frequent itemsets of size k.
simple algorithm
take a random sample of the market baskets.
Run a-priori or one of its improvements in main memory.
Simple algorithm
Use as your support threshold a suitable, scaled-back number.
Son algorithm
Repeatedly read small subsets of the baskets into main memory and perform the first pass of the simple algorithm on each subset. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets.
On a second pass, count all the candidate itemsets and determine which are frequent in the entire set.
Key monotonicity idea : an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset.
If baskets are distributed among many nodes, compute frequent itemsets at each node, then distribute the candidates from each node.
Toivonen’s Algorithm
Start as in the simple algorithm, but lower the threshold slightly for the sample.
If the sample is 1% of the baskets, use s/125 as the support threshold rather than s/100.
Goal is to avoid missing any itemset that is frequent in the full set of baskets.
Add to the itemsets that are frequent in the sample the negative border of these itemsets.
An itemset is in the negative border if it is not deemed frequent in the sample, but all its immediate subsets are.
Negative border : ABCD is in the negative border if and only if
- It is not frequent in the sample
- All of ABC, BCD, ACD and ABD are.
A is in the negative border if and only if it is not frequent in the sample. Because the empty set is always frequent.
In a second pass, count all candidate frequent itemsets from the first pass, and also count thieir negative border. If no itemset from the negative border turns out to be frequent, then the candidates found to be frequent in the whole data are exactly the frequent itemsets.
If there is an itemset that is frequent in the whole, but not frequent in the sample, then there is a member of the negative border for the sample that is frequent in the whole.
Proof : suppose not
- There is an itemset S frequent in the whole but not frequent in the sample.
- Nothing in the negative border is frequent in the whole.
Let T be a smallest subset of S that is not frequent in the sample. T is frequent in the whole(S is frequent + monotonicity). T is in the negative border(else not smallest).
Compacting the output
1. Maximal frequent itemsets : no immediate superset is frequent.
2. Closed itemsets : no immediate superset has the same count(>0).
3. Finding similar sets
비슷한 문서 찾기에 필수 기술
1. Shingling
2. Minhashing : 작은 signatures로 만들면서 유사성 유지
3. Locality sensitive hashing : 유사한 signature찾기에 focus
Shingles
A k-shingle for a document is a sequence of k characters that appears in the document.
You must pick k large enough, or most documents will have most shingles.
9 shingles 를 4byte로 해쉬하는 것이 4 shingles로 하는게 나은 이유
- 4 shingles 일 경우 2^32 까지 표현 가능하지만 character이므로 공간 낭비가 심함
- 좋은 hash를 사용하면 2^32에 공간을 잘 활용할 수 있다.
Minhashing
Jaccard similarity : Sim(c1, c2) = intersection / union = a / a+b+c
Rows : elements of the universal set
Columns = sets = document
False negatives 또는 false positives 가 발생할 수 있음.
Signatures : hash each column c to a small signature sig(c)
길이 n짜리 signature를 만들려면 hash function n개 필요
Random permutation 만들기가 힘들고 random permutation 을 저장할 공간 문제도 있고 random permutation 에 따라 accessing rows 문제도 있음
For each row r
For each column c
If c has in row r
For each hash function hi do
If hi(r) is a smaller value than M(I, c) then
M(I,c) := hi(r);
Locality sensitive hashing
모든 signature가 memory에 올라가 있다 해도 comparing하는데 힘들다.
10^6 columns 가 있다면 column comparison에 5*10^11 이 필요.
10000 columns, 100 signatures * integer = 40mb, 80% similar 원함, 5,000,000,000 pairs of signatures can take a while to compare, choose 20 bands of 5 integer/band.
C1, C2가 80% similar 라고 가정했을 때
C1, C2 가 한 밴드에서 일치할 확률 = (0.8)^5 = 0.328
C1, C2 가 어느 밴드에서도 일치하지 않을 확률 = (1-0.328)^20 = 0.00035 false negatives.
C1, C2가 40% similar 라고 가정했을 때
C1, C2 가 한 밴드에서 일치할 확률 = (0.4)^5 = 0.01
C1, C2 가 하나 이상의 밴드에서 일치할 확률 = 20 * 0.01 = 0.2 false positives.
(안 유사할수록 candidate가 될 확률이 줄어든다)
Trade off = 15 bands, 5row 로 할 경우 false positives 는 줄어들지만 false negatives는 늘어난다.
4. Applications of LSH
Represent a fingerprint by the set of positions of minutiae.
20%는 minutiae, 같은 지문이라 했을 때 80% 이상 match.
서로 다른 두 지문이 일치하다고 될 확률 = (0.2)^6 = 0.000064
같은 두 지문이 일치하다고 될 확률 = ((0.2)(0.8))^3 = 0.004096
1024번 뽑는 데 같은 지문일 때 = 1-(1-0.004096)^1024 = 0.985 (1.5% false negative)
1024번 뽑는 데 다른 지문일 때 = 1-(1-0.000064)^1024 = 0.063 (6.3% false positive)
5. Theory of LSH
Euclidian distance : has some number of real-valued dimensions and dense points.
Non-euclidian distance : is based on properties of points.
Distance measure
1. d(x,y) >= 0
2. d(x,y) = 0 iff x = y
3. d(x,y) = d(y,x)
4. d(x,y) <= d(x,z) + d(z,y) (triangular inequality)
L2 norm = distance
L1 norm = sum of the differences in each dimension. (manhattan distance)
L∞ norm = the maximum of the differences between x and y in any dimension.
Jaccard distance
Cosine distance
Two points’ vectors make an angle, whose consine is the normalized dot-product of the vectors.
p1∙p2/ |p2||p1|
Edit distance
Number of inserts and deletes of characters needed to turn one into the other.
d(x,y) = |x| + |y| - 2|LCS(x,y)|
LCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y.
Variant edit distance
Allow insert, delete and mutate.
Hamming distance
Number of positions in which bit-vectors differ.
Families of hash function
h(x) = h(y) means h says x and y are equal.
(d1, d2, p1, p2) – sensitive
d1 : 작은숫자, p1 : 큰숫자
Prob[h(x) = h(y)] = 1 – d(x,y)
H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.
For jaccard similarity, minhashing gives us a (d1, d2, (1-d1), (1-d2))-sensitive family for any d1<d2.
The theory leaves unknown what happens to pairs that are at distance between d1 and d2.
'DB' 카테고리의 다른 글
debian raspbian can't connect to mysql server on (10061) (0) | 2014.05.10 |
---|---|
mysql timestamp datetime (0) | 2014.05.10 |
데이터베이스 (0) | 2014.05.01 |
mysql 리플리케이션 (0) | 2014.03.25 |
mysql 내림 (0) | 2013.12.22 |