웹정보시스템

2014. 5. 1. 20:19

1. association rules

The technology focuses on common events, not rare events(long tail)

Support for itemset I = the number of baskets containing all items in i.

Support threshold s, sets of items that appear in at least s baskets are called frequent itemsets.

1. Subset을 찾는다.

2. Support를 찾는다.

Applications

1. Items that appear together too often could represent plagiarism.

2. Unusual words appearing together in a large number of documents.

Stop words : very common words. Ex) he, is, a

TF/IDF : term frequency inverse document frequency. Relates the number of times a word appears to the number of documents in which it appears.

Association rules를 찾기 힘드므로 frequent item을 먼저 찾는다.

Use k nested loops to generate all sets if size k.

True cost of mining disk-resident data is usually the number of disk IO.

Pass : all baskets read in turn.

Main memory bottle neck

1. As we read baskets, we need to count something.

2. The number of different things we can count is limited by main memory.

3. Swapping counts in/out is a disaster.

Naïve algorithm

From each basket of n items, generate its n(n-1)/2 pairs by two nested loops.

Fails if (#items)^2 exceeds main memory.

Suppose 10^5 items. Suppose counts are 4-byte integers.

Number of pairs of items : 10^5*(10^5-1)/2 = 5*10^9.

Therefore, 2*10^10 (20gb) of main memory needed.

1. Count all pairs, using a triangular matrix. Requires only 4 bytes per pair. 1차원으로 표현.

Find pair {i,j} at the position (i-1)(n-i/2)+j-1.

Total number of pairs n(n-1)/2. Total bytes 2n^2.

2. Keep a table of triples [i,j,c]. requires 12 bytes, but only for those pairs with count >0.

Total bytes used is about 12p, where p is the number of pairs that actually occur.

Beats triangular matrix if at most 1/3 of possible pairs actually occur.

A-priori algorithm : two passes approach. Key ided : monotonicity : if a set of items appears at least s times, so does every subsets.

Pass1 : read baskets and count in main memory the occurrences of each item. Items that appear at least s times are the frequent item.

Pass2 : read baskets again and count in main memory only those pairs both of which were found in pass1 to be frequent. By triangular matrix.

Frequent triples.

C1 = all items, in general, Lk = members of Ck with support >= s.

Ck+1 = (K+1)- sets, each k of which is in Lk.

2. Improvement to a-priori

Hash based filtering

S = one billion string of length 10

1GB main memory

Create a bit array of 8 billion bits.

Choose a hash function h with range [0, 8*10^9] and hash each member of S to one of the bits.

Filter the file F by hashing each string and outputting only those that hash to a 1.

At most 1/8 of the bit array is 1, 1/8만 운이 좋으면 output 된다.

반복하면 false positives를 1/8로 계속 줄일 수 있다.

Each filter step costs one pass through the remaining file F and reduces the fraction of false positives by a factor of 8.

Repeat passes until few false positives.

Throwing darts

If we throw k darts into n equally likely targets.

적어도 하나는 맞을 확률.

Superimposed codes(bloom filters)

We could use two hash functions, and hash each member of S to two bits of the bit array.

Around 1/4 of the array is 1’s

The more hash functions, the smaller the probability of a false positives.

PCY algorithm

During pass1 of a-priori, most memory is idle. Use that memory to keep counts of buckets into which pairs of items are hashed.

Pairs of items need to be generated from the input file. We need to see whether it is present at least s times.

On pass2, we only count pairs that hash to frequent buckets.

for(each basket){

for(each item in the basket) add 1 to item’s count;

for(each pair of items){

hash the pair to a bucket;

add 1 to the count for that bucket.

}

A bucket that a frequent pair hashes to is surely frequent.

Event without any frequent pair, a bucket can be frequent.

But in the best case, the count for a bucket is less than the support s.

Bitmap : replace the buckets by a bit vector. 4 byte integers are replaced by bits, so the bit-vector requires 1/32 of memory. Also, decide which items are frequent and list them for the second pass.

Pass2 : count all pairs {i,j} that meet the conditions for being a candidate pair.

- Both i and j are frequent items.

- The pair {i,j} hashes to a bucket number whose bit in the bit vector is 1.

Memory details : buckets require a few bytes each. On second pass, a table of (item, item, count) triples is essential. 실시간으로 체크하며 동적으로 pair들이 생성되기 때문에 고정된 형태의 triangular를 사용할 수 없다.

Multistage algorithm

After pass1 of pcy, rehash only those pairs that qualify for pass2 of pcy. On middle pass, fewer pairs contribute to buckets, so fewer false positives frequent buckets with no frequent pair.

Pass3 : count only those pairs {i,j} that satisfy these candidate pair conditions.

- Both i and j are frequent items.

- Using the first hash function, the pair hashes to a bucket whose bit in the first bit-vector is 1.

- Using the second hash function, the pair hashes to a bucket whose bit in the second bit-vector is 1.

Two hash function have to be independent. We need to check both hashes on the third pass.

Multihash

Use several independent hash tables on the first pass.

Either multistage or multihash can use more than two hash functions.

Multihash의 bitmap size가 multistage보다 작다.

Too many hash functions makes all counts >= s.

a-priori, pcy, etc take k passes to find frequent itemsets of size k.

simple algorithm

take a random sample of the market baskets.

Run a-priori or one of its improvements in main memory.

Simple algorithm

Use as your support threshold a suitable, scaled-back number.

Son algorithm

Repeatedly read small subsets of the baskets into main memory and perform the first pass of the simple algorithm on each subset. An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets.

On a second pass, count all the candidate itemsets and determine which are frequent in the entire set.

Key monotonicity idea : an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset.

If baskets are distributed among many nodes, compute frequent itemsets at each node, then distribute the candidates from each node.

Toivonen’s Algorithm

Start as in the simple algorithm, but lower the threshold slightly for the sample.

If the sample is 1% of the baskets, use s/125 as the support threshold rather than s/100.

Goal is to avoid missing any itemset that is frequent in the full set of baskets.

Add to the itemsets that are frequent in the sample the negative border of these itemsets.

An itemset is in the negative border if it is not deemed frequent in the sample, but all its immediate subsets are.

Negative border : ABCD is in the negative border if and only if

- It is not frequent in the sample

- All of ABC, BCD, ACD and ABD are.

A is in the negative border if and only if it is not frequent in the sample. Because the empty set is always frequent.

In a second pass, count all candidate frequent itemsets from the first pass, and also count thieir negative border. If no itemset from the negative border turns out to be frequent, then the candidates found to be frequent in the whole data are exactly the frequent itemsets.

If there is an itemset that is frequent in the whole, but not frequent in the sample, then there is a member of the negative border for the sample that is frequent in the whole.

Proof : suppose not

- There is an itemset S frequent in the whole but not frequent in the sample.

- Nothing in the negative border is frequent in the whole.

Let T be a smallest subset of S that is not frequent in the sample. T is frequent in the whole(S is frequent + monotonicity). T is in the negative border(else not smallest).

Compacting the output

1. Maximal frequent itemsets : no immediate superset is frequent.

2. Closed itemsets : no immediate superset has the same count(>0).

3. Finding similar sets

비슷한 문서 찾기에 필수 기술

1. Shingling

2. Minhashing : 작은 signatures로 만들면서 유사성 유지

3. Locality sensitive hashing : 유사한 signature찾기에 focus

Shingles

A k-shingle for a document is a sequence of k characters that appears in the document.

You must pick k large enough, or most documents will have most shingles.

9 shingles 를 4byte로 해쉬하는 것이 4 shingles로 하는게 나은 이유

- 4 shingles 일 경우 2^32 까지 표현 가능하지만 character이므로 공간 낭비가 심함

- 좋은 hash를 사용하면 2^32에 공간을 잘 활용할 수 있다.

Minhashing

Jaccard similarity : Sim(c1, c2) = intersection / union = a / a+b+c

Rows : elements of the universal set

Columns = sets = document

False negatives 또는 false positives 가 발생할 수 있음.

Signatures : hash each column c to a small signature sig(c)

길이 n짜리 signature를 만들려면 hash function n개 필요

Random permutation 만들기가 힘들고 random permutation 을 저장할 공간 문제도 있고 random permutation 에 따라 accessing rows 문제도 있음

For each row r

For each column c

If c has in row r

For each hash function hi do

If hi(r) is a smaller value than M(I, c) then

M(I,c) := hi(r);

Locality sensitive hashing

모든 signature가 memory에 올라가 있다 해도 comparing하는데 힘들다.

10^6 columns 가 있다면 column comparison에 5*10^11 이 필요.

10000 columns, 100 signatures * integer = 40mb, 80% similar 원함, 5,000,000,000 pairs of signatures can take a while to compare, choose 20 bands of 5 integer/band.

C1, C2가 80% similar 라고 가정했을 때

C1, C2 가 한 밴드에서 일치할 확률 = (0.8)^5 = 0.328

C1, C2 가 어느 밴드에서도 일치하지 않을 확률 = (1-0.328)^20 = 0.00035 false negatives.

C1, C2가 40% similar 라고 가정했을 때

C1, C2 가 한 밴드에서 일치할 확률 = (0.4)^5 = 0.01

C1, C2 가 하나 이상의 밴드에서 일치할 확률 = 20 * 0.01 = 0.2 false positives.

(안 유사할수록 candidate가 될 확률이 줄어든다)

Trade off = 15 bands, 5row 로 할 경우 false positives 는 줄어들지만 false negatives는 늘어난다.

4. Applications of LSH

Represent a fingerprint by the set of positions of minutiae.

20%는 minutiae, 같은 지문이라 했을 때 80% 이상 match.

서로 다른 두 지문이 일치하다고 될 확률 = (0.2)^6 = 0.000064

같은 두 지문이 일치하다고 될 확률 = ((0.2)(0.8))^3 = 0.004096

1024번 뽑는 데 같은 지문일 때 = 1-(1-0.004096)^1024 = 0.985 (1.5% false negative)

1024번 뽑는 데 다른 지문일 때 = 1-(1-0.000064)^1024 = 0.063 (6.3% false positive)

5. Theory of LSH

Euclidian distance : has some number of real-valued dimensions and dense points.

Non-euclidian distance : is based on properties of points.

Distance measure

1. d(x,y) >= 0

2. d(x,y) = 0 iff x = y

3. d(x,y) = d(y,x)

4. d(x,y) <= d(x,z) + d(z,y) (triangular inequality)

L2 norm = distance

L1 norm = sum of the differences in each dimension. (manhattan distance)

L∞ norm = the maximum of the differences between x and y in any dimension.

Jaccard distance

Cosine distance

Two points’ vectors make an angle, whose consine is the normalized dot-product of the vectors.

p1∙p2/ |p2||p1|

arccos = cos에서 각도를 꺼냄.

Edit distance

Number of inserts and deletes of characters needed to turn one into the other.

d(x,y) = |x| + |y| - 2|LCS(x,y)|

LCS = longest common subsequence = any longest string obtained both by deleting from x and deleting from y.

Variant edit distance

Allow insert, delete and mutate.

Hamming distance

Number of positions in which bit-vectors differ.

Families of hash function

h(x) = h(y) means h says x and y are equal.

(d1, d2, p1, p2) – sensitive

d1 : 작은숫자, p1 : 큰숫자

Prob[h(x) = h(y)] = 1 – d(x,y)

H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d.

For jaccard similarity, minhashing gives us a (d1, d2, (1-d1), (1-d2))-sensitive family for any d1<d2.

The theory leaves unknown what happens to pairs that are at distance between d1 and d2.

저작자표시 (새창열림)

'DB' 카테고리의 다른 글

debian raspbian can't connect to mysql server on (10061) (0)	2014.05.10
mysql timestamp datetime (0)	2014.05.10
데이터베이스 (0)	2014.05.01
mysql 리플리케이션 (0)	2014.03.25
mysql 내림 (0)	2013.12.22

ndlessrain

웹정보시스템

'DB' 카테고리의 다른 글

+ Recent posts

티스토리툴바