[PS] News Clustering

Programmers: News Clustering

Conditions

Jaccard similarity J(A, B) defines the intersection size of the two sets divided by the union size.
In multiple set that allows duplication of elements, the intersection has min(# of elements in A, # of elements in B). And the union has max(...).
Each string length is 2...1,000.
Cut the input string into two letters and make it as a set.
Only alphabet character pairs are valid and discard the pair if it contains spaces, numbers or special characters.
Differences in upper case and lower case should be ignored.
Since the similarity value is in 0...1 range real number, 65536 should be multiplied and print as integer value.

Strategy

Python set type provides intersection and union methods, but these are not available because there is no overlap.
Counting the number of each element is required because it is a set that allows overlap.
Use min() and max() listed in the conditions.
If the element of s1 is in s2, it can be in the intersection set. Calculate it and remove the element from s2.
If the element of s1 is not in s2, collect it seperately.
After the for statement, s2 has elements which are not in s1.
Add the number of elements in the s1_not_s2 and s2 to union.
If the union is zero, it cannot be divided into zero when calculating the Jaccard similarity, so it should be printed by multiplying 1 as given in the condition.
Otherwise, calculate the similarity and multiply it by 65536, and print only integer part.

def split2char(str):
    temp = {}
    for word in [str[i:i+2] for i in range(len(str)-1)]:
        if word.isalpha():
            uppered = word.upper()
            temp[uppered] = temp.get(uppered, 0) + 1
    return temp


def solution(str1, str2):
    s1 = split2char(str1);
    s2 = split2char(str2);

    inter = 0
    union = 0

    s1_not_s2 = {}
    for k1, v1 in s1.items():
        if k1 in s2:
            inter += min(v1, s2[k1])
            union += max(v1, s2[k1])
            del s2[k1]
        else:
            s1_not_s2[k1] = v1

    union += sum(list(s1_not_s2.values()))
    union += sum(list(s2.values()))

    if union == 0:
        return 65536
    else:
        return int(65536 / union * inter)

Remember

I think v1 can add directly to the union without making s1_not_s2.
I need to remember how to find intersecion and union in this solution.