La foret rouge
Published on

[PS] News Clustering

Authors
  • avatar
    Name
    신주용

Programmers: News Clustering

Conditions

  • Jaccard similarity J(A, B) defines the intersection size of the two sets divided by the union size.
  • In multiple set that allows duplication of elements, the intersection has min(# of elements in A, # of elements in B). And the union has max(...).
  • Each string length is 2...1,000.
  • Cut the input string into two letters and make it as a set.
  • Only alphabet character pairs are valid and discard the pair if it contains spaces, numbers or special characters.
  • Differences in upper case and lower case should be ignored.
  • Since the similarity value is in 0...1 range real number, 65536 should be multiplied and print as integer value.

Strategy

  • Python set type provides intersection and union methods, but these are not available because there is no overlap.
  • Counting the number of each element is required because it is a set that allows overlap.
  • Use min() and max() listed in the conditions.
  • If the element of s1 is in s2, it can be in the intersection set. Calculate it and remove the element from s2.
  • If the element of s1 is not in s2, collect it seperately.
  • After the for statement, s2 has elements which are not in s1.
  • Add the number of elements in the s1_not_s2 and s2 to union.
  • If the union is zero, it cannot be divided into zero when calculating the Jaccard similarity, so it should be printed by multiplying 1 as given in the condition.
  • Otherwise, calculate the similarity and multiply it by 65536, and print only integer part.
def split2char(str):
    temp = {}
    for word in [str[i:i+2] for i in range(len(str)-1)]:
        if word.isalpha():
            uppered = word.upper()
            temp[uppered] = temp.get(uppered, 0) + 1
    return temp


def solution(str1, str2):
    s1 = split2char(str1);
    s2 = split2char(str2);

    inter = 0
    union = 0

    s1_not_s2 = {}
    for k1, v1 in s1.items():
        if k1 in s2:
            inter += min(v1, s2[k1])
            union += max(v1, s2[k1])
            del s2[k1]
        else:
            s1_not_s2[k1] = v1

    union += sum(list(s1_not_s2.values()))
    union += sum(list(s2.values()))

    if union == 0:
        return 65536
    else:
        return int(65536 / union * inter)

Remember

  • I think v1 can add directly to the union without making s1_not_s2.
  • I need to remember how to find intersecion and union in this solution.