[PS] News Clustering
- Published on
- Published on
- Authors
- Name
- 신주용
Programmers: News Clustering
Conditions
- Jaccard similarity
J(A, B)
defines the intersection size of the two sets divided by the union size. - In multiple set that allows duplication of elements, the intersection has
min(# of elements in A, # of elements in B)
. And the union hasmax(...)
. - Each string length is 2...1,000.
- Cut the input string into two letters and make it as a set.
- Only alphabet character pairs are valid and discard the pair if it contains spaces, numbers or special characters.
- Differences in upper case and lower case should be ignored.
- Since the similarity value is in 0...1 range real number, 65536 should be multiplied and print as integer value.
Strategy
- Python
set
type provides intersection and union methods, but these are not available because there is no overlap. - Counting the number of each element is required because it is a set that allows overlap.
- Use
min()
andmax()
listed in the conditions. - If the element of
s1
is ins2
, it can be in the intersection set. Calculate it and remove the element froms2
. - If the element of
s1
is not ins2
, collect it seperately. - After the for statement,
s2
has elements which are not ins1
. - Add the number of elements in the
s1_not_s2
ands2
to union. - If the union is zero, it cannot be divided into zero when calculating the Jaccard similarity, so it should be printed by multiplying 1 as given in the condition.
- Otherwise, calculate the similarity and multiply it by 65536, and print only integer part.
def split2char(str):
temp = {}
for word in [str[i:i+2] for i in range(len(str)-1)]:
if word.isalpha():
uppered = word.upper()
temp[uppered] = temp.get(uppered, 0) + 1
return temp
def solution(str1, str2):
s1 = split2char(str1);
s2 = split2char(str2);
inter = 0
union = 0
s1_not_s2 = {}
for k1, v1 in s1.items():
if k1 in s2:
inter += min(v1, s2[k1])
union += max(v1, s2[k1])
del s2[k1]
else:
s1_not_s2[k1] = v1
union += sum(list(s1_not_s2.values()))
union += sum(list(s2.values()))
if union == 0:
return 65536
else:
return int(65536 / union * inter)
Remember
- I think
v1
can add directly to the union without makings1_not_s2
. - I need to remember how to find intersecion and union in this solution.