# What is the best approach and tools to apply Jaccard's Ratio and Kolmogorov–Smirnov to two textlists?

Keywords： python-3.x string list statistics

Question:

I am using Kolmogorov-Smirnov and Jaccard's Ratio to identify if a correlation exists between 2 lists of text words (the lists are ranked 1-n, different number of items in each list and there is only some of the items in each list that are the same, no duplicate items within a list)

Question 1 What is the best way to apply these statistical algorithms Kolmogorov-Smirnov and Jaccard's Ratio.

• Apply the statistical to the full lists only?
• Apply the statistical per ranked item in the list and to graph the content in a line graph?
• both approaches?

Question 2 What are the best languages and tools to use in calculating these statistics for text lists. What what I have read so far python is used frequently for Jaccards Ratio and excel/python is used to calculate Kolmogorov-Smirnov

Any guidance that could be provided would be most appreciated. Thank you

Jaccards Ratio

I have found the following that applies Jaccard to a full list of text strings Code Reference - How can I calculate the Jaccard Similarity of two lists containing strings in Python?

``````def jaccard_similarity(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
print(list(set(list1).intersection(list2)))
union = (len(list1) + len(list2)) - intersection
return float(intersection / union)
``````

But I have also found the following that relates to the individual strings Code Reference https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50

``````def get_jaccard_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
``````

Kolmogorov-Smirnov

The related outputs like wikipedia show a graph so I don't know if this should be applied per item on the list also for example as shown on wikipedia https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

``````from scipy.stats import ks_2samp
x = [1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10]

ks_statistic, p_value = ks_2samp(x, y)

print(ks_statistic)
print(p_value)
``````