Compare Lists in Python

If you search for how to compare two lists in Python, you will find a lot of helpful pages in a lot of places, many of which assume you are working with numbers or you want exact matches. But what if you want to compare all the items in one list with all the items in another list and you want to be able to set some arbitrary measure of similarity or difference?

The problem arose for me recently when I was trying to compare two lists of different lengths. The two lists represented keyword sets derived from a corpus using NMF, which I had run with two different component values. As part of wanting to discover a probable “best fit” I wanted to compare which strings had remained the same and which had changed to some degree.

My first impulse was to try the Jaccard coefficient, and I used some simple code to make that work:

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

I then embedded that bit of code, but it could be any code you wanted, in the following:

for jk, jv in enumerate(second_list):
    for ik, iv in enumerate(first_list): 

The logic is pretty simple, but it is a leap, at least for me, in terms of how I think about things. When I started work on this, I kept trying to pack everything in one for loop: after all, I wanted to compare one list to another. But I wanted to compare all of one list with all of another list, which means I needed to iterate through both lists. A simpler version of this would be:

for j in second_list:
    for i in first_list:

The addition of enumerate above was so that I could keep track of which string in each list was matching without necessarily having to see the string itself — I could use the index values that enumerate produces to call those, if I needed. enumerate is one of those functions I regularly forget, and it is very convenient: essentially it takes a list of items and transforms it into a list of tuples where the first value is the item’s index and the second value is the item itself, so [‘a’] becomes [(0,’a’)]. You can call the parts of the tuple by any variable name you like, but I tend to stick with k and v, for key and value, because … well, because. (It could easily be anything else, and I’ve even written code that called three-item tuples with rather bland, and thus also not advisable, t, u, v. Do not do this.)

So essentially both the for loops above are transforming each of the lists involved into a list of tuples and then walking through the list, comparing the items themselves but reporting only their indices.

It doesn’t really matter which list is which, so far as I can tell, so long as you keep the variables correctly aligned. My final code block looked like this:

print("Jc = Jaccard coefficient")
for jk, jv in enumerate(topics_45):
    for ik, iv in enumerate(topics_35):
        if jaccard_similarity(iv.split(" "), jv.split(" ")) > 0.5:
            print(f"35-{ik} and 45-{jk} have a Jc of {jaccard_similarity(iv,jv):.2f}.") 

My next step is to determine how to transform this into a network or tree so that I can see which keyword clusters continues (relatively) unchanged — where I set the threshold for relatively (and perhaps end up using something other than the Jaccard coefficient which doesn’t seem terribly discriminating — and also where clusters split or, in a few cases, disappear/die.