Hierarchical Clustering of Facebook Friends

Studying Artificial Intelligence has exposed me to the many sub-fields of research in the area. Machine Learning, which is the study of algorithms that allow computers to learn and evolve based on the data they process is particularly interesting. Unsatisfied with my shallow theoretical understanding, I recently ordered O’Reillys Programming Collective Intelligence book, which provides hands on experience with practical examples and code.

I have a few projects in mind that need to perform various forms of clustering, so I decided to try a quick project to get my hands dirty. First, I needed a source of data that I could cluster in a sane way. It struck me that it would be interesting to cluster Facebook friends based on the similarity of status updates. We just need to fetch some status updates for all (or some) of our friends, and build up from there.

The first step is to use the Facebook Graph API to fetch our friends. We need to get the base URL, and our OAuth access token. To get both, I just go to the API reference page, and copy the access token from the URLs of the examples. Lets, first get our friends:

import httplib2
import json

ACCESS_TOKEN = "2227470867|..."
FRIEND_URL = "https://graph.facebook.com/me/friends?access_token=%s" % ACCESS_TOKEN

def get_friends():
  "Gets all of the friends. Returns a Name -> Id mapping"
  http = httplib2.Http(timeout=10) 
  resp, raw = http.request(FRIEND_URL, "GET")
  data = json.loads(raw)["data"]
  return dict([(x["name"], x["id"]) for x in data])

>> friends = get_friends()
>> friends.items()[:3]
[("Micky Mouse", "12345"), ("Minnie Mouse", "13252"), ("Donald Duck", "123451")]

Now we have a mapping of friend names, to their Facebook IDs. We can use this to scrape status updates from their walls. Again, we use the Facebook Graph API. We access the feed resource of our friends, using their user ID, and our access token.

WALL_URL_FORM = "https://graph.facebook.com/%(id)s/feed?access_token=%(access)s"

def get_friend_wall_posts(friends, num_pages=1):
  "Returns all the posts made on friends walls for the number of wall pages"
  all_data = []
  http = httplib2.Http(timeout=10)

  # Download walls for each friend
  for friend_name, friend_id in friends.items():
    current_url = WALL_URL_FORM % {"id":friend_id, "access":ACCESS_TOKEN}
    for page in xrange(num_pages):
      resp, raw = http.request(current_url, "GET")
      raw = json.loads(raw)
      data = raw["data"]
      for post in data:
      print "Added %d posts from %s" % (len(data), friend_name)

      # Go to the next page
        current_url = raw["paging"]["next"]

  return all_data

>> raw_posts = get_friend_wall_posts(friends, 3)
Added 22 posts from Donald Duck
Added 24 posts from Micky Mouse
Added 20 posts from Minnie Mouse

At this point we have everything we need. However, we need to slice and dice the data a little bit to get it into a more usable format. What we currently have are the “raw” posts, which contain comments, images, links, albums, and other things that are not useful for our purposes. We only need the status message text, so lets cut away everything else.

def filter_data(data):
  "Filters data to just get id, name, and message for status updates"
  filtered = []
  for entry in data:
      if entry["type"] != "status":
      id = entry["id"]
      name = entry["from"]["name"]
      message = entry["message"].lower()
      filtered.append((id, name, message))
      print "Failed to parse: ",entry

  return filtered

def filter_friends_only(friends, filtered):
  # Get the set of friend ids
  return [post for post in filtered if post[1] in friends]

>> filtered = filter_data(raw_posts)
>> filtered_friends = filter_friends_only(friends, filtered)

The first function, filter_data() is used to just get status updates, and to simplify the data representation to a tuple of (message id, user name, message text). The second function, filter_friends_only() only retains status updates that our friends make, and not any posts others have made to our friends walls.

Now that we have all the data we need, we must “normalize” it in some way so that we can compare people against each other. The simplest way to do this is to determine the domain of all used words, prune out ones that occur very rarely (only once or twice), and those that occur with high frequency (it, the, a, etc.), and use the remaining words as a list of key words. We can count how frequently each friend uses those words, and use this frequency vector to compare them. First, we can construct our domain and do some frequency counts:

import re
from collections import defaultdict

def get_words(filtered):
  Processes status updates to return 2 dictionaries.
  The first dictionary returns a mapping of words to a set
  of message id's that contain the word.

  The second dictionary returns a mapping of people to a
  dictionary of words -> occurrences
  p = re.compile("[^a-zA-Z]+")

  all_words = defaultdict(set)
  person_count = {}

  for id,name,mesg in filtered:
    person_count.setdefault(name, defaultdict(int))

    words = p.split(mesg)
    for word in words:

  return all_words,person_count

def filter_words(total_mesg, all_words, person_data, max_occur=0.02, min_occur=0.0007):
  Filters the dictionaries to only contain words that have
  a specified minimum and maximum occurrence proportion.

  Returns a list of the "good" words, and a modified person dictionary.
  good_words = []
  props = []
  for word,messages in all_words.items():
    proportion = len(messages) / (1.0*total_mesg)
    props.append((proportion, word))
    if proportion < max_occur and proportion > min_occur:
  print props

  persons = dict([(person, dict([(word, person_data[person][word]) for word in good_words])) for person in person_data])
  return good_words, persons

>> all_words, person_data = get_words(filtered)
>> print "Total words: %d, Total People: %d," % (len(all_words), len(person_data)))
Total words: 9009, Total People: 172

>> all_words, person_data = filter_words(len(filtered), all_words, person_data)
[(0.00027601435274634281, u'aaa'), ...., (0.32514490753519182, u'i')]
>> print "Filtered words: %d, Filtered People: %d" % (len(all_words), len(person_data))
Filtered words: 2171, Filtered People: 172

There are a few values worth tuning for each run. The main ones are max_occur and min_occur which are the arguments to filter_words(). Those are used to prune the set of words that are used to do the matching, and have a large impact. In my case, I found that there was very little overlap in words in general, so the values had to be very low. I just eyeballed those values and set max_occur to the point where very common words were appearing and min_occur somewhat above the threshold for a single use. This leaves a fairly reasonable set of words, about 2K left to be used to compare people with.

The algorithm that performs the clustering is pretty straightforward. We start out with each person as part of their own “cluster”. We compute the distance between each cluster, and merge the two clusters that are most similar. We repeat this process until we have only a single root cluster remaining. The key ingredients to make this work are a way to represent a cluster and a way to compute the distance between clusters. A cluster is very simple, it contains a vector frequency count of word occurrences, and two possible children which were used to form the interior clusters. I borrowing the bicluster class from the book:

class bicluster:
  def __init__(self,vec,left=None,right=None,distance=0.0,id=None):

def vectorize_dict(key_ordered, dict):
  "Vectorizes the values of a dictionary by iterating in a specific order"
  return [dict[key] for key in key_ordered]

>> person_clusters = [bicluster(vectorize_dict(all_words, person_data[p]), id=p) for p in person_data.keys()]

There are a handful of methods we could use to compute the distance, but one simple and useful metric is the Pearsons correlation co-efficient. It is used to compute the strength of a linear fit to data. It returns a value between -1 and 1, where 0 indicates no fit, a 1 indicates a strong positive fit, and -1 a strong negative fit. However, we can use it as a measure of distance between our clusters. Here is a simple function to compute the distance:

from math import sqrt
def pearson(v1,v2):
  sum1 = 0
  sum2 = 0
  sum1Sq = 0
  sum2Sq = 0
  pSum = 0

  # Compute all the sums
  for i in xrange(len(v1)):
    val1 = v1[i]
    val2 = v2[i]
    sum1 += val1
    sum2 += val2
    sum1Sq += val1*val1
    sum2Sq += val2*val2
    pSum += val1*val2

  # Calculate r (Pearson score)
  if den==0: return 0

  return 1.0-num/den

We now have our clusters, and a means of measuring distance, so we need the actual function to perform the clustering. This is mostly from the book, but I will reproduce it here with my minor modifications:

def hcluster(clust,distance=pearson):

  while len(clust)>1:

    # loop through every pair looking for the smallest distance
    for i in xrange(len(clust)):
      for j in xrange(i+1,len(clust)):
        # distances is the cache of distance calculations
        if (clust[i].id,clust[j].id) not in distances: 
          d = distances[(clust[i].id,clust[j].id)] = distance(clust[i].vec,clust[j].vec)
          d = distances[(clust[i].id,clust[j].id)]

        if d<closest:

    # calculate the average of the two clusters
    for i in xrange(len(clust[0].vec))]

    # create the new cluster

    # cluster ids that weren't in the original set are negative
    del clust[lowestpair[1]]
    del clust[lowestpair[0]]

  return clust[0]

>> person_root = hcluster(person_clusters)

At long last! We have clustered our friend group. It is rather difficult to visualize our results in the current form, so we make use of the Python Imaging Library (PIL), to generate a basic visualization of our cluster. I will omit this code, but it is available in the source. My friends don’t have unusually strange names, I’ve just anonymized my results for privacy.

A partial sub-tree showing the clustering at the leaves of the tree:
Partial Sub-tree

The full tree, showing the general structure of the tree:
Full Tree

Here is a link to the full-sized version.

Some of the clusters will make more sense than others, based on friend groups and such. Others cannot be clustered in any meaningful way due to lack of data. A critical factor in getting good results is having enough data from all of your friends. In my tests I tried getting only the first page worth of status updates for each friend which resulted in a terrible clustering. Getting the first three pages dramatically improved things, but there is still room for improvement. One of the limitations is how much data you can fetch from Facebook without getting rate-limited.

It is clear that we are only scratching the surface of what can be used to cluster friends. You could go further and use pages that friends have liked, pull in biographies, favorite quotes, music interests, or even mine links and videos that they share. Those are outside the scope of my simple example, but it should be relatively clear how to do that based on these techniques.

It you want to try playing with the code, it is posted in its entirety with some slight modifications on GitHub.

Happy Clustering!

  1. armondadgar posted this