<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>I’m Armon Dadgar. Currently living in San Francisco and working for a startup called @Kiip. This is my space to provide yet another technical blog. Although I am a polyglot, I have a particular fondness for Python, Erlang, and C. I find problems of distributed computing, “Big Data” and “Big Storage” fascinating and hope to do more research in that area. I studied Computer Science at the University of Washington.</description><title>SIGINT</title><generator>Tumblr (3.0; @armondadgar)</generator><link>http://armondadgar.com/</link><item><title>Django: Development to Deployment (Part 3)</title><description>&lt;p&gt;I &lt;a href="http://armondadgar.com/post/7031565232/django-endtoend-part1" target="_blank"&gt;started the series&lt;/a&gt; with 
setting up your local environment using Vagrant and Fabric to quickly bootstrap. In &lt;a href="http://armondadgar.com/post/7267697181/django-endtoend-part2" target="_blank"&gt;the second part&lt;/a&gt;, we reviewed some conventions for Django development as well as useful tools and tricks. In the final part of the series we will cover a simple deployment to Amazon EC2.&lt;/p&gt;

&lt;h2&gt;Getting started with AWS&lt;/h2&gt;

&lt;p&gt;The first step in deploying to Amazon EC2 is to setup an account with Amazon Web Services. This is fairly straightforward. Go to &lt;a href="http://aws.amazon.com/" target="_blank"&gt;&lt;/a&gt;&lt;a href="http://aws.amazon.com/" target="_blank"&gt;&lt;a href="http://aws.amazon.com/" target="_blank"&gt;http://aws.amazon.com/&lt;/a&gt;&lt;/a&gt; and click on “Create an AWS Account”, and follow the steps. It may take a few hours for the account to be active, but then you will be able to login to the AWS Management Console. From there you have access to all the AWS services.&lt;/p&gt;

&lt;p&gt;For our simple application, open the console and click the EC2 tab. The Amazon
Elastic Compute Cloud (EC2) allows you to rent server infrastructure on a
pay-as-you-go basis. This is great for startups or projects where a large investment
in server infrastructure is not possible.This is perfect for our project, as we will
deploy onto a single server. There are several important key concepts in EC2:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;Instances&amp;#160;: An instance in EC2 is a single host. There are a variety of types of instances, ranging from micro to high-memory large instances. Each type of instances has a variable amount of CPU power, RAM, I/O capacity, and disk space. The more powerful machines cost more on an hourly basis.&lt;/li&gt;
&lt;li&gt;Elastic Block Store&amp;#160;: EBS is a service which allows for persistent storage. A typical EC2 instances has “instance storage” provided which is very large and high speed, but there is no data backup provided. If the instance dies, then the instance store is lost. EBS instead provides disks which are backed up and will persist in case an instance dies. They are not necessary, but may be useful depending on the application.&lt;/li&gt;
&lt;li&gt;AMI&amp;#160;: AMI is short for Amazon Machine Image, and it is basically a “snapshot” of a running machine. When a new instance is started, it uses an AMI as its base image. This image may have any operating system or software pre-installed. Typically, you would start an instance with something like Ubuntu or CentOS with a default install, and then customize it from there. If you want, you can create your own AMI from an existing setup.&lt;/li&gt;
&lt;!-- more --&gt;
&lt;li&gt;Security Groups&amp;#160;: Security groups allows you to set a firewall configuration policy for a set of hosts in EC2. This increases the security of your servers and allows you to run software that listens on the network but to limit connections to only those originating from your servers. A common mistake is to not configure the security groups, and thus make it impossible to access your web server.&lt;/li&gt;
&lt;li&gt;Elastic IPs&amp;#160;: Every Amazon EC2 instance has a public and private DNS name. The private DNS name can only be used by hosts on EC2 in the same region. This is used for servers to communicate with each other, without causing the data to leave the data center. It is important that things such as SQL databases and Memcached instances be accessed using the private DNS name to minimize latency and bandwidth cost. The public DNS name is accessible by hosts on the Internet, however the DNS name has to guarantee of stability, and the IP address may change at any time. Elastic IPs allow you to provide a stable IP address to an EC2 host. An Elastic IP is first allocated and then associated with a specific host.&lt;/li&gt;
&lt;li&gt;Key Pairs&amp;#160;: EC2 instances are typically configured using SSH if they are running Linux, or Remote Desktop if they are running Windows. EC2 maintains a set of named key pairs which are used to communicate with your hosts. When a host is created, you may specify a key pair to use, and all SSH communication must provide the key to login.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;There is a lot to digest when getting started with Amazon EC2, but most of is fairly basic. To get us started, create a new key pair. This will download an .pem file that should be kept safe. This file can be used to access your hosts, so do not make it public. Next, create a security group. Click the security group, go to the “Inbound” tab, add enable access to your web server on port 80 from all IPs (specified as 0.0.0.0/0 in CIDR format). Also enable incoming SSH connections on port 22.&lt;/p&gt;

&lt;p&gt;Once that is all setup, you are ready to create your first instance. Go to Instances, and click on “Launch Instance”. The first step is to select an AMI. Since in part 1 we setup Vagrant to develop within a linux environment running Ubuntu 10.04, we want to use the same thing on EC2. The 32bit AMI that uses the instance storage is ami-e4d42d8d. 
Next we select the type of instance we want, in this case a single small instance.
We can continue on and select a name, key pair, and security group. Eventually we
get to review our setup:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_loi9cg9BHK1qgbach.png" alt="Instance Review"/&gt;&lt;/p&gt;

&lt;p&gt;Once you click launch, the instance will begin booting and will be available in a few minutes. When the status is “Running”, you can select the instance and find its public DNS name. To login, you use your key pair, and execute something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ chmod 600 testdjango.pem
$ ssh -i testdjango.pem ubuntu@ec2-...compute-1.amazonaws.com
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you can login, then you have succeeded! If not, check the status of your host and make sure the security group is properly configured to allow incoming connections. At this point, we have a blank host up and running, but we need to bootstrap it to run our Django server.&lt;/p&gt;

&lt;h2&gt;Bootstrapping with Fabric&lt;/h2&gt;

&lt;p&gt;Once we create a new host, we need to bootstrap it with our environment. This is normally the most difficult process of
using a service such as EC2. Unless you are using automated tools, bootstrapping is tedious and error prone. However,
because we have invested time in defining our Fabric file, we can bootstrap new hosts painlessly.&lt;/p&gt;

&lt;p&gt;To start, we need to modify our fabfile to specify the hosts. You should modify the &lt;code&gt;HOSTS&lt;/code&gt; variable at the top of
the file to include the EC2 Public hostname. Next, we need to provide the key file that is used to communicate
with the host. Hosts on EC2 accept SSH connections only if a valid identity file is provided. By default, our fabric
file will use the pem file at config/aws/testdjango.pem. You should replace this with your actual key that was downloaded
when a new keypair was generated.&lt;/p&gt;

&lt;p&gt;Next, we need to enable our hosts to get the code from GitHub. To do this, we need to generate a set
of “deploy keys”, which are SSH keys that you provide to GitHub which enables your code to be cloned.
Generating the keys is straightforward:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cd config; ssh-keygen -f id_rsa -t rsa -N ''
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once you have the key, go to the GitHub page for your project, click on “Admin”, and in the Deploy Keys
section, you need to upload the contents of id_rsa.pub. This will enable the EC2 servers to clone a copy
of your repo to serve the site.&lt;/p&gt;

&lt;p&gt;Once we have specified the hosts and setup all our keys, we are ready to bootstrap. We
can just issue the command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ fab production bootstrap
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will select the production environment, and run the bootstrap command on all the hosts. This is similar
to how we did:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ fab vagrant setup_vagrant
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The setup_vagrant command shares most of the sub-routines as bootstrap but a few minor differences are
present due to the Virtualbox environment. Once the bootstrap command is finished, we should be able
to point our browser at the public hostname of our EC2 instance and see our site running live.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_loi9f9FzSP1qgbach.png" alt="Site Live"/&gt;&lt;/p&gt;

&lt;p&gt;This covered the simplest case of bootstrapping in production. To build a staging environment,
we can just define a different set of hosts for staging, and run the same command just selecting
stage instead of production.&lt;/p&gt;

&lt;h2&gt;Updating code and pushing to production&lt;/h2&gt;

&lt;p&gt;We’ve covered everything necessary to get our configure our servers and get our site running live.
However, websites are inherently iterative. As such, a common process is deploying the latest version
of code to the running site with minimal interruption. One way to do this would be to rerun bootstrap,
but this is rather invasive and may cause a few minutes of downtime.&lt;/p&gt;

&lt;p&gt;Instead, we can use a set of Fabric commands to make the process easy and minimize downtime.
The critical commands are the following:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;cut_staging&amp;#160;: Merges the code from master into the staging branch. This allows the latest
              development code to be deployed to a staging environment.&lt;/li&gt;
&lt;li&gt;cut_release&amp;#160;: Merges the code from staging into release. This allows code that has been
              tested in a staging environment to be deployed to production.&lt;/li&gt;
&lt;li&gt;pull&amp;#160;: Performs a Git pull, so that the server has the latest code from the proper branch&lt;/li&gt;
&lt;li&gt;reload&amp;#160;: Reloads both Nginx and uWSGI so that any new code changes take affect&lt;/li&gt;
&lt;li&gt;syncdb&amp;#160;: Performs a synbdb and migrate command so that DB schemas are brought up to date.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Because Fabric commands can be composed, we can easily use a single command to perform all
the relevant steps. The most common is to push code from the development (master) branch
to staging. We can use the following to do so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ fab cut_staging staging pull reload syncdb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can then test our changes in staging and look for any bugs. Once we are confident in
our code, we can easily deploy to production:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ fab cut_release production pull reload syncdb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will basically perform the same steps, but merge into the release branch and
run on the production hosts. At this point, our latest code changes are available in
production.&lt;/p&gt;

&lt;h2&gt;Buttoning Down&lt;/h2&gt;

&lt;p&gt;I’ve tried to cover all the basics required to get our side deployed to EC2,
but there are still many things needed to button down our deployment. For most
deployments, you will probably want to make use of a database to provide persistence.
Setting up a DB is outside the scope of this post, but one could either setup
an EC2 instance and manually configure MySQL or PostgreSQL, or use Amazon RDS
to provide a database as a service.&lt;/p&gt;

&lt;p&gt;Additionally, most sites will benefit from using caching, particularly memcache.
Memcache is installed by default on all of our machines, but the Django settings
files need to be updated to use the proper hostnames.&lt;/p&gt;

&lt;p&gt;Lastly, if you would like to run your site on EC2 with nicer URLs you have two
options. You can create an Elastic IP and assign it to your instance which will
provide it a stable public IP, and then update your DNS A record to point to the
EIP. Alternatively, you can just setup a DNS CNAME record to the public hostname
of the EC2 instance. The advantage of the first approach is that Elastic IPs can
be remapped very quickly, and in the case of a host failure changing a CNAME may
take a substantial amount of time.&lt;/p&gt;

&lt;h2&gt;Summary&lt;/h2&gt;

&lt;p&gt;In the final part of this series I covered how to deploy our code to Amazon EC2.
Doing our deployment is simplified by the tools we used and by our foresight in
maintaining consistent development and production environments. This allows us to
have rapid iteration without worrying about incompatibles or environmental issues.
We can use Fabric both to maintain and grow our staging and production fleets, but
also to do simple code deployments. Lastly, we briefly covered some of the considerations 
needed for a proper buttoned down deployment.&lt;/p&gt;

&lt;p&gt;I wrote this series of blog posts for a number of reasons. I wanted to swap out
the information I had built up in my mind for my own reference. It also serves
as documentation for team members working on projects so that they may fully grasp
our setup and process. Lastly, I hoped to share our methodology and structure so
that those who are new to Django may adopt sensible conventions for projects as
well as more rigorous engineering processes.&lt;/p&gt;

&lt;p&gt;Regardless of your background or expose to Django, I hope some of this information
was valuable. If you have any questions about the setup, please ping me (
twitter works well: &lt;a href="http://twitter.com/ArmonDadgar/" target="_blank"&gt;@ArmonDadgar&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Resources:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/armon/DjangoProjectExample" target="_blank"&gt;GitHub Project&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://aws.amazon.com/ec2/" target="_blank"&gt;Amazon EC2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description><link>http://armondadgar.com/post/7744212093</link><guid>http://armondadgar.com/post/7744212093</guid><pubDate>Sun, 17 Jul 2011 18:34:00 -0700</pubDate><category>Django</category><category>Python</category><category>EC2</category><category>AWS</category><category>Fabric</category><category>Deployment</category></item><item><title>Django: Development to Deployment (Part 2)</title><description>&lt;p&gt;&lt;b&gt; Update: &lt;a href="http://armondadgar.com/post/7744212093/django-endtoend-part3" target="_blank"&gt;Part 3 is out now: Deploying to AWS.&lt;/a&gt;&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href="http://armondadgar.com/post/7031565232/django-endtoend-part1" target="_blank"&gt;Part 1&lt;/a&gt; of our series I covered setting up your local environment using Vagrant for virtualization and Fabric in combination with several other tools to do the bootstrapping. In Part 2, I’ll cover my Django setup and the development process.&lt;/p&gt;

&lt;h2&gt;Django Settings&lt;/h2&gt;

&lt;p&gt;At the end of part 1 we created a blank project and had a web server running that allowed us to reach the congratulations page. The next step of our setup is to setup Django, mostly by modifying our settings.py file. This file controls all the settings for Django, and if very important.&lt;/p&gt;

&lt;p&gt;The first thing I do is add some helper methods to resolve the absolute path to the current directory. This has an edge case related to how we use symlinks inside Vagrant, that we need to check for, but is otherwise straightforward:&lt;/p&gt;

&lt;!-- more --&gt;

&lt;pre&gt;&lt;code&gt;
import os
import os.path
import sys
import datetime

##### Path Resolution

# Get the current path
BASE_PATH = os.path.dirname(os.path.abspath(__file__))

# Hack for vagrant
if BASE_PATH == "/project/project":
  BASE_PATH = "/server/env.example.com/project/project"

# Makes a normalized path from the base path
def make_abs_path(*rel_path):
  args = (BASE_PATH,) + rel_path
  return os.path.normpath(os.path.join(*args))

# Modify the python path
sys.path.append(make_abs_path("core/"))

# Make the tmp paths
TMP_PATHS = ["django","query"]
for p in TMP_PATHS:
  path = make_abs_path("../../tmp/",p)
  if not os.path.exists(path):
    os.makedirs(path)

# Get the date string
NOW = datetime.datetime.now()
DATE_STR = NOW.strftime("%Y-%m-%d")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once we have these settings, we can use the &lt;code&gt;make_abs_path()&lt;/code&gt; method to generate our paths. This is used in several places, such as for template locations, static media, logging, etc. Many of the other settings are pretty typical for Django, but one common requirement is to support multiple “environments”. You may want to have separate settings for development, staging, and production. Clearly, you wouldn’t want to use the same databases or caches. To support this,
I check for the existence of two files to help determine the current environment. If a file named &lt;code&gt;PRODUCTION&lt;/code&gt; is in the project folder, we set our environment to production, and import production settings. Likewise, if we see a &lt;code&gt;STAGING&lt;/code&gt; file, we do the same. In the absence of these files, we load our development settings.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
# Check which environment to load
if os.path.exists(make_abs_path("PRODUCTION")):
  from settings_prod import *
  ENVIRONMENT = "PRODUCTION"
elif os.path.exists(make_abs_path("STAGING")):
  from settings_stage import *
  ENVIRONMENT = "STAGING"
else:
  from settings_dev import *
  ENVIRONMENT = "DEV"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I prefer this approach due to its simplicity. I create a blank file with the appropriate name and only add it
to the correct git branch, and the proper settings are loaded. Here are the steps to create a new staging and release branch, with the files that trigger the environmental settings:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
$ git checkout -b stage
$ touch project/STAGING
$ git add project/STAGING
$ git commit -m "Add staging file"
$ git push -u origin stage
$ git checkout -b release
$ git mv project/STAGING project/PRODUCTION
$ git commit -m "Add production file"
$ git push -u origin release
$ git checkout master
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Our staging machines use the stage branch from the git repo, and our production machines use the release branch. Now that we have added the proper files to each of the branches, the proper settings will be loaded for each environment. Usually, the only settings that need to be set per environment are for the database and caches. Here is an example of the production settings:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
DEBUG = False
TEMPLATE_DEBUG = False

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'exampleproddb',
        'USER': 'produser',
        'PASSWORD': 'ExampleProdPass',
        'HOST': 'db.example.com',
        'PORT': '',
    }
}

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': ['memcache.example.com:11211'],
        'KEY_PREFIX': 'PROD',
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It is critical to disable DEBUG on production for performance reasons. In addition, we use a standard MySQL database configuration and a Memcached backend. For caches, always set a &lt;code&gt;KEY_PREFIX&lt;/code&gt; to prevent staging or development environments from clobbering the cached production data. Once Django has been configured you can create your first app.&lt;/p&gt;

&lt;h2&gt;Apps and Templates&lt;/h2&gt;

&lt;p&gt;In Django, an app is merely an organizational unit. An app is nothing more than a collection of views and models. I try to maintain a clean folder structure by placing apps in a subdirectory of the project folder. As you can see in our &lt;a href="https://github.com/armon/DjangoProjectExample" target="_blank"&gt;example repository&lt;/a&gt;, I’ve created a folder in project called apps/ and added our &lt;code&gt;main&lt;/code&gt; app in there. I’m not going to cover the details of using Django, so suffice it to say that our app renders only the index page using a template:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
from django.shortcuts import render_to_response
from django.template import RequestContext

def index(request):
  return render_to_response("home/index.html", context_instance=RequestContext(request))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We use the convenience method &lt;code&gt;render_to_response&lt;/code&gt; to handle template rendering. We’ve configured Django to
look in the project/templates/ directory for templates, so index.html should be inside project/templates/home/.
It is very standard to have a base template (&lt;a href="https://github.com/armon/DjangoProjectExample/blob/master/project/templates/base.html" target="_blank"&gt;project/templates/base.html&lt;/a&gt;), and then have [templates](&lt;a href="https://github.com/armon/DjangoProjectExample/blob/master/project/templates/home/index.html" target="_blank"&gt;https://github.com/armon/DjangoProjectExample/blob/master/project/templates/home/index.html&lt;/a&gt;) which extend the base template with the page specific content. This is a simple method of maintaining a unified look while respecting the DRY principle.&lt;/p&gt;

&lt;h2&gt;Models and Migrations&lt;/h2&gt;

&lt;p&gt;We now have a very simple app that is able to generate our index page and show off our lolcat.
However, we can’t do much beyond serving this simple static page. Most non-trivial applications
will need to make use of a database to provide a persistent datastore. In Django, this is done by
writing Models which are built on top of Django’s ORM framework. Models provide a convenient and simple
means of accessing our data, and are automatically transformed into tables and rows in our database.&lt;/p&gt;

&lt;p&gt;One annoyance with Django out of the box is the inability to change the schema. Lets say you define
a Person model with age and name. You may later decide that you would also like to store their gender.
Django provides no good solution to this problem, however there is a third-party plugin called &lt;a href="http://south.aeracode.org/" target="_blank"&gt;South&lt;/a&gt; which solves it. South provides simple schema migrations and allows
you to be somewhat more flexible and adaptive. I highly recommend it and use it for every project.&lt;/p&gt;

&lt;p&gt;The only change that is made when you adopt South, is that instead of using the typical &lt;code&gt;syncdb&lt;/code&gt;
command, you must now use the &lt;code&gt;migrate&lt;/code&gt; command. This command might need to apply multiple migrations
to bring the database up to the current state. As an example, we might define our model as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
from django.db import models

class Person(models.Model):
  name = models.CharField(max_length=64,db_index=True)
  age = models.IntegerField(max_length=512)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once we have our initial model, we need to instruct South to create our initial schema. This is
made slightly more complicated due to vagrant and virtualenv, but is nonetheless simple:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
$ vagrant ssh
$ cd /server/env.example.com/
$ source bin/activate
$ cd project/project/
$ python manage.py schemamigration --initial main
Creating migrations directory at '/project/project/apps/main/migrations'...
Creating __init__.py in '/project/project/apps/main/migrations'...
 + Added model main.Person
Created 0001_initial.py. You can now apply this migration with: ./manage.py migrate main
$ python manage.py syncdb ; python manage.py migrate
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After we run all these commands, we will have created our first migration and applied it
to create the table. At this point, we might decide to extend our Person to add a gender.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
class Person(models.Model):
  name = models.CharField(max_length=64,db_index=True)
  age = models.IntegerField(max_length=512)
  gender = models.CharField(max_length=6, null=True, default=None)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After we do this, we can generate a new migration that will add the column to our table.
This is all done automatically by South, we just need to instruct it to generate the necessary
files and apply them:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
$ python manage.py schemamigration --auto main
 + Added field gender on main.Person
Created 0002_auto__add_field_person_gender.py. You can now apply this migration with: ./manage.py migrate main
$ python manage.py migrate
Running migrations for main:
 - Migrating forwards to 0002_auto__add_field_person_gender.
 &amp;gt; main:0002_auto__add_field_person_gender
 - Loading initial data for main.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, South is very simple to use but also is extremely useful in allowing us to change
our schema. In addition to supporting forward migrations, it can also handle reverse migrations to
rollback an unwanted change.&lt;/p&gt;

&lt;p&gt;In our examples, we needed to manually SSH in to apply our migrations. This can be a bit tedious,
so there is a Fabric task to handle it. If all we need to do is run &lt;code&gt;syncdb&lt;/code&gt; and &lt;code&gt;migrate&lt;/code&gt;, which
are usually done after deploying new code, then we can use the &lt;code&gt;syncdb&lt;/code&gt; task:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
$ fab vagrant syncdb
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will run both tasks without needed to SSH and activate virtualenv.&lt;/p&gt;

&lt;h2&gt;Development Server&lt;/h2&gt;

&lt;p&gt;Our current setup uses Nginx to handle the incoming requests and uWSGI as our
application server. However, uWSGI is designed to be performant and as such it 
caches our application code in-memory. This means that when we update our python
files our changes are not reflected until uWSGI is restarted or reloaded (to reload
uWSGI send the HUP signal to it). For rapid development this can be inconvenient since
we need to constantly restart it. To minimize the overhead of development, we can run
the Django development server which automatically reloads on code change. This allows
us to make changes and immediately see the results.&lt;/p&gt;

&lt;p&gt;To do this, we just use a Fabric task to start the development server:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
$ fab vagrant dev_server
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To use this server, we just use local port 7000 instead of 8080. Where port 8080
will use uWSGI to serve requests, port 7000 is used in development only and uses
the dev server to handle requests. We don’t use the development server in production
because it is not particularly fast or efficient.&lt;/p&gt;

&lt;h2&gt;Summary&lt;/h2&gt;

&lt;p&gt;In Part 2 I tried to cover some of the tips and tricks for development in our environment.
We covered how to configure Django settings such that we can control the settings on a per-environment
basis and easily generate absolute paths. Django apps are placed in a sub-folder and templates in
a separate folder to keep the folder structure clean. South was introduced as a simple tool
for enabling schema migrations. Lastly, we covered using the Django development server for more
rapid development.&lt;/p&gt;

&lt;p&gt;I hope some of this was useful as an example of how to structure a Django project,
and some tricks for development. In the next part we will cover using Git to manage
the stage and release branches and deploying our project onto Amazon EC2.&lt;/p&gt;

&lt;p&gt;Resources:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/armon/DjangoProjectExample" target="_blank"&gt;GitHub Project&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://south.aeracode.org/" target="_blank"&gt;South&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.djangoproject.com/en/1.3/" target="_blank"&gt;Django Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.djangoproject.com/en/1.3/intro/tutorial01/" target="_blank"&gt;Django Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description><link>http://armondadgar.com/post/7267697181</link><guid>http://armondadgar.com/post/7267697181</guid><pubDate>Tue, 05 Jul 2011 10:00:00 -0700</pubDate><category>Django</category><category>Python</category><category>Web Development</category><category>South</category></item><item><title>Django: Development to Deployment (Part 1)</title><description>&lt;p&gt;&lt;b&gt; Update: &lt;a href="http://armondadgar.com/post/7744212093/django-endtoend-part3" target="_blank"&gt;Part 3 is out now: Deploying to AWS.&lt;/a&gt;&lt;/b&gt;&lt;br/&gt;&lt;b&gt; Update: &lt;a href="http://armondadgar.com/post/7267697181/django-endtoend-part2" target="_blank"&gt;Part 2 is out now: Developing Django.&lt;/a&gt;&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;As programmers we adopt new tools to make our lives easier and increase the speed of development. Web development is an area of extremely rapid innovation with many incredible libraries and packages. I have worked on a number of Django powered websites, and I wanted to share my process to help those who are just getting started or who want to improve their setup.&lt;/p&gt;

&lt;p&gt;I plan on covering an end-to-end project, from development to deployment on AWS, so there is a lot to cover. In an attempt to make this more manageable, I’ll break apart the posts into multiple parts. In this segment we will setting up the local environment.&lt;/p&gt;

&lt;h2&gt;Virtualization&lt;/h2&gt;

&lt;p&gt;It may happen that you have an idea for the next billion dollar, social, 2.0 cloud service and you want to just start coding immediately. So you download the tools you need and start hacking. Soon, you need to work with others and incompatibilities arise. Then you push to production and nothing seems to work. In our exuberance to build things, we sometimes forget the engineering part our work: it’s all about the process.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;Firstly, embrace virtualization whole heartedly. Virtualization, for the unfamiliar, allows you to run a full operating system within another operating system. What does this mean? It means you and your team can develop on Windows, OSX, or Linux but have all your code running within a consistent environment where the operating system and package versions are controlled. This means you don’t need to worry about compatibility across development machines, staging or production. The immediate consequence is a slight learning curve, but the long-term payoff of consistent environments is that bugs are found more quickly and rarely reach production.&lt;/p&gt;

&lt;p&gt;To use virtual machines (VMs) in our workflow, we make use of &lt;a href="http://www.virtualbox.org/" target="_blank"&gt;VirtualBox&lt;/a&gt; which is a tool from Oracle (previously Sun). VirtualBox is a “hypervisor”, because it can supervise multiple operating systems. It interfaces those operating systems with the host system, and makes things like network and file access possible. However, dealing with VirtualBox can be a bit cumbersome and error prone, so we use a cool tool called &lt;a href="http://vagrantup.com/" target="_blank"&gt;Vagrant&lt;/a&gt; to make development easier. Vagrant wraps around VirtualBox and uses a “Vagrantfile” to automatically setup and provision our virtual machines. This simplifies things, but also makes them much more repeatable and we can use version control on our Vagrantfile to further button down our setup.&lt;/p&gt;

&lt;p&gt;I’ve created a GitHub repository to host our example project, called &lt;a href="https://github.com/armon/DjangoProjectExample" target="_blank"&gt;DjangoProjectExample&lt;/a&gt;. Here is an example &lt;a href="https://github.com/armon/DjangoProjectExample/blob/master/Vagrantfile" target="_blank"&gt;Vagrantfile&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
Vagrant::Config.run do |config|
  config.vm.box = "lucid32"

  # The url from where the 'config.vm.box' box will be fetched if it
  # doesn't already exist on the user's system.
  config.vm.box_url = "http://files.vagrantup.com/lucid32.box"

  # Assign this VM to a host only network IP, allowing you to access it
  # via the IP.
  config.vm.network :hostonly, "33.33.33.60"

  # Forward a port from the guest to the host, which allows for outside
  # computers to access the VM, whereas host only networking does not.
  config.vm.forward_port 80, 8080
  config.vm.forward_port 81, 7000

  # Share an additional folder to the guest VM.
  config.vm.share_folder("v2-data", "/project", "./")
end
&lt;/code&gt;
&lt;/pre&gt;

&lt;p&gt;The syntax of the file is in Ruby, but it is well documented. Our file is fairly basic, and basically instructs Vagrant to download a copy of the Ubuntu Lucid 32bit image as the base of our VM, to make it available at IP 33.33.33.60, forward some ports (so we can access our web server), and to share our project folder. This goal is to mask the fact that we are developing inside a virtual machine. By mapping our project folder into the VM, all our code is available to both us and the VM, and by forwarding the ports we can access localhost just like we normally would but have the connection forwarded to our VM.&lt;/p&gt;

&lt;p&gt;To get started using these tools, we need to get Virtualbox and Vagrant, and then bring our VM up. This is fairly straightforward. First, download Virtualbox from the official website. Vagrant can be installed via the gem system:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo gem install vagrant
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once this is done, we can manipulate our VM using vagrant:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ vagrant up        # Starts the VM
$ vagrant halt      # Sotops the VM
$ vagrant status    # Shows the status
$ vagrant reload    # Restarts the VM
$ vagrant ssh       # SSH into the VM
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Those are the basic commands that are needed. You can read the documentation for advanced features and commands.
By doing &lt;code&gt;vagrant up&lt;/code&gt;, a headless VM will be configured and started in the background, ready for use.&lt;/p&gt;

&lt;h2&gt;Bootstrapping our Environment&lt;/h2&gt;

&lt;p&gt;At this point, we have managed to bring up a headless Ubuntu system. Now we need to setup our project so we can
start developing. There are number of ways to do this. The one we are all most familiar with is manually SSH’ing
in and tweaking things until they work. While this works, it is often error prone and difficult to repeat.
There are a number of tools designed to automate this, including &lt;a href="http://www.opscode.com/chef/" target="_blank"&gt;Chef&lt;/a&gt;, &lt;a href="http://www.puppetlabs.com/" target="_blank"&gt;Puppet&lt;/a&gt;, and &lt;a href="http://docs.fabfile.org/" target="_blank"&gt;Fabric&lt;/a&gt;. Chef and Puppet are very sophisticated, as they allow for centralized management, client-server configurations, roles, and are extremely flexible. However, their strengths increase their complexity a fair bit, and they can be difficult to get started with.&lt;/p&gt;

&lt;p&gt;Somewhere between fully automated and completely manual is Fabric. Fabric is a command line tool for system administration. It allows you to define tasks in Python and have them be run on any number of machines. This is my personal choice for small projects as it is very simple to get started with and has a shallow learning curve. Fabric is very flexible in its use, but I prefer to define a set of “environments”, and design my tasks to work in any environment. Environments are simple, they are basically used to establish a list of hosts that should be configured and tweak various settings. I use a “vagrant”, “staging”, and “production” environment. If you hate the idea of using virtualization, you could write a new “local” environment that foregoes using VMs. In addition to environments, we take our routine commands and implement them as Fabric tasks. Examples are bootstrapping new hosts, updating the running code, restarting web servers, etc. Then we can simply compose our environments and tasks from the command line, and issue a command that basically says “download the latest code, and restart the production web servers”.&lt;/p&gt;

&lt;p&gt;Just as Vagrant used a Vagrantfile, Fabric uses a fabfile (fabfile.py specifically). &lt;a href="https://github.com/armon/DjangoProjectExample/blob/master/fabfile.py" target="_blank"&gt;Here&lt;/a&gt; is the one for our project. Don’t be overwhelmed, there is a lot there, but most of it is very mechanical. The command we are immediately interested in &lt;code&gt;setup_vagrant&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
def setup_vagrant():
  "Bootstraps the Vagrant environment"
  require('hosts', provided_by=[vagrant])
  sub_stop_processes()   # Stop everything
  sub_install_packages() # Get the installed packages
  sub_build_packages()   # Build some stuff
  sub_get_virtualenv()   # Download virtualenv
  sub_make_virtualenv()  # Build the virtualenv
  sub_vagrant_link_project() # Links the project in
  sub_get_requirements() # Get the requirements (pip install)
  sub_get_admin_media()  # Copy Django admin media over
  sudo("usermod -aG vagrant www-data") # Add www-data to the vagrant group
  sub_copy_memcached_config() # Copies the memcache config
  sub_start_processes()  # Start everything
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is the basic command that we issue when we want to provision our vagrant instance. As the function names and comments indicate, we stop all the running services first, install and build the packages we need, and get everything in order, and start up the processes. The installed packages are pretty basic, NTP to keep our time updated, python for obvious reasons, git, gcc, and memcached. These packages are relatively up-to-date from the package manager, so we don’t need to build it ourselves. The packages we need to build are &lt;a href="http://nginx.org/" target="_blank"&gt;Nginx&lt;/a&gt;, which is a web server, and &lt;a href="http://projects.unbit.it/uwsgi/" target="_blank"&gt;uWSGI&lt;/a&gt; which is a python application server. We use Nginx as it is very performant for serving static content and handling a large number of connections. For the dynamic context, Nginx passes the request to uWSGI which is a fast and stable application server that works very well for Django. Together, these are the basic ingredients to any front-end server.&lt;/p&gt;

&lt;p&gt;In addition to the basics, there are two nifty tools that are integrated in our setup flow. The first is &lt;a href="http://pypi.python.org/pypi/virtualenv" target="_blank"&gt;VirtualEnv&lt;/a&gt;, which is used to build an isolated Python environment. What that means is that we can have a python environment with total control over the versions of libraries that are installed. It is common to work on projects that have different version requirements (Django 1.2 vs Django 1.3), and these are very difficult to resolve if dependencies are installed at a system level. Instead of installing requirements into our system site-packages directory, VirtualEnv creates a “virtual environment” (shocker, I know), and installs libraries in there.
When we execute &lt;code&gt;sub_get_virtualenv()&lt;/code&gt; it downloads the VirtualEnv package, and &lt;code&gt;sub_make_virtualenv()&lt;/code&gt; creates a new virtual environment for us at /server/env.example.com. Because of the isolation this provides, we could easily have another project at /server/env.foobar.com with conflicting dependencies. For the developer, once this is setup it is completely transparent, so there is really no reason not to do it. As we will see, we can use this to control the versions we deploy to production to ensure we don’t run into conflicts.&lt;/p&gt;

&lt;p&gt;Once we have a our virtual environment, we need to install the frameworks we use for development, the main one being Django. Continuing our trend of automation, we use &lt;a href="http://pypi.python.org/pypi/pip" target="_blank"&gt;PIP&lt;/a&gt;, which is a package installer for Python. Not coincidentally pip understand virtual environments and can install directly into our folder instead of at the system level. When Fabric calls &lt;code&gt;sub_get_requirements()&lt;/code&gt; we are invoking PIP with our &lt;a href="https://github.com/armon/DjangoProjectExample/blob/master/requirements.txt" target="_blank"&gt;requirements.txt&lt;/a&gt; file. For our simple example, we rely on &lt;a href="https://www.djangoproject.com/" target="_blank"&gt;Django&lt;/a&gt;, &lt;a href="http://south.aeracode.org/" target="_blank"&gt;South&lt;/a&gt; which provides migrations for our models, and &lt;a href="http://pypi.python.org/pypi/pylibmc" target="_blank"&gt;pylibmc&lt;/a&gt; which is a memcache client.&lt;/p&gt;

&lt;p&gt;There is a lot to internalize, and although we are adding complexity in the configuration and setup of projects this is a one time cost. Instead of needing to remember how to set things up or documenting them haphazardly, we have a self-documenting fabfile. As long as you avoid by-hand configuration and always add it to the fabfile, you never need to worry about forgetting a step. All this enables us to provision a system with a single command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ fab vagrant setup_vagrant default_project
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will select “vagrant” as our environment, meaning the only host we are provisioning is our local VM. Then by chaining setup_vagrant we will run all the commands to setup our environment. Lastly, when default_project runs it will create a blank generic Django project for us.&lt;/p&gt;

&lt;p&gt;Now, if we point our browser to http://localhost:8080, we see:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_lmi9ffCTrZ1qgbach.png" alt="It worked!"/&gt;&lt;/p&gt;

&lt;p&gt;Lets summarize the command that were necessary for us to set all of this up:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ vagrant up
$ fab vagrant setup_vagrant default_project
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;By putting our effort into defining our environment in a tangible way, we can make use of tools such as Vagrant and Fabric to make setting up easy and portable. If we need to onboard a new developer, we can having the entire stack up and running in 20 minutes now. I won’t deny that there is a learning curve involved in learning all these tools, but the benefits truly do outweigh the costs.&lt;/p&gt;

&lt;p&gt;I hope that our setup can be used as an example of how to add rigor and automation to the development process. There are some additional details that I did not discuss, but everything is available in the git project. In the next part, we will dig into setting up our Django project and doing some development.&lt;/p&gt;

&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;&lt;a href="https://github.com/armon/DjangoProjectExample" target="_blank"&gt;GitHub Project&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.virtualbox.org/" target="_blank"&gt;VirtualBox&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://vagrantup.com/" target="_blank"&gt;Vagrant&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://docs.fabfile.org/" target="_blank"&gt;Fabric&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nginx.org/" target="_blank"&gt;Nginx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://projects.unbit.it/uwsgi/" target="_blank"&gt;uWSGI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://pypi.python.org/pypi/virtualenv" target="_blank"&gt;VirtualEnv&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://pypi.python.org/pypi/pip" target="_blank"&gt;PIP&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description><link>http://armondadgar.com/post/7031565232</link><guid>http://armondadgar.com/post/7031565232</guid><pubDate>Tue, 28 Jun 2011 19:56:00 -0700</pubDate><category>Django</category><category>Python</category><category>Vagrant</category><category>Web Development</category><category>Fabric</category></item><item><title>Hierarchical Clustering of Facebook Friends</title><description>&lt;p&gt;Studying Artificial Intelligence has exposed me to the many sub-fields of research in the area. Machine Learning, which is the study of algorithms that allow computers to learn and evolve based on the data they process is particularly interesting. Unsatisfied with my shallow theoretical understanding, I recently ordered O’Reillys &lt;a href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325" target="_blank"&gt;Programming Collective Intelligence&lt;/a&gt;  book, which provides hands on experience with practical examples and code.&lt;/p&gt;

&lt;p&gt;I have a few projects in mind that need to perform various forms of clustering, so I decided to try a quick project to get my hands dirty. First, I needed a source of data that I could cluster in a sane way. It struck me that it would be interesting to cluster Facebook friends based on the similarity of status updates. We just need to fetch some status updates for all (or some) of our friends, and build up from there.&lt;/p&gt;

&lt;p&gt;The first step is to use the &lt;a href="https://developers.facebook.com/docs/reference/api/" target="_blank"&gt;Facebook Graph API&lt;/a&gt; to fetch our friends. We need to get the base URL, and our OAuth access token. To get both, I just go to the API reference page, and copy the access token from the URLs of the examples. Lets, first get our friends:&lt;/p&gt;

&lt;!-- more --&gt;

&lt;pre&gt;&lt;code&gt;
import httplib2
import json

ACCESS_TOKEN = "2227470867|..."
FRIEND_URL = "https://graph.facebook.com/me/friends?access_token=%s" % ACCESS_TOKEN

def get_friends():
  "Gets all of the friends. Returns a Name -&amp;gt; Id mapping"
  http = httplib2.Http(timeout=10) 
  resp, raw = http.request(FRIEND_URL, "GET")
  data = json.loads(raw)["data"]
  return dict([(x["name"], x["id"]) for x in data])

&amp;gt;&amp;gt; friends = get_friends()
&amp;gt;&amp;gt; friends.items()[:3]
[("Micky Mouse", "12345"), ("Minnie Mouse", "13252"), ("Donald Duck", "123451")]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we have a mapping of friend names, to their Facebook IDs. We can use this to scrape status updates from their walls. Again, we use the Facebook Graph API. We access the feed resource of our friends, using their user ID, and our access token.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
WALL_URL_FORM = "https://graph.facebook.com/%(id)s/feed?access_token=%(access)s"

def get_friend_wall_posts(friends, num_pages=1):
  "Returns all the posts made on friends walls for the number of wall pages"
  all_data = []
  http = httplib2.Http(timeout=10)

  # Download walls for each friend
  for friend_name, friend_id in friends.items():
    current_url = WALL_URL_FORM % {"id":friend_id, "access":ACCESS_TOKEN}
    for page in xrange(num_pages):
      resp, raw = http.request(current_url, "GET")
      raw = json.loads(raw)
      data = raw["data"]
      for post in data:
        all_data.append(post)
      print "Added %d posts from %s" % (len(data), friend_name)

      # Go to the next page
      try:
        current_url = raw["paging"]["next"]
      except:
        break

  return all_data

&amp;gt;&amp;gt; raw_posts = get_friend_wall_posts(friends, 3)
Added 22 posts from Donald Duck
Added 24 posts from Micky Mouse
Added 20 posts from Minnie Mouse
....
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At this point we have everything we need. However, we need to slice and dice the data a little bit to get it into a more usable format. What we currently have are the “raw” posts, which contain comments, images, links, albums, and other things that are not useful for our purposes. We only need the status message text, so lets cut away everything else.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
def filter_data(data):
  "Filters data to just get id, name, and message for status updates"
  filtered = []
  for entry in data:
    try:
      if entry["type"] != "status":
        continue
      id = entry["id"]
      name = entry["from"]["name"]
      message = entry["message"].lower()
      filtered.append((id, name, message))
    except:
      print "Failed to parse: ",entry

  return filtered

def filter_friends_only(friends, filtered):
  # Get the set of friend ids
  return [post for post in filtered if post[1] in friends]

&amp;gt;&amp;gt; filtered = filter_data(raw_posts)
&amp;gt;&amp;gt; filtered_friends = filter_friends_only(friends, filtered)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first function, &lt;code&gt;filter_data()&lt;/code&gt; is used to just get status updates, and to simplify the data representation to a tuple of (message id, user name, message text). The second function, &lt;code&gt;filter_friends_only()&lt;/code&gt; only retains status updates that our friends make, and not any posts others have made to our friends walls.&lt;/p&gt;

&lt;p&gt;Now that we have all the data we need, we must “normalize” it in some way so that we can compare people against each other. The simplest way to do this is to determine the domain of all used words, prune out ones that occur very rarely (only once or twice), and those that occur with high frequency (it, the, a, etc.), and use the remaining words as a list of key words. We can count how frequently each friend uses those words, and use this frequency vector to compare them. First, we can construct our domain and do some frequency counts:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
import re
from collections import defaultdict

def get_words(filtered):
  """
  Processes status updates to return 2 dictionaries.
  The first dictionary returns a mapping of words to a set
  of message id's that contain the word.

  The second dictionary returns a mapping of people to a
  dictionary of words -&amp;gt; occurrences
  """
  p = re.compile("[^a-zA-Z]+")

  all_words = defaultdict(set)
  person_count = {}

  for id,name,mesg in filtered:
    person_count.setdefault(name, defaultdict(int))

    words = p.split(mesg)
    for word in words:
      all_words[word].add(id)
      person_count[name][word]+=1

  return all_words,person_count

def filter_words(total_mesg, all_words, person_data, max_occur=0.02, min_occur=0.0007):
  """
  Filters the dictionaries to only contain words that have
  a specified minimum and maximum occurrence proportion.

  Returns a list of the "good" words, and a modified person dictionary.
  """
  good_words = []
  props = []
  for word,messages in all_words.items():
    proportion = len(messages) / (1.0*total_mesg)
    props.append((proportion, word))
    if proportion &amp;lt; max_occur and proportion &amp;gt; min_occur:
      good_words.append(word)
  props.sort()
  print props

  persons = dict([(person, dict([(word, person_data[person][word]) for word in good_words])) for person in person_data])
  return good_words, persons

&amp;gt;&amp;gt; all_words, person_data = get_words(filtered)
&amp;gt;&amp;gt; print "Total words: %d, Total People: %d," % (len(all_words), len(person_data)))
Total words: 9009, Total People: 172

&amp;gt;&amp;gt; all_words, person_data = filter_words(len(filtered), all_words, person_data)
[(0.00027601435274634281, u'aaa'), ...., (0.32514490753519182, u'i')]
&amp;gt;&amp;gt; print "Filtered words: %d, Filtered People: %d" % (len(all_words), len(person_data))
Filtered words: 2171, Filtered People: 172
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are a few values worth tuning for each run. The main ones are &lt;em&gt;max_occur&lt;/em&gt; and &lt;em&gt;min_occur&lt;/em&gt; which are the arguments to &lt;code&gt;filter_words()&lt;/code&gt;. Those are used to prune the set of words that are used to do the matching, and have a large impact. In my case, I found that there was very little overlap in words in general, so the values had to be very low. I just eyeballed those values and set &lt;em&gt;max_occur&lt;/em&gt; to the point where very common words were appearing and &lt;em&gt;min_occur&lt;/em&gt; somewhat above the threshold for a single use. This leaves a fairly reasonable set of words, about 2K left to be used to compare people with.&lt;/p&gt;

&lt;p&gt;The algorithm that performs the clustering is pretty straightforward. We start out with each person as part of their own “cluster”. We compute the distance between each cluster, and merge the two clusters that are most similar. We repeat this process until we have only a single root cluster remaining. The key ingredients to make this work are a way to represent a cluster and a way to compute the distance between clusters. A cluster is very simple, it contains a vector frequency count of word occurrences, and two possible children which were used to form the interior clusters. I borrowing the &lt;code&gt;bicluster&lt;/code&gt; class from the book:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
class bicluster:
  def __init__(self,vec,left=None,right=None,distance=0.0,id=None):
    self.left=left
    self.right=right
    self.vec=vec
    self.id=id
    self.distance=distance

def vectorize_dict(key_ordered, dict):
  "Vectorizes the values of a dictionary by iterating in a specific order"
  return [dict[key] for key in key_ordered]

&amp;gt;&amp;gt; person_clusters = [bicluster(vectorize_dict(all_words, person_data[p]), id=p) for p in person_data.keys()]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are a handful of methods we could use to compute the distance, but one simple and useful metric is the Pearsons correlation co-efficient. It is used to compute the strength of a linear fit to data. It returns a value between -1 and 1, where 0 indicates no fit, a 1 indicates a strong positive fit, and -1 a strong negative fit. However, we can use it as a measure of distance between our clusters. Here is a simple function to compute the distance:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
from math import sqrt
def pearson(v1,v2):
  sum1 = 0
  sum2 = 0
  sum1Sq = 0
  sum2Sq = 0
  pSum = 0

  # Compute all the sums
  for i in xrange(len(v1)):
    val1 = v1[i]
    val2 = v2[i]
    sum1 += val1
    sum2 += val2
    sum1Sq += val1*val1
    sum2Sq += val2*val2
    pSum += val1*val2

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/len(v1))
  den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
  if den==0: return 0

  return 1.0-num/den
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We now have our clusters, and a means of measuring distance, so we need the actual function to perform the clustering. This is mostly from the book, but I will reproduce it here with my minor modifications:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;
def hcluster(clust,distance=pearson):
  distances={}
  currentclustid=-1

  while len(clust)&amp;gt;1:
    lowestpair=(0,1)
    closest=distance(clust[0].vec,clust[1].vec)

    # loop through every pair looking for the smallest distance
    for i in xrange(len(clust)):
      for j in xrange(i+1,len(clust)):
        # distances is the cache of distance calculations
        if (clust[i].id,clust[j].id) not in distances: 
          d = distances[(clust[i].id,clust[j].id)] = distance(clust[i].vec,clust[j].vec)
        else:
          d = distances[(clust[i].id,clust[j].id)]

        if d&amp;lt;closest:
          closest=d
          lowestpair=(i,j)

    # calculate the average of the two clusters
    mergevec=[
    (clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 
    for i in xrange(len(clust[0].vec))]

    # create the new cluster
    newcluster=bicluster(mergevec,left=clust[lowestpair[0]],
                         right=clust[lowestpair[1]],
                         distance=closest,id=currentclustid)


    # cluster ids that weren't in the original set are negative
    currentclustid-=1
    del clust[lowestpair[1]]
    del clust[lowestpair[0]]
    clust.append(newcluster)

  return clust[0]

&amp;gt;&amp;gt; person_root = hcluster(person_clusters)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At long last! We have clustered our friend group. It is rather difficult to visualize our results in the current form, so we make use of the Python Imaging Library (PIL), to generate a basic visualization of our cluster. I will omit this code, but it is available in the &lt;a href="https://gist.github.com/982485" target="_blank"&gt;source&lt;/a&gt;. My friends don’t have unusually strange names, I’ve just anonymized my results for privacy.&lt;/p&gt;

&lt;p&gt;A partial sub-tree showing the clustering at the leaves of the tree:&lt;br/&gt;&lt;img src="http://media.tumblr.com/tumblr_llhfm0B7QC1qgbach.png" alt="Partial Sub-tree"/&gt;&lt;br/&gt;&lt;/p&gt;

&lt;p&gt;The full tree, showing the general structure of the tree: &lt;br/&gt;&lt;img src="http://media.tumblr.com/tumblr_llhfmghlTl1qgbach.jpg" alt="Full Tree"/&gt;&lt;br/&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://imgur.com/KWnvz?full" target="_blank"&gt;Here is a link to the full-sized version&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Some of the clusters will make more sense than others, based on friend groups and such. Others cannot be clustered in any meaningful way due to lack of data. A critical factor in getting good results is having enough data from all of your friends. In my tests I tried getting only the first page worth of status updates for each friend which resulted in a terrible clustering. Getting the first three pages dramatically improved things, but there is still room for improvement. One of the limitations is how much data you can fetch from Facebook without getting rate-limited.&lt;/p&gt;

&lt;p&gt;It is clear that we are only scratching the surface of what can be used to cluster friends. You could go further and use pages that friends have liked, pull in biographies, favorite quotes, music interests, or even mine links and videos that they share. Those are outside the scope of my simple example, but it should be relatively clear how to do that based on these techniques.&lt;/p&gt;

&lt;p&gt;It you want to try playing with the code, it is &lt;a href="https://gist.github.com/982485" target="_blank"&gt;posted in its entirety with some slight modifications on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy Clustering!&lt;/p&gt;</description><link>http://armondadgar.com/post/5662017226</link><guid>http://armondadgar.com/post/5662017226</guid><pubDate>Fri, 20 May 2011 00:32:00 -0700</pubDate><category>Machine Learning</category><category>Artificial Intelligence</category><category>Python</category><category>Clustering</category><category>Facebook</category></item><item><title>Open Sourced: A tale of two Priority Queue's</title><description>&lt;p&gt;One of my favorite data structures is the binary heap. I first learned about it in my data structures class, and remember marveling at its simplicity and elegance. How could something so simple be so powerful and useful?&lt;/p&gt;



&lt;p&gt;For those unfamiliar with the binary heap, it is a relatively simple data structure. It is a binary tree, which means each node has at most 2 children and n top of this, there are several constraints which are imposed. One is that the the children of each node must have a value that is equal to or greater. This means we have some sort of loose ordering, where nodes closer to the root have a lower value. Secondly, the tree must be &lt;em&gt;complete&lt;/em&gt;. This means that all levels of the tree except for the last level must be full, and filled left-to-right. As described, this would be a Min-Heap, since the minimum value is always at the root. If we were to reverse our comparisons, so that the children must be less than or equal to the current node, we would have a Max-Heap with the maximum value at the root.&lt;/p&gt;



&lt;p&gt;One of the primary uses of a Min/Max Heap is as a Priority Queue. If one inserts nodes based on their &amp;#8220;priority&amp;#8221;, then the nodes with the highest priority are at the root.&lt;/p&gt;



&lt;p&gt;Below is a example of a Min-Heap, where each node represents a persons age and name. We can see that there is a rough ordering with the youngest person at the root.&lt;/p&gt;



&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_ljrmlwj5991qgbach.png"/&gt;&lt;/p&gt;



&lt;!-- more --&gt;

&lt;p&gt;What makes binary heaps very interesting are their performance characteristics. In the case of a Min-Heap, we can find the minimum element in O(1) which is useful for implementing a Priority Queue. We can also insert and remove nodes in O(log n). Insertion is done by adding the new node in the bottom of the tree (remember it must be filled left-to-right), and then repeatedly swapping it with the parent node while it has a lower value. Since this is a binary tree, the height is at most log n, so we need to do at most log n swaps per insert. Deletion is done by moving the last node to the root, and repeatedly swapping it with the child that has a lower value. Again, we need to do at most log n swaps down any given path of the tree.&lt;/p&gt;



&lt;p&gt;This property means that a binary heap can be used to perform sorting in O(n log n) which competes with the best sorting algorithms. So in summary, binary heaps give us:&lt;/p&gt;



&lt;ul&gt;&lt;li&gt;O(1) time for the min/max value&lt;/li&gt;&#13;
&lt;li&gt;O(log n) insertion / deletion time&lt;/li&gt;&#13;
&lt;li&gt;O(n log n) as a sorting implementation&lt;/li&gt;&#13;
&lt;li&gt;Dead simple implementation (Ever tried an AVL tree?)&lt;/li&gt;&#13;
&lt;/ul&gt;&lt;p&gt;Okay, so at this point we know that binary heaps are pretty cool. But it gets better. If we had to implement a binary heap as a tree, we would need each node to have a pointer to its children and parents, and we would waste several words per node. However, because binary heaps must be complete they can be implemented as an array. The array is setup as a breadth first pass of the tree, so the root node is first, then all the nodes at depth 1, all the nodes at depth 2, etc. This means the nodes that we are inserting always go at the end of the array, since they are filling the final depth left-to-right.&lt;/p&gt;



&lt;p&gt;I needed a quick refresher on C, so I decided to implement a Min-Heap in C. The results can be found on &lt;a title="c-minheap-array" href="https://github.com/armon/c-minheap-array" target="_blank"&gt;Github here&lt;/a&gt;. It is a simple C library that implements the Min-Heap using an array, and supports resizing (growing and shrinking). The approach is to double or halve the array as we need more or less space. What bothered me about this implementation was the wasted memory it causes. Because the array grows simply by doubling, it may be that I only needed a few more nodes but now have allocated space for millions. In addition, some operations require an expensive copy operation which runs in time proportional to the length of the array. I wanted to see if there were any clever tricks that could fix this.&lt;/p&gt;



&lt;p&gt;What I decided to do was to make use of a little indirection. Instead of using a plain array, I decided to make use of a small table which points to other arrays which hold the actual entries. This looks something like this:&lt;/p&gt;



&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_ljrnn6hNfH1qgbach.png"/&gt;&lt;/p&gt;



&lt;p&gt;The idea is to make use of low level virtual memory APIs to manage the memory used for the heap. Modern operating systems all operate on virtual memory, which is basically an abstraction on physical memory. All processes are provided with a memory address space that starts at 0 and goes to 2^32 or 2^64 depending on the CPU architecture, even though most systems do not have that much memory available. The memory space a process has is &amp;#8220;virtual&amp;#8221; because it does not directly correspond to physical memory, and in fact most of the memory space is not mapped to RAM and will cause an exception if the program ever addresses it (the infamous segmentation fault). Instead, memory is typically divided into &amp;#8220;pages&amp;#8221; which are smaller chunks, usually 4K in size and pages are mapped to physical memory as needed. This is done by using a number of low level system calls such as brk(), malloc(), and mmap(). These system calls instruct the kernel to map addition pages of virtual memory onto physical memory, so that the program can safely make use of it without crashing.&lt;/p&gt;



&lt;p&gt;We can make use of the same methods to implement our heap more efficient. So, I created a separate &lt;a title="c-minheap-indirect" href="https://github.com/armon/c-minheap-indirect" target="_blank"&gt;GitHub project&lt;/a&gt; to do just that. Now, instead of allocating a large array, we initially map a single page in to use as our mapping table, and a single page to hold our heap entries. Each node in the heap requires 2 words (8 bytes on a 32 bit system, or 16 on a 64bit system). Since we have 4K pages, we can store 512 or 256 nodes per page depending on 32/64 bit. Our mapping table needs a single word to represent each page, so it can hold 1024/512 references per page.&lt;/p&gt;



&lt;p&gt;The major benefit of this new architecture is resizing is now extremely cheap, and memory utilization is much much better. Previously, when we needed more space we would reallocate a whole new array and copy over all the previous entries, which would take O(n) time. Now, we can map in a single page of memory and add a reference to it in our mapping table. We have to be careful to resize the mapping table if it ever runs out of space, but that is cheap due to its small size. Because we only add new pages on an as-needed basis, the most space we waste is about a single page, or 4K. This is as opposed to potentially wasting hundreds of megabytes or gigabytes of space. Similarly, when we are downsizing, we just unmap pages and avoid the copying that the array implementation had to do.&lt;/p&gt;



&lt;p&gt;One thing that is unclear is the performance implications of our change. It is simple to see that by adding our indirection table, we now need to take an extra pointer to get to the data for every access of a heap entry. This is disconcerting because the speed of most applications is bounded by memory access and not CPU speed since modern CPU&amp;#8217;s have far outstripped memory speed, and latency has remained constant. As an example, most instructions only take several instructions to execute (add, shift, multiply), while a memory access that takes a cache miss might take over 300 CPU cycles. In this context, we might be worried that the indirection table interferes with cache locality and introduces a dramatic slowdown. To test this, I decided to perform a simple benchmark. I took both implementations, and timed them using the provided test program with an input size of 1, 5, 10, 50, and 100 million nodes using &lt;a title="Benchmark Script" href="https://gist.github.com/630a6a074d1a5ee8d2b0" target="_blank"&gt;this script&lt;/a&gt;. The test generates a pseudo-random set of keys and inserts them into the Min-Heap and then extracts them, checking that the order is correct. I averaged the results of 3 runs per input size, and graphed the averages: &lt;/p&gt;



&lt;p&gt;&lt;img src="http://media.tumblr.com/tumblr_ljrp3bjXN91qgbach.png"/&gt;&lt;/p&gt;



&lt;p&gt;Okay, so what do these numbers tell us? It is important to note that we are graphing on a logarithmic scale, so don&amp;#8217;t let that fool you. It is clear that the implementation based on the indirection table is slower in every case, and roughly averages a 30% slowdown. The array implementation is absent from the final test of 100 million nodes because it is unable to allocate an array large enough during a resize to hold all the nodes. The indirection version is able to complete since it uses no more space than it actually needs.&lt;/p&gt;



&lt;p&gt;The complexity of modern processors makes it hard to reason about the performance numbers. On one hand, we expect the O(n) copy time for the array implementation to be problematic, but that cost is amortized over the majority of the operations. For the indirection table, we might expect a very large penalty for the extra pointer dereference but with large amounts of cache the indirection table may by largely cached so we avoid paying the penalty for most accesses, especially near the root. With numbers in hand, it is hard to recommend one implementation over another. In situations where memory is limited then the indirection version is more suitable. Additionally, if systems need soft real-time characteristics then the indirection table is a better choice as it will never perform a large copy on any single operation. However, if raw speed is all that matters, then the array implementation is the winner. If the maximum element size is known in advance, then the array can be pre-allocated to that size and avoid the copy operations altogether, making it even faster (the indirection table can do the same, and pre-allocate the pages, saving the allocation time).&lt;/p&gt;



&lt;p&gt;I suppose the moral is never assume anything about the performance implications of a particular implementation. Always test and decide based on what criteria are most important to your application, as it&amp;#8217;s not one size fits all.&lt;/p&gt;



&lt;p&gt;Feel free to as me any questions!&lt;/p&gt;



&lt;ul&gt;&lt;li&gt;GitHub: &lt;a title="-minheap-array" href="https://github.com/armon/c-minheap-array" target="_blank"&gt;c-minheap-array&lt;/a&gt;&lt;/li&gt;&#13;
&lt;li&gt;GitHub: &lt;a title="-minheap-indirect" href="https://github.com/armon/c-minheap-indirect" target="_blank"&gt;c-minheap-indirect&lt;/a&gt;&lt;/li&gt;&#13;
&lt;li&gt;&lt;a href="https://gist.github.com/630a6a074d1a5ee8d2b0" target="_blank"&gt;Benchmark Script&lt;/a&gt;&lt;/li&gt;&#13;
&lt;li&gt;Wikipedia: &lt;a title="Binary Heaps" href="http://en.wikipedia.org/wiki/Binary_heap" target="_blank"&gt;Binary Heap&lt;/a&gt;&lt;/li&gt;&#13;
&lt;li&gt;Wikipedia: &lt;a title="Virtual Memory" href="http://en.wikipedia.org/wiki/Virtual_memory" target="_blank"&gt;Virtual Memory&lt;/a&gt;&lt;/li&gt;&#13;
&lt;/ul&gt;</description><link>http://armondadgar.com/post/4672394755</link><guid>http://armondadgar.com/post/4672394755</guid><pubDate>Sat, 16 Apr 2011 16:14:00 -0700</pubDate><category>C</category><category>Priority Queue</category><category>Min Heap</category><category>Data Structures</category><category>Open Source</category><category>Programming</category></item><item><title>Hello World!</title><description>&lt;p&gt;My name is Armon Dadgar and I am studying Computer Science at the University of Washington (as I type this post, I about to begin the final quarter of my undergraduate degree).&lt;/p&gt;



&lt;p&gt;This Tumblr is meant to be Yet Another Technical Blog. I plan to occasionally post articles that I find interesting and which I think others will find useful. I have a few projects that I&amp;#8217;ve worked on in the past that I will shortly try to open source onto GitHub and then describe here.&lt;/p&gt;



&lt;p&gt;I am also working on a startup, &lt;a title="Amped Systems" target="_blank" href="http://ampedsystems.com/"&gt;Amped Systems&lt;/a&gt;, where we are focussed on student oriented services. More about this later&amp;#8230; Since our project is powered by Django, I will soon be posting a multi-part series on the modern tools of Python web developers, and examples of how to get started with Django.&lt;/p&gt;



&lt;p&gt;Until then, adios!&lt;/p&gt;</description><link>http://armondadgar.com/post/3790027780</link><guid>http://armondadgar.com/post/3790027780</guid><pubDate>Fri, 11 Mar 2011 12:51:00 -0800</pubDate></item></channel></rss>
