SoCal or NO Cal: K-means Clustering and a Tale as Old as Maybe 1850 (depending on who you ask)

Isaiahwestphalen
4 min readMay 11, 2021

--

As someone who has only once stepped foot in the state, the feud between northern and southern California has largely gone unnoticed by me. That is until recently when I was investigating the California Housing Price dataset from Kaggle.

While we won’t go too far into it, as I am far from an expert, the feud began around 1850 when the pro-slavery south began to put pressure on for a split. While we don’t want to put any fuel on the fire, there does appear to be some actual data to back up the idea of a divided California — at least going by housing districts. We’ll explore this divide through k-means clustering.

What Is K-Means Clustering?

K-means clustering is a method of unsupervised learning that separates data into k-clusters based off of data points’ proximity to cluster means, or centroids.

The centroids are initially picked randomly, and the points closest to the centroid become part of its cluster. Once this is done, the mean point of each cluster is calculated and assigned as the new centroid. This goes on until the sum of squared distance of all points in a cluster to its centroid has been minimized.

Here is an example of k-means in action on 100 randomly generated points. Each star is a centroid, and each point’s cluster is determined by how close it is to a respective centroid.

K-means clustering can be used in a variety of different ways, but it is extremely useful in finding relationships that may exist within a given dataset. You can use the results of K-means clustering to make informed decisions on how to go about making a supervised learning model, or to just use the information provided to make your own decisions.

How does this Apply to California?

One aspect of K-means clustering that I find really interesting is its ability to find relationships in geo-data. Since the actual clustering is calculated based off of distance, it stands to reason that this would be the case. Out of curiosity, I wanted to see what I could find when applying this tool to California housing information. This is where the California Housing Price dataset comes in. This Kaggle dataset includes the latitude and longitude of each Californian housing district, as well as the median home-owner age, total rooms in the district, population, number of homes, the median income, median house value, etc. In this case however, I was just interested in the district latitude and longitude.

Let’s Get into Some Coding!

First let’s read in our dataset to create a pandas DataFrame for each district latitude and longitude:

Now let’s plot this and see what our data looks like initially!

It’s California!

At this point we can start investigating with k-means. We’re going into this with no target in mind, as this is unsupervised learning. So as opposed to measuring how well we’re clustering based off of a metric like accuracy, we’ll use each point’s silhouette score. A silhouette score is a measure of how similar, or close, a point is to other points in its cluster vs other points in other clusters. The scores range from -1 to 1, with the closer to 1 the better (this is just a brief overview of what is going on behind the scenes of calculating a clustering’s score). What we want to do here is to find the number of clusters that optimizes the silhouette score. For this we’ll use a for loop to iterate through possible number of clusters and find which one gives us the maximum score:

Since k-means is based off of distance, it’s a good practice to scale your data first. Here we need to start at two clusters (one cluster is the information we have already), and we’ll go to 100 clusters first and investigate past that if we need to. Here we’ll use random steps to make sure that the loop doesn’t run for too long.

After running this code we get:

So now we know that the best scores were achieved when our housing districts were split up into two clusters. From here we can go about visualizing them:

Now we have to interpret what our k-means clusters are telling us. In this case, we got lucky — the clusters are separated by north and south. While I may not be a California native, the first time I saw this visualization I was still able to guess that the clusters are defined by Northern and Southern California . Thats when I did some digging, and to really see what was going on made a second visual comparing the k-means clusters to the actual map of NoCal and SoCal:

While the eastern edge of our map is slightly different than the actual divide, the western split is right where you’d expect it to be! It was at this point that I took a deeper dive into the differences between Northern and Southern California, and this feud came to my attention. While it’s generally better to let bygones be bygones, this shows that there is a definitive split in California between north and south.

--

--

Isaiahwestphalen

I am a Data Scientist from Boston, MA. I’m passionate about learning more about the world around us, math, physics, and anything to do with music and painting.