This is a problem I have encountered many times where the goal is to take a sample of spatial locations and apply constraints to the algorithm. In addition to providing a pre-determined number of *K* clusters a fixed size of elements needs to be held constant within each cluster. An application of this algorithm is when one needs to geographically stratify and pre-allocate the sample frame but keep the sizes the same (or constant) to facilitate operational fielding of a study.

I have done a cursory look for other approaches to this problem but have come up fairly empty. I would certainly be interested in other approaches that are used. However, in general, this is somewhat counter to the textbook teaching of k-means clustering where cluster sizes naturally form based on the specified criteria.

This is one of several approaches to determine the optimal clustering when dealing with spatial data. Other cluster assignment approaches could be used. One in particular is the CLARANS algorithm, but like other clustering approaches it does not constrain the sizes of the clusters. Ultimately the goal here is to keep the clusters the same size and to reduce the total spatial distance from the center of the cluster.

I created a random dataset with just under 10000 randomly selected geographic coordinates (removing Alaska and Hawaii) in the 48 states. Based on the latitude and longitude the locations can be clustered and the sizes constrained. In this example I use exactly equal sized clusters (except when n is not divisible by K), . However, the option exists where one could pre-allocated the cluster sizes so they are fixed in advance but are different from cluster to cluster and then constrained to those sizes if desired.

Centers |
||

Latitude | Longitude | Cluster |

37.46644 | -113.412 | 1 |

40.24648 | -74.7457 | 2 |

31.89746 | -85.5054 | 3 |

41.08111 | -85.3031 | 4 |

As for the distance function I take a couple things into account. First, the earth does not have a constant radius. Because the earth is spinning around so fast it flattens a bit. So I built into the distance function those two radii. This way when traveling over a greater distance of latitude the correct distance can be calculated based on the flattening earth. Second, because the earth is mostly round the Pythagorean theorem doesn’t exactly apply and a more accurate distance around a curved earth is needed. Consequently, the flattening of the earth as well as the curvature of the earth is combined as a more complex formula that is used in the function.

I’m still working on fine tuning and making the algorithm better but my initial algorithm is as follows:

1) set equal cluster size, e.g. n/k, or assign specified sizes.

2) initialize cluster assignment. I’m still working on a better approach but for now I just randomly select, order and systematically assign it through all observations.

3) calculate the center of the clusters.

4) take the first observation and assign it to the closest cluster.

5) since one cluster now has and another has establish a trade to even out the sizes. The closest observation to the giving cluster is then traded.

6) this process continues through all locations.

7) the sum of the distance from each observation to its assigned centroid is calculated.

8) if the next iteration doesn’t decrease that distance (within the tolerance threshold) then stop.

9) continue the process with step 3 until the maximum iteration is meet.

The following code is what I used for my prototype and is not strictly optimized and may take several minutes (~15) on datasets with many thousands of observations. I’ll provide optimized R, Python, and maybe some PHP code at a later time. I’ve included a verbose version of the R code where it will provide convergence information as well as a timer indicating how long it will take before the maximum iteration is met.

# Convert to radian as_radians = function(theta=0){ return(theta * pi / 180) } calc_dist = function(fr, to) { lat1 = as_radians(fr$lat) lon1 = as_radians(fr$lon) lat2 = as_radians(to$lat) lon2 = as_radians(to$lon) a = 3963.191; b = 3949.903; numerator = ( a^2 * cos(lat2) )^2 + ( b^2 * sin(lat2) ) ^2 denominator = ( a * cos(lat2) )^2 + ( b * sin(lat2) )^2 radiusofearth = sqrt(numerator/denominator) #Accounts for the ellipticity of the earth. d = radiusofearth * acos( sin(lat1) * sin(lat2) + cos(lat1)*cos(lat2)*cos(lon2 - lon1) ) d.return = list(distance_miles=d) return(d.return) } raw.og = read.csv("http://statistical-research.com/wp-content/uploads/2013/11/sample_geo.txt", header=T, sep="\t") orig.data = raw.og[,1:3] dirichletClusters_constrained = function(orig.data, k=5, max.iter =50, tolerance = 1, plot.iter=TRUE) { fr = to = NULL r.k.start = sample(seq(1:k)) n = nrow( orig.data ) k.size = ceiling(n/k) initial.clusters = rep(r.k.start, k.size) if(n%%length(initial.clusters)!=0){ exclude.k = length(initial.clusters) - n%%length(initial.clusters) } else { exclude.k = 0 } orig.data$cluster = initial.clusters[1:(length(initial.clusters)-exclude.k)] orig.data$cluster_original = orig.data$cluster ## Calc centers and merge mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) tmp1 = matrix( match(orig.data$cluster, mu[,3]) ) orig.data.centers = cbind(as.matrix(orig.data), mu[tmp1,])[,c(1:2,4:6)] ## Calc initial distance from centers fr$lat = orig.data.centers[,3]; fr$lon = orig.data.centers[,4] to$lat = orig.data.centers[,1]; to$lon = orig.data.centers[,2] orig.data$distance.from.center = calc_dist(fr, to)$distance_miles orig.data$distance.from.center_original = orig.data$distance.from.center ## Set some initial configuration values is.converged = FALSE iteration = 0 error.old = Inf error.curr = Inf while ( !is.converged && iteration < max.iter ) { # Iterate until threshold or maximum iterations if(plot.iter==TRUE){ plot(orig.data$Longitude, orig.data$Latitude, col=orig.data$cluster, pch=16, cex=.6, xlab="Longitude",ylab="Latitude") } iteration = iteration + 1 start.time = as.numeric(Sys.time()) cat("Iteration ", iteration,sep="") for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center # Produces an exchange. It takes the observation closest to it's mean and in return it gives the observation # closest to the giver, k, mean fr = to = distances = NULL for( j in 1:k ){ # Determine the distance from each k group fr$lat = orig.data$Latitude[i]; fr$lon = orig.data$Longitude[i] to$lat = mu[j,1]; to$lon = mu[j,2] distances[j] = as.numeric( calc_dist(fr, to) ) } # Which k cluster is the observation closest. which.min.distance = which(distances==min(distances), arr.ind=TRUE) previous.cluster = orig.data$cluster[i] orig.data$cluster[i] = which.min.distance # Replace cluster with closest cluster # Trade an observation that is closest to the giving cluster if(previous.cluster != which.min.distance){ new.cluster.group = orig.data[orig.data$cluster==which.min.distance,] fr$lat = mu[previous.cluster,1]; fr$lon = mu[previous.cluster,2] to$lat = new.cluster.group$Latitude; to$lon = new.cluster.group$Longitude new.cluster.group$tmp.dist = calc_dist(fr, to)$distance_miles take.out.new.cluster.group = which(new.cluster.group$tmp.dist==min(new.cluster.group$tmp.dist), arr.ind=TRUE) LocationID = new.cluster.group$LocationID[take.out.new.cluster.group] orig.data$cluster[orig.data$LocationID == LocationID] = previous.cluster } } # Calculate new cluster means mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) tmp1 = matrix( match(orig.data$cluster, mu[,3]) ) orig.data.centers = cbind(as.matrix(orig.data), mu[tmp1,])[,c(1:2,4:6)] mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) ## Calc initial distance from centers fr$lat = orig.data.centers[,3]; fr$lon = orig.data.centers[,4] to$lat = orig.data.centers[,1]; to$lon = orig.data.centers[,2] orig.data$distance.from.center = calc_dist(fr, to)$distance_miles # Test for convergence. Is the previous distance within the threshold of the current total distance from center error.curr = sum(orig.data$distance.from.center) error.diff = abs( error.old - error.curr ) error.old = error.curr if( !is.nan( error.diff ) && error.diff < tolerance ) { is.converged = TRUE } # Set a time to see how long the process will take is going through all iterations stop.time = as.numeric(Sys.time()) hour.diff = (((stop.time - start.time) * (max.iter - iteration))/60)/60 cat("\n Error ",error.diff," Hours remain from iterations ",hour.diff,"\n") # Write out iterations. Can later be used as a starting point if iterations need to pause write.table(orig.data, paste("C:\\optimize_iteration_",iteration,"_instore_data.csv", sep=""), sep=",", row.names=F) } centers = data.frame(mu) ret.val = list("centers" = centers, "cluster" = factor(orig.data$cluster), "LocationID" = orig.data$LocationID, "Latitude" = orig.data$Latitude, "Longitude" = orig.data$Longitude, "k" = k, "iterations" = iteration, "error.diff" = error.diff) return(ret.val) } # Constrained clustering cl_constrain = dirichletClusters_constrained(orig.data, k=4, max.iter=5, tolerance=.0001, plot.iter=TRUE) table( cl_constrain$cluster ) plot(cl_constrain$Longitude, cl_constrain$Latitude, col=cl_constrain$cluster, pch=16, cex=.6, xlab="Longitude",ylab="Latitude") library(maps) map("state", add=T) points(cl_constrain$centers[,c(2,1)], pch=4, cex=2, col='orange', lwd=4)

## anspiess

/ November 5, 2013Interesting post! With a little bit of work I think one could modify the kmeans.c function in the R source /library/stats/src folder not only to minimize the squared distances from the centroids but to also use equal group sizes as an additional constraint into the minimization. If you know someone proficient in C, maybe this may be worth trying…

Cheers, Andrej.

## Paulo van Breugel

/ November 27, 2013For similar functionality (and somewhat approach I think), you may want to have a look at the spcosa R package, and more specifically the function stratify. Did you do any further work on this? It would be interesting to have a python implementation that could be used in GRASS or QGIS.

## Philippe Perie

/ October 1, 2014Hi,

Nice code !

I’m actually using it to minimize travelling costs while ensuring geographical spread for a face to face Survey, the number of clusters is the number of sampling points.

Maybe a simple suggestion to speed the process without having to play with the intricacies of C : since there is only 2 latitude and longitude, we can just sort the file by latitude or longitude or a function of both to help the algorithm with the inital cluster assignments (closer to the final solution, less iterations)

## rania

/ January 23, 2015Many thanks for this nicely explained algorithm and code!

I am testing it for a different application related to location privacy.

I would like to ask if there is any publication such as book or paper where this approach is published.