This post will be about replicating the Marcos Lopez de Prado algorithm from his paper building diversified portfolios that outperform out of sample. This algorithm is one that attempts to make a tradeoff between the classic mean-variance optimization algorithm that takes into account a covariance structure, but is unstable, and an inverse volatility algorithm that ignores covariance, but is more stable.

This is a paper that I struggled with until I ran the code in Python (I have anaconda installed but have trouble installing some packages such as keras because I’m on windows…would love to have someone walk me through setting up a Linux dual-boot), as I assumed that the clustering algorithm actually was able to concretely group every asset into a particular cluster (I.E. ETF 1 would be in cluster 1, ETF 2 in cluster 3, etc.). Turns out, that isn’t at all the case.

Here’s how the algorithm actually works.

First off, it computes a covariance and correlation matrix (created from simulated data in Marcos’s paper). Next, it uses a hierarchical clustering algorithm on a distance-transformed correlation matrix, with the “single” method (I.E. friend of friends–do ?hclust in R to read up more on this). The key output here is the order of the assets from the clustering algorithm. Note well: this is the only relevant artifact of the entire clustering algorithm.

Using this order, it then uses an algorithm that does the following:

Initialize a vector of weighs equal to 1 for each asset.

Then, run the following recursive algorithm:

1) Break the order vector up into two equal-length (or as close to equal length) lists as possible.

2) For each half of the list, compute the inverse variance weights (that is, just the diagonal) of the covariance matrix slice containing the assets of interest, and then compute the variance of the cluster when multiplied by the weights (I.E. w’ * S^2 * w).

3) Then, do a basic inverse-variance weight for the two clusters. Call the weight of cluster 0 alpha = 1-cluster_variance_0/(cluster_variance_0 + cluster_variance_1), and the weight of cluster 1 its complement. (1 – alpha).

4) Multiply all assets in the original vector of weights containing assets in cluster 0 with the weight of cluster 0, and all weights containing assets in cluster 1 with the weight of cluster 1. That is, weights[index_assets_cluster_0] *= alpha, weights[index_assets_cluster_1] *= 1-alpha.

5) Lastly, if the list isn’t of length 1 (that is, not a single asset), repeat this entire process until every asset is its own cluster.

Here is the implementation in R code.

First off, the correlation matrix and the covariance matrix for use in this code, obtained from Marcos Lopez De Prado’s code in the appendix in his paper.

> covMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000647799 -0.003050479 0.010033224 -0.010759689 -0.005036503 0.008762563 0.998201625 -0.001393196 -0.001254522 -0.009365991 2 -0.003050479 1.009021349 0.008613817 0.007334478 -0.009492688 0.013031817 -0.009420720 -0.015346223 1.010520047 1.013334849 3 0.010033224 0.008613817 1.000739363 -0.000637885 0.001783293 1.001574768 0.006385368 0.001922316 0.012902050 0.007997935 4 -0.010759689 0.007334478 -0.000637885 1.011854725 0.005759976 0.000905812 -0.011912269 0.000461894 0.012572661 0.009621670 5 -0.005036503 -0.009492688 0.001783293 0.005759976 1.005835878 0.005606343 -0.009643250 1.008567427 -0.006183035 -0.007942770 6 0.008762563 0.013031817 1.001574768 0.000905812 0.005606343 1.064309825 0.004413960 0.005780148 0.017185396 0.011601336 7 0.998201625 -0.009420720 0.006385368 -0.011912269 -0.009643250 0.004413960 1.058172027 -0.006755374 -0.008099181 -0.016240271 8 -0.001393196 -0.015346223 0.001922316 0.000461894 1.008567427 0.005780148 -0.006755374 1.074833155 -0.011903469 -0.013738378 9 -0.001254522 1.010520047 0.012902050 0.012572661 -0.006183035 0.017185396 -0.008099181 -0.011903469 1.075346677 1.015220126 10 -0.009365991 1.013334849 0.007997935 0.009621670 -0.007942770 0.011601336 -0.016240271 -0.013738378 1.015220126 1.078586686 > corMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000000000 -0.003035829 0.010026270 -0.010693011 -0.005020245 0.008490954 0.970062043 -0.001343386 -0.001209382 -0.009015412 2 -0.003035829 1.000000000 0.008572055 0.007258718 -0.009422702 0.012575370 -0.009117080 -0.014736040 0.970108941 0.971348946 3 0.010026270 0.008572055 1.000000000 -0.000633903 0.001777455 0.970485047 0.006205079 0.001853505 0.012437239 0.007698212 4 -0.010693011 0.007258718 -0.000633903 1.000000000 0.005709500 0.000872861 -0.011512172 0.000442908 0.012052964 0.009210090 5 -0.005020245 -0.009422702 0.001777455 0.005709500 1.000000000 0.005418538 -0.009347204 0.969998023 -0.005945165 -0.007625721 6 0.008490954 0.012575370 0.970485047 0.000872861 0.005418538 1.000000000 0.004159261 0.005404237 0.016063910 0.010827955 7 0.970062043 -0.009117080 0.006205079 -0.011512172 -0.009347204 0.004159261 1.000000000 -0.006334331 -0.007592568 -0.015201540 8 -0.001343386 -0.014736040 0.001853505 0.000442908 0.969998023 0.005404237 -0.006334331 1.000000000 -0.011072068 -0.012759610 9 -0.001209382 0.970108941 0.012437239 0.012052964 -0.005945165 0.016063910 -0.007592568 -0.011072068 1.000000000 0.942667300 10 -0.009015412 0.971348946 0.007698212 0.009210090 -0.007625721 0.010827955 -0.015201540 -0.012759610 0.942667300 1.000000000

Now, for the implementation.

This reads in the two matrices above and gets the clustering order.

covMat <- read.csv('cov.csv', header = FALSE) corMat <- read.csv('corMat.csv', header = FALSE) clustOrder <- hclust(dist(corMat), method = 'single')$order

This is the clustering order:

> clustOrder [1] 9 2 10 1 7 3 6 4 5 8

Next, the getIVP (get Inverse Variance Portfolio) and getClusterVar functions (note: I’m trying to keep the naming conventions identical to Dr. Lopez’s paper)

getIVP <- function(covMat) { # get inverse variance portfolio from diagonal of covariance matrix invDiag <- 1/diag(as.matrix(covMat)) weights <- invDiag/sum(invDiag) return(weights) } getClusterVar <- function(covMat, cItems) { # compute cluster variance from the inverse variance portfolio above covMatSlice <- covMat[cItems, cItems] weights <- getIVP(covMatSlice) cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights return(cVar) }

Next, my code diverges from the code in the paper, because I do not use the list comprehension structure, but instead opt for a recursive algorithm, as I find that style to be more readable.

One wrinkle to note is the use of the double arrow dash operator, to assign to a variable outside the scope of the recurFun function. I assign the initial weights vector w in the global environment, and update it from within the recurFun function. I am aware that it is a faux pas to create variables in the global environment, but my attempts at creating a temporary environment in which to update the weight vector did not produce the updating mechanism I had hoped to, so a little bit of assistance with refactoring this code would be appreciated.

getRecBipart <- function(covMat, sortIx) { # keeping track of weights vector in the global environment assign("w", value = rep(1, ncol(covMat)), envir = .GlobalEnv) # run recursion function recurFun(covMat, sortIx) return(w) } recurFun <- function(covMat, sortIx) { # get first half of sortIx which is a cluster order subIdx <- 1:trunc(length(sortIx)/2) # subdivide ordering into first half and second half cItems0 <- sortIx[subIdx] cItems1 <- sortIx[-subIdx] # compute cluster variances of covariance matrices indexed # on first half and second half of ordering cVar0 <- getClusterVar(covMat, cItems0) cVar1 <- getClusterVar(covMat, cItems1) alpha <- 1 - cVar0/(cVar0 + cVar1) # updating weights outside the function using scoping mechanics w[cItems0] <<- w[cItems0] * alpha w[cItems1] <<- w[cItems1] * (1-alpha) # rerun the function on a half if the length of that half is greater than 1 if(length(cItems0) > 1) { recurFun(covMat, cItems0) } if(length(cItems1) > 1) { recurFun(covMat, cItems1) } }

Lastly, let’s run the function.

out <- getRecBipart(covMat, clustOrder)

With the result (which matches the paper):

> out [1] 0.06999366 0.07592151 0.10838948 0.19029104 0.09719887 0.10191545 0.06618868 0.09095933 0.07123881 0.12790318

So, hopefully this democratizes the use of this technology in R. While I have seen a raw Rcpp implementation and one from the Systematic Investor Toolbox, neither of those implementations satisfied me from a “plug and play” perspective. This implementation solves that issue. Anyone here can copy and paste these functions into their environment and immediately make use of one of the algorithms devised by one of the top minds in quantitative finance.

A demonstration in a backtest using this methodology will be forthcoming.

Thanks for reading.

NOTE: I am always interested in networking and full-time opportunities which may benefit from my skills. Furthermore, I am also interested in project work in the volatility ETF trading space. My linkedin profile can be found here.