Continuing from the previous chapter, we have published three articles in the series of articles on "Building a Powerful Crypto-Asset Portfolio Using Multi-Factor Models": "Theoretical Basics" , "Data Preprocessing" , and "Factor Validity Testing" .
The first three articles explain the theory of multi-factor strategy and the steps of single-factor testing respectively.
1. Reasons for factor correlation testing: multicollinearity
We screen out a batch of effective factors through single-factor testing, but the above factors cannot be directly entered into the database. The factors themselves can be divided into broad categories according to specific economic meanings. There is a strong correlation between factors of the same type. If they are directly entered into the database without correlation screening and multiple linear regression is performed to calculate the expected return rate based on different factors, there will be Multicollinearity problem. In econometrics, multicollinearity means that some or all explanatory variables in a regression model have a "complete" or accurate linear relationship (high correlation between variables).
Therefore, after the effective factors are screened out, it is first necessary to conduct a T test on the correlation of factors according to major categories. For factors with higher correlation, either discard factors with lower significance or perform factor synthesis.
The mathematical explanation of multicollinearity is as follows:
$$$ Y = β₁ + β₂X₂ᵢ + β₃X₃ᵢ + ... + βₖXₖᵢ + μᵢ , i = 1,2,...,n $$$
There will be two situations:
1.$$C₂X₂ᵢ + C₃X₃ᵢ + … + CₖXₖᵢ = constant vector $$, $$ Cᵢ$$ is not all $$0→Xᵢ$$ is completely collinear
2.$$C₂X₂ᵢ + C₃X₃ᵢ + … + CₖXₖᵢ + Vᵢ = constant vector $$, $$Cᵢ$$ is not all 0, $$Vᵢ$$ is a random error term, and there is complete collinearity between $$→Xᵢ$$
Consequences of multicollinearity:
1. Parameter estimators do not exist under perfect collinearity
2. OLS estimator is not valid under approximate collinearity
We first define the variance-inflating factor (VIF) as $$VIF=1 / (1 - rᵢⱼ)$$ , which means that the variance of the parameter estimator is inflated due to multicollinearity. As the correlation coefficient increases, VIF Significant increase.
Take the binary linear model as an example: $$Y = β₁ + β₂X₂ᵢ + β₃X₃ᵢ + μᵢ$$
$$$ var(β₂) = (σ²/ΣX²₂)·(1/(1-r²₂₃)) $$$
$$$ The sum of squares of the correlation coefficient r²₂₃=(ΣX₂ᵢX₃ᵢ)²/ ΣX²₂ᵢΣX²₃ᵢ ≤ 1 $$$
$$$ 1/(1-r²₂₃)≥ 1 $$$
Completely non-collinear (completely uncorrelated): $$r²₂₃ = 0 → var(β₂) = σ²/ΣX²₂$$
Approximately collinear: $$0<r²₂₃<1 → var(β₂) = (σ²/ΣX²₂)·(1/(1-r²₂₃)) > σ²/ΣX²₂, the closer to 1, the variance ↑$$
Completely collinear: $$r²₂₃ = 1 → var(β₂) = ∞, the variance is infinite$$
3. The economic significance of parameter estimators is unreasonable
4. The significance test of the variable (t test) loses significance
5. The prediction function of the model fails: the predicted return rate fitted by the multivariate linear model is extremely inaccurate, and the model fails.
2. Step 1: Correlation test of factors of the same type
Test the correlation between the newly calculated factors and the factors already in the database. Generally speaking, there are two types of data for correlation:
1. Calculate the correlation based on the factor values of all tokens during the backtest period
2. Calculate the correlation based on the factor excess return values of all tokens during the backtest period
$$$ Excess return = Long group return - Baseline return, return = ln(closeₜ / closeₜ ₋₁) $$$
Each factor we seek has a certain contribution and explanatory power to the token's rate of return. The purpose of conducting correlation testing** is to find factors that have different explanations and contributions to strategy returns. The ultimate goal of the strategy is returns**. If two factors describe returns the same, it is meaningless even if the two factor values are significantly different. Therefore, we did not want to find factors with large differences in factor values themselves, but wanted to find factors with different factors describing returns, so we finally chose to use factor excess return values to calculate correlations.
Our strategy is daily frequency, so we calculate the correlation coefficient matrix between factor excess returns based on the date of the backtest interval.
Programming to solve the top n factors with the highest correlation in the library:
def get_n_max_corr(self, factors, n=1): factors_excess = self.get_excess_returns(factors) save_factor_excess = self.get_excess_return(self.factor_value, self.start_date, self.end_date) if len(factors_excess) < 1: return factor_excess, 1.0, None factors_excess[self.factor_name] = factor_excess['excess_return'] factors_excess = pd.concat(factors_excess, axis=1) factors_excess.columns = factors_excess.columns.levels[0] # get corr matrix factor_corr = factors_excess.corr() factor_corr_df = factor_corr.abs().loc[self.factor_name] max_corr_score = factor_corr_df.sort_values(ascending=False).iloc[1:].head(n) return save_factor_excess, factor_corr_df, max_corr_score
3. Step 2: Factor selection and factor synthesis
For factor sets with high correlation, there are two ways to deal with them:
(1) Factor selection
Based on the ICIR value, yield, turnover rate, and Sharpe ratio of the factor itself, the most effective factors in a certain dimension are selected to retain and other factors are deleted.
(2) Factor synthesis
Synthesize the factors in the factor set and retain as much effective information as possible on the cross-section
$$$ F = w1 * f1 + w2 f2+...+wn fn, F is the final synthesis factor, and f is the factor that needs to be synthesized $$$
Assume that there are currently 3 factor matrices to be processed:
synthesis = pd.concat([a,b,c],axis = 1) synthesis a b c BTC.BN 0.184865 -0.013253 -0.001557 ETH.BN 0.185691 0.022708 0.031793 BNB.BN 0.242072 -0.180952 -0.067430 LTC.BN 0.275923 -0.125712 -0.049596 AAVE.BN 0.204443 -0.000819 -0.006550 ... ... ... ... SOC.BN 0.231638 -0.095946 -0.049495 AVAX.BN 0.204714 -0.079707 -0.041806 DAO.BN 0.194990 0.022095 -0.011764 ETC.BN 0.184236 -0.021909 -0.013325 TRX.BN 0.175118 -0.055077 -0.039513
2.1 Equal weighting
The weight of each factor is equal $$ (w = 1/number of factors)$$, and the comprehensive factor = the sum of the values of each factor is averaged.
Eg. Momentum factors, one-month rate of return, two-month rate of return, three-month rate of return, six-month rate of return, twelve-month rate of return. The factor loadings of these six factors each account for 1/6 of the weight. , synthesize new momentum factor loadings, and then perform normalization again.
synthesis1 = synthesis.mean(axis=1) # 按行求均值
2.2 Historical IC weighting, historical ICIR, historical income weighting
The factors are weighted by the IC value (ICIR value, historical return value) during the backtest period. There were many periods in the past, and each period had an IC value, so their average was used as the weight of the factor. Usually the mean (arithmetic mean) of the IC during the backtest period is used as the weight.
# 权重归一化(后文中的因子加权方式也基本都需要进行权重归一化) w_IC = ic.mean() / ic.mean().sum() w_ICIR = icir.mean() / icir.mean().sum() w_Ret = Return.mean() / Return.mean().sum() synthesis2 = (synthesis * w_IC).sum(axis=1) synthesis2 = (synthesis * w_ICIR).sum(axis=1) synthesis2 = (synthesis * w_Ret).sum(axis=1)
2.3 Historical IC half-life weighting, historical ICIR half-life weighting
2.1 and 2.2 both calculate arithmetic averages, and each IC and ICIR in the backtest period have the same effect on the factor by default.
However, in reality, the impact of each period of the backtest period on the current period is not exactly the same, and there is a time attenuation. The closer the period is to the current period, the greater the impact, and the further away the impact is, the smaller it is. Based on this principle, before calculating the IC weight, first define a half-life weight. The closer to the current period, the larger the weight value, and the further away, the smaller the weight.
Mathematical derivation of half-life weight:
# 半衰权重def Decay(H,T): t = np.arange(T+1)[1:] wt = 2**((tT-1)/H) #半衰权重decay = wt/wt.sum() #归一化return decay # 历史IC半衰加权w_bs = Decay(6,12) # 假设T=12,H=6 ic_bs = ic.mul(w_bs,axis=0) w = ic_bs.mean()/ic.mean().sum() synthesis3 = (synthesis * w).sum(axis=1) # 历史ICIR半衰加权# 历史ICIR半衰加权在历史IC半衰加权的基础上,除以IC值的标准差。 w_bs = bs(6,12) ic_bs = ic.mul(w_bs,axis=0) ir_bs = ic_bs.mean()/ic.std() w = ir_bs.mean()/ir_bs.mean().sum() synthesis3 = (synthesis * w).sum(axis=1)
2.4 Maximize ICIR weighting
By solving the equation, calculate the optimal factor weight w to maximize the ICIR
Covariance matrix estimation problem: The covariance matrix is used to measure the correlation between different assets. In statistics, the sample covariance matrix is often used instead of the population covariance matrix. However, when the sample size is insufficient, the sample covariance matrix and the population covariance matrix will be very different. Therefore, someone proposed a compression estimation method. The principle is to minimize the mean square error between the estimated covariance matrix and the actual covariance matrix.
Way:
1. Sample covariance matrix
# 最大化ICIR加权(样本协方差) ic_cov = np.array(ic.cov()) inv_ic_cov = np.linalg.inv(ic_cov) ic_vector = np.mat(ic.mean()) w = inv_ic_cov * ic_vector.T w = w / w.sum() synthesis4 = (synthesis * pd.DataFrame(w,index=synthesis.columns)[0]).sum(axis=1)
2. Ledoit-Wolf shrinkage: Introduce a shrinkage coefficient to mix the original covariance matrix with the identity matrix to reduce the impact of noise.
# 最大化ICIR加权(Ledoit-Wolf压缩估计协方差) from sklearn.covariance import LedoitWolf model=LedoitWolf() model.fit(ic) ic_cov_lw = model.covariance_ inv_ic_cov = np.linalg.inv(ic_cov_lw) ic_vector = np.mat(ic.mean()) w = inv_ic_cov*ic_vector.T w = w/w.sum() synthesis4 = (synthesis * pd.DataFrame(w,index=synthesis.columns)[0]).sum(axis=1)
3. Oracle Approximate Shrinkage: An improvement over Ledoit-Wolf shrinkage, the goal is to more accurately estimate the true covariance matrix when the sample size is small by adjusting the covariance matrix. (The programming implementation is the same as Ledoit-Wolf shrinkage)
2.5 Principal Component Analysis PCA
Principal Component Analysis (PCA) is a statistical method used to reduce dimensionality and extract the main features of data. The goal is to map the original data to a new coordinate system through linear transformation to maximize the variance of the data in the new coordinate system.
Specifically, PCA first finds the principal component in the data, which is the direction with the largest variance in the data. It then finds the second principal component that is orthogonal (unrelated) to the first principal component and has the largest variance. This process is repeated until all principal components in the data are found.
# 主成分分析(PCA) from sklearn.decomposition import PCA model1 = PCA(n_components=1) model1.fit(f) w=model1.components_ w=w/w.sum() weighted_factor=(f*pd.DataFrame(w,columns=f.columns).iloc[0]).sum(axis=1)
About LUCIDA & FALCON
Lucida ( https://www.lucida.fund/ ) is an industry-leading quantitative hedge fund that entered the Crypto market in April 2018. It mainly trades CTA/statistical arbitrage/option volatility arbitrage and other strategies, with a current management scale of US$30 million. .
Falcon ( https://falcon.lucida.fund/ ) is a new generation of Web3 investment infrastructure. It is based on a multi-factor model and helps users "select", "buy", "manage" and "sell" crypto assets. Falcon was hatched by Lucida in June 2022.
More content can be found at https://linktr.ee/lucida_and_falcon
Previous articles
Use multi-factor strategies to build a powerful crypto asset portfolio#Factor Validity Test#
Use multi-factor strategies to build powerful crypto asset portfolios#Data Preprocessing#
Construct a powerful crypto asset portfolio using multi-factor strategies#Theoretical Basics#
From Tech Breakthroughs to Market Boom: Understanding the Link in the Crypto Bull Market
What exactly is driving Crytpo’s bull market? Is it a technological upgrade?
Development as the Driving Force: Understanding the Impact on Token Price Performance?
Is "the team doing something" really related to the currency price?
5 million rows of data review Crypto’s three-year bull market @LUCIDA
5 Million Rows of Data Recap: Investigating The Crypto Market's 3-Year Bull Run @LUCIDA
LUCIDA: Use multi-factor models to select tracks and currencies
LUCIDA × SnapFingers DAO: 21 top public chains in three-year bull market recap
LUCIDA × OKLink: The value of on-chain data to secondary market investment