Use multi-factor strategies to build a powerful crypto asset portfolio #Big Category Factor Analysis: Factor Synthesis#

This article is machine translated
Show original

Continuing from the previous chapter, we have published three articles in the series of articles on "Building a Powerful Crypto-Asset Portfolio Using Multi-Factor Models": "Theoretical Basics" , "Data Preprocessing" , and "Factor Validity Testing" .

The first three articles explain the theory of multi-factor strategy and the steps of single-factor testing respectively.

1. Reasons for factor correlation testing: multicollinearity

We screen out a batch of effective factors through single-factor testing, but the above factors cannot be directly entered into the database. The factors themselves can be divided into broad categories according to specific economic meanings. There is a strong correlation between factors of the same type. If they are directly entered into the database without correlation screening and multiple linear regression is performed to calculate the expected return rate based on different factors, there will be Multicollinearity problem. In econometrics, multicollinearity means that some or all explanatory variables in a regression model have a "complete" or accurate linear relationship (high correlation between variables).

Therefore, after the effective factors are screened out, it is first necessary to conduct a T test on the correlation of factors according to major categories. For factors with higher correlation, either discard factors with lower significance or perform factor synthesis.

The mathematical explanation of multicollinearity is as follows:

$$$ Y = β₁ + β₂X₂ᵢ + β₃X₃ᵢ + ... + βₖXₖᵢ + μᵢ , i = 1,2,...,n $$$

There will be two situations:

1.$$C₂X₂ᵢ + C₃X₃ᵢ + … + CₖXₖᵢ = constant vector $$, $$ Cᵢ$$ is not all $$0→Xᵢ$$ is completely collinear

2.$$C₂X₂ᵢ + C₃X₃ᵢ + … + CₖXₖᵢ + Vᵢ = constant vector $$, $$Cᵢ$$ is not all 0, $$Vᵢ$$ is a random error term, and there is complete collinearity between $$→Xᵢ$$

Consequences of multicollinearity:

1. Parameter estimators do not exist under perfect collinearity

2. OLS estimator is not valid under approximate collinearity

We first define the variance-inflating factor (VIF) as $$VIF=1 / (1 - rᵢⱼ)$$ , which means that the variance of the parameter estimator is inflated due to multicollinearity. As the correlation coefficient increases, VIF Significant increase.

Take the binary linear model as an example: $$Y = β₁ + β₂X₂ᵢ + β₃X₃ᵢ + μᵢ$$

$$$ var(β₂) = (σ²/ΣX²₂)·(1/(1-r²₂₃)) $$$

$$$ The sum of squares of the correlation coefficient r²₂₃=(ΣX₂ᵢX₃ᵢ)²/ ΣX²₂ᵢΣX²₃ᵢ ≤ 1 $$$

$$$ 1/(1-r²₂₃)≥ 1 $$$

  • Completely non-collinear (completely uncorrelated): $$r²₂₃ = 0 → var(β₂) = σ²/ΣX²₂$$

  • Approximately collinear: $$0<r²₂₃<1 → var(β₂) = (σ²/ΣX²₂)·(1/(1-r²₂₃)) > σ²/ΣX²₂, the closer to 1, the variance ↑$$

  • Completely collinear: $$r²₂₃ = 1 → var(β₂) = ∞, the variance is infinite$$

3. The economic significance of parameter estimators is unreasonable

4. The significance test of the variable (t test) loses significance

5. The prediction function of the model fails: the predicted return rate fitted by the multivariate linear model is extremely inaccurate, and the model fails.

2. Step 1: Correlation test of factors of the same type

Test the correlation between the newly calculated factors and the factors already in the database. Generally speaking, there are two types of data for correlation:

1. Calculate the correlation based on the factor values ​​of all tokens during the backtest period

2. Calculate the correlation based on the factor excess return values ​​of all tokens during the backtest period

$$$ Excess return = Long group return - Baseline return, return = ln(closeₜ / closeₜ ₋₁) $$$

Each factor we seek has a certain contribution and explanatory power to the token's rate of return. The purpose of conducting correlation testing** is to find factors that have different explanations and contributions to strategy returns. The ultimate goal of the strategy is returns**. If two factors describe returns the same, it is meaningless even if the two factor values ​​are significantly different. Therefore, we did not want to find factors with large differences in factor values ​​themselves, but wanted to find factors with different factors describing returns, so we finally chose to use factor excess return values ​​to calculate correlations.

Our strategy is daily frequency, so we calculate the correlation coefficient matrix between factor excess returns based on the date of the backtest interval.

Programming to solve the top n factors with the highest correlation in the library:

 def get_n_max_corr(self, factors, n=1): factors_excess = self.get_excess_returns(factors) save_factor_excess = self.get_excess_return(self.factor_value, self.start_date, self.end_date) if len(factors_excess) < 1: return factor_excess, 1.0, None factors_excess[self.factor_name] = factor_excess['excess_return'] factors_excess = pd.concat(factors_excess, axis=1) factors_excess.columns = factors_excess.columns.levels[0] # get corr matrix factor_corr = factors_excess.corr() factor_corr_df = factor_corr.abs().loc[self.factor_name] max_corr_score = factor_corr_df.sort_values(ascending=False).iloc[1:].head(n) return save_factor_excess, factor_corr_df, max_corr_score

3. Step 2: Factor selection and factor synthesis

For factor sets with high correlation, there are two ways to deal with them:

(1) Factor selection

Based on the ICIR value, yield, turnover rate, and Sharpe ratio of the factor itself, the most effective factors in a certain dimension are selected to retain and other factors are deleted.

(2) Factor synthesis

Synthesize the factors in the factor set and retain as much effective information as possible on the cross-section

$$$ F = w1 * f1 + w2 f2+...+wn fn, F is the final synthesis factor, and f is the factor that needs to be synthesized $$$

Assume that there are currently 3 factor matrices to be processed:

 synthesis = pd.concat([a,b,c],axis = 1) synthesis a b c BTC.BN 0.184865 -0.013253 -0.001557 ETH.BN 0.185691 0.022708 0.031793 BNB.BN 0.242072 -0.180952 -0.067430 LTC.BN 0.275923 -0.125712 -0.049596 AAVE.BN 0.204443 -0.000819 -0.006550 ... ... ... ... SOC.BN 0.231638 -0.095946 -0.049495 AVAX.BN 0.204714 -0.079707 -0.041806 DAO.BN 0.194990 0.022095 -0.011764 ETC.BN 0.184236 -0.021909 -0.013325 TRX.BN 0.175118 -0.055077 -0.039513

2.1 Equal weighting

The weight of each factor is equal $$ (w = 1/number of factors)$$, and the comprehensive factor = the sum of the values ​​of each factor is averaged.

Eg. Momentum factors, one-month rate of return, two-month rate of return, three-month rate of return, six-month rate of return, twelve-month rate of return. The factor loadings of these six factors each account for 1/6 of the weight. , synthesize new momentum factor loadings, and then perform normalization again.

 synthesis1 = synthesis.mean(axis=1) # 按行求均值

2.2 Historical IC weighting, historical ICIR, historical income weighting

The factors are weighted by the IC value (ICIR value, historical return value) during the backtest period. There were many periods in the past, and each period had an IC value, so their average was used as the weight of the factor. Usually the mean (arithmetic mean) of the IC during the backtest period is used as the weight.

 # 权重归一化(后文中的因子加权方式也基本都需要进行权重归一化) w_IC = ic.mean() / ic.mean().sum() w_ICIR = icir.mean() / icir.mean().sum() w_Ret = Return.mean() / Return.mean().sum() synthesis2 = (synthesis * w_IC).sum(axis=1) synthesis2 = (synthesis * w_ICIR).sum(axis=1) synthesis2 = (synthesis * w_Ret).sum(axis=1)

2.3 Historical IC half-life weighting, historical ICIR half-life weighting

2.1 and 2.2 both calculate arithmetic averages, and each IC and ICIR in the backtest period have the same effect on the factor by default.

However, in reality, the impact of each period of the backtest period on the current period is not exactly the same, and there is a time attenuation. The closer the period is to the current period, the greater the impact, and the further away the impact is, the smaller it is. Based on this principle, before calculating the IC weight, first define a half-life weight. The closer to the current period, the larger the weight value, and the further away, the smaller the weight.

Mathematical derivation of half-life weight:

\* Half-life H: Each time H periods are pushed forward, the weight value decreases by half in an exponential manner\* T: The number of periods considered for backtesting

 # 半衰权重def Decay(H,T): t = np.arange(T+1)[1:] wt = 2**((tT-1)/H) #半衰权重decay = wt/wt.sum() #归一化return decay # 历史IC半衰加权w_bs = Decay(6,12) # 假设T=12,H=6 ic_bs = ic.mul(w_bs,axis=0) w = ic_bs.mean()/ic.mean().sum() synthesis3 = (synthesis * w).sum(axis=1) # 历史ICIR半衰加权# 历史ICIR半衰加权在历史IC半衰加权的基础上,除以IC值的标准差。 w_bs = bs(6,12) ic_bs = ic.mul(w_bs,axis=0) ir_bs = ic_bs.mean()/ic.std() w = ir_bs.mean()/ir_bs.mean().sum() synthesis3 = (synthesis * w).sum(axis=1)

2.4 Maximize ICIR weighting

By solving the equation, calculate the optimal factor weight w to maximize the ICIR

Covariance matrix estimation problem: The covariance matrix is ​​used to measure the correlation between different assets. In statistics, the sample covariance matrix is ​​often used instead of the population covariance matrix. However, when the sample size is insufficient, the sample covariance matrix and the population covariance matrix will be very different. Therefore, someone proposed a compression estimation method. The principle is to minimize the mean square error between the estimated covariance matrix and the actual covariance matrix.

Way:

1. Sample covariance matrix

 # 最大化ICIR加权(样本协方差) ic_cov = np.array(ic.cov()) inv_ic_cov = np.linalg.inv(ic_cov) ic_vector = np.mat(ic.mean()) w = inv_ic_cov * ic_vector.T w = w / w.sum() synthesis4 = (synthesis * pd.DataFrame(w,index=synthesis.columns)[0]).sum(axis=1)

2. Ledoit-Wolf shrinkage: Introduce a shrinkage coefficient to mix the original covariance matrix with the identity matrix to reduce the impact of noise.

 # 最大化ICIR加权(Ledoit-Wolf压缩估计协方差) from sklearn.covariance import LedoitWolf model=LedoitWolf() model.fit(ic) ic_cov_lw = model.covariance_ inv_ic_cov = np.linalg.inv(ic_cov_lw) ic_vector = np.mat(ic.mean()) w = inv_ic_cov*ic_vector.T w = w/w.sum() synthesis4 = (synthesis * pd.DataFrame(w,index=synthesis.columns)[0]).sum(axis=1)

3. Oracle Approximate Shrinkage: An improvement over Ledoit-Wolf shrinkage, the goal is to more accurately estimate the true covariance matrix when the sample size is small by adjusting the covariance matrix. (The programming implementation is the same as Ledoit-Wolf shrinkage)

2.5 Principal Component Analysis PCA

Principal Component Analysis (PCA) is a statistical method used to reduce dimensionality and extract the main features of data. The goal is to map the original data to a new coordinate system through linear transformation to maximize the variance of the data in the new coordinate system.

Specifically, PCA first finds the principal component in the data, which is the direction with the largest variance in the data. It then finds the second principal component that is orthogonal (unrelated) to the first principal component and has the largest variance. This process is repeated until all principal components in the data are found.

 # 主成分分析(PCA) from sklearn.decomposition import PCA model1 = PCA(n_components=1) model1.fit(f) w=model1.components_ w=w/w.sum() weighted_factor=(f*pd.DataFrame(w,columns=f.columns).iloc[0]).sum(axis=1)

About LUCIDA & FALCON

Lucida ( https://www.lucida.fund/ ) is an industry-leading quantitative hedge fund that entered the Crypto market in April 2018. It mainly trades CTA/statistical arbitrage/option volatility arbitrage and other strategies, with a current management scale of US$30 million. .

Falcon ( https://falcon.lucida.fund/ ) is a new generation of Web3 investment infrastructure. It is based on a multi-factor model and helps users "select", "buy", "manage" and "sell" crypto assets. Falcon was hatched by Lucida in June 2022.

More content can be found at https://linktr.ee/lucida_and_falcon

Previous articles

Mirror
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments