Continuing from the previous chapter, we have published two articles in the series of articles on "Building a Powerful Crypto-Asset Portfolio Using Multi-Factor Models": "Theoretical Basics" and "Data Preprocessing"
This is the third article: factor validity test.
After determining the specific factor value, it is necessary to first conduct a validity test on the factors and screen factors that meet the requirements of significance, stability, monotonicity, and return rate; the factor validity test is done by analyzing the factor value of the current period and the expected return rate. relationship to determine the effectiveness of the factors. There are mainly 3 classic methods:
IC/IR method: IC/IR value is the correlation coefficient between factor value and expected return . The larger the factor, the better the performance.
T value (regression method): The T value reflects the significance of the coefficient after the linear regression of the next period's return on the current period's factor value. By comparing whether the regression coefficient passes the t test, the degree of contribution of the current period's factor value to the next period's return is judged. , typically used in multivariate (i.e., multifactor) regression models.
Hierarchical backtesting method: The hierarchical backtesting method stratifies tokens based on factor values, and then calculates the rate of return of each layer of tokens to determine the monotonicity of the factors.
1. IC/IR Law
(1) Definition of IC/IR
IC : Information Coefficient, which represents the factor's ability to predict Tokens returns. The IC value of a certain period is the correlation coefficient between the factor value of the current period and the return rate of the next period.
$$$ ICₜ = Correlation(fₜ , Rₜ₊₁) $$$
fₜ: factor value in period t
Rₜ₊₁: The rate of return of token in period t+1
IC∈(-1,1), the larger the IC factor, the stronger the currency selection ability.
The closer the IC is to 1, the stronger the positive correlation between the factor value and the next period's rate of return. IC=1 means that the factor's currency selection is 100% accurate. It corresponds to the token with the highest ranking score. The selected token will be used in the next rebalancing cycle. , the largest increase;
The closer the IC is to -1, the stronger the negative correlation between the factor value and the next period's rate of return. If IC=-1, it means the token with the highest ranking will have the largest decline in the next rebalancing cycle, which is a complete inversion. index;
If the IC is closer to 0, it means that the predictive ability of the factor is extremely weak, indicating that the factor has no predictive ability for tokens.
IR: Information ratio, which represents the factor's ability to obtain stable Alpha. IR is the mean IC of all periods divided by the IC standard deviation of all periods.
$$$ IR = mean(ICₜ) / std(ICₜ) $$$
When the absolute value of IC is greater than 0.05 (0.02), the factor's stock selection ability is strong. When IR is greater than 0.5, the factor has a strong ability to obtain excess returns stably.
(2) IC calculation method
- Normal IC (Pearson correlation): Calculate the Pearson correlation coefficient, the most classic correlation coefficient. However, there are many assumptions in this calculation method: the data is continuous, normally distributed, the two variables satisfy a linear relationship, etc.
$$$ ICₚₑₐᵣₛₒₙ,ₜ = cov(fₜ , Rₜ₊₁)/√var(fₜ)var(Rₜ₊₁) = ∑ᵗₜ₌₁(fₜ - fₜ)(Rₜ₊₁ - Rₜ₊₁)/√∑ ᵗₜ ₌₁(fₜ - fₜ)²(Rₜ₊₁ - Rₜ₊₁)² $$$
- Rank IC (Spearman's rank coefficient of correlation): Calculate Spearman's rank correlation coefficient, first sort the two variables, and then calculate the Pearson correlation coefficient based on the sorted results. Spearman's rank correlation coefficient evaluates the monotonic relationship between two variables, and is less affected by data outliers because it is converted into ordered values ; while Pearson's correlation coefficient evaluates the linear relationship between two variables, Not only does it have certain prerequisites for the original data, but it is also greatly affected by data outliers. In real calculations, it is more consistent to find rank IC.
(3) IC/IR method code implementation
Create a list of unique date and time values in ascending order of date and time - record the rebalancing date def choosedate(dateList,cycle)
class TestAlpha(object): def __init__(self,ini_data): self.ini_data = ini_data def chooseDate(self,cycle,start_date,end_date): ''' cycle: day, month, quarter, year df: 原始数据框df,date列的处理''' chooseDate = [] dateList = sorted(self.ini_data[self.ini_data['date'].between(start_date,end_date)]['date'].drop_duplicates().values) dateList = pd.to_datetime(dateList) for i in range(len(dateList)-1): if getattr(dateList[i], cycle) != getattr(dateList[i + 1], cycle): chooseDate.append(dateList[i]) chooseDate.append(dateList[-1]) chooseDate = [date.strftime('%Y-%m-%d') for date in chooseDate] return chooseDate def ICIR(self,chooseDate,factor): # 1.先展示每个调仓日期的IC,即ICt testIC = pd.DataFrame(index=chooseDate,columns=['normalIC','rankIC']) dfFactor = self.ini_data[self.ini_data['date'].isin(chooseDate)][['date','name','price',factor]] for i in range(len(chooseDate)-1): # (1) normalIC X = dfFactor[dfFactor['date'] == chooseDate[i]][['date','name','price',factor]].rename(columns={'price':'close0'}) Y = pd.merge(X,dfFactor[dfFactor['date'] == chooseDate[i+1]][['date','name','price']], on=['name']).rename(columns={'price':'close1'}) Y['returnM'] = (Y['close1'] - Y['close0']) / Y['close0'] Yt = np.array(Y['returnM']) Xt = np.array(Y[factor]) Y_mean = Y['returnM'].mean() X_mean = Y[factor].mean() num = np.sum((Xt-X_mean)*(Yt-Y_mean)) den = np.sqrt(np.sum((Xt-X_mean)**2)*np.sum((Yt-Y_mean)**2)) normalIC = num / den # pearson correlation # (2) rankIC Yr = Y['returnM'].rank() Xr = Y[factor].rank() rankIC = Yr.corr(Xr) testIC.iloc[i] = normalIC, rankIC testIC =testIC[:-1] # 2.基于ICt,求['IC_Mean', 'IC_Std','IR','IC<0占比--因子方向','|IC|>0.05比例'] ''' ICmean: |IC|>0.05, 因子的选币能力较强,因子值与下期收益率相关性高。|IC|<0.05,因子的选币能力较弱,因子值与下期收益率相关性低IR: |IR|>0.5,因子选币能力较强,IC值较稳定。|IR|<0.5,IR值偏小,因子不太有效。若接近0,基本无效IClZero(IC less than Zero): IC<0占比接近一半->因子中性.IC>0超过一大半,为负向因子,即因子值增加,收益率降低ICALzpF(IC abs large than zero poin five): |IC|>0.05比例偏高,说明因子大部分有效''' IR = testIC.mean()/testIC.std() IClZero = testIC[testIC<0].count()/testIC.count() ICALzpF = testIC[abs(testIC)>0.05].count()/testIC.count() combined =pd.concat([testIC.mean(),testIC.std(),IR,IClZero,ICALzpF],axis=1) combined.columns = ['ICmean','ICstd','IR','IClZero','ICALzpF'] # 3.IC 调仓期内IC的累积图print("Test IC Table:") print(testIC) print("Result:") print('normal Skewness:',combined['normalIC'].skew(),'rank Skewness:',combined['rankIC'].skew()) print('normal Skewness:',combined['normalIC'].kurt(),'rank Skewness:',combined['rankIC'].kurt()) return combined,testIC.cumsum().plot()
2. T-value test (regression method)
The T-value method also tests the relationship between the current period's factor value and the next period's rate of return, but it is different from the ICIR method in analyzing the correlation between the two. The t-value method uses the next period's rate of return as the dependent variable Y, and the current period's factor value as the independent variable X. For X regression, conduct a t test on the regression coefficient of the factor value to test whether it is significantly different from 0, that is, whether the factor of this period affects the return rate of the next period.
The essence of this method is to solve the bivariate regression model. The specific formula is as follows:
$$$ Rₜ₊₁ = αₜ + βₜfₜ + μₜ $$$
Rₜ₊₁: Token rate of return in period t+1
fₜ: factor value in period t
βₜ: The regression coefficient of the factor value in period t, that is, the factor return rate
αₜ: Intercept term, reflecting the average impact of all variables not included in the model on Rₜ₊₁
(1) Regression method theory
Set the significance level α, usually 10%, 5%, or 1%.
Test hypothesis: $$H₀ : βₜ =0$$, $$H₁ : βₜ ≠ 0$$
$$$ T statistic = (βʌₜ - βₜ) / se(βʌₜ) ~ tα/₂(nk) $$$
k: the number of parameters in the regression model
- If |t statistics| > |tα/₂(nk)| → H₀ is rejected, that is, the factor value fₜ in this period has a significant impact on the next period's rate of return Rₜ₊₁.
(2) Regression method code implementation
def regT(self,chooseDate,factor,return_24h): testT = pd.DataFrame(index=chooseDate,columns=['coef','T']) for i in range(len(chooseDate)-1): X = self.ini_data[self.ini_data['date'] == chooseDate[i]][factor].values Y = self.ini_data[self.ini_data['date'] == chooseDate[i+1]][return_24h].values b, intc = np.polyfit(X, Y, 1) # 斜率ut = Y - (b * X + intc) # 求t值t = (\hat{b} - b) / se(b) n = len(X) dof = n - 2 # 自由度std_b = np.sqrt(np.sum(ut**2) / dof) t_stat = b / std_b testT.iloc[i] = b, t_stat testT = testT[:-1] testT_mean = testT['T'].abs().mean() testTL196 = len(testT[testT['T'].abs() > 1.96]) / len(testT) print('testT_mean:',testT_mean) print('T值大于1.96的占比:',testTL196) return testT
3. Stratified backtesting method
Stratification refers to stratifying all tokens, and backtesting refers to calculating the rate of return of each layer of token combinations.
(1) Stratification
First, obtain the factor value corresponding to the token pool, and sort the tokens by the factor value. Sort in ascending order, that is, those with smaller factor values are ranked first, and the tokens are equally divided according to the sorting. The factor value of layer 0 token is the smallest, and the factor value of layer 9 token is the largest.
Theoretically, "equal division" refers to splitting the number of tokens equally, that is, the number of tokens in each layer is the same, and is achieved with the help of quantiles. In reality, the total number of tokens is not necessarily a multiple of the number of layers, that is, the number of tokens in each layer is not necessarily equal.
(2) Backtesting
After dividing the tokens into 10 groups in ascending order of factor values, start calculating the return rate of each token combination. This step treats the tokens of each layer as an investment portfolio (the tokens contained in the token portfolio of each layer will change during different backtest periods), and calculates the overall return rate of the portfolio in the next period . ICIR and t-value analyze the factor value of the current period and the overall return rate of the next period , but stratified backtesting requires calculating the stratified portfolio return rate for each trading day during the backtesting period . Since there are many backtesting periods with many periods, stratification and backtesting are required in each period. Finally, the token return rate of each layer is cumulatively multiplied to calculate the cumulative return rate of the token combination.
Ideally, for a good factor, group 9 has the highest curve return and group 0 has the lowest curve return.
The curves of Group 9 minus Group 0 (i.e. long-short returns) show a monotonous increase.
(3) Code implementation of hierarchical backtesting method
def layBackTest(self,chooseDate,factor): f = {} returnM = {} for i in range(len(chooseDate)-1): df1 = self.ini_data[self.ini_data['date'] == chooseDate[i]].rename(columns= {'price':'close0'}) Y = pd.merge(df1,self.ini_data[self.ini_data['date'] == chooseDate[i+1]] [['date','name','price']],left_on=['name'],right_on=['name']).rename(columns= {'price':'close1'}) f[i] = Y[factor] returnM[i] = Y['close1'] / Y['close0'] -1 labels = ['0','1','2','3','4','5','6','7','8','9'] res = pd.DataFrame(index=['0','1','2','3','4','5','6','7','8','9','LongShort']) res[chooseDate[0]] = 1 for i in range(len(chooseDate)-1): dfM = pd.DataFrame({'factor':f[i],'returnM':returnM[i]}) dfM['group'] = pd.qcut(dfM['factor'], 10, labels=labels) dfGM = dfM.groupby('group').mean()[['returnM']] dfGM.loc['LongShort'] = dfGM.loc['0']- dfGM.loc['9'] res[chooseDate[i+1]] = res[chooseDate[0]] * (1 + dfGM['returnM']) data = pd.DataFrame({'分层累积收益率':res.iloc[:10,-1],'Group': [0,1,2,3,4,5,6,7,8,9]}) df3 = data.corr() print("Correlation Matrix:") print(df3) return res.T.plot(title='Group backtest net worth curve')
About LUCIDA & FALCON
Lucida ( https://www.lucida.fund/ ) is an industry-leading quantitative hedge fund that entered the Crypto market in April 2018. It mainly trades CTA/statistical arbitrage/option volatility arbitrage and other strategies, with a current management scale of US$30 million. .
Falcon ( https://falcon.lucida.fund/ ) is a new generation of Web3 investment infrastructure. It is based on a multi-factor model and helps users "select", "buy", "manage" and "sell" crypto assets. Falcon was hatched by Lucida in June 2022.
More content can be found at https://linktr.ee/lucida_and_falcon
Previous articles
Use multi-factor strategies to build powerful crypto asset portfolios#Data Preprocessing#
Construct a powerful crypto asset portfolio using multi-factor strategies#Theoretical Basics#
From Tech Breakthroughs to Market Boom: Understanding the Link in the Crypto Bull Market
What exactly is driving Crytpo’s bull market? Is it a technological upgrade?
Development as the Driving Force: Understanding the Impact on Token Price Performance?
Is "the team doing something" really related to the currency price?
5 million rows of data review Crypto’s three-year bull market @LUCIDA
5 Million Rows of Data Recap: Investigating The Crypto Market's 3-Year Bull Run @LUCIDA
LUCIDA: Use multi-factor models to select tracks and currencies
LUCIDA × SnapFingers DAO: 21 top public chains in three-year bull market recap
LUCIDA × OKLink: The value of on-chain data to secondary market investment