Preface
Continuing from the previous chapter, we published the first article in the series of articles "Constructing a Powerful Crypto-Asset Portfolio Using Multi-Factor Strategies" - Theoretical Basis , and this article is the second article - Data Preprocessing.
Before/after calculating factor data, and before testing the validity of a single factor, the relevant data need to be processed. Specific data preprocessing involves the processing of duplicate values, outliers/missing values/extreme values, standardization and data frequency .
1. Duplicate values
Data related definitions:
Key: Represents a unique index. eg. For a piece of data with all tokens and all dates, the key is "token_id/contract_address - date"
Value: The object indexed by the key is called the "value".
Diagnosing duplicate values first requires understanding what the data "should" look like. Usually the data is in the form of:
Time Series data. The key is "time". eg. Five years of price data for a single token
Cross Section data (Cross Section). The key is "individual". eg.2023.11.01 Price data of all tokens in the crypto market on that day
Panel data (Panel). The key is the combination of "individual-time". eg. Price data of all tokens in four years from 2019.01.01 to 2023.11.01.
Principle: Once the index (key) of the data is determined, you can know at what level the data should have no duplicate values.
Check method:
pd.DataFrame.duplicated(subset=[key1, key2, ...])
Check the number of duplicate values:
pd.DataFrame.duplicated(subset=[key1, key2, ...]).sum()
Sampling to see duplicate samples:
df[df.duplicated(subset=[...])].sample()
After finding the sample, usedf.loc
to select all duplicate samples corresponding to the index
pd.merge(df1, df2, on=[key1, key2, ...], indicator=True, validate='1:1')
In the horizontal merging function, adding
indicator
parameter will generate the_merge
field. Usedfm['_merge'].value_counts()
to check the number of samples from different sources after merging.By adding
validate
parameter, you can check whether the index in the merged data set is as expected (1 to 1
,1 to many
ormany to many
, the last case actually does not require verification). If it is not as expected, the merge process will report an error and abort execution.
2. Outliers/missing values/extreme values
Common causes of outliers:
extreme case . For example, if the token price is 0.000001$ or the token has a market value of only 500,000 US dollars, if it changes slightly, there will be dozens of times of return.
Data characteristics . For example, if the token price data starts to be downloaded on January 1, 2020, then it is naturally impossible to calculate the rate of return data on January 1, 2020, because there is no closing price of the previous day.
data error . Data providers will inevitably make mistakes, such as recording 12 yuan per token as 1.2 yuan per token.
Principles for handling outliers and missing values:
delete. Outliers that cannot reasonably be corrected or corrected may be considered for deletion.
replace. Usually used to process extreme values, such as winsorizing or taking logarithms (not commonly used).
filling. For missing values , you can also consider filling them in a reasonable way. Common methods include mean (or moving average), interpolation (Interpolation), filling in 0
df.fillna(0)
, forwarddf.fillna('ffill')
/ forward After fillingdf.fillna('bfill')
, etc., you must consider whether the assumptions on which the filling relies are consistent.Use backward filling with caution in machine learning, as there is a risk of Look-ahead bias.
How to deal with extreme values:
1. Percentile method.
By arranging the order from small to large, data that exceeds the minimum and maximum ratios are replaced with critical data. For data with rich historical data, this method is relatively rough and not very applicable. Forcibly deleting a fixed proportion of data may cause a certain proportion of losses.
2.3σ / three standard deviation method
The standard deviation $$σfactor$$ reflects the degree of dispersion of the factor data distribution, that is, the volatility. The $$μ±3×σ$$ range is used to identify and replace outliers in the data set, with approximately 99.73% of the data falling within this range. The premise for this method is that the factor data must obey the normal distribution , that is, $$X∼N(μ,σ2)$$.
Among them, $$μ=∑ⁿᵢ₌₁·Xi / N$$, $$σ²=∑ⁿᵢ₌₁(xi-μ)² / n$$, and the reasonable range of factor values is $$ [μ-3×σ ,μ+3×σ]$$.
Make the following adjustments to all factors within the data range:
The disadvantage of this method is that commonly used data in the quantitative field, such as stock prices and token prices, often exhibit sharp peaks and thick tail distributions, which do not conform to the assumption of normal distribution. In this case, using the $$3σ$$ method will cause a large amount of data to be mistakenly Identified as outliers.
3. Median Absolute Deviation (MAD) method
This method is based on the median and absolute deviation, making the processed data less sensitive to extreme values or outliers. More robust than methods based on mean and standard deviation.
Median of absolute deviation values $$MAD = median( ∑ⁿᵢ₌₁ (Xi - Xmedian))$$
The reasonable range of factor values is $$[Xmedian - n×MAD, Xmedian + n×MAD]$$. Make the following adjustments to all factors within the data range:
# 处理因子数据极端值情况class Extreme(object): def __init__(s, ini_data): s.ini_data = ini_data def three_sigma(s,n=3): mean = s.ini_data.mean() std = s.ini_data.std() low = mean - n*std high = mean + n*std return np.clip(s.ini_data,low,high) def mad(s, n=3): median = s.ini_data.median() mad_median = abs(s.ini_data - median).median() high = median + n * mad_median low = median - n * mad_median return np.clip(s.ini_data, low, high) def quantile(s,l = 0.025, h = 0.975): low = s.ini_data.quantile(l) high = s.ini_data.quantile(h) return np.clip(s.ini_data, low, high)
3. Standardization
1.Z-score normalization
Premise: $$X ~ N(μ,σ)$$
Because of the use of standard deviation, this method is more sensitive to outliers in the data.
$$$ x'ᵢ = (x-μ) / σ = (x - mean(X)) / std(X) $$$
2. Maximum and minimum value difference standardization (Min-Max Scaling)
Convert each factor data into data in the $$(0,1)$$ interval to allow comparison of data of different sizes or ranges, but it does not change the distribution within the data, nor does it make the sum become 1.
Due to considering the maximum and minimum values, it is sensitive to outliers
Unifying the dimensions facilitates comparison of data in different dimensions.
$$$ x'ᵢ = (xᵢ - min(x)) / max(x) - min(x) $$$
3. Rank Scaling
Convert data features into their rankings, and convert these rankings into scores between 0 and 1, typically their percentiles in the data set. *
This method is insensitive to outliers since rankings are not affected by outliers.
Absolute distances between points in the data are not maintained, but converted to relative rankings.
$$$ NormRankᵢ = (Rankₓᵢ - min(Rankₓᵢ)) / max(Rankₓ) - min(Rankₓ) = Rankₓᵢ / N $$$
Among them, $$min(Rankₓ)=0$$, $$N$$ is the total number of data points in the interval.
# 标准化因子数据class Scale(object): def __init__(s, ini_data,date): s.ini_data = ini_data s.date = date def zscore(s): mean = s.ini_data.mean() std = s.ini_data.std() return s.ini_data.sub(mean).div(std) def maxmin(s): min = s.ini_data.min() max = s.ini_data.max() return s.ini_data.sub(min).div(max - min) def normRank(s): # 对指定列进行排名,method='min'意味着相同值会有相同的排名,而不是平均排名ranks = s.ini_data.rank(method='min') return ranks.div(ranks.max())
4. Data frequency
Sometimes the data obtained is not of the frequency required for our analysis. For example, if the level of analysis is monthly and the frequency of original data is daily, you need to use "downsampling", that is, the aggregated data is monthly.
Downsampling
It refers to aggregating data in a collection into one row of data , such as aggregating daily data into monthly data. At this time, the characteristics of each aggregated indicator need to be considered. Common operations include:
first value/last value
mean/median
standard deviation
upsampling
It refers to splitting one row of data into multiple rows of data, such as annual data used for monthly analysis. This situation usually requires simple repetition, and sometimes it is necessary to aggregate the annual data into each month in proportion.
About LUCIDA & FALCON
Lucida ( https://www.lucida.fund/ ) is an industry-leading quantitative hedge fund that entered the Crypto market in April 2018. It mainly trades CTA/statistical arbitrage/option volatility arbitrage and other strategies, with a current management scale of US$30 million. .
Falcon ( https://falcon.lucida.fund/ ) is a new generation of Web3 investment infrastructure. It is based on a multi-factor model and helps users "select", "buy", "manage" and "sell" crypto assets. Falcon was hatched by Lucida in June 2022.
More content can be found at https://linktr.ee/lucida_and_falcon
Previous articles
Construct a powerful crypto asset portfolio using multi-factor strategies#Theoretical Basics#
From Tech Breakthroughs to Market Boom: Understanding the Link in the Crypto Bull Market
What exactly is driving Crytpo’s bull market? Is it a technological upgrade?
Development as the Driving Force: Understanding the Impact on Token Price Performance?
Is "the team doing something" really related to the currency price?
5 million rows of data review Crypto’s three-year bull market @LUCIDA
5 Million Rows of Data Recap: Investigating The Crypto Market's 3-Year Bull Run @LUCIDA
LUCIDA: Use multi-factor models to select tracks and currencies
LUCIDA × SnapFingers DAO: 21 top public chains in three-year bull market recap
LUCIDA × OKLink: The value of on-chain data to secondary market investment