Navigator
Installing
pip
installation
pip install copulas
Reference docs
Copulas is a Python library for modeling multivariate distributions and sampling from them using copulas functions
. Given a table containing numerical data, we can use Copulas to learn the distribution and later on generate new synthetic rows following the same statistical properties.
Some of the features provided by this library include:
- A variety of distributions for modeling univariate data.
- Multiple Archimedean copulas for modeling bivariate data.
- Gaussian and Vine copulas for modeling multivariate data.
- Automatic selection of univariate distributions and bivariate copulas.
Probability Integral Transform
Suppose we have a random variable X X X that comes from a distribution with cumulative density function F ( X ) F(X) F(X). Then, we can define a random variable Y Y Y as
Y = F ( X ) Y=F(X) Y=F(X)
and prove that Y Y Y follows a uniform
distribution over the interval [ 0 , 1 ] [0, 1] [0,1].
code
def func_CDF_uniform():
X = stats.norm.rvs(size=10000) # 对1000个样本点进行采样
Xp = stats.norm.cdf(X)
fig = plt.figure(figsize=(10, 4))
fig.add_subplot(1,2,1)
plt.hist(X, density=True, bins=10) # 绘制样本点的直方图
plt.title('Samples')
plt.xlabel('x')
fig.add_subplot(1,2,2)
plt.hist(Xp, density=True, bins=10)
plt.title('Transformed Samples')
plt.xlabel('x')
plt.show()
Copulas
The key intuition underlying copula functions is the idea that marginal distributions can be modeled independently from the joint distribution. Consider a dataset with two columns A and B. A copula-based modeling approach would:
- Model A and B independently, transforming them into uniform distributions using the
probability integral transform
. - Model the relationship between the transformed variables using the copula function.
Modeling by Gaussian
copula, the GaussianMultivariate
class will automatically transform the columns using the best available distribution.
copula = GaussianMultivariate()
copula.fit(df)
age_cdf = copula.univariates[0].cdf(df['age'])
inc_cdf = copula.univariates[1].cdf(df['income'])
side_by_side(hist_1d, {
'Age': age_cdf, 'Income': inc_cdf})
plt.show()
Figure of the transformed data is as follows, we can observe that the transformed data looks much more uniform than the original values.
Comparing the real data and the synthetic data
synthetic = copula.sample(len(df))
print(synthetic.head())
compare_2d(df, synthetic)