Installing

pip installation

pip install copulas

Reference docs
Copulas is a Python library for modeling multivariate distributions and sampling from them using copulas functions. Given a table containing numerical data, we can use Copulas to learn the distribution and later on generate new synthetic rows following the same statistical properties.

Some of the features provided by this library include:

A variety of distributions for modeling univariate data.
Multiple Archimedean copulas for modeling bivariate data.
Gaussian and Vine copulas for modeling multivariate data.
Automatic selection of univariate distributions and bivariate copulas.

Probability Integral Transform

Suppose we have a random variable $X$ that comes from a distribution with cumulative density function $F (X)$ . Then, we can define a random variable $Y$ as
$Y = F (X)$
and prove that $Y$ follows a uniform distribution over the interval $[0, 1]$ .

code

def func_CDF_uniform():
    X = stats.norm.rvs(size=10000) # 对1000个样本点进行采样
    Xp = stats.norm.cdf(X)

    fig = plt.figure(figsize=(10, 4))
    fig.add_subplot(1,2,1)
    plt.hist(X, density=True, bins=10) # 绘制样本点的直方图
    plt.title('Samples')
    plt.xlabel('x')

    fig.add_subplot(1,2,2)
    plt.hist(Xp, density=True, bins=10)
    plt.title('Transformed Samples')
    plt.xlabel('x')

    plt.show()

fit

Copulas

The key intuition underlying copula functions is the idea that marginal distributions can be modeled independently from the joint distribution. Consider a dataset with two columns A and B. A copula-based modeling approach would:

Model A and B independently, transforming them into uniform distributions using the probability integral transform.
Model the relationship between the transformed variables using the copula function.

sep hist Modeling by Gaussian copula, the GaussianMultivariate class will automatically transform the columns using the best available distribution.

copula = GaussianMultivariate()
copula.fit(df)
age_cdf = copula.univariates[0].cdf(df['age'])
inc_cdf = copula.univariates[1].cdf(df['income'])
side_by_side(hist_1d, {
    
    'Age': age_cdf, 'Income': inc_cdf})
plt.show()

Figure of the transformed data is as follows, we can observe that the transformed data looks much more uniform than the original values.

Comparing the real data and the synthetic data

synthetic = copula.sample(len(df))
print(synthetic.head())
compare_2d(df, synthetic)

【Copulas】Cpula python(1)

Navigator

Installing

Probability Integral Transform

code

Copulas

猜你喜欢