Synthetic Data for Machine Learning
loading dataset
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
Generating synthetic data. This simulates a scenario where a company may unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset’s properties for researchers to use.
def create_synthetic(X, y):
dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)
model = GaussianMultivariate()
model.fit(dataset)
synthetic = model.sample(len(dataset))
X = synthetic.values[:, :-1]
y = synthetic.values[:, -1]
return X,y
X_syn, y_syn = create_synthetic(X_train, y_train)
# Training
from sklearn.linear_model import ElasticNet
model = ElasticNet() # 弹性网
model.fit(X_syn, y_syn)
score_syn = model.score(X_test, y_test)
model.fit(X_train, y_train)
score_real = model.score(X_test, y_test)
print('syn score is {}'.format(score_syn))
print('real score is {}'.format(score_real))
Vine Copula
A vine is a graphical tool for labeling constraints in high-dimensional probability distributions
. A R-Vine
is a special case for which all constraints are two-dimensional or conditional two-dimensional. Although the number of parametric copula families with flexible dependence is limited, there are many parametric families of bivariate copulas. R-Vine
has proven useful in other problems such as (constrained) sampling of correlation matrices, building non-parametric continuous Bayesian networks.