【IMDB】IMDB数据集导入PostgreSQL和join order benchmark(JOB)查询生成

简述

  • IMDB数据库是一个很大的,被广泛使用的电影,电视节目和演员信息的数据库,它包括了有关电影、电视节目、演员、制作公司、编剧、导演等信息。IMDB数据集可以为电影评论、分类、预测以及其他机器学习任务提供有用的参考信息。
  • Join Order Benchmark(JOB)是一个数据库基准测试,旨在评估数据库优化器的能力,特别是在确定关系表之间连接顺序方面。该基准测试涉及多个关系表的连接,挑战数据库优化器在处理复杂查询时进行最佳加入顺序决策的能力。论文链接:http://www.vldb.org/pvldb/vol9/p204-leis.pdf

join order benchmark(JOB)查询获取

IMDB数据集导入PostgreSQL和join order benchmark(JOB)查询生成:

join order benchmark(JOB)-github-含有安装教程

进入github,需要查询语句直接下载即可:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UK92X7yI-1686801757755)(typora_img/image-20230615111540448.png)]

注意,代码里有给出IMDB数据集的下载,但是第二步的网站链接失效了,所以用其它方法导入:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3AJ1dPZb-1686801757756)(typora_img/image-20230615111713043.png)]

IMDB导入数据到PG

数据集 TPC-H、TPC-DS、IMDB的导入使用

(1)下载CSV等文件

下载 imdb.tgz,放置到到某个路径,记住该路径,后面有用。作者这里放置在/var/lib/pgsql/benchmark

接着,解压imdb.tgz:

tar -zxvf imdb.tgz

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L6aDiH1J-1686801757757)(typora_img/image-20230615112333547.png)]

以下命令都需要进入psql后运行:

(2)psql进入PG,创建数据库:

CREATE DATABASE imdbload;

使用imdbload数据库:

\c imdbload

(2)执行sql脚本创建,注意讲前面的路径修改为imdb.tgz的放置路径:

\i /var/lib/pgsql/benchmark/schematext.sql;

(3)导入数据,注意讲前面的路径修改为imdb.tgz的放置路径:

\copy aka_name from '/var/lib/pgsql/benchmark/aka_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy aka_title from '/var/lib/pgsql/benchmark/aka_title.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy cast_info from '/var/lib/pgsql/benchmark/cast_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy char_name from '/var/lib/pgsql/benchmark/char_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy comp_cast_type from '/var/lib/pgsql/benchmark/comp_cast_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_name from '/var/lib/pgsql/benchmark/company_name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy company_type from '/var/lib/pgsql/benchmark/company_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy complete_cast from '/var/lib/pgsql/benchmark/complete_cast.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy info_type from '/var/lib/pgsql/benchmark/info_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy keyword from '/var/lib/pgsql/benchmark/keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy kind_type from '/var/lib/pgsql/benchmark/kind_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy link_type from '/var/lib/pgsql/benchmark/link_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_companies from '/var/lib/pgsql/benchmark/movie_companies.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info from '/var/lib/pgsql/benchmark/movie_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_info_idx from '/var/lib/pgsql/benchmark/movie_info_idx.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_keyword from '/var/lib/pgsql/benchmark/movie_keyword.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy movie_link from '/var/lib/pgsql/benchmark/movie_link.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy name from '/var/lib/pgsql/benchmark/name.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy person_info from '/var/lib/pgsql/benchmark/person_info.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy role_type from '/var/lib/pgsql/benchmark/role_type.csv' with delimiter as ',' csv quote '"' escape as '\';
\copy title from '/var/lib/pgsql/benchmark/title.csv' with delimiter as ',' csv quote '"' escape as '\';

(4)检验数据(可有可无

导入后我们并不知道是否导入成功,可以写个shell脚本检验下。当然,如果嫌麻烦可以跳过,检查一两个表即可。

bash命令显示imdbload的所有表:

echo "\dt" | psql -t -A -d imdbload

如果显示的结果是:

public|aka_name|table|postgres
public|aka_title|table|postgres
public|cast_info|table|postgres
public|char_name|table|postgres
public|comp_cast_type|table|postgres
public|company_name|table|postgres
public|company_type|table|postgres
public|complete_cast|table|postgres
public|info_type|table|postgres
public|keyword|table|postgres
public|kind_type|table|postgres
public|link_type|table|postgres
public|movie_companies|table|postgres
public|movie_info|table|postgres
public|movie_info_idx|table|postgres
public|movie_keyword|table|postgres
public|movie_link|table|postgres
public|name|table|postgres
public|person_info|table|postgres
public|role_type|table|postgres
public|title|table|postgres

那么脚本需要分割|

#!/bin/bash

# 获取所有表格名称
TABLES=$(echo "\dt" | psql -t -A -d imdbload)

# 遍历每个表格并获取其记录数
for table in $TABLES; do
	table=$(echo "${table}" | cut -d '|' -f 2)
    count=$(echo "SELECT COUNT(*) FROM $table" | psql -t -A -d imdbload)
    echo "$table: $count"
done

如果只是movie_info,那么去掉第八行table=$(echo "${table}" | cut -d '|' -f 2)

测试:

vim test_imdb.sh,将完整的脚本写入,wq退出。然后sh test_imdb.sh,如果结果:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vYnH2HRs-1686801757757)(typora_img/image-20230615114426024.png)]

发现都有数据,那么说明导入数据成功!

猜你喜欢

转载自blog.csdn.net/aruewds/article/details/131225201