Spark通过JDBC加载部分数据、添加过滤条件

当我们需要使用SparkSQL通过JDBC方式连接MySQL、Oracle、Greenplum等来实现对数据的操作时，可能在某些情况下并不需要加载全量的数据表。例如：

只需要其中的部分字段
按照条件进行筛选后的数据
此时就需要在JDBC连接时对option(“dbtable”, tablename)属性值进行修改，参看spark官网给出的属性介绍：(spark2.3 jdbc-to-other-databases 详细属性链接）

Property Name   Meaning
url   The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret
dbtable   The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
driver   The class name of the JDBC driver to use to connect to this URL.
… …   … …
dbtable：应该读取的JDBC表。另外也可以在括号中使用子查询语句，而不是完整的表。

测试代码如下：

object JDBCSource {
   def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("Greenplum_test").setMaster("local[*]")
   val sc = new SparkContext(conf)
   sc.setLogLevel("WARN")
   val spark = SparkSession.builder().config(conf).getOrCreate()

   //由于dbtable被用作SELECT语句的源。如果要填入子查询语句，则应提供别名：
   val tablename = "(select id,name,gender from test.info where gender='man') temp"

   val data = spark.sqlContext.read
   .format("jdbc")
   .option("driver", "com.mysql.jdbc.Driver")
   .option("url", "jdbc:mysql://localhost:3306/test")
   .option("dbtable", tablename)           //将查询语句传入
   .option("user", "username")
   .option("password", "password")
   .load()

data.show()
}
}

u011250186

发布了19 篇原创文章 · 获赞 4 · 访问量 17万+

私信关注

Spark通过JDBC加载部分数据、添加过滤条件

猜你喜欢