当我们需要使用SparkSQL通过JDBC方式连接MySQL、Oracle、Greenplum等来实现对数据的操作时,可能在某些情况下并不需要加载全量的数据表。例如:
只需要其中的部分字段
按照条件进行筛选后的数据
此时就需要在JDBC连接时对option(“dbtable”, tablename)属性值进行修改,参看spark官网给出的属性介绍:(spark2.3 jdbc-to-other-databases 详细属性链接)
Property Name Meaning
url The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret
dbtable The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
driver The class name of the JDBC driver to use to connect to this URL.
… … … …
dbtable:应该读取的JDBC表。另外也可以在括号中使用子查询语句,而不是完整的表。
测试代码如下:
object JDBCSource {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Greenplum_test").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val spark = SparkSession.builder().config(conf).getOrCreate()
//由于dbtable被用作SELECT语句的源。如果要填入子查询语句,则应提供别名:
val tablename = "(select id,name,gender from test.info where gender='man') temp"
val data = spark.sqlContext.read
.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", "jdbc:mysql://localhost:3306/test")
.option("dbtable", tablename) //将查询语句传入
.option("user", "username")
.option("password", "password")
.load()
data.show()
}
}