应用开发原则 :
设置好配置项,享受 Spark SQL 的性能优势,如钨丝计划、AQE、SQL functions
钨丝计划:
AQE :
实现步骤 :
val dates: List[String] = List("2020-01-01", "2020-01-02", "2020-01-03")
val rootPath: String = _//读取日志文件,去重、并展开userInterestList
def createDF(rootPath: String, date: String): DataFrame = {val path: String = rootPath + dateval df = spark.read.parquet(path).distinct.withColumn("userInterest", explode(col("userInterestList")))df
} //提取字段、过滤,再次去重,把多天的结果用union合并
val distinctItems: DataFrame = dates.map {case date: String =>val df: DataFrame = createDF(rootPath, date).select("userId", "itemId", "userInterest", "accessFreq").filter("accessFreq in ('High', 'Medium'))").distinctdf
}.reduce(_ union _)
优化 :
val dates: List[String] = List("2020-01-01", "2020-01-02", "2020-01-03")val rootPath: String = _
val filePaths: List[String] = dates.map(rootPath + _)/**
一次性调度所有文件
先进行过滤和列剪枝
然后再展开userInterestList
最后统一去重
*/
val distinctItems = spark.read.parquet(filePaths: _*).filter("accessFreq in ('High', 'Medium'))").select("userId", "itemId", "userInterestList").withColumn("userInterest", explode(col("userInterestList"))).select("userId", "itemId", "userInterest").distinct
在右表用 map ,在 map 内实例化 Util 类获取哈希算法,拼接 Join keys 进行哈希运算
import java.security.MessageDigestclass Util {val md5: MessageDigest = MessageDigest.getInstance("MD5")val sha256: MessageDigest = _ //其他哈希算法
} val df: DataFrame = _val ds: Dataset[Row] = df.map {case row: Row =>val util = new Util()val s: String = row.getString(0) + row.getString(1) + row.getString(2)val hashKey: String = util.md5.digest(s.getBytes).map("%02X".format(_)).mkString(hashKey, row.getInt(3))
}
优化 :
val ds: Dataset[Row] = df.mapPartitions(iterator => {val util = new Util()val res = iterator.map {case row => {val s: String = row.getString(0) + row.getString(1) + row.getString(2)val hashKey: String = util.md5.digest(s.getBytes).map("%02X".format(_)).mkString(hashKey, row.getInt(3)) }}res
})