Spark学习之Spark调优与调试(7)
1. 对Spark进行调优与调试通常需要修改Spark应用运行时配置的选项。
当创建一个SparkContext时就会创建一个SparkConf实例。
2. Spark特定的优先级顺序来选择实际配置:
优先级最高的是在用户代码中显示调用set()方法设置选项;
其次是通过spark-submit传递的参数;
再次是写在配置文件里的值;
最后是系统的默认值。
3.查看应用进度信息和性能指标有两种方式:网页用户界面、驱动器和执行器进程生成的日志文件。
4.Spark执行的组成部分:作业、任务和步骤
需求:使用Spark shell完成简单的日志分析应用。
scala>
val input =sc.textFile(
"/home/spark01/Documents/input.text")
input: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[
3] at textFile at <console>:
27
scala>
val tokenized = input.map(line=>line.split(
" ")).filter(words=>words.size>
0)
tokenized: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[
5] at filter at <console>:
29
scala>
val counts = tokenized.map(words=>(words(
0),
1)).reduceByKey{(a,b)=>a+b}
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[
7] at reduceByKey at <console>:
31
scala>
scala> input.toDebugString
res0: String =
(
1) MapPartitionsRDD[
3] at textFile at <console>:
27 []
| /home/spark01/Documents/input.text HadoopRDD[
2] at textFile at <console>:
27 []
scala> counts.toDebugString
res1: String =
(
1) ShuffledRDD[
7] at reduceByKey at <console>:
31 []
+-(
1) MapPartitionsRDD[
6] at map at <console>:
31 []
| MapPartitionsRDD[
5] at filter at <console>:
29 []
| MapPartitionsRDD[
4] at map at <console>:
29 []
| MapPartitionsRDD[
3] at textFile at <console>:
27 []
| /home/spark01/Documents/input.text HadoopRDD[
2] at textFile at <console>:
27 []
scala> counts.collect()
res2: Array[(String, Int)] = Array((ERROR,
1), (##input.text##,
1), (INFO,
4), (
"",
2), (WARN,
2))
scala> counts.cache()
res3: counts.
type = ShuffledRDD[
7] at reduceByKey at <console>:
31
scala> counts.collect()
res5: Array[(String, Int)] = Array((ERROR,
1), (##input.text##,
1), (INFO,
4), (
"",
2), (WARN,
2))
scala>
5. Spark网页用户界面
默认情况地址是http://localhost:4040
通过浏览器可以查看已经运行过的作业(job)的详细情况
如图下图:
图1所有任务用户界面 图二作业2详细信息用户界面
6. 关键性能考量:
代码层面:并行度、序列化格式、内存管理
运行环境:硬件供给。