after a heavy cost time(primary at download huge number of jars),the first example from book 'learning spark' is run through.
the source code is very simple
/** * Illustrates flatMap + countByValue for wordcount. */ package com.oreilly.learningsparkexamples.scala import org.apache.spark._ import org.apache.spark.SparkContext._ object WordCount { def main(args: Array[String]) { val master = args.length match { case x: Int if x > 0 => args(0) case _ => "local" } println("**spark-home :" + System.getenv("SPARK_HOME")) //null if not set val sc = new SparkContext(master, "WordCount", System.getenv("SPARK_HOME")) val input = args.length match { case x: Int if x > 1 => sc.textFile(args(1)) case _ => sc.parallelize(List("pandas", "i like pandas")) //-else generate a list as input data } val words = input.flatMap(line => line.split(" ")) args.length match { case x: Int if x > 2 => { val counts = words.map(word => (word, 1)).reduceByKey{case (x,y) => x + y} //-same as xxByKey((x,y)=> x+y) counts.saveAsTextFile(args(2)) } case _ => { //-else count by words number val wc = words.countByValue() println(wc.mkString(",")) } } } }
and the project outline will figure like this:
below is the running logs in local mode:
JHLinMacBook:learning-spark-src userxx$ spark-submit --verbose --class com.oreilly.learningsparkexamples.scala.WordCount target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar Using properties file: null Parsed arguments: master local[*] deployMode null executorMemory null executorCores null totalExecutorCores null propertiesFile null driverMemory null driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors null files null pyFiles null archives null mainClass com.oreilly.learningsparkexamples.scala.WordCount primaryResource file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar name com.oreilly.learningsparkexamples.scala.WordCount childArgs [] jars null packages null repositories null verbose true Spark properties used, including those specified through --conf and those from the properties file null: Main class: com.oreilly.learningsparkexamples.scala.WordCount Arguments: System properties: SPARK_SUBMIT -> true spark.app.name -> com.oreilly.learningsparkexamples.scala.WordCount spark.jars -> file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar spark.master -> local[*] Classpath elements: file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar /Users/userxx/Cloud/Spark/spark-1.4.1-bin-hadoop2.4 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/09/22 23:20:13 INFO SparkContext: Running Spark version 1.4.1 2015-09-22 23:20:13.411 java[918:1903] Unable to load realm info from SCDynamicStore 15/09/22 23:20:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/09/22 23:20:13 WARN Utils: Your hostname, JHLinMacBook resolves to a loopback address: 127.0.0.1; using 192.168.1.144 instead (on interface en0) 15/09/22 23:20:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/09/22 23:20:13 INFO SecurityManager: Changing view acls to: userxx 15/09/22 23:20:13 INFO SecurityManager: Changing modify acls to: userxx 15/09/22 23:20:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(userxx); users with modify permissions: Set(userxx) 15/09/22 23:20:14 INFO Slf4jLogger: Slf4jLogger started 15/09/22 23:20:14 INFO Remoting: Starting remoting 15/09/22 23:20:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:49613] 15/09/22 23:20:14 INFO Utils: Successfully started service 'sparkDriver' on port 49613. 15/09/22 23:20:14 INFO SparkEnv: Registering MapOutputTracker 15/09/22 23:20:14 INFO SparkEnv: Registering BlockManagerMaster 15/09/22 23:20:14 INFO DiskBlockManager: Created local directory at /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6 15/09/22 23:20:14 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/09/22 23:20:15 INFO HttpFileServer: HTTP File server directory is /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/httpd-38c60499-b2cf-4fe1-b94f-f3d04fcae1b3 15/09/22 23:20:15 INFO HttpServer: Starting HTTP Server 15/09/22 23:20:15 INFO Utils: Successfully started service 'HTTP file server' on port 49614. 15/09/22 23:20:15 INFO SparkEnv: Registering OutputCommitCoordinator 15/09/22 23:20:15 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/09/22 23:20:15 INFO SparkUI: Started SparkUI at http://192.168.1.144:4040 15/09/22 23:20:15 INFO SparkContext: Added JAR file:/Users/userxx/Cloud/Spark/learning-spark-src/target/scala-2.10/learning-spark-examples_2.10-0.0.1.jar at http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534 15/09/22 23:20:15 INFO Executor: Starting executor ID driver on host localhost 15/09/22 23:20:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49616. 15/09/22 23:20:15 INFO NettyBlockTransferService: Server created on 49616 15/09/22 23:20:15 INFO BlockManagerMaster: Trying to register BlockManager 15/09/22 23:20:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:49616 with 265.4 MB RAM, BlockManagerId(driver, localhost, 49616) 15/09/22 23:20:15 INFO BlockManagerMaster: Registered BlockManager 15/09/22 23:20:16 INFO SparkContext: Starting job: countByValue at WordCount.scala:28 15/09/22 23:20:16 INFO DAGScheduler: Registering RDD 3 (countByValue at WordCount.scala:28) 15/09/22 23:20:16 INFO DAGScheduler: Got job 0 (countByValue at WordCount.scala:28) with 1 output partitions (allowLocal=false) 15/09/22 23:20:16 INFO DAGScheduler: Final stage: ResultStage 1(countByValue at WordCount.scala:28) 15/09/22 23:20:16 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 15/09/22 23:20:16 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 15/09/22 23:20:16 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28), which has no missing parents 15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(3072) called with curMem=0, maxMem=278302556 15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KB, free 265.4 MB) 15/09/22 23:20:16 INFO MemoryStore: ensureFreeSpace(1755) called with curMem=3072, maxMem=278302556 15/09/22 23:20:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1755.0 B, free 265.4 MB) 15/09/22 23:20:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49616 (size: 1755.0 B, free: 265.4 MB) 15/09/22 23:20:16 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874 15/09/22 23:20:16 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at countByValue at WordCount.scala:28) 15/09/22 23:20:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/09/22 23:20:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1465 bytes) 15/09/22 23:20:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/09/22 23:20:16 INFO Executor: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar with timestamp 1442935215534 15/09/22 23:20:17 INFO Utils: Fetching http://192.168.1.144:49614/jars/learning-spark-examples_2.10-0.0.1.jar to /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/fetchFileTemp4094197159668194166.tmp 15/09/22 23:20:17 INFO Executor: Adding file:/private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/userFiles-a8ac42d1-cf52-4428-8075-2090b6ed6c85/learning-spark-examples_2.10-0.0.1.jar to class loader 15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 879 bytes result sent to driver 15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 300 ms on localhost (1/1) 15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/09/22 23:20:17 INFO DAGScheduler: ShuffleMapStage 0 (countByValue at WordCount.scala:28) finished in 0.332 s 15/09/22 23:20:17 INFO DAGScheduler: looking for newly runnable stages 15/09/22 23:20:17 INFO DAGScheduler: running: Set() 15/09/22 23:20:17 INFO DAGScheduler: waiting: Set(ResultStage 1) 15/09/22 23:20:17 INFO DAGScheduler: failed: Set() 15/09/22 23:20:17 INFO DAGScheduler: Missing parents for ResultStage 1: List() 15/09/22 23:20:17 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28), which is now runnable 15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(2304) called with curMem=4827, maxMem=278302556 15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.3 KB, free 265.4 MB) 15/09/22 23:20:17 INFO MemoryStore: ensureFreeSpace(1371) called with curMem=7131, maxMem=278302556 15/09/22 23:20:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1371.0 B, free 265.4 MB) 15/09/22 23:20:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49616 (size: 1371.0 B, free: 265.4 MB) 15/09/22 23:20:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874 15/09/22 23:20:17 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at countByValue at WordCount.scala:28) 15/09/22 23:20:17 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/09/22 23:20:17 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1245 bytes) 15/09/22 23:20:17 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks 15/09/22 23:20:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms 15/09/22 23:20:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1076 bytes result sent to driver 15/09/22 23:20:17 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 84 ms on localhost (1/1) 15/09/22 23:20:17 INFO DAGScheduler: ResultStage 1 (countByValue at WordCount.scala:28) finished in 0.084 s 15/09/22 23:20:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/09/22 23:20:17 INFO DAGScheduler: Job 0 finished: countByValue at WordCount.scala:28, took 0.702187 s pandas -> 2,i -> 1,like -> 1 15/09/22 23:20:17 INFO SparkContext: Invoking stop() from shutdown hook 15/09/22 23:20:17 INFO SparkUI: Stopped Spark web UI at http://192.168.1.144:4040 15/09/22 23:20:17 INFO DAGScheduler: Stopping DAGScheduler 15/09/22 23:20:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 15/09/22 23:20:17 INFO Utils: path = /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2/blockmgr-acf8bb83-2c13-4df4-8301-30a8a78ebcc6, already present as root for deletion. 15/09/22 23:20:17 INFO MemoryStore: MemoryStore cleared 15/09/22 23:20:17 INFO BlockManager: BlockManager stopped 15/09/22 23:20:17 INFO BlockManagerMaster: BlockManagerMaster stopped 15/09/22 23:20:17 INFO SparkContext: Successfully stopped SparkContext 15/09/22 23:20:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 15/09/22 23:20:17 INFO Utils: Shutdown hook called 15/09/22 23:20:17 INFO Utils: Deleting directory /private/var/folders/rt/6f6nq06577vb3c0d8bskm97m0000gn/T/spark-49efa949-3a64-4404-b495-f91435fe4ee2
as u see, i use 'verbose' mode to submit this application,so certain details are more clear.
tips:
one thing u should know is that command spark-submit parameters are ugly:
spark-submit [options] <app jar | python file> [app arguments]
so the app jar should follow the options(if any),then app args.
e.g. submit a wordcount app to standalone-client mode spark ensemble:
spark-submit --master spark://gzsw-02:7077 --class org.apache.spark.examples.JavaWordCount --executor-memory 600m --total-executor-cores 16 --verbose --deploy-mode client /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/lib/spark-examples-1.4.1-hadoop2.4.0.jar /home/hadoop/spark/spark-1.4.1-bin-hadoop2.4/RELEASE 2 output-result
param-2:minpartitions; output-result:output all the wc detailed result
supplement
yep,spark can use do certain 'memory-computings',though,we have already used memory to do something before,but may be not to form some theorys yet.