[HiBD] Fwd: HiBD in OpenStack

Wed Sep 27 10:20:27 EDT 2017

Rajarshi - This guy reported an issue with spark-shell. I think we had a similar ticket for this one earlier. Can you try to reproduce this one? Also, add this one to the existing spark-shell ticket. We can keep track of our solutions later.

Xiaoyi

> Begin forwarded message:
> 
> From: John Garbutt <john.garbutt at stackhpc.com>
> Subject: Re: HiBD in OpenStack
> Date: September 27, 2017 at 8:59:10 AM EDT
> To: Xiaoyi Lu <lu.932 at osu.edu>
> Cc: Stig Telfer <stig at stackhpc.com>, Dhabaleswar Panda <panda at cse.ohio-state.edu>, John Taylor <John.Taylor at stackhpc.com>
> 
> Hi,
> 
> I wasn't trying any of the benchmarks yet, I am just running ./bin/spark-shell and then trying:
> 
>   val textFile = sc.textFile("README.md")
>   textFile.count()
> 
> To make sure I got the filename correct, I tried an invalid filename and got the following error:
>   org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/spark/README.md22
> 
> In debugging we have ibping working between the master and slave host, and we have Mellanox OFED 4.1.1 drivers installed on CentOS 7.4
> 
> Here is the config:
> 
> cat /opt/spark/conf/slaves
> vanilla-2-7-3-centos7-vanilla-2-7-3-slave-0
> 
> cat /opt/spark/conf/spark-defaults.conf
> spark.ib.enabled true
> hadoop.ib.enabled true
> spark.master spark://vanilla-2-7-3-centos7-vanilla-2-7-3-master-0:7077
> spark.executor.extraLibraryPath /opt/spark/lib/native/Linux-amd64-64:/opt/hadoop/lib/native
> spark.driver.extraLibraryPath /opt/spark/lib/native/Linux-amd64-64:/opt/hadoop/lib/native
> spark.rdma.dev.name <http://spark.rdma.dev.name/> mlx5_0
> 
> cat /opt/spark/conf/spark-env.sh
> export HADOOP_CONF_DIR=/opt/stack/conf
> export SPARK_LOCAL_IP=vanilla-2-7-3-centos7-vanilla-2-7-3-master-0
> export SPARK_MASTER_HOST=vanilla-2-7-3-centos7-vanilla-2-7-3-master-0
> export SPARK_WORKER_MEMORY=64g
> export SPARK_WORKER_CORES=8
> export SPARK_WORKER_INSTANCES=1
> export SPARK_DAEMON_MEMORY=2g
> 
> To run spark I was using:
> /opt/spark/sbin/start-all.sh 
> 
> The spark.rdma.dev.name <http://spark.rdma.dev.name/> was just a guess rename of the hadoop config we are using to target the correct interface, as the mlx5_1 interface isn't up, only the mlx5_0 interface is up.
> 
> From the spark web UI I extracted these logs for stdout:
> [WARN] 2017-09-27 12:48:19  client.c:896 ucr_probe_blocking return value -1, conn id 0.
> And for stderr:
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> 17/09/27 12:46:12 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 13030 at vanilla-2-7-3-centos7-vanilla-2-7-3-slave-0.novalocal
> 17/09/27 12:46:12 INFO SignalUtils: Registered signal handler for TERM
> 17/09/27 12:46:12 INFO SignalUtils: Registered signal handler for HUP
> 17/09/27 12:46:12 INFO SignalUtils: Registered signal handler for INT
> 17/09/27 12:46:12 INFO SecurityManager: Changing view acls to: hadoop
> 17/09/27 12:46:12 INFO SecurityManager: Changing modify acls to: hadoop
> 17/09/27 12:46:12 INFO SecurityManager: Changing view acls groups to: 
> 17/09/27 12:46:12 INFO SecurityManager: Changing modify acls groups to: 
> 17/09/27 12:46:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
> 17/09/27 12:46:13 INFO TransportClientFactory: Successfully created connection to /10.60.253.139:44806 <http://10.60.253.139:44806/> after 52 ms (0 ms spent in bootstraps)
> 17/09/27 12:46:13 INFO SecurityManager: Changing view acls to: hadoop
> 17/09/27 12:46:13 INFO SecurityManager: Changing modify acls to: hadoop
> 17/09/27 12:46:13 INFO SecurityManager: Changing view acls groups to: 
> 17/09/27 12:46:13 INFO SecurityManager: Changing modify acls groups to: 
> 17/09/27 12:46:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
> 17/09/27 12:46:13 INFO TransportClientFactory: Successfully created connection to /10.60.253.139:44806 <http://10.60.253.139:44806/> after 1 ms (0 ms spent in bootstraps)
> 17/09/27 12:46:13 INFO SparkEnv: create RdmaBlockTransferService with 8
> 17/09/27 12:46:13 INFO DiskBlockManager: Created local directory at /tmp/spark-b5b30630-9813-4153-9549-f9f1a9a0ff14/executor-95269895-6f57-4620-9cd8-422aabf9b15a/blockmgr-51353602-19aa-41d4-baff-ca4654d750f3
> 17/09/27 12:46:13 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
> 17/09/27 12:46:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.60.253.139:44806 <http://CoarseGrainedScheduler@10.60.253.139:44806/>
> 17/09/27 12:46:13 INFO WorkerWatcher: Connecting to worker spark://Worker@10.60.253.140:37273 <http://Worker@10.60.253.140:37273/>
> 17/09/27 12:46:13 INFO TransportClientFactory: Successfully created connection to /10.60.253.140:37273 <http://10.60.253.140:37273/> after 1 ms (0 ms spent in bootstraps)
> 17/09/27 12:46:13 INFO WorkerWatcher: Successfully connected to spark://Worker@10.60.253.140:37273 <http://Worker@10.60.253.140:37273/>
> 17/09/27 12:46:13 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
> 17/09/27 12:46:13 INFO Executor: Starting executor ID 0 on host 10.60.253.140
> 17/09/27 12:46:13 INFO RdmaServer: Starting RDMAReader
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer listener on 32463 with ctx id 0: starting
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer handler 1 on 32463: starting
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer handler 0 on 32463: starting
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer handler 2 on 32463: starting
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer handler 3 on 32463: starting
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer handler 4 on 32463: starting
> 17/09/27 12:46:13 INFO RdmaServer: IPC RdmaServer handler 5 on 32463: starting
> 17/09/27 12:46:13 INFO RdmaBlockTransferService: RdmaShuffleServer created on 32463
> 17/09/27 12:46:13 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
> 17/09/27 12:46:13 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 10.60.253.140, 32463, None)
> 17/09/27 12:46:13 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 10.60.253.140, 32463, None)
> 17/09/27 12:46:13 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 10.60.253.140, 32463, None)
> 17/09/27 12:46:13 INFO Executor: Using REPL class URI: spark://10.60.253.139:44806/classes
> 17/09/27 <http://10.60.253.139:44806/classes17/09/27> 12:46:18 INFO RdmaShuffleClient: In pconn, get num of peers 0
> 17/09/27 12:46:18 INFO RdmaClient: TotalSize = 4294967296 slab size  = 33554432 buffer (128 to 524288) count is 32
> 17/09/27 12:46:18 INFO RdmaClient: Starting RDMAReader
> 17/09/27 12:47:11 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 17/09/27 12:47:11 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 17/09/27 12:47:11 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 17/09/27 12:47:11 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 17/09/27 12:47:11 INFO TorrentBroadcast: Started reading broadcast variable 1
> ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
> Failed status transport retry counter exceeded (12) for wr_id -1641979904
> 17/09/27 12:49:33 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0)
> 17/09/27 12:49:33 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1)
> 17/09/27 12:49:34 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
> 17/09/27 12:49:34 INFO RdmaServer: Stopping RdmaServer on 32463 with ctx id 0
> 17/09/27 12:49:34 INFO RdmaServer: IPC RdmaServer handler 0 on 32463: exiting
> 17/09/27 12:49:34 INFO RdmaServer: IPC RdmaServer handler 1 on 32463: exiting
> 17/09/27 12:49:34 INFO RdmaServer: IPC RdmaServer handler 2 on 32463: exiting
> 17/09/27 12:49:34 INFO RdmaServer: IPC RdmaServer handler 3 on 32463: exiting
> 17/09/27 12:49:34 INFO RdmaServer: IPC RdmaServer handler 4 on 32463: exiting
> 17/09/27 12:49:34 INFO RdmaServer: IPC RdmaServer handler 5 on 32463: exiting
> 17/09/27 12:49:34 INFO RdmaServer: Stopping IPC RdmaServer listener on 32463 with ctx id 0
> 17/09/27 12:49:35 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> 3 with ctx id 0
> I hope that helps.
> 
> Many thanks,
> John
> 
> On Fri, Sep 22, 2017 at 7:44 PM, Xiaoyi Lu <lu.932 at osu.edu <mailto:lu.932 at osu.edu>> wrote:
> Hi, John,
> 
> Which benchmark are you running here? What’s the command you are using?
> 
> And also how about your configuration?
> 
> Thanks,
> Xiaoyi
> 
> > On Sep 22, 2017, at 2:19 PM, John Garbutt <john.garbutt at stackhpc.com <mailto:john.garbutt at stackhpc.com>> wrote:
> >
> > Hi,
> >
> > Sorry for the slow follow up (mostly due to OpenStack conference travel). After having got hadoop up previously, I have finally got around to experimenting with spark.
> >
> > I seem to have spark configured with the master and slave processes running. I have an active spark-shell up, but it seems to lockup when attempting to execute a task on a worker. This is what I did in the spark-shell:
> >
> > val textFile = sc.textFile("README.md")
> > textFile.count()
> >
> > Spark-shell was running on the master node, it seems to connect to the master and dispatch the work to the slave OK, but the job never completes.
> >
> > I get this in the worker standard out:
> >
> > [WARN] 2017-09-22 18:04:49 client.c:896 ucr_probe_blocking return value -1, conn id 0.
> >
> > And in standard error I see:
> > <snip lots of startup logs>
> >
> >
> > 17/09/22 18:03:25 INFO RdmaClient: Starting RDMAReader
> > <seem to get here with an open spark-shell>
> > 17/09/22 18:03:45 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> > 17/09/22 18:03:45 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> > 17/09/22 18:03:45 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> > 17/09/22 18:03:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> > 17/09/22 18:03:45 INFO TorrentBroadcast: Started reading broadcast variable 1
> >
> > <waits a good while>
> > ctx error: ibv_poll_cq() failed: IBV_WC_SUCCESS != wc.status
> > Failed status transport retry counter exceeded (12) for wr_id 313143296
> >
> > I am wondering if you could give us any pointers on how to debug this issue further?
> >
> > Many thanks,
> > John G
> >
> >
> > On Thu, Aug 31, 2017 at 10:54 PM, Xiaoyi Lu <lu.932 at osu.edu <mailto:lu.932 at osu.edu>> wrote:
> > Sounds good, Stig. Let us know if you find any other issue.
> >
> > Thanks,
> > Xiaoyi
> >
> > Sent from my iPhone
> >
> > > On Aug 31, 2017, at 5:35 PM, Stig Telfer <stig at stackhpc.com <mailto:stig at stackhpc.com>> wrote:
> > >
> > > circumvented
> >
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/hibd-announce/attachments/20170927/5941c991/attachment-0001.html>