[mvapich-discuss] PMI_KVS_Get error in mvapich2-1.5.1p1

Mike Heinz michael.heinz at qlogic.com
Tue Oct 19 09:34:15 EDT 2010


Jonathan,

I'm still not sure if that odd exit(1) call is correct, but my own issue appears to be resolved - it looks like a mis-match on the mpi-selector settings across the cluster was causing one node to run the Qlogic version of "mpispawn" even though the binaries were compiled to run on vanilla mvapich2. This resulted in a communications mis-match that caused the runs to fail.

I might take a look at submitting a patch to you with improved error reporting on this issue; I had to add an fprintf() call to dump out the raw PMI strings to identify the node that was sending "uuid" strings instead of hostnames.

-----Original Message-----
From: Jonathan Perkins [mailto:perkinjo at cse.ohio-state.edu] 
Sent: Saturday, October 16, 2010 12:12 PM
To: Mike Heinz
Cc: mvapich-discuss at cse.ohio-state.edu; Todd Rimmer
Subject: Re: [mvapich-discuss] PMI_KVS_Get error in mvapich2-1.5.1p1

Hello Mike, the snippet of code you found does look suspect.  We're
investigating the history of this piece of code and will get back to
you with any solutions we find as soon as we can.

On Fri, Oct 15, 2010 at 4:20 PM, Mike Heinz <michael.heinz at qlogic.com> wrote:
> I've encountered an odd piece of code while looking into this error:
>
> Is the call to exit(1) supposed to be there? Is it left over debugging code?
>
> pmi_tree.c:
>
>    switch (strlen(command)) {
>    case 3:                     /* get, put */
>        if (0 == strcmp(command, "get")) {
>            char *kvc_val = check_kvc(key);
>            hdr.msg_rank = rank;
>            if (kvc_val) {
>                sprintf(resp, "cmd=get_result rc=0 value=%s\n", kvc_val);
>                hdr.msg_len = strlen(resp);
>                if (src == MT_CHILD) {
>                    write(fd, &hdr, msg_hdr_s);
>                }
>                writeline(fd, resp, hdr.msg_len);
>            } else {
>                fprintf(stderr, "mpirun_rsh: PMI key '%s' not found.", key);
>                exit(1);
>                /* add pending req */
>                save_pending_req(rank, key, fd);
>                /* send req to parent */
>                send_parent(rank, msg, msg_len);
>            }
>        }
>
>
>
> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Mike Heinz
> Sent: Friday, October 15, 2010 2:30 PM
> To: mvapich-discuss at cse.ohio-state.edu
> Subject: [mvapich-discuss] PMI_KVS_Get error
>
> Hi; I'm trying to diagnose a problem on a small cluster, and I'm hoping someone can point me in the right direction.
>
> We have a standardized set of test scripts used to verify cluster operation, and on one particular cluster mvapich2 is failing with the following message:
>
> mpirun_rsh: PMI key 'MVAPICH2_0001' not found.[cli_0]: readline failed
>
> Googling around produced nothing useful - can someone suggest where I should look for this problem? Other MPIs are working correctly.
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo




More information about the mvapich-discuss mailing list