Pvm Troubleshooting

LINUX, C++, troubleshooting


First identify the type of your problem:

  1. I cannot start pvm
  2. I cannot add machines in pvm
  3. I cannot compile my parallel program
  4. My master program crashes (runtime)
  5. My slave program crashes (runtime)
  6. while configuring a PVM (a parallel virtual machine)
Unidentified problem? Report in detail to: Jan Lemeire:    jan.lemeire@vub.ac.be    02/629.2997

Troubles starting pvm.

Problems with adding machines in pvm.

Problem: I can not add a host to the PVM. I get a message like this one:
pvm> add infopc16
0 successful
                    HOST     DTID
                infopc16 Can't start pvmd
pvm>

 

Solution:

  1. It could be that someone restarted that machine in order to use another operating system. Since we do not install the software for operating systems others than Linux, you will not be able to add a machine to a PVM if it is not running under Linux. Or the machine is simply down. Check this by trying to ssh to that machine: ssh machine. The error message No Route To Host indicates that the machine is down.
  2. Is the name of the machine you are using present in your .rhosts file (see pvm configuration)? If not, then the local machine will not be allowed to start pvmd for you.
  3. Always halt your PVM when you finished your tests!
  4. If you get very quickly the 'Can't Start pvmd'  response:
 PVM makes a pvmd.xxxxx file in the /tmp directory of every host when a user adds that host to a PVM. In order to allow different users to add the same machine in different PVM's these file (located in /tmp) are specific for each user. The purpose of this file is to prevent PVM of starting pmvd twice for the same user. In other words this is a lock file. So whenever you add a host to a PVM, PVM first check the /tmp directory to see if there is not a PVM lock file for you. If there is then PVM reports something like the red text underneath. That is how they can detect attempts to duplicate hosts in a PVM. What probably happened it that someone or something stopped the machine in an abnormal way. Remove in the /tmp directory of each of the machines that you can not add the file called pvmd.xxxx. (where xxxx is your user identification) with the following commands:
ssh name_of_machine   (go to the machine with the pvmd problem)
cd /tmp
ls -all pvmd.*  
(you are the owner of your pvmd file)
rm pvmd.xxxxx   
(answer with y)
Restart pvm (after halting it)!!

Problem


pvm> add infopc26
0 successful
                    HOST     DTID
                infopc26 Duplicate host
pvm>

 
this machine is already in pvm!! Check it with conf:
pvm> conf
1 host, 1 data format
                    HOST     DTID     ARCH  SPEED
                infopc26    40000    LINUX    1000

Problem: I can not add elements of the Crunch cluster to my PVM. It is very strange because it works prefectly for the other machines of the lab.

Solution:

  1. Check your .rhosts file please. It should contain crunch. This is the name of the machine that severs as gateway to the cluster or to be more precise that is the name the nodes of Crunch use. You read it correctly. The gateway is called crunch when looking at it from within the cluster and info9 when looking from say... the internet.
  2. Another possiblity is that I forgot the add your profile to the crunch-users on the cluster in which case you should e-mail me.


Problem: I have some problems adding node of the cluster this time, it used to work before.

Solution:

  1. This problem is again related to the lock files that are left behind on the machines. The situation on the Crunch cluster is somewhat different in that you can not start telnet sessions to the different nodes of the cluster. You can use remote shell however so you do not need to login on the machines to solve you problem! Follow the different steps described underneath. Of course you need to replace the x with the number of the node that is bothering you.
  2. Type 'rsh crunchx ls /tmp' in a shell. This should display a list of file in the /tmp directory of that machine. You should find a pvmd.xxxx file where xxxx is your user id. See here if you don not know you user id on our machines.
  3. You will now delete this lock file of yours by typing this: 'rsh crunchx rm /tmp/pmvd.xxxx' That should do it.


Compile time problems

These tips are for people programming in C. Problem: when I compile I get warnings and/or errors;

Solution:

  1. You need to use an include statementlike #include <pvm.h>. In that case you need to specify to gcc were it can find the system wide include files. Why? Because this include syntax assumes that the include file is a standard include file. You can specify several path for the include files using a capital i as a command line parameter for gcc. Like this: gcc a_program.c -I/path_to_include_files.


Problem: When linking I get a lot of errors for each of the pvm functions I use.

Solution:

  1. These errors are probably called undefined reference. This is a very simple problem, you need to link your program with the pvm library. You can either use the -L command line option or a combination of -L and -l. In the first case you specify the complete path to the library file with -L. Like this: gcc program.c -L/home/pvm3/lib/pvm3. Or you can combine this with -l this is more convenient if you have several library files in one directory. The syntax is then the following one: gcc program.c -L/home/pvm3/lib -lpvm3.

Run time problems with master

When a master or slave crashes, type reset in pvm to stop all running processes that are still running (and keep on running for ever!).
Don't forget the pvm_exit() add the end of your code.


libpvm [pid27726]: /tmp/pvmd.19120: No such file or directory
  1. You forgot to start a PVM. As a result you program is not able to request services from any PVM and it can not run.
  2. You did indeed start a PVM but you start your program on a machine that is not part of the PVM. Use the command hostname in your console and conf in pvm to check this.
some messages, at the end of the program, seem not to arrive..
no pvm_exit() add the end of your code

Runtime problems with slave

Check the pvm log file pvml.xxxxx in your /tmp directory for slave error messages.
When a master or slave crashes, type reset in pvm to stop all processes that are still running (and keep on running for ever!)
Don't forget the pvm_exit() add the end of your code.
Message
XTerm xt error: Can't open display
if you are trying to debug your slaves: you should spawn them on the same machine as you are working

pvm error: "Cannot find executable slave"

PVM configuration problems

For our students, working on our LINUX system, the configuration of their account should be all right...
Configuration problems are problems you experience while using the pvm program. This program is used to add or delete hosts of a PVM, see which processes are running on the PVM and stop a PVM. Problem: My PVM does not start anymore! But it used to work before. After typing pvm on the command line I get a message like:
 
%pvm
libpvm [pid2745]: mksocs() connect: Invalid argument
pvmd already running.
libpvm [pid2745]: mksocs() connect: Invalid argument
libpvm [pid2745]: mksocs() connect: Invalid argument
libpvm [pid2745]: mksocs() connect: Invalid argument
libpvm [pid2745]: pvm_mytid(): Can't contact local daemon
%
 

  Solution:

  1. This topic is related to the problem described in the next section, so please read it the find the real solution to your problem. But keep reading the next point to understand what happened.
  2. The machine on which you issued the pvm-command seems to have suffered an abnormal stop in its operations (in other words it crashed or someone stopped it in an improper manner). This inadvertant stop makes it of course impossible for the different programs to remove their lock-files. The symptoms differ from the one of the next section because the lock file is present on the local machine that all.
  3. Always stop your PVM when you finished your tests to avoid this kind of problems.


back to top