IMP logo
IMP Reference Guide  develop.18ca3ba1ae,2021/02/28
The Integrative Modeling Platform
IMP::parallel Namespace Reference

Distribute IMP tasks to multiple processors or machines. More...

Detailed Description

Distribute IMP tasks to multiple processors or machines.

This module employs a manager-worker model; the main (manager) IMP process sends the tasks out to one or more workers. Tasks cannot communicate with each other, but return results to the manager. The manager can then start new tasks, possibly using results returned from completed tasks. The system is fault tolerant; if a worker fails, any tasks running on that worker are automatically moved to another worker.

To use the module, first create a Manager object. Add one or more workers to the Manager using its add_worker() method (example workers are LocalWorker, which simply starts another IMP process on the same machine as the manager, and SGEQsubWorkerArray, which starts an array of multiple workers on a Sun GridEngine cluster). Next, call the get_context() method, which creates and returns a new Context object. Add tasks to the Context with the Context.add_task() method (each task is simply a Python function or other callable object). Finally, call Context.get_results_unordered() to send the tasks out to the workers (a worker only runs a single task at a time; if there are more tasks than workers later tasks will be queued until a worker is done with an earlier task). This method returns the results from each task as it completes.

Setup in IMP is often expensive, and thus the Manager.get_context() method allows you to specify a Python function or other callable object to do any setup for the tasks. This function will be run on the worker before any tasks from that context are started (the return values from this function are passed to the task functions). If multiple tasks from the same context are run on the same worker, the setup function is only called once.

Troubleshooting

Several common problems with this module are described below, together with solutions.

  • Master process fails with /bin/sh: qsub: command not found, but qsub works fine from a terminal.
    SGEQsubWorkerArray uses the qsub command to submit the SGE job that starts the workers. Thus, qsub must be in your system PATH. This may not be the case if you are using a shell script such as imppy.sh to start IMP. To fix this, modify the shell script to add the directory containing qsub to the PATH, or remove the setting of PATH entirely.
  • The manager process 'hangs' and does not do anything when Context.get_results_unordered() is called.
    Usually this is because no workers have successfully started up. Check the worker output files to determine what the problem is.
  • Worker output files contain only a Python traceback ending in ImportError: No module named IMP.parallel.worker_handler.
    The workers simply run 'python' and expect to be able to load in the IMP Python modules. If you need to run a modified version of Python, or usually prefix your Python command with a shell script such as imppy.sh, you need to tell the workers to do that too. Specify the full command line needed to start a suitable Python interpreter as the 'python' argument when you create the Manager object.
  • Worker output files contain only a Python traceback ending in socket.error: (110, 'Connection timed out').
    The workers need to connect to the machine running the manager process over the network. This connection can fail (or time out) if that machine is firewalled. It can also fail if the manager machine is multi-homed (a common setup for the headnode of a compute cluster). For a multi-homed manager machine, use the 'host' argument when you create the Manager object to tell the workers the name of the machine as visible to them (typically this is the name of the machine's internal network interface).
  • Worker output files contain only a Python traceback ending in socket.error: (111, 'Connection refused').
    If the manager encounters an error and exits, it will no longer be around to accept connections from workers, so they will get this error when they try to start up. Check the manager log file for errors. Alternatively, the manager may have simply finished all of its work and exited normally before the worker started (either the manager had little work to do, or the worker took a very long time to start up). This is normal.

Info

Author(s): Ben Webb

Maintainer: benmwebb

License: LGPL This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

Publications:

Namespaces

 manager_communicator
 Classes for communicating from the manager to workers.
 
 subproc
 Subprocess handling.
 
 util
 Utilities for the IMP.parallel module.
 

Classes

class  Context
 A collection of tasks that run in the same environment. More...
 
class  Error
 Base class for all errors specific to the parallel module. More...
 
class  LocalWorker
 A worker running on the same machine as the manager. More...
 
class  Manager
 Manages workers and contexts. More...
 
class  NetworkError
 Error raised if a problem occurs with the network. More...
 
class  NoMoreWorkersError
 Error raised if all workers failed, so tasks cannot be run. More...
 
class  RemoteError
 Error raised if a worker has an unhandled exception. More...
 
class  SGEPEWorkerArray
 An array of workers in a Sun Grid Engine system parallel environment. More...
 
class  SGEQsubWorkerArray
 An array of workers on a Sun Grid Engine system, started with 'qsub'. More...
 
class  Worker
 Representation of a single worker. More...
 
class  WorkerArray
 Representation of an array of workers. More...
 

Functions

def get_data_path
 Return the full path to one of this module's data files. More...
 
def get_example_path
 Return the full path to one of this module's example files. More...
 
def get_module_name
 Return the fully-qualified name of this module. More...
 
def get_module_version
 Return the version of this module, as a string. More...
 

Function Documentation

def IMP.parallel.get_data_path (   fname)

Return the full path to one of this module's data files.

Note
This function is only available in Python.

Definition at line 620 of file parallel/__init__.py.

def IMP.parallel.get_example_path (   fname)

Return the full path to one of this module's example files.

Note
This function is only available in Python.

Definition at line 625 of file parallel/__init__.py.

def IMP.parallel.get_module_name ( )

Return the fully-qualified name of this module.

Note
This function is only available in Python.

Definition at line 615 of file parallel/__init__.py.

def IMP.parallel.get_module_version ( )

Return the version of this module, as a string.

Note
This function is only available in Python.

Definition at line 610 of file parallel/__init__.py.