IMP  2.4.0
The Integrative Modeling Platform
EMageFit protocol

Table of Contents

This page gives a full description of running EMageFit, from the collection of data to the final production of models. For a demonstration of actually applying it to an example complex, see the EMageFit tutorial.

Input data

Only three things are needed to get models using EMageFit:

  1. A set of PDB files with the components of the assembly.
  2. A set of EM images.
  3. A configuration file.

Each PDB file must contain a protein, a DNA strand, or a subcomplex. It is possible to have various different chains within the same PDB, and all of them will be considered as a rigid body. All the chains that are going to be assembled must have different ID (for example a chain with ID 'A' cannot be present in two different files). All the atoms in each PDB file will be used ,so if there are duplicated atoms, they need to be removed; in fact there is no harm in also removing other records like REMARK, SOURCE, COMPND, SEQRES, DBREF, CONECT. They are not relevant for the type of problem that EMageFit solves.

IMP can understand 3 image formats: Spider, JPG and TIFF. Spider is probably the most commonly used, since it is specific for EM. The free program em2em can be used to convert other EM image formats to Spider. JPG and TIFF are more useful for visualization; EMageFit includes a script (convert_spider_to_jpg) to convert Spider files to JPG.

Each EM image has to be a separate file. The images to be used need to be listed in a "selection file". This is just a file with 2 columns; the left column is simply the name of the image, and the right column is either 0 or 1. 0 signifies that that image will not be used. For example, the following selection file

image1.spi 1
image2.spi 0
image3.spi 1

contains 3 images but only image1.spi and image3.spi will be used for modeling.

An EMageFit configuration file is just a Python file with classes that describe all the parameters and restraints for the modeling. Using a Python file as configuration file makes adding new parameters to the simulation trivial. An example configuration file is included as part of the EMageFit tutorial.

Producing models

Producing models requires 4 steps:

  1. Doing the preliminary dockings.
  2. Obtaining models with Monte Carlo optimization.
  3. Gathering the solutions from the Monte Carlo optimizations.
  4. Combining models from Monte Carlo with DOMINO to get even better models.

Pairwise dockings

EMageFit performs docking between components that are subject to cross-linking restraints using the program HEXDOCK. This is what EMageFit does:

This procedure is run with the –dock argument to the emagefit command line tool. (See the EMageFit tutorial for an example command line.)

Once this step is complete, the user must take the information from the dockings and put it in the configuration file, indicating which component is anchored and referencing the files of relative transformations from the dockings. The options to modify are self.anchor and self.dock_transforms. See the EMageFit tutorial for an example. As mentioned before, the pairwise dockings are optional; EMageFit can work without them. In that case all the Monte Carlo moves will be random. The user should indicate that docking solutions are not available by setting the option self.non_relative_move_prob (probability of doing a not docking-related move) to value 1. This means that EmageFit will always do a random move. Another strategy is possible: even if the docking program does not produce solutions compatible with the cross-linking restraints the user may still want to use them. They have reasonable conformations after all, without classes, and perhaps not far from some other conformation that actually satisfies the restraints. It is advisable then to set self.non_relative_move_prob to a low number (say 0.2, meaning that a move from a docking solution will be chosen only 20% of the time).

emagefit also takes an option –log. If it is used (recommended), a log file is generated (otherwise, the log is output to the screen). This logging information comes only from EMageFit itself, and is different from the IMP logging system. The granularity of the logging can be selected by using the variables from the Python logging module, e.g. DEBUG, INFO, etc.

Obtain models with Simulated annealing Monte Carlo optimization

Once the relative docking transformations are set, models can be generated using Monte Carlo optimizations. This uses the –monte_carlo argument to emagefit, which takes a single numeric argument, which is the random seed to use for the optimization. Repeated runs using the same random seed should generate the same outputs. If running on a single machine, the special value -1 can be used, which uses the current time to set the random seed (this is not recommended when running multiple jobs on a compute cluster, since several jobs could start at the same time).

Each run generates an SQLite database containing only one solution. It should be repeeated many times to generate multiple solutions, as multiple database files.

Gather the results of all Monte Carlo optimizations

In this step, all of the independent Monte Carlo solutions are gathered into a single database file. This uses the –gather option to emagefit.

Combine the models from Monte Carlo with DOMINO

The solutions in the gathered database are already solutions for the modeling. They are a set of discrete solutions that can be improved by combining the positions of the components in all of them. For example, if there are 100 solutions from the Monte Carlo experiments, then each component has 100 100 possible positions. The positions should be already correct, but DOMINO will explore all the possible combinations of Monte Carlo solutions further improving their quality. If the assembly has 4 components, DOMINO can efficiently explore the 1004 possible combinations.

Visualizing the models and understanding the information in the database of solutions

The database of results contains all the positions for the rigid bodies in the solutions. At this point, EMageFit can write them out as PDB files.

Each record for a solution in the database contains the following information: