IMP Reference Guide
develop.7400db2aee,2024/11/23
The Integrative Modeling Platform
|
This page gives a full description of running EMageFit, from the collection of data to the final production of models. For a demonstration of actually applying it to an example complex, see the EMageFit tutorial.
Only three things are needed to get models using EMageFit:
Each PDB file must contain a protein, a DNA strand, or a subcomplex. It is possible to have various different chains within the same PDB, and all of them will be considered as a rigid body. All the chains that are going to be assembled must have different ID (for example a chain with ID 'A' cannot be present in two different files). All the atoms in each PDB file will be used ,so if there are duplicated atoms, they need to be removed; in fact there is no harm in also removing other records like REMARK, SOURCE, COMPND, SEQRES, DBREF, CONECT. They are not relevant for the type of problem that EMageFit solves.
IMP
can understand 3 image formats: Spider, JPG and TIFF. Spider is probably the most commonly used, since it is specific for EM. The free program em2em can be used to convert other EM image formats to Spider. JPG and TIFF are more useful for visualization; EMageFit includes a script (convert_spider_to_jpg) to convert Spider files to JPG.
Each EM image has to be a separate file. The images to be used need to be listed in a "selection file". This is just a file with 2 columns; the left column is simply the name of the image, and the right column is either 0 or 1. 0 signifies that that image will not be used. For example, the following selection file
image1.spi 1 image2.spi 0 image3.spi 1
contains 3 images but only image1.spi and image3.spi will be used for modeling.
An EMageFit configuration file is just a Python file with classes that describe all the parameters and restraints for the modeling. Using a Python file as configuration file makes adding new parameters to the simulation trivial. An example configuration file is included as part of the EMageFit tutorial.
Producing models requires 4 steps:
EMageFit performs docking between components that are subject to cross-linking restraints using the program HEXDOCK. This is what EMageFit does:
A-B-D-E \|/ Cthe computed maximum spanning tree could look similar to:
A-B-D-E | Cwhich implies that B should be anchored and be the first receptor, as it is the component with the largest number of neighbors. An edge indicates that a docking should be done. In this case: A (ligand) is docked to B (receptor), C is docked to B, D is docked to B, and finally E is docked to D.
This procedure is run with the –dock
argument to the emagefit
command line tool. (See the EMageFit tutorial for an example command line.)
Once this step is complete, the user must take the information from the dockings and put it in the configuration file, indicating which component is anchored and referencing the files of relative transformations from the dockings. The options to modify are self.anchor
and self.dock_transforms
. See the EMageFit tutorial for an example. As mentioned before, the pairwise dockings are optional; EMageFit can work without them. In that case all the Monte Carlo moves will be random. The user should indicate that docking solutions are not available by setting the option self.non_relative_move_prob
(probability of doing a not docking-related move) to value 1. This means that EmageFit will always do a random move. Another strategy is possible: even if the docking program does not produce solutions compatible with the cross-linking restraints the user may still want to use them. They have reasonable conformations after all, without classes, and perhaps not far from some other conformation that actually satisfies the restraints. It is advisable then to set self.non_relative_move_prob
to a low number (say 0.2, meaning that a move from a docking solution will be chosen only 20% of the time).
emagefit also takes an option –log
. If it is used (recommended), a log file is generated (otherwise, the log is output to the screen). This logging information comes only from EMageFit itself, and is different from the IMP
logging system. The granularity of the logging can be selected by using the variables from the Python logging module, e.g. DEBUG, INFO, etc.
Once the relative docking transformations are set, models can be generated using Monte Carlo optimizations. This uses the –monte_carlo
argument to emagefit, which takes a single numeric argument, which is the random seed to use for the optimization. Repeated runs using the same random seed should generate the same outputs. If running on a single machine, the special value -1 can be used, which uses the current time to set the random seed (this is not recommended when running multiple jobs on a compute cluster, since several jobs could start at the same time).
Each run generates an SQLite database containing only one solution. It should be repeated many times to generate multiple solutions, as multiple database files.
In this step, all of the independent Monte Carlo solutions are gathered into a single database file. This uses the –gather
option to emagefit.
The solutions in the gathered database are already solutions for the modeling. They are a set of discrete solutions that can be improved by combining the positions of the components in all of them. For example, if there are 100 solutions from the Monte Carlo experiments, then each component has 100 100 possible positions. The positions should be already correct, but DOMINO will explore all the possible combinations of Monte Carlo solutions further improving their quality. If the assembly has 4 components, DOMINO can efficiently explore the 1004 possible combinations.
The database of results contains all the positions for the rigid bodies in the solutions. At this point, EMageFit can write them out as PDB files.
Each record for a solution in the database contains the following information: