IMP Reference Guide
develop.330bebda01,2025/01/21
The Integrative Modeling Platform
|
Sampling exhaustiveness protocol. More...
Sampling exhaustiveness protocol.
This module implements the sampling exhaustiveness test described in Viswanath et al, 2017. The protocol is primarily designed to work with models generated by the Integrative Modeling Platform (IMP) (and more specifically the IMP::pmi module), but could probably be adapted for other systems.
pyRMSD is needed. (This is a fork of the original pyRMSD - which is no longer maintained - to fix bugs and add Python 3 support.)
In the Sali lab, pyRMSD is already built, so can be used with module load python3/pyrmsd
.
The protocol is typically run using the imp_sampcon
command line tool:
imp_sampcon show_stat
to show the available fields (e.g. scoring function terms, restraint satisfaction) in an IMP::pmi stat file.imp_sampcon select_good
to select a subset of good-scoring models from a set of IMP::pmi trajectories.imp_sampcon plot_score
to plot the score distributions of the selected models.imp_sampcon exhaust
to analyze the selected models and determine whether sampling was exhaustive.For a full demonstration of the protocol, see its usage in IMP's actin modeling tutorial.
PMI analysis is the current way to select models for the exhaustiveness test. It selects models based on MCMC equilibration and HDBSCAN density-based clustering of model scores.
In this module, we also have scripts for an earlier method to select models based on thresholds on score terms.
select_good
selects a set of good-scoring models based on user-specified score thresholds. The options are described in detail in the actin modeling tutorial.
A few things to note:
select_good
has two modes to help with this. First, one can just guess the values of lower and upper thresholds of score terms based on individual stat files from sampling and use select_good
in FILTER mode (i.e. without -e option) to see how many models one can get with the current thresholds. One can also plot these scores using plot_score
to aid in refining the thresholds. Once the final thresholds are obtained, one can use select_good
in EXTRACT mode (i.e. with -e option) to extract the corresponding RMFs.-alt
and -aut
(aggregate lower and upper thresholds) are the lower threshold and upper thresholds to be set for most score terms like Total_Score
. -mlt
and -mut
(member lower and upper thresholds) only need to be set for crosslinks when specifying distance thresholds for each type of crosslink.-sl
(selection list) specifies which scores are selected based on specified thresholds and -pl
(print list) is the extra set of scores that will be printed for each selected model. If a score term is already in the selection list it need not be specified in the print list. Currently, empty print lists are not allowed.$run_directory/$run_prefix*/output
where $run_directory and $run_prefix are specified by -rd
and -rp
.See the usage of exhaust
in IMP's actin modeling tutorial.
The output includes
Convergence of top score file : *.Top_Score_Conv.txt
contains the average and standard deviation of the top score (second and third columns) obtained by random subsets of size 10%, 20% …. 100% of the total number of models (first column). Difference in score distributions : *.Score_Hist_*.txt
stores the histogram of score distributions and *.KS_test.txt
contains the effect size D and p-value from the KS test. Chi-square test : actin.ChiSquare_Grid_Stats.txt
contains for each clustering threshold (first column), the p-value, Cramer’s V and population of models in the contingency table (second, third and last columns). actin.Sampling_Precision_Stats.txt
contains the value of sampling precision and the p-value, Cramer’s V and population of models at the sampling precision.
PDF files are generated for results of the above tests if the gnuplot option is specified (-gp
).
Clusters : A directory is created for each cluster (cluster.0
, cluster.1
and so on). The indices of models belonging to each cluster x are listed in cluster.x.all.txt
, and listed by sample in cluster.x.sample_A.txt
and cluster.x.sample_B.txt
. Cluster populations are in actin.Cluster_Population.txt
, showing the number of models in samples A (second column) and B (third column) in each cluster (first column). Cluster precisions are in actin.Cluster_Precision.txt
, with the precision defined by the average RMSD to the cluster centroid. The individual cluster directories contain representative bead models and localization densities.
Note that for both the score distribution test and the chi-square tests, the distributions of models in samples A and B are deemed to be different only if the p-value is significant as well as effect size (D or Cramer's V) is also large. See also Similarity of Scores
section in the paper Viswanath et al, 2017.
The model precision is now calculated as the average RMSD between a model in the cluster and the cluster centroid model, and not using weighted RMSF as mentioned in the original paper.
Here a few special cases of exhaust
are mentioned.
There are two separate number-of-cores arguments in exhaust
. One (-c
), sets the number of parallel threads to be used while computing the RMSD matrix, and the other (-cc
), sets the number of cores to be used for clustering and parallelly reading the RMF files. For the latter, especially in clustering, large systems may occupy a significant portion of the memory and it is advised to keep this option at a lower value in that case. The former is relatively more memory-robust.
Sometimes one would like to simply obtain clusters given a threshold, without going through the sampling precision calculation. For example, this could be because the user has already run exhaust
once on the same input and found too many clusters at the sampling precision and would like to visualize a smaller number of clusters by clustering at a threshold worse than the sampling precision. In this case, one can use the -s -ct <threshold>
options to skip the expensive steps of distance matrix generation and sampling precision calculation and directly cluster models at the given threshold.
This is usually not necessary if the sampling uses an EM map to place the complex. One must not align when specifying the subunit or a selected set of particles (-su
or -sn
option), since the reference frame will be specified by the fixed particles.
By default, exhaust
considers all subunits for RMSD and clustering. There are a couple of options to specify one or more subunits in particular. To select a single subunit -su
option can be used. To select multiple subunits or domains of subunits, -sn
option can be used. Protein domains are specified with start and end residue numbers. Each selection is listed as an entry in a dictionary called density_custom_ranges
in a text file, like in the following example. ``` density_custom_ranges = {"Rpb4":[(1,200,"Rpb4")],"Rpb7":[("Rpb7")],"Rpb11-Rpb14-subcomplex":[("Rpb11"),("Rpb14")]} `` If there are multiple protein copies, to specify copy number in the protein name the format
prot.copy_number<tt>must be used. For example: `` density_custom_ranges = {"Rpb4":[(1,200,"Rpb4.0")],"Rpb7.1":[(1,710,"Rpb7.1")]} ```
Note that one usually does not align (-a
) when specifying select subunits, since the frame of reference is determined by the fixed subunits.
A density file must be specified. The syntax is identical to the selection text file above (-sn
). The output will contain MRC files, one per element of the dictionary in the density file.
Voxel (-v
) and density threshold (-dt
) input options are related to density generation.
The protocol can also handle systems with ambiguity and symmetry (e.g. multiple protein copies whose positions can be interchanged), where this information needs to be considered while calculating the RMSD between models. The RMSD between two protein models is the minimum RMSD over permutations of equivalent proteins.
Note**: You only need to use this option if your protein copies are symmetric, i.e. they are interchangeable in the model. If there are multiple copies but they occupy distinct positions in the model, this option will not be useful and will give the same result as regular exhaust
without ambiguity option.
If a system has 2 copies of protein A and 1 copy of protein B, i.e. the proteins are A.0, A.1,B.0. The RMSD between any pair of models m0 and m1, is the minimum RMSD between RMSD[m0(A.0,A.1,B.0) , m1(A.0,A.1,B.1)]
and RMSD[m0(A.0,A.1,B.1), m1(A.1,A.0,B.1]
. Note that the copies of A in m1 were interchanged while calculating the second RMSD.
To implement this, pyRMSD takes an additional argument symm_groups
which is a list of particle indices of equivalent particles. For the above case for instance, symm_groups
has one symmetric group with the particle indices of A.0 and A.1. symm_groups=[[[A.0.b0,A.1.b0],[A.0.b1,A.1.b1],[A.0.b2,A.1.b2]..[A.0.bn,A.1.bn]]]
. Here A.X.bi
is the index of the i'th bead in protein A.X and the ith beads of the two protein copies are considered equivalent particles.
To generate this list of symmetric groups, one needs to pass an additional file with the ambiguity option to the master exhaust script. The file contains one line per symmetric group, and components of symmetric groups are separated by white space. See also the example in symminput
.
Additionally, as different individual elements of a symm group, one can specify a complex of multiple molecules such that they are treated as a single unit while trying out different permutations to find the minimum RMSD. As an illustration, for a system with A.0
, A.1
, A.2
, B.0
, B.1
, B.2
, without complexes, the ambiguity file would look as follows:
``` A.0 A.1 A.2 B.0 B.1 B.2 ```
There are a total of 64 permutations/swaps to explore, i.e. the final RMSD is the minimum of the RMSD of these permutations. If A.x
and B.x
are complexed, specifying this should make the ambiguity file look as follows:
``` A.0/B.0 A.1/B.1 A.2/B.2 ```
This has only 8 permutations to explore.
Note**: While using complexes (i.e. A.0/B.0
), the script assumes that the order in which the selected particles are returned by the Selection on the model Hierarchy is regular, i.e. either A.0 A.1 B.0 B.1
or A.0 B.0 A.1 B.1
and any other arbitrary order might cause the symm groups to be incorrect.
Namespaces | |
clustering_rmsd | |
Utilities to cluster and calculate RMSD and precision. | |
good_scoring_model_selector | |
Select good-scoring models based on scores and/or data satisfaction. | |
precision_rmsd | |
Calculation of precision and RMSD. | |
rmsd_calculation | |
Utilities to help with RMSD calculation. | |
scores_convergence | |
Utilities to manage distributions of scores. | |
Functions | |
def | get_data_path |
Return the full path to one of this module's data files. More... | |
def | get_example_path |
Return the full path to one of this module's example files. More... | |
def | get_module_name |
Return the fully-qualified name of this module. More... | |
def | get_module_version |
Return the version of this module, as a string. More... | |
def IMP.sampcon.get_data_path | ( | fname | ) |
Return the full path to one of this module's data files.
Definition at line 17 of file sampcon/__init__.py.
def IMP.sampcon.get_example_path | ( | fname | ) |
Return the full path to one of this module's example files.
Definition at line 22 of file sampcon/__init__.py.
def IMP.sampcon.get_module_name | ( | ) |
Return the fully-qualified name of this module.
Definition at line 12 of file sampcon/__init__.py.
def IMP.sampcon.get_module_version | ( | ) |
Return the version of this module, as a string.
Definition at line 7 of file sampcon/__init__.py.