IMP Tutorial  for IMP version 2.4.0
Stage 4 - Analysis Part 1

Introduction

In the analysis stage we cluster (group by similarity) the sampled models to determine high-probability configurations. Comparing clusters may indicate that there are multiple acceptable configurations given the data.

Precomputed results

A long modeling run was precomputed and analyzed. You can download it from our website, and you can download the corresponding analysis.

Clustering top models (clustering.py)

The clustering.py script, found in the rnapolii/analysis directory, calls the AnalysisReplicaExchange0 macro, which finds top-scoring models, extracts coordinates, runs k-means clustering, and does basic cluster analysis including creating localization densities for each subunit. The script generates a directory containing as many subdirectories as the number of clusters queried. Each subdirectory contains an RMF and a PDB for each structure extracted, a stat file, and the localization densities.

We can choose the number of clusters, the subunits we want to use to calculate the RMSD, and the number of good-scoring solutions to include. These options are at the top of the script:

1 num_clusters = 1 # how many clusters to create
2 num_top_models = 5 # total number of best models to analyze
3 merge_directories = ["../modeling/"] # directories to analyze
4 prefiltervalue = 2900.0 # prefilter by score

If we perform sampling multiple times separately, they can all be analyzed at the same time by appending to merge_directories. The prefiltervalue removes all models scoring below this value (meaning, they aren’t clustered) which can be helpful to reduce the problem size.

Create the analysis macro and pass it basic information (it will search for stat files):

1 model=IMP.Model()
3  merge_directories=merge_directories)

These are features that are kept around (and moved to the cluster stat files):

1 feature_list=["ISDCrossLinkMS_Distance_intrarb",
2  "ISDCrossLinkMS_Distance_interrb",
3  "ISDCrossLinkMS_Data_Score",
4  "GaussianEMRestraint_None",
5  "SimplifiedModel_Linker_Score_None",
6  "ISDCrossLinkMS_Psi",
7  "ISDCrossLinkMS_Sigma"]

Now we specify the subunits (or groups or fractions of subunits) for which we want to create density localization maps. density_names is a dictionary, where the keys are convenient names like "Rpb1-CTD" and the values are a list of selections. The selection items can either be a domain name like "Rpb1" or a list like (200,300,"Rpb1") which means residues 200-300 of component Rpb1. This enables the user to combine multiple selections for a single density calculation.

1 density_names = {"Rpb4":["Rpb4"],
2  "Rpb7":["Rpb7"]}

Next, we specify the components used in calculating the RMSD between models. All selections here are used together for a single RMSD calculation between two models. The format is the same as density_names. One use case is when only a subset of the system is actually being sampled (with the rest kept static). Note that unless you provide something to align_names (see below), no alignment is done before calculating RMSD.

1 rmsd_names = {"Rpb4":"Rpb4",
2  "Rpb7":"Rpb7"}

Next, we specify components used for structural alignment. This is needed in case there is no absolute reference frame (like an EM map). The format is the same as density and RMSD. In this case we use None because of the EM map.

1 align_names = None

Finally, we start the clustering. Most of the options were chosen earlier in the script.

1 mc.clustering(prefiltervalue=prefiltervalue, # prefilter the models by score
2  number_of_best_scoring_models=num_top_models, # number of models to be clustered
3  alignment_components=None, # list of proteins you want to use for structural alignment
4  rmsd_calculation_components=rmsd_names, # list of proteins used to calculated the rmsd
5  distance_matrix_file="distance.rawmatrix.pkl", # save the distance matrix
6  outputdir=out_dir, # location for clustering results
7  feature_keys=feature_list, # extract these fields from the stat file
8  load_distance_matrix_file=False, # skip the matrix calculation and read the precalculated matrix
9  display_plot=True, # display the heat map plot of the distance matrix
10  exit_after_display=False, # exit after having displayed the distance matrix plot
11  get_every=1, # skip structures for faster computation
12  number_of_clusters=num_clusters, # number of clusters to be used by kmeans algorithm
13  voxel_size=3.0, # voxel size of the mrc files
14  density_custom_ranges=density_names) # setup the list of densities to be calculated

Results

Run the clustering script by changing into the rnapolii/analysis directory and then running:

python clustering.py

If you ran modeling.py with the --test option, it is a good idea to give the --test option to clustering.py as well (this increases the prefilter value; none of the 50 test models generated may be good enough to satisfy the default prefilter value). With such minimal sampling, the quality of the results is unlikely to be high; you can download the precalculated results and the resulting clusters from our website.

First we can look through the cluster results directory to see the output (see example below). The clustering directory contains the distance matrix plot (described below) and a folder for each cluster. Within the cluster folder are PDB and RMF files containing members of each cluster, localization densities for requested components (the .mrc files), and a stat file output (with one entry for each cluster member). All RMF, PDB, and MRC files should be viewable in Chimera.

clustering files

Here is an example modeling result (from the provided files, cluster.1/4.rmf3, the cluster center):

Example result

Next we can examine the plots outputted by the clustering script. The plots are output to a single file (dist_matrix.pdf) in the clustering directory. The first plot is the distance matrix of the models after being grouped into clusters. The matrix should show the requested number of clusters with much lower within-cluster than between-cluster distance. If this is not the case, then perhaps too many clusters were chosen.

The second plot is a dendrogram, basically showing the distance matrix in a hierarchical way. Each vertical line from the bottom is a model, and the horizontal lines show the RMSD agreement between models. Sometimes the dendrogram can indicate a natural number of clusters, which can help determine the correct number to use. Here is the result from using 2 clusters on the example results:

Distance matrix and dendrogram

Next can examine the localization densities of a cluster. These can give a qualitative idea of the precision of a cluster. Below we show results from cluster.1 in the provided results: the native structure without Rpb4/7 (in blue), the target density map (in mesh), and the localization densities (Rpb4 in cyan, Rpb7 in purple). The localizations are quite narrow and close to the native solution:

Localization densities

For quantitative analysis of the clustering results we need to call another script (see Part 2).