IMP logo
IMP Tutorial
Deposition of integrative models

Introduction

In this tutorial we will introduce the procedure used to deposit integrative modeling studies in the PDB-Dev database in mmCIF format.

We will demonstrate the procedure using IMP and its PMI module, but the database will accept integrative models from any software, as long as they are compliant mmCIF files (e.g. there are several HADDOCK and Rosetta models already in PDB-Dev).

This tutorial can be followed in several ways:

Why PDB-Dev?

PDB-Dev is a database run by the wwPDB. It is specifically for the deposition of integrative models, i.e. models generated using more than one source of experimental data. (Models that use only one experiment generally go in PDB; models that use no experimental information - theoretical models - go in ModelArchive).

Why mmCIF?

wwPDB already uses the mmCIF file format for X-ray structures, and has a formal data model (PDBx) to describe these structures. The format is extensible; extension dictionaries exist to describe NMR, SAS, EM data, etc. For integrative models, we use PDBx plus an "integrative/hybrid methods" (IHM) extension dictionary. This supports coarse-grained structures, multiple input experimental data sources, multiple states, multiple scales, and ensembles related by time or other order.

Study versus single model

A frequently asked question is whether mmCIF files can be generated directly from an IMP::Model or an RMF file. This generally isn't possible because RMF and IMP::Model are designed to store one or more output models, where each model is a set of coordinates for a single conformation of the system being studied. A deposition, on the other hand, aims to cover a complete modeling study, and should capture the entire ensemble of output models, all of the input data needed to reproduce the modeling, and quality metrics such as the precision of the ensemble and the degree to which it fits the output data. A deposition is also designed to be visualized, and so may contain additional data not used in the modeling itself, such as preset colors or views to match figures in the publication or highlight regions of interest, and more human-descriptive names for parts of the system. Thus, deposition is largely a data-gathering exercise, and benefits from a modeling study being tidy and well organized (for example, by storing it in a GitHub repository) so that data is easy to find and track to its source.

File format

The file format used by PDB for deposition of integrative modeling studies is the same mmCIF format that is used for crystal structures, with extensions (the Integrative/Hybrid Methods (IHM) dictionary) to support coarse-grained structures, multiple input experimental data sources, multiple states, multiple scales, and ensembles related by time or other order. Currently, models that are compliant with the IHM dictionary can be deposited in the PDB-Dev database.

mmCIF is a text format with a well-defined syntax, so in principle files could be generated by hand or with simple scripts. However, it is generally easier to use the existing python-ihm library. This stores the same data as in an mmCIF file, but represents it as a set of Python classes, so it is easier to manipulate. IMP includes a copy of the python-ihm library, and uses it internally to read and write mmCIF files.

There are two main approaches to generating an mmCIF file for deposition:

  1. Use the python-ihm library directly, by writing a Python script that reads in output models and input data, adds annotations, and writes out an mmCIF file.
  2. Use the ProtocolOutput class to automatically capture an entire IMP::pmi modeling protocol.

The first approach offers the greatest flexibility for custom protocols or for modeling outside of IMP itself, but does require everything to be done manually. For a good example of this approach, see the modeling of Nup133. (One example of the tradeoff between flexibility and manual coding: the cross-links used in the study are stored in a plain text file so a custom Python class had to be written to parse them.)

The second approach will be explored in this tutorial.

Basic usage of ProtocolOutput

ProtocolOutput is designed to be attached to a top-level PMI object (usually IMP::pmi::topology::System). Then, as the script is run, it will capture all of the information about the modeling study, in an ihm.System object. Additional information not in the modeling script itself, such as the resulting publication, can be added using the python-ihm API.

We will demonstrate ProtocolOutput by using it to make a deposition of the RNA Pol II stalk modeling, as covered in the introductory PMI tutorial. First, download the files for this tutorial by using the "Clone or download" link at the tutorial's GitHub page.

We will then modify the modeling.py script from the introductory tutorial, calling it deposition.py in the rnapolii/modeling subdirectory, to capture the study as mmCIF.

The first modification to deposition.py is to import the PMI mmCIF and python-ihm Python modules:

# Imports needed to use ProtocolOutput
import ihm
import ihm.location
import ihm.model

Next, as soon as the top-level IMP::pmi::topology::System object is created, we attach a ProtocolOutput object (BuildSystem contains a system member):

# Use the BuildSystem macro to build states from the topology file
# Record the modeling protocol to an mmCIF file
bs.system.add_protocol_output(po)
po.system.title = "Modeling of RNA Pol II"
# Add publication
po.system.citations.append(ihm.Citation.from_pubmed_id(25161197))

Note that the ProtocolOutput object po simply wraps an ihm.System object as po.system. We can then customize the ihm.System by setting a human-readable title and adding a citation (here we use ihm.Citation.from_pubmed_id, which looks up a citation by PubMed ID - this particular PubMed ID is actually for the previously-published modeling of the Nup84 complex).

As it is generally undesirable to rerun the entire modeling just to generate an mmCIF file, we can save a lot of time when we do the replica exchange by turning on ReplicaExchange's test_mode, which skips the actual Monte Carlo simulation (so we use the previously-generated trajectory):

# This object defines all components to be sampled as well as the sampling protocol
root_hier=root_hier,
monte_carlo_sample_objects=dof.get_movers(),
output_objects=outputobjects,
monte_carlo_temperature=1.0,
simulated_annealing=True,
simulated_annealing_minimum_temperature=1.0,
simulated_annealing_maximum_temperature=2.5,
simulated_annealing_minimum_temperature_nframes=200,
simulated_annealing_maximum_temperature_nframes=20,
replica_exchange_minimum_temperature=1.0,
replica_exchange_maximum_temperature=2.5,
number_of_best_scoring_models=10,
monte_carlo_steps=num_mc_steps,
number_of_frames=num_frames,
global_output_directory="output",
test_mode=True)
# Start Sampling
mc1.execute_macro()

Once we're done with our PMI protocol, we call the ProtocolOutput.finalize method to collect all of the information about the integrative modeling protocol in ihm.System:

po.finalize()

Note that the ihm.System object, stored as the system attribute of ProtocolOutput, contains a full description of the system:

s = po.system
print("all subunits:",
[a.details for a in s.asym_units])
print("first subunit sequence:",
"".join(r.code for r in s.asym_units[0].entity.sequence))
print("all restraints on the system:", s.restraints)

Note that python-ihm only knows about three restraints, the two crosslinking restraints and the one EM density restraint, not the connectivity or excluded volume. This is because PDB considers that all valid structures satisfy connectivity and excluded volume by definition.

mmCIF file format

At this point we have basic information which we can write out to an mmCIF file. Let's do that to get an idea of what the file looks like:

import ihm.dumper
with open('initial.cif', 'w') as fh:
ihm.dumper.write(fh, [s])

mmCIF is a simple text format, so we can look at initial.cif in any text editor. Some simple data are stored in straightforward key:value pairs, e.g.

_struct.title 'Modeling of RNA Pol II'

More complex data can be stored as a table using a loop construct. This lists the fields (headings for the columns) after which the data (as rows) follow, e.g.

loop_
_software.pdbx_ordinal
_software.name
_software.classification
_software.description
_software.version
_software.type
_software.location
1 'IMP PMI module' 'integrative model building' 'integrative model building'
2.12.0 program https://integrativemodeling.org
2 'Integrative Modeling Platform (IMP)' 'integrative model building'
'integrative model building' 2.12.0 program https://integrativemodeling.org

Missing data are represented by ? if they are unknown or . if they are deliberately omitted (for example if they don't make sense here).

In each case both the categories (struct, software) and the data items (title, pdbx_ordinal, etc.) are defined by PDB, in the PDBx dictionary. This is the same dictionary used for regular PDB entries (e.g. crystal structures). Categories prefixed with ihm are in the IHM dictionary, which is specific to integrative models.

Linking to other data

Integrative modeling draws on datasets from a variety of sources, so for a complete deposition all of this data need to be available. python-ihm provides an ihm.dataset module to describe various types of dataset, such as input experimental data, auxiliary output files (such as localization densities), or workflow files (such as Python scripts for modeling or visualization). For example, each restraint has an associated dataset:

print("restraint datasets:", [r.dataset for r in s.restraints])

The data are not placed directly in the mmCIF file - rather, the file contains links (using the ihm.location module). These links can be:

  • an identifier in a domain-specific database, such as PDB or EMDB.
  • a DOI where the files can be obtained.
  • a path to a file on the local disk.

Database identifiers are preferable because the databases are curated by domain experts and include domain-specific information, and the files are in standard formats. ProtocolOutput will attempt to use these where possible. When a file is used for the modeling which cannot be tracked back to a database, ProtocolOutput will include its path (relative to that of the mmCIF file). For example, in this case the cross-links used are stored in simple CSV files and are linked as such:

# Datasets for XL-MS restraint
for r in s.restraints:
if isinstance(r, ihm.restraint.CrossLinkRestraint):
print("XL-MS dataset at:", r.dataset.location.path)
print("Details:", r.dataset.location.details)

In addition, the Python script itself is linked from the mmCIF file. Such local paths won't be available to end users, so for deposition we need to replace these paths with database IDs or DOIs (more on that later).

Datasets all support a simple hierarchy, where one dataset can be derived from another. In this case the EM restraint uses a locally-available GMM file, but it is derived from a density map which is stored in EMDB (as EMD-1883). ProtocolOutput is able to read (using the ihm.metadata module) metadata in both the GMM file and .mrc file to automatically determine this relationship:

# Dataset for EM restraint
em, = [r for r in s.restraints
if isinstance(r, ihm.restraint.EM3DRestraint)]
d = em.dataset
print("GMM file at", d.location.path)
print("is derived from EMDB entry", d.parents[0].location.access_code)

In cases where ProtocolOutput is not able to determine the parent of a dataset, we can create a dataset manually and add it to d.parents, either directly or using the add_primary method. For example, see the modeling of the NPC where we linked our EM restraint back to the parent (the density map in EMDB) and grandparent (cryo-ET tilt series in EMPIAR from which the density map was constructed).

As a further example of linkage, see the links in the previously-published modeling of the Nup84 complex below. The mmCIF file links to the data directly used in the modeling (cross-links, crystal structures, electron microscopy class averages, comparative models, and Python scripts) via database IDs or DOIs. Furthermore, where available links are provided from this often-processed data back to the original data, such as templates for comparative models, mass spectometry spectra for cross-links, or micrographs for class averages:

Annotation of input files

ProtocolOutput will look at all input files to try to extract as much metadata as possible. As described above this is used to look up database identifiers, but it can also detect other inputs, such as the templates used for comparative modeling. Thus, it is important for deposition that all input files are annotated as well as possible:

  • deposit input files in a domain-specific database where possible and use the deposited file (which typically will contain identifying headers) in the modeling repository.
  • for PDB crystal structures, do not remove the original headers, such as the HEADER and TITLE lines.
  • for MODELLER comparative models, leave in the REMARK records and make sure that any files mentioned in REMARK 6 ALIGNMENT: or REMARK 6 SCRIPT: records are available (modify the paths if necessary, for example if you moved the PDB file into a different directory from the modeling script and alignment).
  • for manually generated PDB files, such as those extracted from a published work or generated by docking or other means, add suitable EXPDTA and TITLE records to the files for ProtocolOutput to pick up. See the python-ihm docs for more information.
  • for GMM files used for the EM density restraint, keep the original MRC file around and make sure that the # data_fn: header in the GMM file points to it.

Polishing the deposition

ProtocolOutput attempts to automatically generate as much as possible of the mmCIF file, but there are some areas where manual intervention is necessary because the data is missing, or its guess was incorrect. This data can be corrected by manipulating the ihm.System object directly, at the end of the script. We will look at a few examples in this section.

Cross-linker type

For cross-linking experiments, the mmCIF file contains a description of the cross-linking reagent used. This information is not in the CSV file. It can be provided in the Python script with the linker argument to passed to the PMI CrossLinkingMassSpectrometryRestraint, but if this argument was missing (or incorrect) we can correct it here. By checking the publications for the cross-link datasets from Al Burlingame's lab and Juri Rappsilber's lab), we can determine that the DSS and BS3 cross-linkers were used, respectively. These are common enough cross-linkers that python-ihm already includes definitions for them in the ihm.cross_linkers module (for less common linkers we can create an ihm.ChemDescriptor object from scratch to describe its chemistry). Then we just set the linker type for each cross-linking restraint in the list of all restraints:

# Definitions of some common crosslinkers
import ihm.cross_linkers
# Set linker types to match those used (DSS and BS3)
trnka, chen = [r for r in s.restraints
if isinstance(r, ihm.restraint.CrossLinkRestraint)]
trnka.linker = ihm.cross_linkers.dss
chen.linker = ihm.cross_linkers.bs3

Correct number of output models

ProtocolOutput captures information on the actual sampling, storing it in ihm.protocol.Step objects. Here it correctly notes that we ran Monte Carlo to generate 20000 frames:

# Get last step of last protocol (protocol is an 'orphan' because
# we haven't used it for a model yet)
last_step = s.orphan_protocols[-1].steps[-1]
print(last_step.num_models_end)

However, in many modeling scenarios the modeling script is run multiple times on a compute cluster to generate several independent trajectories which are then combined. ProtocolOutput cannot know whether this happened. It is straightforward though to use the python-ihm API to manually change the number of output models to match that reported in the publication:

# Correct number of output models to account for multiple runs
last_step.num_models_end = 200000

Add model coordinates

The current mmCIF file contains all of the input data, and notes that Monte Carlo was used to generate frames, but doesn't actually store any of those coordinates in the file. Ultimately this information will be added automatically by PMI model analysis and validation scripts, but for the time being any information about clustering, localization densities, and final models needs to be added to the file using the python-ihm API:

# Get last protocol in the file
protocol = po.system.orphan_protocols[-1]
# State that we filtered the 200000 frames down to one cluster of
# 100 models:
analysis = ihm.analysis.Analysis()
protocol.analyses.append(analysis)
analysis.steps.append(ihm.analysis.ClusterStep(
feature='RMSD', num_models_begin=200000,
num_models_end=100))

python-ihm allows for models to be placed into groups (ihm.model.ModelGroup) so we can make such a group for the cluster. These groups are in turn grouped in an ihm.model.State, a distinct state of the system (e.g. open/closed form, or reactant/product in a chemical reaction). Finally, states can themselves be grouped into ihm.model.StateGroup objects. All of these groups act like regular Python lists. In this case we have only one state, so we just add our model group to that:

mg = ihm.model.ModelGroup(name="Cluster 0")
# Add to last state
po.system.state_groups[-1][-1].append(mg)

Next, we can describe the cluster using the ihm.model.Ensemble class, noting that it was generated from our clustering:

e = ihm.model.Ensemble(model_group=mg,
num_models=100,
post_process=analysis.steps[-1],
name="Cluster 0")
po.system.ensembles.append(e)

To add a model to the file, we create an ihm.model.Model object, add atoms or coarse-grained objects to it, and then add that to the previously-created ModelGroup. ProtocolOutput provides a convenience function add_model which does this, converting the current IMP structure to python-ihm (we just need to load an IMP structure first, e.g. from an RMF file using IMP.rmf.load_frame).

import RMF
import IMP.rmf
# Add the model from RMF (let's say it's frame 42 from the output RMF file)
rh = RMF.open_rmf_file_read_only('output/rmfs/0.rmf3')
IMP.rmf.link_hierarchies(rh, [root_hier])
IMP.rmf.load_frame(rh, RMF.FrameID(42))
del rh
model = po.add_model(e.model_group)

In practice we typically want to store more than just a single model in the file, since a single model cannot represent the flexibility and variability of the complete ensemble. mmCIF files allow for storing multiple conformations (like a traditional PDB file for storing an NMR ensemble) and also representations of the entire ensemble (localization densities).

However, mmCIF is a text format and is not best suited for storing large numbers of structures. We can instead store the coordinates in a simpler, binary file and link to it from the mmCIF using the ihm.location.OutputFileLocation class, providing one or more representative structures (such as the cluster centroid) in the mmCIF file itself. DCD is one such format that we can (ab)use for this purpose (it is really designed for atomic trajectories). python-ihm provides an ihm.model.DCDWriter class which, given an ihm.model.Model, will output DCD.

If we have localization densities from the analysis for the cluster, we can add those to the Ensemble as ihm.model.LocalizationDensity objects. These take python-ihm subunits (or subsets of them) as ihm.AsymUnit objects. ProtocolOutput provides an asym_units member, which is a Python dict, to get python-ihm subunits given PMI component names (such as "Rpb4.0"). For example, we can easily add the localization density of the Rpb4 subunit for the entire cluster to the file:

# Look up the ihm.AsymUnit corresponding to a PMI component name
asym = po.asym_units['Rpb4.0']
# Add path to a local output file
loc = ihm.location.OutputFileLocation('output/cluster0.Rpb4.mrc')
den = ihm.model.LocalizationDensity(file=loc, asym_unit=asym)
# Add to ensemble
e.densities.append(den)

(We can also add information about the fit of the model to each of the restraints. Most python-ihm restraint objects contain a fits member which is a Python dict with ihm.model.Model objects as keys and some sort of restraint-fit object as values. For example, we could note that the cluster centroid model m was fit against the EM map by adding an ihm.restraint.EM3DRestraintFit object with the cross correlation coefficient, if known. We could also specify the Bayesian nuisances ψ and σ for our cross-links with ihm.restraint.CrossLinkFit objects.)

Replace local links with DOIs

We added paths for localization densities and the DCD file with ihm.location.FileLocation objects. These files won't be accessible to end users since they live on the local disk. (Recall from earlier that ProtocolOutput also adds local paths for the cross-link CSV files.)

FileLocation takes an optional repo argument which, if given, is an ihm.location.Repository describing the DOI where the files can be found, generally as a zip file (the first argument to ihm.location.FileLocation is a path within the repository if repo is specified, otherwise a path on the local disk). So we could explicitly refer to a DOI for each of our external files. (A number of services provide DOIs for uploaded files, such as Zenodo or FigShare.)

An alternative is to retroactively change the existing FileLocations (both those created automatically by ProtocolOutput and manually by us using python-ihm) using the root and top_directory arguments to Repository and the ihm.System.update_locations_in_repositories function:

# Replace local links with DOIs
repo = ihm.location.Repository(doi="10.5281/zenodo.2598760", root="../..",
top_directory="salilab-imp_deposition_tutorial-1ad5919",
url="https://zenodo.org/record/2598760/files/salilab/"
"imp_deposition_tutorial-v0.2.zip")
s.update_locations_in_repositories([repo])

This assumes that the zipfile at the given DOI and URL was created by archiving all files under the top directory of the GitHub repository (../.. relative to the directory containing the script). (It was actually created automatically by Zenodo when a new release of this tutorial was created on GitHub. Zenodo puts all the files in the zipfile under a uniquely-named directory, so we need to set top_directory to match.)

In principle archiving a GitHub repository at Zenodo is as simple as downloading the repository from GitHub as a zip file, then uploading that file to Zenodo. In practice, however, these files might be quite large, and so viewers such as ChimeraX may need to download a multi-gigabyte zip file just to get a relatively small input, such as a CSV or FASTA file. To remedy this, the repository can be split into multiple smaller zip files. See for example the multiple files archiving the 2018 NPC study; these were generated using this script.

Thus the cross-link file which was previously linked with the local path ../data/polii_xlinks.csv can now be found by downloading imp_deposition_tutorial-v0.2.zip and extracting salilab-imp_deposition_tutorial-1ad5919/rnapolii/data/polii_xlinks.csv from it:

# Datasets for XL-MS restraint
trnka, chen = [r for r in s.restraints
if isinstance(r, ihm.restraint.CrossLinkRestraint)]
d = trnka.dataset
print("XL-MS dataset now at %s/%s inside %s"
% (d.location.repo.top_directory,
d.location.path, d.location.repo.url))

Output

We can now save the entire protocol to an mmCIF file (or to BinaryCIF, if we have the Python msgpack package installed) using the ihm.dumper.write function:

po.finalize()
with open('rnapolii.cif', 'w') as fh:
ihm.dumper.write(fh, [po.system])
#with open('rnapoliii.bcif', 'wb') as fh:
# ihm.dumper.write(fh, [s], format='BCIF')

(BinaryCIF stores all the same data as mmCIF, but rather than storing it as a text file, the data is compressed and then written out as a MessagePack file, which is a form of binary JSON.)

Visualization

ChimeraX

mmCIF files can be viewed in many viewers. However, most viewers do not yet support the integrative modeling extensions, and so may only show the atomic parts of the model (if any). Integrative models can be viewed in ChimeraX - be sure to use a recent nightly build, and open the file from the ChimeraX command line using the format ihm option, e.g. open rnapolii.cif format ihm. (If you also want to see any DCD files, add ensembles true to the end of your command line to load them and then use the ChimeraX coordset command to manipulate the set of coordinates, e.g. coordset slider #1.3.2.)

ChimeraX also supports downloading and displaying structures directly from the PDB-Dev database, for example from the ChimeraX command line open 10 from pdbdev.

Note that even though PDB-Dev is still quite small, the models vary widely in composition, e.g.

  • Some models (e.g. PDBDEV_00000018) are atomic and look much like a traditional "PDB" x-ray structure.
  • Some models (e.g. PDBDEV_00000002) contain multiple states that have different sequences (this particular case contains exosome in nucleus-localized and cytoplasm-localized forms).
  • Some models are not of proteins at all - e.g. PDBDEV_00000008 is a model of chromatin, part of the fly genome.
  • Some models contain multiple representations - e.g. PDBDEV_00000012, a model of the yeast nuclear pore complex, contains both models of the scaffold (ring) and the flexible FG repeat regions which occupy the center of the pore.

VMD is also reportedly working on support in their forthcoming 1.9.4 release.

An example view of the deposition, as rendered by ChimeraX, is shown below. The EM map is shown as a mesh, the DSS cross-links as green dashed lines, and the BS3 cross-links as blue dashed lines. ChimeraX also shows the comparative models used as starting guesses for the subunit structures - toggle off the display of the result models to see them.

Web browser

mmCIF files can also be viewed in a web browser, using Mol* Viewer. (Note that it currently only shows the coordinates, not the experimental information or other metadata.)

Plain text

mmCIF files are also just plain text files, and can be viewed in any text editor (for example to check for errors, particularly for categories that ChimeraX doesn't support yet). Most of the data is stored as simple tables (look for the loop_ keyword in the file). For example, the coordinates of the coarse-grained beads are stored in the _ihm_sphere_obj_site table, the start of which in rnapolii.cif looks like:

loop_
_ihm_sphere_obj_site.ordinal_id
_ihm_sphere_obj_site.entity_id
_ihm_sphere_obj_site.seq_id_begin
_ihm_sphere_obj_site.seq_id_end
_ihm_sphere_obj_site.asym_id
_ihm_sphere_obj_site.Cartn_x
_ihm_sphere_obj_site.Cartn_y
_ihm_sphere_obj_site.Cartn_z
_ihm_sphere_obj_site.object_radius
_ihm_sphere_obj_site.rmsf
_ihm_sphere_obj_site.model_id
1 1 1 1 A 158.180 180.876 96.861 3.068 . 1
2 1 2 2 A 105.913 135.689 154.058 2.888 . 1
3 1 3 3 A 104.305 132.810 156.027 2.273 . 1
4 1 4 4 A 101.027 134.743 156.557 3.008 . 1

This simply states that a sphere representing residue 1 in chain A is centered at (158.180, 180.876, 96.861) and has radius 3.068, and so on.

As Python objects

We can also read in mmCIF or BinaryCIF files using python-ihm's ihm.reader module and explore the data using the python-ihm API. For example, we can read the structure we just created:

import ihm.reader
with open('rnapolii.cif') as fh:
s, = ihm.reader.read(fh)
print(s.title, s.restraints, s.ensembles, s.state_groups)

Similarly we could explore any integrative model deposited in PDB-Dev. For example we can look at PDB-Dev #14, a HADDOCK model:

try:
import urllib.request as urllib2 # python3
except ImportError:
import urllib2 # python2
fh = urllib2.urlopen('https://pdb-dev.wwpdb.org/cif/PDBDEV_00000014.cif')
s, = ihm.reader.read(fh)
print(s.title, s.restraints, s.ensembles, s.state_groups)

Further reading

More tutorials on using IMP are available at the IMP web site.

CC BY-SA logo