IMP logo
IMP Manual  for IMP version 2.6.0
rnapolii_2.md
1 Stage 2 - Representation of subunits and translation of the data into spatial restraints {#rnapolii_2}
2 ========================================================================================
3 
4 In this stage, we will initially define a representation of the system. Afterwards, we will convert the data into spatial restraints. This is performed using the script `rnapolii/modeling/modeling.py` and uses the
5 [topology file](@ref IMP::pmi::topology::TopologyReader),
6 `topology.txt`, to define the system components and their representation
7 parameters.
8 
9 ### Setting up Model Representation in IMP
10 
11 **Representation**
12 Very generally, the *representation* of a system is defined by all the variables that need to be determined based on input information, including the assignment of the system components to geometric objects (e.g. points, spheres, ellipsoids, and 3D Gaussian density functions).
13 
14 Our RNA Pol II representation employs *spherical beads* of varying sizes and *3D Gaussians*, which coarsen domains of the complex using several resolution scales simultaneously. The *spatial restraints* will be applied to individual resolution scales as appropriate.
15 
16 Beads and Gaussians of a given domain are arranged into either a rigid body or a flexible string, based on the crystallographic structures. In a *rigid body*, all the beads and the Gaussians of a given domain have their relative distances constrained during configurational sampling, while in a *flexible string* the beads and the Gaussians are restrained by the sequence connectivity.
17 
18 <img src="rnapolii_Multi-scale_representation.png" width="600px" />
19 _Multi-scale representation of Rpb1 subunit of RNA Pol II_
20 
21 
22 
23 The GMM of a subunit is the set of all 3D Gaussians used to represent it; it will be used to calculate the EM score. The calculation of the GMM of a subunit can be done automatically in the
24 [topology file](@ref IMP::pmi::topology::TopologyReader).
25 For the purposes of this tutorial, we already created these for Rpb4 and Rpb7 and placed them in the `rnapolii/data` directory in their respective `.mrc` and `.txt` files.
26 
27 **Dissecting the script**
28 The script `rnapolii/modeling/modeling.py` sets up the representation of the system and the restraint. (Subsequently it also performs [sampling](@ref rnapolii_3), but more on that later.)
29 
30 **Header**
31 The first part of the script defines the files used in model building and restraint generation.
32 
33 \code{.py}
34 #---------------------------
35 # Set up Input Files
36 #---------------------------
37 datadirectory = "../data/"
38 topology_file = datadirectory+"topology.txt"
39 target_gmm_file = datadirectory+'emd_1883.map.mrc.gmm.50.txt'
40 \endcode
41 
42 The first section defines where input files are located. The
43 [topology file](@ref IMP::pmi::topology::TopologyReader)
44 defines how the system components are structurally represented. `target_gmm_file` stores the EM map for the entire complex, which has already been converted into a Gaussian mixture model.
45 
46 \code{.py}
47 #--------------------------
48 # Set MC Sampling Parameters
49 #--------------------------
50 num_frames = 20000
51 num_mc_steps = 10
52 \endcode
53 
54 MC sampling parameters define the number of frames (model structures) which will be output during sampling. `num_mc_steps` defines the number of Monte Carlo steps between output frames. This setup would therefore encompass 200000 MC steps in total.
55 
56 \code{.py}
57 #--------------------------
58 # Create movers
59 #--------------------------
60 
61 # rigid body movement params
62 rb_max_trans = 2.00
63 rb_max_rot = 0.04
64 
65 # flexible bead movement
66 bead_max_trans = 3.00
67 
68 rigid_bodies = [["Rpb4"],
69  ["Rpb7"]]
70 super_rigid_bodies = [["Rpb4","Rpb7"]]
71 chain_of_super_rigid_bodies = [["Rpb4"],
72  ["Rpb7"]]
73 \endcode
74 
75 The movers section defines movement parameters and hierarchies of movers. `rb_max_trans` and `bead_max_trans` set the maximum translation (in angstroms) allowed per MC step. `rb_max_rot` is the maximum rotation for rigid bodies in radians.
76 
77 `rigid_bodies` is a Python list defining the components that will be moved as rigid bodies. Components must be identified by the _domain name_ (as given in the topology file).
78 
79 `super_rigid_bodies` defines sets of rigid bodies and beads that will move together in an additional Monte Carlo move.
80 
81 `chain_of_super_rigid_bodies` sets additional Monte Carlo movers along the connectivity chain of a subunit. It groups sequence-connected rigid domains and/or beads into overlapping pairs and triplets. Each of these groups will be moved rigidly. This mover helps to sample more efficiently complex topologies, made of several rigid bodies, connected by flexible linkers.
82 
83 **Build the Model Representation**
84 The next bit of code takes the input files and creates an
85 [IMP hierarchy](@ref IMP::atom::Hierarchy) based on the given
86 topology, rigid body lists and data files:
87 
88 \code{.py}
89 # Initialize model
90 m = IMP.Model()
91 
92 # Create list of components from topology file
93 topology = IMP.pmi.topology.TopologyReader(topology_file)
94 domains = topology.component_list
95 
96 bm = IMP.pmi.macros.BuildModel(m,
97  component_topologies=domains,
98  list_of_rigid_bodies=rigid_bodies,
99  list_of_super_rigid_bodies=super_rigid_bodies,
100  chain_of_super_rigid_bodies=chain_of_super_rigid_bodies)
101 
102 representation = bm.get_representation()
103 \endcode
104 
105 The [representation](@ref IMP::pmi::representation::Representation)
106 object returned holds a list of molecular hierarchies that define the model, that are passed to subsequent functions.
107 
108 \code{.py}
109 # Randomize the initial configuration before sampling
110 representation.shuffle_configuration(50)
111 \endcode
112 
113 This line randomizes the initial configuration to remove any bias from the initial starting configuration read from input files. Since each subunit is composed of rigid bodies (i.e., beads constrained in a structure) and flexible beads, the configuration of the system is initialized by placing each rigid body and each randomly in a box with a side of 50 Angstroms, and far enough from each other to prevent any steric clashes. The rigid bodies are also randomly rotated.
114 
115 
116 **Additional parameters and lists**
117 \code{.py}
118 # Add default mover parameters to simulation
119 representation.set_rigid_bodies_max_rot(rb_max_rot)
120 representation.set_floppy_bodies_max_trans(bead_max_trans)
121 representation.set_rigid_bodies_max_trans(rb_max_trans)
122 \endcode
123 
124 Here, we set the maximum rotation and translation for rigid bodies and "floppy bodies" (which are our flexible strings of beads).
125 
126 \code{.py}
127 outputobjects = [] # reporter objects (for stat files)
128 sampleobjects = [] # sampling objects
129 
130 # Add the movers to the sample and representation lists
131 outputobjects.append(representation)
132 sampleobjects.append(representation)
133 \endcode
134 
135 We would like to both sample, and output the information about the structural model. Therefore, they must be added to both the output and sample lists.
136 
137 ---
138 
139 ### Set up Restraints
140 
141 After defining the representation of the model, we build the restraints by which the individual structural models will be scored based on the input data.
142 
143 **Excluded Volume Restraint**
144 \code{.py}
145 ev = IMP.pmi.restraints.stereochemistry.ExcludedVolumeSphere(
146  representation, resolution=20)
147 ev.add_to_model()
148 outputobjects.append(ev)
149 \endcode
150 
151 The excluded volume restraint is calculated at resolution 20 (20 residues per bead).
152 
153 
154 **Crosslinks**
155 
156 A crosslinking restraint is implemented as a distance restraint between two residues. The two residues are each defined by the protein (component) name and the residue number. The script here extracts the correct four columns that provide this information from the [input data file](@ref rnapolii_1).
157 
158 \code{.py}
159 columnmap={}
160 columnmap["Protein1"]="pep1.accession"
161 columnmap["Protein2"]="pep2.accession"
162 columnmap["Residue1"]="pep1.xlinked_aa"
163 columnmap["Residue2"]="pep2.xlinked_aa"
164 columnmap["IDScore"]=None
165 ...
166 xl1 = IMP.pmi.restraints.crosslinking.ISDCrossLinkMS(representation,
167  datadirectory+'polii_xlinks.csv',
168  length=21.0,
169  slope=0.01,
170  columnmapping=columnmap,
171  resolution=1.0,
172  label="Trnka",
173  csvfile=True)
174 
175 xl1.add_to_model()
176 
177 # Since we are sampling psi, crosslink restraint must be added to sampleobjects
178 sampleobjects.append(xl1)
179 outputobjects.append(xl1)
180 \endcode
181 
182 An object `xl1` for this crosslinking restraint is created and then added to the model.
183 * `length`: The maximum length of the crosslink
184 * `slope`: Slope of linear energy function added to sigmoidal restraint
185 * `columnmapping`: Defining the structure of the input file
186 * `resolution`: The resolution at which the restraint is evaluated. 1 = residue level
187 * `label`: A label for this set of cross links - helpful to identify them later in the stat file
188 
189 \code{.py}
190 # optimize a bit before adding the EM restraint
191 representation.optimize_floppy_bodies(10)
192 \endcode
193 
194 All flexible beads are initially optimized for 10 Monte Carlo steps, keeping the rigid body fixed in space.
195 
196 
197 **EM Restraint**
198 
199 \code{.py}
200 em_components = bm.get_density_hierarchies([t.domain_name for t in domains])
201 gemt = IMP.pmi.restraints.em.GaussianEMRestraint(em_components,
202  target_gmm_file,
203  scale_target_to_mass=True,
204  slope=0.000001,
205  weight=100.0)
206 gemt.add_to_model()
207 outputobjects.append(gemt)
208 \endcode
209 
210 The GaussianEMRestraint uses a density overlap function to compare model to data. First the EM map is approximated with a Gaussian Mixture Model (done separately). Second, the components of the model are represented with Gaussians (forming the model GMM)
211 * `scale_to_target_mass` ensures the total mass of model and map are identical
212 * `slope`: nudge model closer to map when far away
213 * `weight`: heuristic, needed to calibrate the EM restraint with the other terms.
214 
215 and then add it to the output object. Nothing is being sampled, so it does not need to be added to sample objects.
216 
217 ---
218 
219 Completion of these steps sets the energy function.
220 The next step is \ref rnapolii_3.
int compare(const VectorD< D > &a, const VectorD< D > &b)