1 """@namespace IMP.pmi1.topology
2 Set of python classes to create a multi-state, multi-resolution IMP hierarchy.
3 * Start by creating a System with `model = IMP.Model(); s = IMP.pmi1.topology.System(model)`. The System will store all the states.
4 * Then call System.create_state(). You can easily create a multistate system by calling this function multiples times.
5 * For each State, call State.create_molecule() to add a Molecule (a uniquely named polymer). This function returns the Molecule object which can be passed to various PMI functions.
6 * Some useful functions to help you set up your Molecules:
7 * Access the sequence residues with slicing (Molecule[a:b]) or functions like Molecule.get_atomic_residues() and Molecule.get_non_atomic_residues(). These functions all return python sets for easy set arithmetic using & (and), | (or), - (difference)
8 * Molecule.add_structure() to add structural information from a PDB file.
9 * Molecule.add_representation() to create a representation unit - here you can choose bead resolutions as well as alternate representations like densities or ideal helices.
10 * Molecule.create_clone() lets you set up a molecule with identical representations, just a different chain ID. Use Molecule.create_copy() if you want a molecule with the same sequence but that allows custom representations.
11 * Once data has been added and representations chosen, call System.build() to create a canonical IMP hierarchy.
12 * Following hierarchy construction, setup rigid bodies, flexible beads, etc in IMP::pmi1::dof.
13 * Check your representation with a nice printout: IMP::atom::show_with_representation()
14 See a [comprehensive example](https://integrativemodeling.org/nightly/doc/ref/pmi_2multiscale_8py-example.html) for using these classes.
16 Alternatively one can construct the entire topology and degrees of freedom via formatted text file with TopologyReader and IMP::pmi1::macros::BuildSystem(). This is used in the [PMI tutorial](@ref rnapolii_stalk).
17 Note that this only allows a limited set of the full options available to PMI users (rigid bodies only, fixed resolutions).
20 from __future__
import print_function
28 from collections
import defaultdict
29 from bisect
import bisect_left
30 from math
import pi,cos,sin
31 from operator
import itemgetter
33 def _build_ideal_helix(model, residues, coord_finder):
34 """Creates an ideal helix from the specified residue range
35 Residues MUST be contiguous.
36 This function actually adds them to the TempResidue hierarchy
41 for n, tempres
in enumerate(residues):
42 if tempres.get_has_structure():
43 raise Exception(
"You tried to build ideal_helix for a residue "
44 "that already has structure:",tempres)
45 if n>0
and (
not tempres.get_index()==prev_idx+1):
46 raise Exception(
"Passed non-contiguous segment to build_ideal_helix for",tempres.get_molecule())
50 rp.set_name(
"Residue_%i" % tempres.get_index())
58 x = 2.3 * cos(n * 2 * pi / 3.6)
59 y = 2.3 * sin(n * 2 * pi / 3.6)
60 z = 6.2 / 3.6 / 2 * n * 2 * pi / 3.6
68 tempres.set_structure(this_res)
69 created_hiers.append(this_res)
70 prev_idx = tempres.get_index()
71 coord_finder.add_residues(created_hiers)
75 """A dictionary-like wrapper for reading and storing sequence data"""
76 def __init__(self,fasta_fn,name_map=None):
77 """read a fasta file and extract all the requested sequences
78 @param fasta_fn sequence file
79 @param name_map dictionary mapping the fasta name to final stored name
81 self.sequences = IMP.pmi1.tools.OrderedDict()
82 self.read_sequences(fasta_fn,name_map)
84 return len(self.sequences)
85 def __contains__(self,x):
86 return x
in self.sequences
87 def __getitem__(self,key):
90 allseqs = list(self.sequences.keys())
91 return self.sequences[allseqs[key]]
93 raise Exception(
"You tried to access sequence num",key,
"but there's only",len(self.sequences.keys()))
95 return self.sequences[key]
97 return self.sequences.__iter__()
100 for s
in self.sequences:
101 ret +=
'%s\t%s\n'%(s,self.sequences[s])
103 def read_sequences(self,fasta_fn,name_map=None):
106 with open(fasta_fn,
'r') as fh:
107 for (num, line)
in enumerate(fh):
108 if line.startswith(
'>'):
110 self.sequences[code] = seq.strip(
'*')
111 code = line.rstrip()[1:]
112 if name_map
is not None:
114 code = name_map[code]
123 "Found FASTA sequence before first header at line %d: %s" % (num + 1, line))
126 self.sequences[code] = seq.strip(
'*')
130 """Automatically setup Sytem and Degrees of Freedom with a formatted text file.
131 The file is read in and each part of the topology is stored as a
132 ComponentTopology object for input into IMP::pmi1::macros::BuildSystem.
133 The topology file should be in a simple pipe-delimited format:
135 |molecule_name|color|fasta_fn|fasta_id|pdb_fn|chain|residue_range|pdb_offset|bead_size|em_residues_per_gaussian|rigid_body|super_rigid_body|chain_of_super_rigid_bodies|flags|
136 |Rpb1 |blue |1WCM.fasta|1WCM:A|1WCM.pdb|A|1,1140 |0|10|0|1|1,3|1||
137 |Rpb1 |blue |1WCM.fasta|1WCM:A|1WCM.pdb|A|1141,1274|0|10|0|2|1,3|1||
138 |Rpb1 |blue |1WCM.fasta|1WCM:A|1WCM.pdb|A|1275,END |0|10|0|3|1,3|1||
139 |Rpb2 |red |1WCM.fasta|1WCM:B|1WCM.pdb|B|all |0|10|0|4|2,3|2||
140 |Rpb2.1 |green |1WCM.fasta|1WCM:B|1WCM.pdb|B|all |0|10|0|4|2,3|2||
144 These are the fields you can enter:
145 - `component_name`: Name of the component (chain). Serves as the parent
146 hierarchy for this structure.
147 - `color`: The color used in the output RMF file. Uses chimera names or R,G,B values
148 - `fasta_fn`: Name of FASTA file containing this component.
149 - `fasta_id`: String found in FASTA sequence header line. The sequence read
150 from the file is assumed to be a protein sequence. If it should instead
151 be treated as RNA or DNA, add an ',RNA' or ',DNA' suffix. For example,
152 a `fasta_id` of 'myseq,RNA' will read the sequence 'myseq' from the
153 FASTA file and treat it as RNA.
154 - `pdb_fn`: Name of PDB file with coordinates (if available).
155 If left empty, will set up as BEADS (you can also specify "BEADS")
156 Can also write "IDEAL_HELIX".
157 - `chain`: Chain ID of this domain in the PDB file.
158 - `residue_range`: Comma delimited pair defining range.
159 Can leave empty or use 'all' for entire sequence from PDB file.
160 The second item in the pair can be END to select the last residue in the
162 - `pdb_offset`: Offset to sync PDB residue numbering with FASTA numbering.
163 - `bead_size`: The size (in residues) of beads used to model areas not
164 covered by PDB coordinates. These will be automatically built.
165 - `em_residues`: The number of Gaussians used to model the electron
166 density of this domain. Set to zero if no EM fitting will be done.
167 The GMM files will be written to <gmm_dir>/<component_name>_<em_res>.txt (and .mrc)
168 - `rigid_body`: Leave empty if this object is not in a rigid body.
169 Otherwise, this is a number corresponding to the rigid body containing
170 this object. The number itself is just used for grouping things.
171 - `super_rigid_body`: Like a rigid_body, except things are only occasionally rigid
172 - `chain_of_super_rigid_bodies` For a polymer, create SRBs from groups.
173 - `flags` additional flags for advanced options
174 \note All filenames are relative to the paths specified in the constructor.
183 @param topology_file Pipe-delimited file specifying the topology
184 @param pdb_dir Relative path to the pdb directory
185 @param fasta_dir Relative path to the fasta directory
186 @param gmm_dir Relative path to the GMM directory
188 self.topology_file = topology_file
189 self.molecules = IMP.pmi1.tools.OrderedDict()
190 self.pdb_dir = pdb_dir
191 self.fasta_dir = fasta_dir
192 self.gmm_dir = gmm_dir
193 self._components = self.
read(topology_file)
197 "Use 'get_components()' instead of 'component_list'.")
198 def __get_component_list(self):
return self._components
199 component_list = property(__get_component_list)
201 def write_topology_file(self,outfile):
202 with open(outfile,
"w")
as f:
203 f.write(
"|molecule_name|color|fasta_fn|fasta_id|pdb_fn|chain|"
204 "residue_range|pdb_offset|bead_size|em_residues_per_gaussian|"
205 "rigid_body|super_rigid_body|chain_of_super_rigid_bodies|\n")
206 for c
in self._components:
207 output = c.get_str()+
'\n'
212 """ Return list of ComponentTopologies for selected components
213 @param topology_list List of indices to return"""
214 if topology_list ==
"all":
215 topologies = self._components
218 for i
in topology_list:
219 topologies.append(self._components[i])
222 def get_molecules(self):
223 return self.molecules
225 def read(self, topology_file, append=False):
226 """Read system components from topology file. append=False will erase
227 current topology and overwrite with new
230 is_directories =
False
235 with open(topology_file)
as infile:
237 if line.lstrip()==
"" or line[0]==
"#":
239 elif line.split(
'|')[1].strip()
in (
"molecule_name"):
241 is_directories =
False
244 elif line.split(
'|')[1] ==
"component_name":
247 "Old-style topology format (using "
248 "|component_name|) is deprecated. Please switch to "
249 "the new-style format (using |molecule_name|)\n")
251 is_directories =
False
253 elif line.split(
'|')[1] ==
"directories":
255 "Setting directories in the topology file "
256 "is deprecated. Please do so through the "
257 "TopologyReader constructor. Note that new-style "
258 "paths are relative to the current working "
259 "directory, not the topology file.\n")
260 is_directories =
True
262 fields = line.split(
'|')
263 setattr(self, fields[1],
266 new_component = self._parse_line(line, linenum, old_format)
267 self._components.append(new_component)
269 return self._components
271 def _parse_line(self, component_line, linenum, old_format):
272 """Parse a line of topology values and matches them to their key.
273 Checks each value for correct syntax
274 Returns a list of Component objects
278 values = [s.strip()
for s
in component_line.split(
'|')]
283 c.molname = values[1]
285 c._domain_name = values[2]
288 names = values[1].split(
'.')
294 c.copyname = names[1]
297 c.copyname = names[1]
298 errors.append(
"Molecule name should be <molecule.copyID>")
299 errors.append(
"For component %s line %d " % (c.molname,linenum))
300 c._domain_name = c.molname +
'.' + c.copyname
301 colorfields = values[2].split(
',')
302 if len(colorfields)==3:
303 c.color = [float(x)
for x
in colorfields]
304 if any([x>1
for x
in c.color]):
305 c.color=[x/255
for x
in c.color]
308 c._orig_fasta_file = values[3]
309 c.fasta_file = values[3]
310 fasta_field = values[4].split(
",")
311 c.fasta_id = fasta_field[0]
313 if len(fasta_field) > 1:
314 c.fasta_flag = fasta_field[1]
315 c._orig_pdb_input = values[5]
316 pdb_input = values[5]
317 tmp_chain = values[6]
320 bead_size = values[9]
323 rbs = srbs = csrbs =
''
329 if c.molname
not in self.molecules:
330 self.molecules[c.molname] = _TempMolecule(c)
333 c._orig_fasta_file = self.molecules[c.molname].orig_component.fasta_file
334 c.fasta_id = self.molecules[c.molname].orig_component.fasta_id
335 self.molecules[c.molname].add_component(c,c.copyname)
338 c.fasta_file = os.path.join(self.fasta_dir,c._orig_fasta_file)
340 errors.append(
"PDB must have BEADS, IDEAL_HELIX, or filename")
341 errors.append(
"For component %s line %d is not correct"
342 "|%s| was given." % (c.molname,linenum,pdb_input))
343 elif pdb_input
in (
"IDEAL_HELIX",
"BEADS"):
344 c.pdb_file = pdb_input
346 c.pdb_file = os.path.join(self.pdb_dir,pdb_input)
349 if len(tmp_chain)==1
or len(tmp_chain)==2:
352 errors.append(
"PDB Chain identifier must be one or two characters.")
353 errors.append(
"For component %s line %d is not correct"
354 "|%s| was given." % (c.molname,linenum,tmp_chain))
358 if rr.strip()==
'all' or str(rr)==
"":
359 c.residue_range =
None
360 elif len(rr.split(
','))==2
and self._is_int(rr.split(
',')[0])
and (self._is_int(rr.split(
',')[1])
or rr.split(
',')[1] ==
'END'):
362 c.residue_range = (int(rr.split(
',')[0]), rr.split(
',')[1])
363 if c.residue_range[1] !=
'END':
364 c.residue_range = (c.residue_range[0], int(c.residue_range[1]))
366 if old_format
and c.residue_range[1] == -1:
367 c.residue_range = (c.residue_range[0],
'END')
369 errors.append(
"Residue Range format for component %s line %d is not correct" % (c.molname, linenum))
370 errors.append(
"Correct syntax is two comma separated integers: |start_res, end_res|. end_res can also be END to select the last residue in the chain. |%s| was given." % rr)
371 errors.append(
"To select all residues, indicate |\"all\"|")
374 if self._is_int(offset):
375 c.pdb_offset=int(offset)
379 errors.append(
"PDB Offset format for component %s line %d is not correct" % (c.molname, linenum))
380 errors.append(
"The value must be a single integer. |%s| was given." % offset)
383 if self._is_int(bead_size):
384 c.bead_size=int(bead_size)
385 elif len(bead_size)==0:
388 errors.append(
"Bead Size format for component %s line %d is not correct" % (c.molname, linenum))
389 errors.append(
"The value must be a single integer. |%s| was given." % bead_size)
392 if self._is_int(emg):
394 c.density_prefix = os.path.join(self.gmm_dir,c.get_unique_name())
395 c.gmm_file = c.density_prefix+
'.txt'
396 c.mrc_file = c.density_prefix+
'.gmm'
398 c.em_residues_per_gaussian=int(emg)
400 c.em_residues_per_gaussian = 0
402 c.em_residues_per_gaussian = 0
404 errors.append(
"em_residues_per_gaussian format for component "
405 "%s line %d is not correct" % (c.molname, linenum))
406 errors.append(
"The value must be a single integer. |%s| was given." % emg)
410 if not self._is_int(rbs):
411 errors.append(
"rigid bodies format for component "
412 "%s line %d is not correct" % (c.molname, linenum))
413 errors.append(
"Each RB must be a single integer, or empty. "
414 "|%s| was given." % rbs)
415 c.rigid_body = int(rbs)
419 srbs = srbs.split(
',')
421 if not self._is_int(i):
422 errors.append(
"super rigid bodies format for component "
423 "%s line %d is not correct" % (c.molname, linenum))
424 errors.append(
"Each SRB must be a single integer. |%s| was given." % srbs)
425 c.super_rigid_bodies = srbs
429 if not self._is_int(csrbs):
430 errors.append(
"em_residues_per_gaussian format for component "
431 "%s line %d is not correct" % (c.molname, linenum))
432 errors.append(
"Each CSRB must be a single integer. |%s| was given." % csrbs)
433 c.chain_of_super_rigid_bodies = csrbs
437 raise ValueError(
"Fix Topology File syntax errors and rerun: " \
444 """Change the GMM dir"""
445 self.gmm_dir = gmm_dir
446 for c
in self._components:
447 c.gmm_file = os.path.join(self.gmm_dir,c.get_unique_name()+
".txt")
448 c.mrc_file = os.path.join(self.gmm_dir,c.get_unique_name()+
".mrc")
449 print(
'new gmm',c.gmm_file)
452 """Change the PDB dir"""
453 self.pdb_dir = pdb_dir
454 for c
in self._components:
455 if not c._orig_pdb_input
in (
"",
"None",
"IDEAL_HELIX",
"BEADS"):
456 c.pdb_file = os.path.join(self.pdb_dir,c._orig_pdb_input)
459 """Change the FASTA dir"""
460 self.fasta_dir = fasta_dir
461 for c
in self._components:
462 c.fasta_file = os.path.join(self.fasta_dir,c._orig_fasta_file)
464 def _is_int(self, s):
468 return float(s).is_integer()
473 """Return list of lists of rigid bodies (as domain name)"""
474 rbl = defaultdict(list)
475 for c
in self._components:
477 rbl[c.rigid_body].append(c.get_unique_name())
481 """Return list of lists of super rigid bodies (as domain name)"""
482 rbl = defaultdict(list)
483 for c
in self._components:
484 for rbnum
in c.super_rigid_bodies:
485 rbl[rbnum].append(c.get_unique_name())
489 """Return list of lists of chains of super rigid bodies (as domain name)"""
490 rbl = defaultdict(list)
491 for c
in self._components:
492 for rbnum
in c.chain_of_super_rigid_bodies:
493 rbl[rbnum].append(c.get_unique_name())
496 class _TempMolecule(object):
497 """Store the Components and any requests for copies"""
499 self.molname = init_c.molname
502 self.add_component(init_c,init_c.copyname)
503 self.orig_copyname = init_c.copyname
504 self.orig_component = self.domains[init_c.copyname][0]
505 def add_component(self,component,copy_id):
506 self.domains[copy_id].append(component)
507 component.domainnum = len(self.domains[copy_id])-1
509 return ','.join(
'%s:%i'%(k,len(self.domains[k]))
for k
in self.domains)
511 class _Component(object):
512 """Stores the components required to build a standard IMP hierarchy
513 using IMP.pmi1.BuildModel()
519 self.fasta_file =
None
520 self._orig_fasta_file =
None
522 self.fasta_flag =
None
524 self._orig_pdb_input =
None
526 self.residue_range =
None
529 self.em_residues_per_gaussian = 0
532 self.density_prefix =
''
534 self.rigid_body =
None
535 self.super_rigid_bodies = []
536 self.chain_of_super_rigid_bodies = []
539 return ",".join(
"%s" % x
for x
in l)
542 return self.get_str()
545 return "%s.%s.%i"%(self.molname,self.copyname,self.domainnum)
548 res_range = self.residue_range
549 if self.residue_range
is None:
552 if self.copyname!=
'':
553 name +=
'.'+self.copyname
554 if self.chain
is None:
559 if isinstance(color, list):
560 color=
','.join([str(x)
for x
in color])
561 a=
'|'+
'|'.join([name,color,self._orig_fasta_file,self.fasta_id,
562 self._orig_pdb_input,chain,self._l2s(list(res_range)),
563 str(self.pdb_offset),str(self.bead_size),
564 str(self.em_residues_per_gaussian),
565 str(self.rigid_body)
if self.rigid_body
else '',
566 self._l2s(self.super_rigid_bodies),
567 self._l2s(self.chain_of_super_rigid_bodies)])+
'|'
572 def __get_name(self):
return self.molname
573 name = property(__get_name)
577 "Use 'get_unique_name()' instead of 'domain_name'.")
578 def __get_domain_name(self):
return self._domain_name
579 domain_name = property(__get_domain_name)
583 '''Extends the functionality of IMP.atom.Molecule'''
585 def __init__(self,hierarchy):
586 IMP.atom.Molecule.__init__(self,hierarchy)
595 def get_extended_name(self):
596 return self.get_name()+
"."+\
597 str(self.get_copy_index())+\
598 "."+str(self.get_state_index())
600 def get_sequence(self):
603 def get_residue_indexes(self):
606 def get_residue_segments(self):
613 s=
'PMIMoleculeHierarchy '
616 s+=
" "+
"State "+str(self.get_state_index())
617 s+=
" "+
"N residues "+str(len(self.get_sequence()))
Hierarchy get_parent() const
Get the parent particle.
std::string get_unique_name(std::string templ)
Return a unique name produced from the string.
def __init__
read a fasta file and extract all the requested sequences
def set_gmm_dir
Change the GMM dir.
static Atom setup_particle(Model *m, ParticleIndex pi, Atom other)
static XYZR setup_particle(Model *m, ParticleIndex pi)
def get_components
Return list of ComponentTopologies for selected components.
Legacy PMI1 module to represent, score, sample and analyze models.
void handle_use_deprecated(std::string message)
Break in this method in gdb to find deprecated uses at runtime.
Extends the functionality of IMP.atom.Molecule.
static Residue setup_particle(Model *m, ParticleIndex pi, ResidueType t, int index, int insertion_code)
A decorator for keeping track of copies of a molecule.
def get_super_rigid_bodies
Return list of lists of super rigid bodies (as domain name)
A dictionary-like wrapper for reading and storing sequence data.
def deprecated_method
Python decorator to mark a method as deprecated.
def set_fasta_dir
Change the FASTA dir.
std::string get_relative_path(std::string base, std::string relative)
Return a path to a file relative to another file.
int get_state_index(Hierarchy h)
Walk up the hierarchy to find the current state.
def read
Read system components from topology file.
General purpose algebraic and geometric methods that are expected to be used by a wide variety of IMP...
The general base class for IMP exceptions.
Associate an integer "state" index with a hierarchy node.
def set_pdb_dir
Change the PDB dir.
Class to handle individual particles of a Model object.
def get_chains_of_super_rigid_bodies
Return list of lists of chains of super rigid bodies (as domain name)
def get_rigid_bodies
Return list of lists of rigid bodies (as domain name)
Store info for a chain of a protein.
int get_copy_index(Hierarchy h)
Walk up the hierarchy to find the current copy index.
Functionality for loading, creating, manipulating and scoring atomic structures.
std::string get_chain_id(Hierarchy h)
Walk up the hierarchy to determine the chain id.
A decorator for a molecule.
Automatically setup Sytem and Degrees of Freedom with a formatted text file.