RMF
|
The RMF file format (short for Rich Molecular Format) stores hierarchical data about a molecular structure in a file or buffer in memory. This data can include
For example, a protein can be stored as a hierarchy where the root is the whole molecule. The root has one node per chain, each chain has one node per residue and each residue one node per atom. Each node in the hierarchy has the appropriate data stored along with it: a chain node has the chain identifier, and a residue node has the type of the residue stored and atom nodes have coordinates, atom type and elements. Bonds between atoms or coarser elements are stored explicitly as dealing with external databases to generate bonds is the source of much of the difficulty of dealing with other formats such as PDB.
The file might also include a node for storing the r-value for a FRET measurement between two residues (with links to the residues) as well as extra markers to highlight key parts of the molecule.
Multiple conformations on the hierarchy are stored as frames. Each frame has the same hierarchical structure, but some aspects of the data (eg coordinates) can have different values for each frame (or no value for a particular frame if they happen to not be applicable then).
A hierarchical storage format was chosen since
In addition to structural information, the file can also store information about
Simple example:
Larger pdb:
More technically, each node in the RMF hierarchy has
One accesses nodes in the hierarchy using handles, RMF::NodeHandle and RMF::NodeConstHandle. The root handle can be fetched from the RMF::FileHandle using RMF::FileHandle::get_root_node().
Each attribute is identified by a key (eg RMF::IntKey) and is defined by a unique combination of
On a per RMF basis, the data associated with a given key can either have one value for each node which has that attribute, or one value per frame per node with the attribute. The methods in RMF::NodeHandle to get and set the attributes take an optional frame number.
The library provides decorators to group together and provide easier manipulation of standard attributes. Examples include RMF::Particle, RMF::decorator::Colored, RMF::decorator::Ball etc. See Decorators for more information.
The data types that can currently be stored in an RMF file are
Name | Description | C++ type | Python type |
---|---|---|---|
RMF::Float | a floating point value | float | float |
RMF::String | a utf8 string | std::string | str |
RMF::Int | an 64 bit integer | int | int |
RMF::Vector3 | three Float values | RMF::Vector3 | RMF.Vector3 |
RMF::Vector4 | four Float values | RMF::Vector4 | RMF.Vector4 |
RMF::String can be used either to store arbitrary text, or paths to files. Paths are stored as relative paths, relative to the directory containing the RMF file. This ensures that if the entire directory structure containing the RMF file is archived, the paths are still correct. The convention is that string attributes containing paths are named ending in "filename" or "filenames". Special handling is done on such attributes (e.g. if an RMF is moved to a different directory with rmf_slice
or rmf_cat
, the relative paths of the static frame are updated accordingly).
In addition, an arbitrary length list of any of the above can be stored. The type for that is the type for the single data with an s
on the end, eg Floats
for a list of Float
values. These are passed as std::vector
like lists in C++
and lists
in Python
.
Each data type has associated typedefs such as
Name | Type | Role |
---|---|---|
RMF::Float | float | the type used to pass a floating point value |
RMF::Floats | std::vector<RMF::Float> | the type used to pass a list of floating point values. It looks like an std::vector in C++ and a list in Python |
RMF::FloatKey | RMF::ID<RMF::FloatTraits> | a RMF::Key used to identify a floating point value associated with a node in the RMF hierarchy |
RMF::FloatsKey | std::vector<RMF::FloatKey> | a RMF::Key used to identify a list of floating points value associated with a node in the RMF hierarchy |
RMF::FloatTraits | RMF::FloatTraits | a traits classes to tell HDF5 how to read and write one or more floating point values |
RMF::FloatsTraits | RMF::FloatsTraits | a traits classes to tell HDF5 how to read and write one or more lists of floating point values |
Certain nodes modify how their children should behave. This modification can be either through inheritance (eg all descendants are assumed to have the property unless they explicitly override it) or composition (the descendant's property is the ancestors composed with theirs). Note that since a given node can be reached through multiple path in the hierarchy, a given view of the file might have to have multiple objects (eg graphics) for a single node.
Current examples are
Each RMF file stores one or more frames (conformations). The attributes of a node in a given conformation are the union of conformation-specific attributes as well as static attributes (values that hold for all frames).
As with nodes, frames have a hierarchical relationship to one another. This hierarchy supports natural representation of clustering results (eg you have a frame for the cluster center with a child frame for each conformation that is in the cluster). By convention, sequential frames in a simulation should be stored as with the successor frame as a child of the predecessor.
Frames also have arbitrary attributes associated with them, like nodes.
When adding data to an RMF file that is just to be used for internal consumption, one should create a new category. For example, IMP defines an ''imp'' category when arbitrary particle data are stored.
If, instead, the data is likely to be a general interest, it probably makes sense to add it to the documentation of this library so that the names used can be standardized.
The RMF library has supported various on-disk formats. Currently 3 output methods are supported: files with suffix .rmf
and .rmfz
and buffers in memory.
The current format stores the structure in an Avro Object Container. If the .rmfz
suffix is used, the contents are compressed. The structure is stored as a series of records, each containing either a frame or static data (there can be multiple static data frames - they are implicitly merged). Upon opening, the file is scanned once; after that, frames can be accessed in a random access fashion. See Frame.json for the schema.
The format is robust to corruption (all on disk data are safe if garbage data is written or the process is killed).
There are several ways that the files can be made more compact (without breaking forwards compatibility of existing files). They can be investigated further if there is sufficient demand.
If HDF5 is available when RMF is built, wrappers for it will be built and support for older HDF5-based RMF formats will be compiled.
A quick comparison of the various options (taken from benchmark/benchmark_rmf.cpp).
type | create | write frame | open | traverse | read frame | size |
---|---|---|---|---|---|---|
rmf | 0.11 | 0.05 | 0.2 | 0.03 | 0.04 | 14M |
rmfz | 0.09 | 0.10 | 0.5 | 0.03 | 0.05 | 10M |
buffer | 0.09 | 0.10 | 0.5 | 0.03 | 0.05 | 10M |
The operations are:
create
: create a RMF file with a hierarchy with 45000 particleswrite frame
: save coordinates for those particlesopen
: open the filetraverse
: traverse the loaded hierarchy, touching atom, residue and chain dataread frame
: load the coordinates from a framesize
: the size of the file. The raw data saved is 11M.Note that the file stayed in RAM for these operations (hence the identical buffer and to disk times).