Overview

The RMF file format (short for Rich Molecular Format) stores hierarchical data about a molecular structure in a file or buffer in memory. This data can include

molecular structures stored hierarchically. These structures need not be atomic resolution.
feature information about parts of the structures, such as how well it fits a particular measurement.
geometric markup such as segments, surfaces, balls, colors which can be used to improve visualization

For example, a protein can be stored as a hierarchy where the root is the whole molecule. The root has one node per chain, each chain has one node per residue and each residue one node per atom. Each node in the hierarchy has the appropriate data stored along with it: a chain node has the chain identifier, and a residue node has the type of the residue stored and atom nodes have coordinates, atom type and elements. Bonds between atoms or coarser elements are stored explicitly as dealing with external databases to generate bonds is the source of much of the difficulty of dealing with other formats such as PDB.

The file might also include a node for storing the r-value for a FRET measurement between two residues (with links to the residues) as well as extra markers to highlight key parts of the molecule.

Multiple conformations on the hierarchy are stored as frames. Each frame has the same hierarchical structure, but some aspects of the data (eg coordinates) can have different values for each frame (or no value for a particular frame if they happen to not be applicable then).

A hierarchical storage format was chosen since

most biological molecules have a natural hierarchical structure
it reduces redundancy (eg the residue type is only stored once, as is the residue index)
most software uses a hierarchy of some sort to represent structures at runtime, so less translation is needed
low resolution and multiresolution structures are then more natural as they are just truncations of a full, atomic hierarchy.

In addition to structural information, the file can also store information about

bonds
how different parts of the structure relate to different experimental data
different organization schemes on the structure
arbitrary extra data needed by other programs
associated authorship and publication information

Examples

Simple example:

simple pdb
simple hierarchy for the hierarchy and data stored
simple on disk for how the data is stored in an RMF3 file.

Larger pdb:

3U7W pdb
3U7W hierarchy for the hierarchy and data stored
3U7W on disk for how the data is stored in an RMF3 file.

The RMF Hierarchy

More technically, each node in the RMF hierarchy has

a type (RMF::NodeType)
a human readable name (RMF::NodeHandle::get_name())
an ID that is unique within the file (RMF::NodeHandle::get_id())
and associated attributes.

One accesses nodes in the hierarchy using handles, RMF::NodeHandle and RMF::NodeConstHandle. The root handle can be fetched from the RMF::FileHandle using RMF::FileHandle::get_root_node().

Attributes

Each attribute is identified by a key (eg RMF::IntKey) and is defined by a unique combination of

a category such as physics, identified by an RMF::Category.
a name string

On a per RMF basis, the data associated with a given key can either have one value for each node which has that attribute, or one value per frame per node with the attribute. The methods in RMF::NodeHandle to get and set the attributes take an optional frame number.

The library provides decorators to group together and provide easier manipulation of standard attributes. Examples include RMF::Particle, RMF::decorator::Colored, RMF::decorator::Ball etc. See Decorators for more information.

The data types that can currently be stored in an RMF file are

Name	Description	C++ type	Python type
RMF::Float	a floating point value	`float`	`float`
RMF::String	a utf8 string	`std::string`	`str`
RMF::Int	an 64 bit integer	`int`	`int`
RMF::Vector3	three Float values	RMF::Vector3	RMF.Vector3
RMF::Vector4	four Float values	RMF::Vector4	RMF.Vector4

RMF::String can be used either to store arbitrary text, or paths to files. Paths are stored as relative paths, relative to the directory containing the RMF file. This ensures that if the entire directory structure containing the RMF file is archived, the paths are still correct. The convention is that string attributes containing paths are named ending in "filename" or "filenames". Special handling is done on such attributes (e.g. if an RMF is moved to a different directory with rmf_slice or rmf_cat, the relative paths of the static frame are updated accordingly).

In addition, an arbitrary length list of any of the above can be stored. The type for that is the type for the single data with an s on the end, eg Floats for a list of Float values. These are passed as std::vector like lists in C++ and lists in Python.

Each data type has associated typedefs such as

Name	Type	Role
RMF::Float	`float`	the type used to pass a floating point value
RMF::Floats	std::vector<RMF::Float>	the type used to pass a list of floating point values. It looks like an `std::vector` in `C++` and a `list` in `Python`
RMF::FloatKey	RMF::ID<RMF::FloatTraits>	a RMF::Key used to identify a floating point value associated with a node in the RMF hierarchy
RMF::FloatsKey	std::vector<RMF::FloatKey>	a RMF::Key used to identify a list of floating points value associated with a node in the RMF hierarchy
RMF::FloatTraits	RMF::FloatTraits	a traits classes to tell HDF5 how to read and write one or more floating point values
RMF::FloatsTraits	RMF::FloatsTraits	a traits classes to tell HDF5 how to read and write one or more lists of floating point values

Inheritance of properties

Certain nodes modify how their children should behave. This modification can be either through inheritance (eg all descendants are assumed to have the property unless they explicitly override it) or composition (the descendant's property is the ancestors composed with theirs). Note that since a given node can be reached through multiple path in the hierarchy, a given view of the file might have to have multiple objects (eg graphics) for a single node.

Current examples are

RMF::decorator::Colored is inherited. That is, a node that is not an RMF::decorator::Colored node, has the color of its closest RMF::decorator::Colored ancestor.
The RMF::decorator::Particle and RMF::decorator::RigidParticle coordinates are transforms of any RMF::decorator::ReferenceFrame(s) that occur on the particle itself or its ancestors. That is, a node that is a RMF::decorator::Particle or RMF::decorator::Ball with an ancestor that is an RMF::decorator::ReferenceFrame has global coordinates at the RMF::decorator::ReferenceFrame's transformation applied to its coordinates. See the example reference_frames.py.

Frames

Each RMF file stores one or more frames (conformations). The attributes of a node in a given conformation are the union of conformation-specific attributes as well as static attributes (values that hold for all frames).

As with nodes, frames have a hierarchical relationship to one another. This hierarchy supports natural representation of clustering results (eg you have a frame for the cluster center with a child frame for each conformation that is in the cluster). By convention, sequential frames in a simulation should be stored as with the successor frame as a child of the predecessor.

Frames also have arbitrary attributes associated with them, like nodes.

Adding custom data to an RMF

When adding data to an RMF file that is just to be used for internal consumption, one should create a new category. For example, IMP defines an ''imp'' category when arbitrary particle data are stored.

If, instead, the data is likely to be a general interest, it probably makes sense to add it to the documentation of this library so that the names used can be standardized.

On disk format

The RMF library has supported various on-disk formats. Currently 3 output methods are supported: files with suffix .rmf and .rmfz and buffers in memory.

The current format stores the structure in an Avro Object Container. If the .rmfz suffix is used, the contents are compressed. The structure is stored as a series of records, each containing either a frame or static data (there can be multiple static data frames - they are implicitly merged). Upon opening, the file is scanned once; after that, frames can be accessed in a random access fashion. See Frame.json for the schema.

The format is robust to corruption (all on disk data are safe if garbage data is written or the process is killed).

There are several ways that the files can be made more compact (without breaking forwards compatibility of existing files). They can be investigated further if there is sufficient demand.

If HDF5 is available when RMF is built, wrappers for it will be built and support for older HDF5-based RMF formats will be compiled.

Benchmarks

A quick comparison of the various options (taken from benchmark/benchmark_rmf.cpp).

type	create	write frame	open	traverse	read frame	size
rmf	0.11	0.05	0.2	0.03	0.04	14M
rmfz	0.09	0.10	0.5	0.03	0.05	10M
buffer	0.09	0.10	0.5	0.03	0.05	10M

The operations are:

create: create a RMF file with a hierarchy with 45000 particles
write frame: save coordinates for those particles
open: open the file
traverse: traverse the loaded hierarchy, touching atom, residue and chain data
read frame: load the coordinates from a frame
size: the size of the file. The raw data saved is 11M.

Note that the file stayed in RAM for these operations (hence the identical buffer and to disk times).

Table of Contents