RMF
File Format

Overview

The RMF file format (short for Rich Molecular Format) stores hierarchical data about a molecular structure in a file or buffer in memory. This data can include

  • molecular structures stored hierarchically. These structures need not be atomic resolution.
  • feature information about parts of the structures, such as how well it fits a particular measurement.
  • geometric markup such as segments, surfaces, balls, colors which can be used to improve visualization

For example, a protein can be stored as a hierarchy where the root is the whole molecule. The root has one node per chain, each chain has one node per residue and each residue one node per atom. Each node in the hierarchy has the appropriate data stored along with it: a chain node has the chain identifier, and a residue node has the type of the residue stored and atom nodes have coordinates, atom type and elements. Bonds between atoms or coarser elements are stored explicitly as dealing with external databases to generate bonds is the source of much of the difficulty of dealing with other formats such as PDB.

The file might also include a node for storing the r-value for a FRET measurement between two residues (with links to the residues) as well as extra markers to highlight key parts of the molecule.

Multiple conformations on the hierarchy are stored as frames. Each frame has the same hierarchical structure, but some aspects of the data (eg coordinates) can have different values for each frame (or no value for a particular frame if they happen to not be applicable then).

A hierarchical storage format was chosen since

  • most biological molecules have a natural hierarchical structure
  • it reduces redundancy (eg the residue type is only stored once, as is the residue index)
  • most software uses a hierarchy of some sort to represent structures at runtime, so less translation is needed
  • low resolution and multiresolution structures are then more natural as they are just truncations of a full, atomic hierarchy.

In addition to structural information, the file can also store information about

  • bonds
  • how different parts of the structure relate to different experimental data
  • different organization schemes on the structure
  • arbitrary extra data needed by other programs
  • associated authorship and publication information

Examples

Simple example:

Larger pdb:

The RMF Hierarchy

More technically, each node in the RMF hierarchy has

One accesses nodes in the hierarchy using handles, RMF::NodeHandle and RMF::NodeConstHandle. The root handle can be fetched from the RMF::FileHandle using RMF::FileHandle::get_root_node().

Attributes

Each attribute is identified by a key (eg RMF::IntKey) and is defined by a unique combination of

On a per RMF basis, the data associated with a given key can either have one value for each node which has that attribute, or one value per frame per node with the attribute. The methods in RMF::NodeHandle to get and set the attributes take an optional frame number.

The library provides decorators to group together and provide easier manipulation of standard attributes. Examples include RMF::Particle, RMF::decorator::Colored, RMF::decorator::Ball etc. See Decorators for more information.

The data types that can currently be stored in an RMF file are

Name Description C++ type Python type
RMF::Float a floating point value float float
RMF::String a utf8 string std::string str
RMF::Int an 64 bit integer int int
RMF::Vector3 three Float values RMF::Vector3 RMF.Vector3
RMF::Vector4 four Float values RMF::Vector4 RMF.Vector4

RMF::String can be used either to store arbitrary text, or paths to files. Paths are stored as relative paths, relative to the directory containing the RMF file. This ensures that if the entire directory structure containing the RMF file is archived, the paths are still correct. The convention is that string attributes containing paths are named ending in "filename" or "filenames". Special handling is done on such attributes (e.g. if an RMF is moved to a different directory with rmf_slice or rmf_cat, the relative paths of the static frame are updated accordingly).

In addition, an arbitrary length list of any of the above can be stored. The type for that is the type for the single data with an s on the end, eg Floats for a list of Float values. These are passed as std::vector like lists in C++ and lists in Python.

Each data type has associated typedefs such as

Name Type Role
RMF::Float float the type used to pass a floating point value
RMF::Floats std::vector<RMF::Float> the type used to pass a list of floating point values. It looks like an std::vector in C++ and a list in Python
RMF::FloatKey RMF::ID<RMF::FloatTraits> a RMF::Key used to identify a floating point value associated with a node in the RMF hierarchy
RMF::FloatsKey std::vector<RMF::FloatKey> a RMF::Key used to identify a list of floating points value associated with a node in the RMF hierarchy
RMF::FloatTraits RMF::FloatTraits a traits classes to tell HDF5 how to read and write one or more floating point values
RMF::FloatsTraits RMF::FloatsTraits a traits classes to tell HDF5 how to read and write one or more lists of floating point values

Inheritance of properties

Certain nodes modify how their children should behave. This modification can be either through inheritance (eg all descendants are assumed to have the property unless they explicitly override it) or composition (the descendant's property is the ancestors composed with theirs). Note that since a given node can be reached through multiple path in the hierarchy, a given view of the file might have to have multiple objects (eg graphics) for a single node.

Current examples are

Frames

Each RMF file stores one or more frames (conformations). The attributes of a node in a given conformation are the union of conformation-specific attributes as well as static attributes (values that hold for all frames).

As with nodes, frames have a hierarchical relationship to one another. This hierarchy supports natural representation of clustering results (eg you have a frame for the cluster center with a child frame for each conformation that is in the cluster). By convention, sequential frames in a simulation should be stored as with the successor frame as a child of the predecessor.

Frames also have arbitrary attributes associated with them, like nodes.

Adding custom data to an RMF

When adding data to an RMF file that is just to be used for internal consumption, one should create a new category. For example, IMP defines an ''imp'' category when arbitrary particle data are stored.

If, instead, the data is likely to be a general interest, it probably makes sense to add it to the documentation of this library so that the names used can be standardized.

On disk format

The RMF library has supported various on-disk formats. Currently 3 output methods are supported: files with suffix .rmf and .rmfz and buffers in memory.

The current format stores the structure in an Avro Object Container. If the .rmfz suffix is used, the contents are compressed. The structure is stored as a series of records, each containing either a frame or static data (there can be multiple static data frames - they are implicitly merged). Upon opening, the file is scanned once; after that, frames can be accessed in a random access fashion. See Frame.json for the schema.

The format is robust to corruption (all on disk data are safe if garbage data is written or the process is killed).

There are several ways that the files can be made more compact (without breaking forwards compatibility of existing files). They can be investigated further if there is sufficient demand.

If HDF5 is available when RMF is built, wrappers for it will be built and support for older HDF5-based RMF formats will be compiled.

Benchmarks

A quick comparison of the various options (taken from benchmark/benchmark_rmf.cpp).

type create write frame open traverse read frame size
rmf 0.11 0.05 0.2 0.03 0.04 14M
rmfz 0.09 0.10 0.5 0.03 0.05 10M
buffer0.09 0.10 0.5 0.03 0.05 10M

The operations are:

  • create: create a RMF file with a hierarchy with 45000 particles
  • write frame: save coordinates for those particles
  • open: open the file
  • traverse: traverse the loaded hierarchy, touching atom, residue and chain data
  • read frame: load the coordinates from a frame
  • size: the size of the file. The raw data saved is 11M.

Note that the file stayed in RAM for these operations (hence the identical buffer and to disk times).