IMP Tutorial
|
This tutorial demonstrates the EMSequenceFinder method for assigning amino acid residue sequence to backbone fragments traced in an input cryo-electron microscopy (cryo-EM) map.
EMSequenceFinder is a method for assigning amino acid residue sequence to backbone fragments traced in an input cryo-electron microscopy (cryo-EM) map. EMSequenceFinder relies on a Bayesian scoring function for ranking 20 standard amino acid residue types at a given backbone position, based on the fit to a density map, map resolution, and secondary structure propensity. The fit to a density is quantified by a convolutional neural network that was trained on 5.56 million amino acid residue densities extracted from cryo-EM maps at 3–10 Å resolution and corresponding atomic structure models deposited in the Electron Microscopy Data Bank (EMDB). For more information, see Mondal et al, 2025.
This tutorial can be followed in several ways:
doc/emseqfinder.ipynb
.EMSequenceFinder is implemented as part of the Integrative Modeling Platform (IMP). It is usually used by running the emseqfinder
command-line tool.
First, download the files for this tutorial by using the "Clone or download" link at the tutorial's GitHub page. Then, install all dependencies, namely:
mrcfile
, scipy
, scikit-learn
, statsmodels
, pandas
, and tensorflow
Python packagesOne way to get these dependencies is via conda-forge. In order for TensorFlow prediction to work correctly on GPUs with libdevice
, you may have to run export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX
Input files for the protocol should be placed in the subdirectories pdb_files
, cryoem_maps
and fasta_files
containing input files in .pdb
, .map
and .fasta
format respectively. Name all three files for a given run with the same stem. For this tutorial we have provided pdb_files/EMD-8637.pdb
, cryoem_maps/EMD-8637.map
, and fasta_files/EMD-8637.fasta
.
Run the protocol on all of the files using emseqfinder batch
. This will take a few minutes to run.
The following output files will be generated:
*_ML_side_ML_prob.dat
contains fragment-wise sequence scores.batch_matching_results.txt
contains overall sequence matching accuracy per structure.More tutorials on using IMP are available at the IMP web site.