Introduction

This tutorial demonstrates the EMSequenceFinder method for assigning amino acid residue sequence to backbone fragments traced in an input cryo-electron microscopy (cryo-EM) map.

EMSequenceFinder is a method for assigning amino acid residue sequence to backbone fragments traced in an input cryo-electron microscopy (cryo-EM) map. EMSequenceFinder relies on a Bayesian scoring function for ranking 20 standard amino acid residue types at a given backbone position, based on the fit to a density map, map resolution, and secondary structure propensity. The fit to a density is quantified by a convolutional neural network that was trained on 5.56 million amino acid residue densities extracted from cryo-EM maps at 3–10 Å resolution and corresponding atomic structure models deposited in the Electron Microscopy Data Bank (EMDB). For more information, see Mondal et al, 2025.

This tutorial can be followed in several ways:

Download the files using the "Clone or download" link at the tutorial's GitHub page and use them in conjunction with this text.
Download the files from GitHub and, using Jupyter Notebook, open the notebook doc/emseqfinder.ipynb.
Load the tutorial directly in your browser, courtesy of Google Colaboratory. (This needs no software installed on your machine, but may take a while to load.)

Basic usage of emseqfinder

EMSequenceFinder is implemented as part of the Integrative Modeling Platform (IMP). It is usually used by running the emseqfinder command-line tool.

First, download the files for this tutorial by using the "Clone or download" link at the tutorial's GitHub page. Then, install all dependencies, namely:

IMP
The STRIDE command line tool for secondary structure prediction
The mrcfile, scipy, scikit-learn, statsmodels, pandas, and tensorflow Python packages

One way to get these dependencies is via conda-forge. In order for TensorFlow prediction to work correctly on GPUs with libdevice, you may have to run export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX

Input files for the protocol should be placed in the subdirectories pdb_files, cryoem_maps and fasta_files containing input files in .pdb, .map and .fasta format respectively. Name all three files for a given run with the same stem. For this tutorial we have provided pdb_files/EMD-8637.pdb, cryoem_maps/EMD-8637.map, and fasta_files/EMD-8637.fasta.

Run the protocol on all of the files using emseqfinder batch. This will take a few minutes to run.

emseqfinder batch

The following output files will be generated:

*_ML_side_ML_prob.dat contains fragment-wise sequence scores.
batch_matching_results.txt contains overall sequence matching accuracy per structure.

cat EMD-8637_ML_side_ML_prob.dat

cat batch_matching_results.txt

Table of Contents

Introduction

Basic usage of emseqfinder

Further reading