1. Cross-link identification ambiguity

There are several models on how to implement the identification ambiguity.

One way to do it is to use the UniqueID keyword; cross-links with the same UniqueID are considered ambiguous:

xldb='''Protein 1,Protein 2,Residue 1,Residue 2,UniqueID,Score
ProtA,ProtB,1,10,1,1.0
ProtA,ProtB,1,11,1,2.0
ProtA,ProtB,1,21,2,2.0
'''
with open('xlinks.csv', 'w') as xlf:
    xlf.write(xldb)

In the example above, cross-links ProtA:1-ProtB:10 and ProtA:1-ProtB:11 are ambiguous because they were assigned to the same UniqueID.

Now we create a conversion map between internal keywords of xlinks features and the one in the file:

cldbkc = IMP.pmi.io.crosslink.CrossLinkDataBaseKeywordsConverter()
cldbkc.set_protein1_key("Protein 1")
cldbkc.set_protein2_key("Protein 2")
cldbkc.set_residue1_key("Residue 1")
cldbkc.set_residue2_key("Residue 2")
cldbkc.set_unique_id_key("UniqueID")
cldbkc.set_id_score_key("Score")

With this keyword interpreter, let's read the cross-link database:

cldb = IMP.pmi.io.crosslink.CrossLinkDataBase(cldbkc)

cldb.create_set_from_file("xlinks.csv")

Let's check that the database looks ok:

print(cldb)

Output

1
--- XLUniqueID 1
--- XLUniqueSubIndex 1
--- XLUniqueSubID 1.1
--- Protein1 ProtA
--- Protein2 ProtB
--- Residue1 1
--- Residue2 10
--- IDScore 1.0
--- Redundancy 1
--- RedundancyList ['1.1']
--- Ambiguity 2
--- Residue1LinksNumber 3
--- Residue2LinksNumber 1
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 2
--- XLUniqueSubID 1.2
--- Protein1 ProtA
--- Protein2 ProtB
--- Residue1 1
--- Residue2 11
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['1.2']
--- Ambiguity 2
--- Residue1LinksNumber 3
--- Residue2LinksNumber 1
-------------
2
--- XLUniqueID 2
--- XLUniqueSubIndex 1
--- XLUniqueSubID 2.1
--- Protein1 ProtA
--- Protein2 ProtB
--- Residue1 1
--- Residue2 21
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['2.1']
--- Ambiguity 1
--- Residue1LinksNumber 3
--- Residue2LinksNumber 1
-------------

As you can see there are two unique indexes, 1 and 2. The first spectral index contains two identifications, with subindexes 1.1 and 1.2, corresponding to the two ambiguous restraints.

2. Compositional ambiguity

Compositional ambiguity occurs when identical copies of the same protein are present in the sample, and we are not able to attribute the cross-link to one or the other copy.

Let's suppose we already have an identification ambiguity, to complicate the example, and see how the two ambiguities combine with each other. See the data below; note that two cross-links have the same UniqueID:

xldb='''Protein 1,Protein 2,Residue 1,Residue 2,UniqueID,Score
ProtA,ProtB,1,10,1,1.0
ProtA,ProtB,1,11,1,2.0
ProtB,ProtA,21,1,2,2.0
ProtA,ProtA,1,2,3,3.0
'''
with open('xlinks.csv', 'w') as xlf:
    xlf.write(xldb)

We will first create a database:

cldbkc = IMP.pmi.io.crosslink.CrossLinkDataBaseKeywordsConverter()
cldbkc.set_protein1_key("Protein 1")
cldbkc.set_protein2_key("Protein 2")
cldbkc.set_residue1_key("Residue 1")
cldbkc.set_residue2_key("Residue 2")
cldbkc.set_unique_id_key("UniqueID")
cldbkc.set_id_score_key("Score")
cldb = IMP.pmi.io.crosslink.CrossLinkDataBase(cldbkc)
cldb.create_set_from_file("xlinks.csv")

Now, we know that there are two copies of ProtA, which we called ProtA.1 and ProtA.2 in our IMP Hierarchy. Let's rename ProtA into ProtA.1 for both ends of each cross-link:

from IMP.pmi.io.crosslink import FilterOperator as FO
import operator
fo1 = FO(cldb.protein1_key, operator.eq, "ProtA")
cldb.set_value(cldb.protein1_key, "ProtA.1", fo1)
fo2 = FO(cldb.protein2_key, operator.eq, "ProtA")
cldb.set_value(cldb.protein2_key, "ProtA.1", fo2)

Next we clone all cross-links involving ProtA.1 so that they were observed also by ProtA.2:

cldb.clone_protein("ProtA.1", "ProtA.2")

Let's check that the database looks OK:

print(cldb)

Output

1
--- XLUniqueID 1
--- XLUniqueSubIndex 1
--- XLUniqueSubID 1.1
--- Protein1 ProtA.1
--- Protein2 ProtB
--- Residue1 1
--- Residue2 10
--- IDScore 1.0
--- Redundancy 1
--- RedundancyList ['1.1']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 2
--- XLUniqueSubID 1.2
--- Protein1 ProtA.2
--- Protein2 ProtB
--- Residue1 1
--- Residue2 10
--- IDScore 1.0
--- Redundancy 1
--- RedundancyList ['1.2']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 3
--- XLUniqueSubID 1.3
--- Protein1 ProtA.1
--- Protein2 ProtB
--- Residue1 1
--- Residue2 11
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['1.3']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 1
--- XLUniqueSubIndex 4
--- XLUniqueSubID 1.4
--- Protein1 ProtA.2
--- Protein2 ProtB
--- Residue1 1
--- Residue2 11
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['1.4']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
2
--- XLUniqueID 2
--- XLUniqueSubIndex 1
--- XLUniqueSubID 2.1
--- Protein1 ProtB
--- Protein2 ProtA.1
--- Residue1 21
--- Residue2 1
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['2.1']
--- Ambiguity 2
--- Residue1LinksNumber 2
--- Residue2LinksNumber 5
-------------
--- XLUniqueID 2
--- XLUniqueSubIndex 2
--- XLUniqueSubID 2.2
--- Protein1 ProtB
--- Protein2 ProtA.2
--- Residue1 21
--- Residue2 1
--- IDScore 2.0
--- Redundancy 1
--- RedundancyList ['2.2']
--- Ambiguity 2
--- Residue1LinksNumber 2
--- Residue2LinksNumber 5
-------------
3
--- XLUniqueID 3
--- XLUniqueSubIndex 1
--- XLUniqueSubID 3.1
--- Protein1 ProtA.1
--- Protein2 ProtA.1
--- Residue1 1
--- Residue2 2
--- IDScore 3.0
--- Redundancy 1
--- RedundancyList ['3.1']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 3
--- XLUniqueSubIndex 2
--- XLUniqueSubID 3.2
--- Protein1 ProtA.2
--- Protein2 ProtA.1
--- Residue1 1
--- Residue2 2
--- IDScore 3.0
--- Redundancy 1
--- RedundancyList ['3.2']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 3
--- XLUniqueSubIndex 3
--- XLUniqueSubID 3.3
--- Protein1 ProtA.1
--- Protein2 ProtA.2
--- Residue1 1
--- Residue2 2
--- IDScore 3.0
--- Redundancy 1
--- RedundancyList ['3.3']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------
--- XLUniqueID 3
--- XLUniqueSubIndex 4
--- XLUniqueSubID 3.4
--- Protein1 ProtA.2
--- Protein2 ProtA.2
--- Residue1 1
--- Residue2 2
--- IDScore 3.0
--- Redundancy 1
--- RedundancyList ['3.4']
--- Ambiguity 4
--- Residue1LinksNumber 5
--- Residue2LinksNumber 2
-------------

As you can see there are three unique indexes, 1, 2 and 3. The first index contains four cross-links, the second two cross-links and the third four cross-links.

Table of Contents

1. Cross-link identification ambiguity

2. Compositional ambiguity