Search and Browse > Advanced Search
Structure Similarity Search
Introduction
The functions of biological molecules follow their form (or shape). This in turn means that molecules that have similar shapes or structures have similar functions. The number of structures, their size, and complexity of experimental structures in the Protein Data Bank (PDB) continues to grow each year. Many of the experimental structures are assemblies of multiple proteins or multiple copies of a protein. The assembly coordinates may either be specific subsets of the model or deposited coordinates or may be derived by applying specific types of symmetry operations. Querying both deposited and assembly coordinates make finding structurally similar proteins and assemblies challenging.
RCSB.org also offers access to more than a million computed structure models (CSMs). The coordinates of these models do not include any symmetry related information so the model and assembly coordinates are identical and included by default in structure based searches.
What is Structure Similarity Search?
The Structure Similarity Search option allows you to query the PDB archive using the 3D shape of a protein structure. This RCSB PDB developed method (Guzenko et al., 2020) looks at proteins as volumes of space filled by atoms (i.e., density distribution), instead of a collection of atomic coordinates and chain connectivities. The protein volumes are broken down using a mathematical tool known as 3D Zernike polynomials, and are described as vectors of Zernike moments. This approach helps describe volumes with compact descriptors that are invariant to rotation and translation (Novotni and Klein, 2004). The search assesses global 3D-shape similarity using BioZernike descriptors to capture the global volumetric shape of the protein and works really fast for both individual protein chains and assemblies.
Why run a Structure Similarity Search?
Finding and classifying structures in the PDB is fundamental to understanding functional and evolutionary relationships. While sequence based searches can reveal conserved domains in proteins, there are many examples in biology where the protein shapes (and functions) are similar, despite sequence variations. Also, sometimes the same protein may adopt more than one conformation, such as open and closed forms of an enzyme. These structures can not be identified using sequence based searches and require structure similarity search options.
Moreover, some proteins are stabilized and/or function as part of an assembly - where it interacts with one or more copies of itself or with other proteins. The structure similarity search option allows you to identify similar assemblies - enabling exploration of shape and interactions of the protein (or its complex).
Documentation
There are a few different options that can be combined to run a Structure or shape based Search. These options are being listed here under query and results sections:
- Query - this will describe options to input the query shape (e.g., a chain or assembly); define the search type (e.g., strict and relaxed); and specify the structural hierarchy for finding a match (i.e., either a chain or assembly)
- Results - this will describe options available for what you wish to see in the results page
Query Options
Structure similarity searches can be launched for shapes similar to:
- a given polymer chain ID - the specified shape may either be matched to a polymer chain (default) or an assembly (needs to be specified)
- a given assembly ID - the specified assembly may either be matched to another assembly (default) or a polymer chain (needs to be specified)
For any structure similarity search it is possible to choose between two modes of matching using the drop down menu:
- Strict: use this if you want to be sure your matches are all relevant, at the risk of not finding some more distant matches
- Relaxed: use this if you want to be sure your matches include all similar structures, at the risk of bringing in some False Positives
Note that while the strict or relaxed options may be selected for the structure similarity searches launched from the Advanced Search panel, the searches launched from the Structure Summary Page automatically select the strict search option.
Along with specifying the Query shape, there are options available to specify what the input shape is compared to. These include:
- Assemblies: use this to match your query to complete assemblies (this is relevant if you are interested in the overall shape of a complex but not its composition)
- Chains: use this to match your query to individual chains of protein structures (this is particularly useful if you expect the chain to be part of a larger complex)
Although reasonable defaults will be set automatically - i.e., search defined using an assembly ID will search for assemblies and search by chain ID will search for chains, it may be worth changing these options if your query does not return any results or the expected results.
Note: The "Search for" options are independent of the “Return” options that can be specified in the bottom-left corner of the Advanced Query Builder. While the "Search for" option determines which structures a query finds, the “Return” options change how results are presented.
The Query can be defined for structures available from RCSB.org and also for structures not available from RCSB.org.
Query for structures available from RCSB.org
Both polymer chain and assembly based structure similarity searches for these structures can be launched from the - (a) Advanced Search panel, and (b) Structure Summary page of the 3D structure.
Query using the 'Advanced Search' panel
The structure similarity search options available from the “Advanced Search” panel and can be accessed by typing in a PDB ID or RCSB.org assigned CSM ID in the box listed under Structure Similarity (Figure 1).
Figure 1: Options for launching a Structure Similarity search from the Advanced Search Query builder. |
Once a 3D structure ID (PDB ID for experimental structures or RCSB.org assigned CSM ID) is typed in the box, some additional options become available. These options allow you to specify the query shape by selecting the suitable Assembly ID or Chain ID and the hierarchy for the search - i.e., search for a matching assembly or polymer chain.
When you specify the query by uploading a PDB ID and selecting an Assembly ID, the following options are possible:
- By default, the option “Assemblies” is selected by default from the “Search for” pull-down menus. Decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search. It is expected that you will also select “Assemblies” in the results "Return" options (Figure 2A). Other options may be selected as appropriate.
- If you wish to find a polymer chain that matches the specified assembly, select “Chain” in the “Search for” pull-down options and select the appropriate “Return” options. (See Example)
If a RCSB.org assigned CSM ID is used for this search, remember to turn on the Include CSM toggle switch (see Figures 2B). Note that for CSMs the assembly coordinates are the same as the model coordinates, so the assembly is denoted as the deposited assembly.
When you specify the query by uploading a PDB ID and selecting a Chain ID, the following options are possible:
For the protein chain based structure similarity search, select the chain ID of the protein of interest in the query structure, select “Polymer entities” in the results Return options, decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search (Figure 3A). If a CSM ID is used for this search, remember to turn on the Include CSM toggle switch (see Figures 3B).
- By default, the option “Chains” is selected by default from the “Search for” pull-down menus. Decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search. It is expected that you will also select “Polymer Entities” in the results "Return" options (Figure 3A). Other options may be selected as appropriate.
- If you wish to find an assembly that matches the specified polymer chain, select “Assembly” in the “Search for” pull-down options and select the appropriate “Return” options. (See Example)
If an RCSB.org assigned CSM ID is used for this search, remember to turn on the Include CSM toggle switch (see Figures 3B).
Query from the Structure Summary page
All 3D structures available from the RCSB.org (experimental structures and CSMs) have a dedicated Structure Summary page that displays information about the entities and assemblies of that entry. To search for structures similar to any one polymer entity in the structure click on the “Structure” link above the details listed for the macromolecule (Figure 4).
Figure 4: Options to launch a structure based search from the structure summary page (highlighted in a red box). |
To search for assemblies similar to a specific assembly of the structure click on the “Find Similar Assemblies” link written below the snapshot of the assembly on the page (Figure 5).
Figure 5: Options to launch a search for an assembly from the structure summary page. Click on link highlighted in the red box. |
Query for Structures not available from RCSB.org
Query for Structures Available in Other Public Data Resource
This option may be used to find structures that are similar to a 3D structure included in a public data resource other than the RCSB.org, e.g., AlphaFold, RoseTTAFold, or ESMFold predictions. The query structure may be input using a URL to construct this query.
To use this feature, open the Advanced Search Query Builder and scroll to the options for Structure Similarity. Switch the input mode from “Entry ID” to “File URL” (Figure 6). Make sure to specify the URL as an “http” or “https” protocol. Specify the file format, which defaults to mmCIF, but BinaryCIF and PDB files are also supported. Select “Polymer entities” or "Structures" in the results Return options, as appropriate. Decide on whether to include or exclude CSMs, and click on the blue Search button with a green magnifying lens icon to launch the search
The search will be based on the deposited coordinates, also referred to as “asymmetric unit”. Note: this is different from the 3D experimental or CSM entry-ID-based query, which allows you to select a specific assembly or chain identifier for the search.
Note: In CSM structures with local low confidence regions, i.e., for CIF files from AlphaFold, RoseTTAFold, ESMFold, where the `ma_qa_metric_local` cif category is present and the local pLDDT scores are less than 70, a pre-filtering step is applied to remove these regions from the query. Excluding such unstructured or highly flexible regions of CSMs can reduce the number of false positives and negatives in the query results.
Figure 6: Structure Similarity Search options using a File URL to specify a non-RCSB.org 3D structure as a Query. |
Query for Structures Available on Your Local Drive
This option may be used to specify custom queries by uploading your own local file to search for structures that are similar to the shape of the molecule in your file.
To use this feature, switch the input mode to “File Upload”. This will give you menus that allows you to select a file from your file system (Figure 7A). Files with “.cif”, “.bcif”, “.pdb”, and “.ent” extensions as well as their gzipped (“.gz”) variants are supported. Upon choosing the file, it will be automatically uploaded to our servers and the input mode will switch to “File URL” (Figure 7B). Your file will be referenced by a unique URL. This random URL cannot be guessed by other users; however, please note that your data is accessible to anyone who knows this URL.
The uploaded file will be available for 90 days, which means you can bookmark your search or share it with colleagues for a limited amount of time. If you want to persistently reference a search in a publication, blog post or similar, you should upload your file to a file sharing service like Dropbox or Google Drive and reference it using the “File URL” feature. The same goes for queries that will be stored in MyPDB. The maximum file size supported is 10 MB, larger files also require the use of external file sharing services.
Figure 7: Structure Similarity Search options defined by (A) uploading a file from a local drive to (B) create a temporary web link. |
Note: Uploaded files behave like external files referenced by their URL and the same limitations regarding the handling of assemblies and regions with low pLDDT confidence values will apply.
Results
Depending on the options selected, structure similarity search results list similar entities or assemblies.
For entity based searches, each matched entity can be superposed on the query entity and viewed in 3D using the pairwise alignment tool by clicking on the View button next to “Structure Match” (Figure 8).
Note: The View button is only available if searching for structures using a 3D experimental or CSM entry-ID-based query.
For assembly based searches, each matched assembly is assigned a structure match score, expressed as a percentage of the probability that it matches the query structure. So a score of 100 indicates a perfect match while lower numbers indicate lesser degrees of similarity in the assemblies (Figure 9).
Note: Results for searches uploaded using the “File URL” or “File Upload” options are treated as assembly based searches. Since the results returned by these searches are assemblies, a Structure Match Score is reported.
Limitations of Structure Similarity Search
The structure similarity search system has some limitations:
- The method can not report an RMSD since it only produces a global optimal superposition of the volumes but knows nothing about residues that are paired in the alignment. Instead the method outputs a score that indicates the likelihood that the match is relevant.
- Highly symmetric assemblies often produce false positives (with lower scores), e.g. searching for a D3 point-group symmetric assembly will likely match a few unrelated D3 assemblies with lower scores.
- Flexible NMR structures will often be unmatched due to the long flexible tails
- Long protruding tails will result in failure to match otherwise globally similar shapes.
- The matching is global, thus local similarities are not found. For example:
- when searching for chains: 2 chains that are similar only in some common domain will usually not match,
- when searching for assemblies: 2 assemblies that are similar in some subset of chains but not globally will usually not match.
Examples
1. Search for entities similar to Myoglobin
- Launch this search from the Advanced Search interface for PDB ID 1mbn, Chain ID A
- Select the strict search option, Display results as Polymer Entities, include CSMs, and launch the search (Figure 10)
Figure 10: Options to run a structure based search for chain ID A in PDB entry 1mbn, to return polymer entities. The search includes CSMs. |
- The search results show many myoglobin entities, some hemoglobin entities, a few neuroglobin and some others entities.
2. Search for entities that are conformationally similar to the open form of hexokinase
- Use a structure of the enzyme hexokinase in an “open” conformation as a query. Launch this search from the Advanced Search interface for PDB ID 2yhx, Chain ID A (Figure 11)
- Select the strict search option, Display results as Polymer Entities, include CSMs, and launch the search.
Figure 11: Options for searching structures that are conformationally similar to the open form of hexokinase |
- The search results show other hexokinase and related proteins. Note that the better matches are hexokinase entities with an open conformation while the matches listed towards the end of the result list include the same or related enzyme entities in the closed conformation.
3. Search for assemblies similar to the SARS-CoV-2 Spike protein trimer
- The SARS-CoV-2 spike protein is composed of three polymer chains, each of which has a receptor-binding domain that can be in an open (or up) conformation for interacting with cellular receptors or a closed (or down) conformation. The Structure Similarity Search functionality can be used to identify spike structures that have a similar arrangement of these domains.
- To find spike structures where all three receptor-binding domains are closed, launch the structure similarity search from the Structure Summary page for the PDB ID 6vxx, Biological Assembly 1 (Figure 12).
Figure 12: Options to search for structures with the same assembly from the structure summary page of PDB ID 6vxx. |
- The search results show similar spike protein assemblies with closed conformations.
4. Search for assemblies similar to Insulin hexamers
- Launch this search from the Structure Summary page for the PDB entry 1trz, Biological Assembly 3 (Figure 13)
Figure 13: Options to launch a structure (assembly) based search from the structure summary page of PDB ID 1trz. |
The search results show many other similar insulin assemblies, and some unrelated structures at ~12% Structure Match Scores.
5. Search for single chain Insulin with a shape similar to mature Insulin (composed of two polymer chains)
- Launch this search from the Advanced Search interface for PDB ID 1trz, Assembly ID 1
- Select the strict search option and the Chains option in the “Search for” pull down menu
- Display results (i.e., set the “Return” option) as Polymer Entities.
Figure 14: Advanced Search Query Builder options to launch a search for a single chain that matches an assembly in PDB entry 1trz. |
The search results show a number of single chain Insulin molecules.
6. Search for assemblies similar to the Chymotrypsin polymer
- Launch this search from the Advanced Search interface for PDB ID 1k2i, Chain ID A
- Select the strict search option, and the Assemblies option in the “Search for” pull down menu
- Display results (i.e., set the “Return” options) as Assemblies, include CSMs, and launch the search.
Figure 15: Advanced Search Query Builder options to launch a search for assemblies that a chain in PDB entry 1k2i. |
The search results show that many other chymotrypsin structures, some that are single chains and some assemblies (composed of multiple polymer chains) but matching the overall shape specified in the query.
References
- Guzenko, D., Burley, S. K., Duarte, J. M. (2020) Real time structural search of the Protein Data Bank". PLoS Computational Biology, https://doi.org/10.1371/journal.pcbi.1007970
- Novotni, M., & Klein, R. (2004). Shape retrieval using 3D Zernike descriptors. Comput. Aided Des., 36, 1047-1062, https://doi.org/10.1016/j.cad.2004.01.005