Sequence Clustering Update

New Features

Sequence Clustering Update

05/03

Sequence clustering based on polymer entity IDs has been relaxed from 90% to 80%

Sequence cluster groups enable exploration of sets of homologous sequences and can reveal trends across hundreds of related proteins.

RCSB.org offers data files that contain the results of the weekly clustering of protein sequences in the PDB by MMseqs2 at 30%, 40%, 50%, 70%, 90%, 95%, and 100% sequence identity. Note that these files use polymer entity identifiers, instead of chain identifiers to avoid redundancy. The files are plain text with one cluster per line, sorted from largest to smallest cluster.

The Advanced Search Group option also simplifies PDB searching by generating a non-redundant search result set based on sequence identity clustering (as well as UniProt ID, and group depositions).

The clustering requires a meaningful overlap between sequences (in addition to their sequence identity). This coverage requirement has been relaxed from 90% to 80%, which is the coverage threshold used by UniRef. This change addresses some unintuitive clusterings, where highly similar sequences were assigned to different clusters.

Consequently, the sequence clusters offered are slightly larger on average and fewer in number. Some group identifiers have changed. Redundancy-filtered result sets (see example), which collapse similar polymer entities into groups, can now be navigated more efficiently as there are fewer groups to explore.

User guides are available for Grouping Structures and Sequence-based Clustering.

New Features Index

12/12 wwPDB News: PDB Entries with Novel Ligands Now Distributed Only in PDBx/mmCIF and PDBML File Formats

11/30 Access CSMs of Available Model Organisms

11/22 wwPDB News: Deprecation of FTP File Download Protocol in the PDB Archive

11/14 wwPDB News: Backbone Annotation and Standardization of Peptide Residues is Now Live

11/02 Explore Antibiotic Resistance in 3D

10/24 Access Computed Structure Model Annotations

10/10 Download PAE JSON Files for AlphaFold Models

09/26 ASBMB Members: Register Now for Virtual Event

09/14 Register Now for October Virtual Crash Courses on RCSB PDB APIs

09/12 wwPDB News: Coming Soon: PDB Entries with Novel Ligands Distributed Only in PDBx/mmCIF and PDBML File Formats

09/11 Take the Tabular Reports Survey and Win

08/29 Turning Global Data into Global Knowledge

07/25 wwPDB News: Updated Annotation and Standardization of Peptide Residues

07/04 wwPDB News: PDB NextGen Archive Now Provides Intra-molecular Connectivity

07/03 DNS name changes for PDB archive downloads from RCSB PDB starting September 2023

06/27 Explore PDB Data Distributions

06/19 Toggle to "Opt-in" to Access Computed Structure Models Alongside PDB Data

06/13 Guide to Understanding PDB Data: Intro to APIs

06/06 Search for Structures or Feature Help, News, and PDB-101 articles

05/30 Easily Build Advanced Searches

05/16 wwPDB News: ls-lR index file to be removed July 12, 2023

05/09 Perform Improved Pairwise Structure Alignments

05/03 Sequence Clustering Update

04/25 Search for Structures By Date

04/18 Find Structurally Similar Chains and Assemblies

04/11 Upload Structure Files to Search the PDB

04/03 Using KBase to access PDB Structures and CSMs

04/02 wwPDB News: Removal of ls-lR index file from the PDB archive

03/26 wwPDB News: Access Depositions Using ORCiD

03/20 Search for Structural Motifs

03/07 wwPDB News: PDB entries with extended CCD or PDB IDs will be distributed in PDBx/mmCIF format only

03/06 SDSC and SingAREN Commit to Improving Data Access

02/14 wwPDB News: Small Angle Scattering News

02/06 wwPDB News: Prototype of PDB NextGen Archive now available

01/31 wwPDB News: Enhanced Collection of Starting Models

01/30 wwPDB News: Structure Predictors: Use ModelCIF for Computed Structure Models

01/10 wwPDB News: PDB Reaches a New Milestone: 200,000+ Entries

01/03 wwPDB News: Time-stamped Copies of PDB and EMDB Archives

RCSB PDB Core Operations are funded by the U.S. National Science Foundation (DBI-2321666), the US Department of Energy (DE-SC0019749), and the National Cancer Institute, National Institute of Allergy and Infectious Diseases, and National Institute of General Medical Sciences of the National Institutes of Health under grant R01GM157729.