Sequence Motif Search

Introduction

What is a sequence motif?

Sequence motifs are short segments of conserved protein or nucleic acid sequences, that are present in many proteins or genes (respectively, and believed to have specific functional significance. In some cases the entire set of amino acids or nucleic acids in the sequence is conserved and required to perform the specific function, while in other cases only amino acids or nucleic acids at specific locations in the sequence motif may be conserved and significant for the function.

What is a Sequence Motif Search?

The sequence motif search option allows you to query for amino acid or nucleotide sequence fragments in an FASTA sequence that appear frequently in polymers present in 3D structures.

Why run a Sequence Motif Search?

Finding a specific sequence motif in a protein or nucleic acid suggests that it may have the function associated with the motif - i.e., it can be used to predict function(s).

Another reason to run sequence motif searches is that it is indeed different from a regular sequence based search in two ways:

  • The sequence defining the sequence motif is short (so the sequence based searches will not work effectively)
  • Parts of the sequence motif may have alternate sequences or may not be conserved at all (so specific conditions have to be included in the query for defining the non-contiguous conserved amino acids/nucleotides in the sequence motif)

Documentation

The sequence motif search options are available from the Advanced Search Query builder (Figure 1).

Figure 1: Interface to specify a sequence motif search for Protein, DNA, or RNA sequences in different formats and find polymer entities that match the query. If appropriate, turn on toggle switch to include CSMs in the search.

Query options

  • Sequence motif searches can be run for protein, DNA, or RNA sequences. Select the type of query sequence using the Sequence Type options.
  • The actual sequence motif can be specified using three types of syntax:
    • Simple: Search for sequence fragments using IUPAC one-letter codes for amino acids like MQTIF. Use the symbol ‘X’ to allow any amino acid at a position. E.g., a query for SH3 domains using the sequence -X-P-P-X-P (where X is a variable residue and P is Proline) can be expressed as: XPPXP.
    • PROSITE: Complex queries can be expressed using PROSITE patterns. For details, see the definitions
    • RegEx: Regular expressions are supported as an alternative representation of complex queries. For instance:
      • Ranges of variable residues are specified by the {n} notation, where n is the number of variable residues. To query a motif with seven variables between residues W and G and twenty variable residues between G and L use the following notation: W.{7}G.{20}L
      • Variable ranges are expressed by the {n,m} notation, where n is the minimum and m the maximum number of repetitions. For example the zinc finger motif that binds Zn in a DNA-binding domain can be expressed as: C.{2,4}C.{12}H.{3,5}H
      • The '^' operator searches for sequence motifs at the beginning of a protein sequence. The following two queries find sequences with N-terminal Histidine tags ^HHHHHH or ^H{6}
      • Square brackets specify alternative residues at a particular position. The Walker (P loop) motif that binds ATP or GTP can be expressed as: [AG]....GK[ST] (A or G are followed by 4 variable residues, then G and K, and finally S or T)
  • Before running the search remember to do the following:
    • change the result return option to Polymer entities
    • decide whether to include CSMs (default option) or exclude them (by turning off the toggle switch next to the Search button.

Result options

The search results display the numbering for the sequence match region (corresponding to PDBx/mmCIF file numbering) (Figure 2). Click on the 3D View button included for each matched result to view the structure interactively in 3D. The matched region specified in the results can be examined closely.

Figure 2: Part of the query results page for a sequence motif search showing the regions of the polymer entity that matches the query sequence motif in a red box. Clicking on the 3D view marked with red arrows opens the structure in Mol*.

Examples

  • Query for SH3 domains - use the Simple format sequence “XPPXP” (where X is a variable residue and P is Proline)
  • Query for a specific pattern of sequence - use the Prosite format sequence motif query “[AC]-x-V-x(4)-{ED}” which translates to Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}.
  • Query for the Walker (P loop) motif that binds ATP or GTP - use the RegEx format sequence motif query “ [AG]....GK[ST]” (where A or G is followed by 4 variable residues, then G and K, and finally S or T)


Please report any encountered broken links to info@rcsb.org
Last updated: 9/21/2022