2017年1月16日 星期一

PDB裡面最長和最短的蛋白質序列

依序執行下述指令就知道PDB裡面最長與最短的序列:
  1. wget ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt
  2. grep -e "mol:protein" -A 1 pdb_seqres.txt | sed '/^--$/d' | grep "^>" | cut -d' ' -f3 | cut -d':' -f2 | sort -n | head -n 1
  3. grep -e "mol:protein" -A 1 pdb_seqres.txt | sed '/^--$/d' | grep "^>" | cut -d' ' -f3 | cut -d':' -f2 | sort -n | tail -n 1
上面的Bash指令意思分別是:
  1. 抓所有PDB的序列(fasta format)
  2. 最短的序列,長度是2
  3. 最長的序列,長度是5037

知道長度以後,用下面指令就知道最長和最短的序列是哪些PDB
  • grep "mol:protein" pdb_seqres.txt | grep "length:2 "
>1ahg_C mol:protein length:2  PHOSPHO-5'-PYRIDOXYL TYROSINE
>1ahg_D mol:protein length:2  PHOSPHO-5'-PYRIDOXYL TYROSINE
>1lgc_H mol:protein length:2  DIPEPTIDE
>1lgc_I mol:protein length:2  DIPEPTIDE
>1lgc_J mol:protein length:2  DIPEPTIDE
>4d2c_D mol:protein length:2  L-ALANINE-L-PHENYLALANINE
>4m6g_B mol:protein length:2  L-alanine-iso-D-glutamine

  • grep "mol:protein" pdb_seqres.txt | grep "length:5037"
>3j8e_G mol:protein length:5037  Ryanodine receptor 1
>3j8e_C mol:protein length:5037  Ryanodine receptor 1
>3j8e_E mol:protein length:5037  Ryanodine receptor 1
>3j8e_H mol:protein length:5037  Ryanodine receptor 1
>4uwa_A mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwa_B mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwa_C mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwa_D mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwe_A mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwe_B mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwe_C mol:protein length:5037  RYANODINE RECEPTOR 1
>4uwe_D mol:protein length:5037  RYANODINE RECEPTOR 1

_EOF_

沒有留言:

張貼留言