资源说明:This project was a collaboration with Jodi Schwarz and Joseph Azofeifa to create a Hidden Markov Model Database set that would outperform existing methods for detecting Leucine Rich Repeats (LRR) in protein sequences, in particular, LRRs in G Protein-Coupled Receptors (LGR). The two families we were interested in were Toll-like and Follicle Stimulating Hormone Receptors. This work is the subject of a pending scientific publication.
Installation = Requirements: --- Perl 5.12 or later @ http://perldoc.perl.org HMMER 3.0 @ http://hmmer.janelia.org/ TMHMM 2.0 @ http://www.cbs.dtu.dk/services/TMHMM-2.0/ (TMHMM is not executed but its output is a parameter. As versions for different OS may have differences in output format, note we use decodeanhmm.Linux_x86_64 with option model file tmhmm-2.0c/lib/TMHMM2.0.model) This program requires no modification of the scripts to execute it. To make it possible to execute this program from locations other than its installation directory, make sure to add the installation directory to your system path. Otherwise, it must be ran from its installation directory. Unzip the compressed archive. Leave the underlying file structure intact. Usage --- To run the LRR_Discoverr.pl: "perl LRR_Discoverr.pl" Options -e Sets the e value threshold for reporting domains for all models -c Sets the number of CPUs to run hmmscan with Query File Contains one or more Amino Acid sequences in Fasta Format TMHMM Result File The results of an execution of TMHMM 2.0 on the Query File It should resemble below: \>a00136801.t1 %len 337 %lett A:29 C:9 D:3 E:22 F:22 G:21 H:6 I:24 K:16 L:30 M:10 N:11 P:16 Q:8 R:16 S:21 T:23 V:25 W:11 Y:14\r %score BG 1433.732113 (4.254398 per character) %score FW 1391.588186 (4.129342 per character) %score NB(0) 1399.491262 (4.152793 per character) %score LO(0) 34.240851 (0.101605 per character) %pred NB(0): o 1 46, M 47 69, i 70 88, M 89 108, o 109 122, M 123 145, i 146 164, M 165 187, o 188 196, M 197 219, i 220 247, M 248 270, o 271 284, M 285 307, i 308 337 MTSEPEPEHHYNHTSAPETEPESSVYEPTAEAEAEPLPEWSKATEEWGIAWDIHQYGLGGVYTLLFLFITMS ?0 ooooooooooooooooooooooooooooooooooooooooooooooMMMMMMMMMMMMMMMMMMMMMMMiii LIKRIKQGRTGGQGHKVPMVVLSLLGMFCLTRGLCLCIDAYRWKKIMPVFFVNVFWGIGQPCIISAYTLVFI ?0 iiiiiiiiiiiiiiiiMMMMMMMMMMMMMMMMMMMMooooooooooooooMMMMMMMMMMMMMMMMMMMMMM VMRNALTLKQNFRRWYTTRNIAIATLPYFIFAFGAELTLSFAPSFKGIAFTCQLLYILYGSSLSVFYSMISF ?0 MiiiiiiiiiiiiiiiiiiiMMMMMMMMMMMMMMMMMMMMMMMoooooooooMMMMMMMMMMMMMMMMMMMM LLWKKLKVATKNRWNSESANRCGKRTRTIFRTCVAAVFGGIAICAMQLFAMIGVYGIFSEARHVSAWPWWAF ?0 MMMiiiiiiiiiiiiiiiiiiiiiiiiiiiiMMMMMMMMMMMMMMMMMMMMMMMooooooooooooooMMMM QTLFRVVEIYMVVVLCYAVNDRNVEAKKGEIAPTSLNSETPVKPLEVEA ?0 MMMMMMMMMMMMMMMMMMMiiiiiiiiiiiiiiiiiiiiiiiiiiiiii Output - In the directory the Query File is located in, you will find several files and directories are created. _NameToNumber.fa The query file with each name translated into a unique integer code. This insures that HMMER does not remove vital identification information from the result file it produces. _NameToNumber.index This file contains the relationship between query sequence names and the id assigned to them. This is used for translating ids back into their given names for later result files. Toll-like_HMMDB_ _NameToNumber.fa [Directory] This directory contains the results of the scan of the Toll-like Hidden Markov Model Database over the query file with encoded sequence names. It contains several files which are described below: Toll-like_HMMDB_ _NameToNumber.fa/ _NameToNumber.fa.hmmscan This file contains the raw results of the execution of HMMSCAN from HMMER, query sequences against Toll-like HMMDB. It contains all predicted domains for all sequences. Sequence names are encoded to insure they will not be cut off. Toll-like_HMMDB_ _NameToNumber.fa/ _NameToNumber.fa_dom_table.hmmscan This file contains only records of domains in compressed format. Used for reducing overlap between Models for each sequence queried. Sequence Names are encoded to insure they will not be cut off Toll-like_HMMDB_ _NameToNumber.fa/ _NameToNumber.fa_dom_table.hmmscan.reduced This file contains only records of domains that do not overlap completely. Overlapping domains have their e values compared, the greater's domain is discraded. Sequence Names are encoded to insure they will not be cut off. Toll-like_HMMDB_ _NameToNumber.fa/ _NameToNumber.fa_dom_table.hmmscan.reduced.reindexed Decodes sequence names from _NameToNumber.fa_dom_table.hmmscan.reduced so that actual sequence names are present. Needed to cross-reference with TMHMM results. Toll-like_HMMDB_ _NameToNumber.fa/ _NameToNumber.fa_dom_table.hmmscan.reduced.reindexed.gpcr_predictions.tsv Cross-reference TMHMM results with Toll-like HMMDB hits. If a sequence has 4 or more TMHMM transmembrane zones and one or more LRRs found, it will be reported in this file in the following format: sequence name # TM regions ~ First TM position ~ Last TM position: LRRs Domain Start Domain End E Value, a00549601.t1 4~73~539: 864 878 0.00094, 806 820 0.0098, 770 792 0.0017, 833 850 0.0021, Regex: /(.+)\t(([0-9]+)~([0-9]+)~([0-9]+)):\t((([0-9]+)\t([0-9]+)\t([0-9e-]+),)+)/ This is a specifically imperfect filtering process (accepting fewer than 7 TM regions) designed to allow possible hits not yet classified or that are more distantly related to appear in our results. With that in mind, predictions can be strengthened by manually examining the alignment. Toll-like_HMMDB_ _NameToNumber.fa/ _NameToNumber.fa_dom_table.hmmscan.reduced.reindexed.gpcr_predictions.tsv_hits_only.fa Sequences which appear in _NameToNumber.fa_dom_table.hmmscan.reduced.reindexed.gpcr_predictions.tsv with resolved sequence names will appear in this file in Fasta format, for convenience. FSHR_HMMDB_ _NameToNumber.fa [Directory] Like the Toll-like directory, this directory contains identical files except that they were derived from FSHR HMMDB scans. The same file types appear within it.
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。