资源说明:Trseeker
# Trseeker: a Python library and framework for working with various genomics data ## Contents - [Introduction](#_toolkit_intro) - [Settings](#_toolkit_settings) - [Models](#_models) - [DNA Sequence](#_models_seq) - [Tandem repeat](#_models_trf) - [Organism](#_models_organism) - [Dataset](#_models_dataset) - [Blast output](#_models_blast) - [Chromosome](#_models_chr) - [WGS assembly](#_models_wgs) - [Genome](#_models_genome) - [Kmer](#_models_kmer) - [Kmers](#_models_kmers) - [Readers](#_io) - [Tab file](#_io_tab) - [Block file](#_io_block) - [Fasta file](#_io_fasta) - [GBFF file](#_io_gbff) - [FTP](#_io_ftp) - [NCBI FTP](#_io_ncbi_ftp) - [MONGO DB](#_io_mongo) - [TRF output](#_io_trf) - [TRs file](#_io_trs) - [Ngram file](#_io_ngram) - [Blast output](#_io_blast) - [Fastq file](#_io_sra) - [Toolkit](#_tools) - [Annotation tools](#_annotation) - [Assembly tools](#_assembly_tools) - [Tulip](#_tools_tulip) - [TRs nomenclature](#_tools_trs_group) - [Sequence](#_tools_sequence) - [Various for files](#_tools_var) - [Various](#_tools_other) - [Kmers](#_tools_kmers) - [Edit distance](#_tools_ed) - [Repbase](#_tools_repbase) - [Trace](#_tools_trace) - [Statistics](#_tools_stat) - [Parsing](#_tools_parsers) - [TRF](#_tools_trf) - [TRs datasets](#_tools_trs) - [SRA datasets](#_tools_sra) - [Sequence patterns](#_tools_patterns) - [Blast](#_tools_blast) - [ESA](#_tools_sa) - [NCBI genomes](#_tools_ncbi_genome) - [Networks](#_tools_networks) - [Networks](#_tools_ncbi_ann) - [Jellyfish](#_tools_jellyfish) - [Trees](#_tools_tree) - [Classifiction TRs in types](#_trs_types) - [Working with Entrez NCBI databases](#_tools_entrez) ## Introduction Framework состоит из следующих частей: - модели данных - чтение/запись данных согласно моделям - инструменты работы с этими данными ## Настройки фреймворка В файле settings.py: ```python SETTINGS_FILENAME = "settings.yaml" NGRAM_LENGTH = 23 NGRAM_N = 100000000 ``` Настройки можно прочитать и записать: ```python from trseeker.settings import load_settings from trseeker.settings import save_settings settings_dict = load_settings() save_settings(settings_dict) ``` Настройки содержат следующие параметры: ```yaml trseeker: os: linux64 root_dir: /root/Dropbox/workspace/trseeker work_dir: /home blast_settings: blast_location: blast_location_NIX: /home/ncbi-blast-2.2.26+/bin/legacy_blast.pl bl2seq blast_location_WIN: c:\Program Files\NCBI\blast-2.2.26+\bin\legacy_blast.pl bl2seq.exe jellyfish_location: jellyfish repbase_db_folder: /home/rebase_blast_db/repbase blast_e: 1e-20 blast_b: 20000000 blast_v: 20000000 trf_settings: trf_location: /root/trf404.linux64 trf_match: 2 trf_mismatch: 5 trf_indel: 7 trf_p: 80 trf_q: 10 trf_threshold: 50 trf_length: 2000 trf_param_postfix: 2.5.7.80.10.50.2000 trf_masked_file: False trf_flanked_data: True trf_data_file: True trf_nohtml: True overlapping_cutoff: 10 overlapping_cutoff_proc: 30 overlapping_gc_diff: 0.05 ngrams_settings: ngram_length: 23 ngram_m: 10000000 ``` ## Avaliable Models ### DNA Sequence ```python from trseeker.models.sequence_model import SequenceModel ``` Attribites: - seq_gi (int) - seq_ref - seq_description - seq_sequence - seq_length (int) - seq_gc (float) - seq_revcom, reverce complement - seq_gapped (int) - seq_chr - seq_head - seq_start_position (int) - seq_end_position (int) Properties - length (self.seq_length) - sequence (self.seq_sequence) - fasta - header - contige_coverage if **_cov_** key in name ```python print seq_obj.fasta >>> ">[seq_ref]\n[seq_sequence]\n" ``` - sa_input ```python print seq_obj.sa_input >>> "[seq_sequence]$" ``` - ncbi_fasta ```python print seq_obj.ncbi_fasta >>> ">gi|[seq_gi]|ref|[seq_ref]|[seq_description]\n[seq_sequence]\n" ``` Methods: - add_sequence_revcom() - set_dna_sequence(self, title, sequence, description=None) ```python self.seq_ref = title ``` - set_ncbi_sequence(self, head, sequence) Chromosome name is **?** or setted with parse_chromosome_name(head). ```python (self.seq_gi, self.seq_ref, self.seq_description) = parse_fasta_head(head) ``` - set_gbff_sequence(self, head, sequence) Head is a dictionary with gi, ref, description keys. Chromosome name is **?** or setted with parse_chromosome_name(head["description"]). Sequence is cleared with clear_sequence(s) function. Lowercase and all non-DNA characters replacing with **n**. If the sequence has **n** then it is gapped. ### TRF results ```python from trseeker.models.trf_model import TRModel ``` Attributes: - project, project name - id (float) - trf_id (int) - trf_type, - trf_family, - trf_family_prob (float), - trf_l_ind (int) - trf_r_ind (int) - trf_period (int) - trf_n_copy (float) - trf_pmatch (float) - trf_pvar (float) - trf_entropy (float) - trf_consensus - trf_array - trf_array_gc (float) - trf_consensus_gc (float) - trf_gi - trf_head - trf_param - trf_array_length (int) - trf_chr - trf_joined (int) - trf_superfamily - trf_superfamily_self - trF_superfamily_ref - trf_family - trf_subfamily - trf_subsubfamily - trf_family_network - trf_family_self - trf_family_ref - trf_hor (int) - trf_n_chrun (int) - trf_chr_refgenome - trf_bands_refgenome - trf_repbase - trf_strand Methods - set_project_data(project), set self.project to given project - set_raw_trf(head, body, line), head, body and line from TRF parser - get_index_repr() ```python print trf_obj.get_index_repr() ''' Tab delimted string with \n-symbol: trf_id trf_period trf_array_length trf_pvar trf_gi trf_l_ind trf_r_ind trf_chr ''' ``` - get_numerical_repr() ```python print trf_obj.get_numerical_repr() >>> [trf_period]\t[trf_array_length]\t[trf_array_gc]\n ``` - get_fasta_repr(), where head is trf_obj.trf_id and sequence is trf_obj.trf_array - or **fasta** property - get_monomer_fasta_repr(), where head is trf_obj.trf_id and sequence is trf_obj.trf_consensus - get_family_repr() - get_gff3_string(self, chromosome=True, trs_type="complex_tandem_repeat", probability=1000, tool="PySatDNA", prefix=None, properties={"id":"trf_id", "family": "trf_family_self"}) ```python print trf_obj.get_family_repr() ''' Tab delimted string with \n-symbol: trf_id trf_period trf_array_length trf_array_gc trf_pvar trf_gi trf_l_ind trf_r_ind trf_chr trf_repbase trf_superfamily trf_family trf_subfamily ''' ``` For network slice added one more index - gid (group id) ```python from trseeker.models.trf_model import NetworkSliceModel slice_obj = NetworkSliceModel() ``` #### TRsClassificationModel(AbstractModel): Model for keeping classification data. Attributes: - project - id (int) - trf_id (int) - trf_period (int) - trf_array_length (int) - trf_array_gc - trf_type - trf_family - trf_subfamily - trf_family_prob (float) - trf_family_kmer - trf_subfamily_kmer - trf_family_self - class_ssr - class_tssr - class_sl - class_good - class_micro - class_100bp - class_perfect - class_x4 - class_entropy - class_gc - trf_consensus ```python class_obj = TRsClassificationModel() class_obj.set_with_trs(trf_obj) print class_obj.network_head() ``` ### Organism model ```python from trseeker.models.organism_model import OrganismModel ``` Attributes: - organism_taxon - organism_common_name - organism_acronym - organism_description - organism_wgs_projects - organism_genome_assemblies ### Dataset model ```python from trseeker.models.dataset_model import DatasetModel ``` Attributes: - dataset_taxon - dataset_id - dataset_sources - dataset_description - dataset_gc (float) - dataset_length (int) - dataset_trs_n (int) - dataset_trs_length (int) - dataset_trs_mean_gc (float) - dataset_trs_fraq (float) ### Blast Results Model ```python from trseeker.models.blast_model import BlastResultModel ``` Attributes: - query_id (int) - query_gi (int) - query_ref - subject_id - subject_gi(int) - subject_ref - query_start (int) - query_end (int) - subject_start (int) - subject_end (int) - evalue (float) - bit_score (flaot) - score (int) - alignment_length (int) - proc_identity (float) - identical (int) - mismatches (int) - positives (int) - gap_opens (int) - gaps (int) - proc_positives (float) - frames - query_frame (int) - subject_frame (int) - fraction_of_query (float) Additional functions: - read_blast_file(blast_file, length), return subject_ref -> list of matches (BlastResultModel models). ```python from trseeker.models.blast_model import read_blast_file ref_to_blast_obj = read_blast_file(file_name) ``` ### Chromosome model ```python from trseeker.models.chromosome_model import ChromosomeModel ``` Attributes: - chr_genome - chr_number - chr_taxon - chr_prefix - chr_gpid - chr_acronym - chr_contigs - chr_length - chr_mean_gc - chr_trs_all - chr_trs_3000 - chr_trs_all_proc - chr_trs_3000_proc - chr_trs_all_length - chr_trs_3000_length - genome_gaps - chr_sum_gc ### WGS assembly model ```python from trseeker.models.wgs_model import WGSModel ``` Attributes: - wgs_prefix - wgs_taxon - wgs_gpid - wgs_acronym - wgs_contigs (int) - wgs_length (int) - wgs_mean_gc (float) - wgs_trs_all (int) - wgs_trs_3000 (int) - wgs_trs_1000 (int) - wgs_trs_500 (int) - wgs_trs_all_proc (float) - wgs_trs_3000_proc (float) - wgs_trs_1000_proc (float) - wgs_trs_500_proc (float) - wgs_trs_all_length (int) - wgs_trs_3000_length (int) - wgs_trs_1000_length (int) - wgs_trs_500_length (int) - wgs_sum_gc (float) Methods: Clear trf information (set to 0): ```python wgs_obj.clear_trf() ``` ### Genome model ```python from trseeker.models.genome_model import GenomeModel ``` - genome_taxon - genome_prefix - genome_gpid - genome_acronym - genome_chromosomes - genome_contigs - genome_length - genome_mean_gc - genome_trs_all - genome_trs_3000 - genome_trs_all_proc - genome_trs_3000_proc - genome_trs_all_length - genome_trs_3000_length - genome_gaps - genome_sum_gc ### RepeatMasker track data ```python from trseeker.models.genome_model import RepeatMaskerAnnotation ``` Container for RepeatMasker track. For example from http://hgdownload.cse.ucsc.edu/goldenPath/felCat5/database/rmsk.txt.gz ``` Table fields: `bin` smallint(5) unsigned NOT NULL, `swScore` int(10) unsigned NOT NULL, `milliDiv` int(10) unsigned NOT NULL, `milliDel` int(10) unsigned NOT NULL, `milliIns` int(10) unsigned NOT NULL, `genoName` varchar(255) NOT NULL, `genoStart` int(10) unsigned NOT NULL, `genoEnd` int(10) unsigned NOT NULL, `genoLeft` int(11) NOT NULL, `strand` char(1) NOT NULL, `repName` varchar(255) NOT NULL, `repClass` varchar(255) NOT NULL, `repFamily` varchar(255) NOT NULL, `repStart` int(11) NOT NULL, `repEnd` int(11) NOT NULL, `repLeft` int(11) NOT NULL, `id` char(1) NOT NULL, ``` ### Ngram/kmer model ```python from trseeker.models.ngram_model import NgramModel ``` ```python ngram_obj = NgramModel(seq_f, seq_r) ngram_obj.add_tr(trf_obj, tf) print ngram_obj >>> '[fseq]\t[rseq]\t[tf]\t[df]\t[len taxons]\t[len fams]\n' print ngram_obj.get_families() >>> ??? ``` Attributes - seq_r - seq_f - tf (float) - df (int) - taxons (set) - trs (set) - families (dict) ### Kmer index model ```python from trseeker.models.ngrams_model import KmerIndexModel ``` Attributes: - kmer - rkmer - tf (int) - df (int) - docs (list of int) - freqs (list of float) ### Kmer Slice model ```python from trseeker.models.ngrams_model import KmerSliceModel ``` #### Attributes - kmer - local_tf (int) - df (int) - f_trs - f_wgs - f_mwgs - f_trace - f_sra - f_ref1 - f_ref2 - f_caroli ### SliceTreeModel ```python from trseeker.models.ngrams_model import SliceTreeModel ``` #### Attributes - deep (int) - size (int) - blast_fams (list of str) - maxdf (int) - nmaxdf (int) - pmaxdf (float) - gc_var (float) - units (list of int) - trs (list of int) - kmers (list of SliceTreeModel) #### Additional function Yield KmerIndexModel: ```python sc_ngram_reader(file_name) ``` Yield (ngram id, [(seq id, tf), ...]): ```python sc_ngram_trid_reader(file_name) ``` Read kmer index data as list: Parameters: - index file - microsatellite kmer dictionary, kmer->index ```python read_kmer_index(ngram_index_file, micro_kmers, cutoff=1) ``` ## IO functions ### Tab file ```python from trseeker.seqio.tab_file import TabDelimitedFileIO reader = TabDelimitedFileIO(skip_first=False, format_func=None, delimiter='\t', skip_startswith='#') reader.read_from_file(file_name) ``` Avaliable all functions from parent class AbstractFileIO from PyExp package. Useful functions: - sc_iter_tab_file(input_file, data_type, remove_starts_with=None) - sc_iter_simple_tab_file(input_file) - sc_read_dictionary(dict_file, value_func=None) - sc_read_simple_tab_file(input_file) - sc_write_model_to_tab_file(output_file, objs) ```python from trseeker.seqio.tab_file import sc_iter_tab_file for wgs_obj in sc_iter_tab_file(file_name, WGSModel): print wgs_obj.wgs_taxon ``` ```python from trseeker.seqio.tab_file import sc_iter_simple_tab_file for item in sc_iter_simple_tab_file(file_name): print item[0] ``` ```python from trseeker.seqio.tab_file import sc_read_dictionary for data in sc_read_dictionary(file_name): for key, value in data.items(): print key, value ``` ```python from trseeker.seqio.tab_file import sc_read_simple_tab_file for data in sc_read_simple_tab_file(file_name): for item in data: print "id = ", item[0] ``` ### Block file ```python from trseeker.seqio.block_file import AbstractBlockFileIO reader = AbstractBlockFileIO(token, **args) for (head, body, start, next) in reader.read_online(file_name): pirnt head ``` Avaliable all functions from parent class AbstractFileIO from PyExp package. ### Fasta file ```python from trseeker.seqio.fasta_file import FastaFileIO reader = FastaFileIO() ``` #### Useful functions: - sc_iter_fasta(file_name) - fasta_reader(file_name), same as sc_iter_fasta - sc_iter_fasta_simple(file_name) - save_all_seq_with_exact_substring(fasta_file, substring, output_file) - sort_fasta_file_by_length(file_name) ```python from trseeker.seqio.fasta_file import sc_iter_fasta for seq_obj in sc_iter_fasta(file_name): print seq_obj.sequence ``` ```python from trseeker.seqio.fasta_file import sc_iter_fasta_simple for (gi, sequence) in sc_iter_fasta_simple(file_name): print gi ``` ```python from trseeker.seqio.fasta_file import save_all_seq_with_exact_substring save_all_seq_with_exact_substring(fasta_file, substring, output_file) ``` Sort by length and save fasta file: ```python sort_fasta_file_by_length(file_name) ``` ### GBFF file ```python from trseeker.seqio.gbff_file import GbffFileIO reader = GbffFileIO() ``` #### Useful functions: - sc_iter_gbff(file_name) - sc_iter_gbff_simple(file_name) - sc_parse_gbff_in_folder(gbbf_folder, fasta_folder, fasta_postfix, mask) ```python from trseeker.seqio.gbff_file import sc_iter_gbff for seq_obj in sc_iter_gbff(file_name): print seq_obj.length ``` ```python from trseeker.seqio.gbff_file import sc_iter_gbff_simple for (gi, sequence) in sc_iter_gbff_simple(file_name): print gi ``` ```python from trseeker.seqio.gbff_file import sc_parse_gbff_in_folder sc_parse_gbff_in_folder("/home/user/", "/home/user/fasta", "fa", "mouse") ``` ### FTP IO ```python from trseeker.seqio.frt_io import AbstractFtpIO reader = AbstractFtpIO(ftp_address=address) reader.connect() reader.cd(['/home', 'user', 'data']) reader.ls() >>> ['readme.txt', 'data.fa'] file_size = reader.size(file_name) reader.get(file, output_file) reader.unzip(file_name) ``` Download from NCBI ftp with aspera: ```python download_with_aspera_from_ncbi(source, destination) ``` ### NCBI ftp ```python from trseeker.seqio.ncbi_ftp import NCBIFtpIO reader = NCBIFtpIO() reader.download_wgs_fasta(wgs_list, file_suffix, output_folder, unzip=False) reader.download_all_wgs_in_fasta(output_folder, aspera=False, get_size=False, unzip=False) reader.download_all_wgs_in_gbff(output_folder) reader.download_chromosomes_in_fasta(ftp_address, name, output_folder) reader.download_trace_data(taxon, output_folder) reader.download_sra_from_ddbj(ftp_address, output_folder) reader.download_with_aspera(local_path, remove_server, remote_path) ``` ### Mongo db reader Not implemented yet. ```python from trseeker.seqio.mongodb_reader import MongoDBReader reader = MongoDBReader() db = reader.get_trdb_conn() ``` ### TRF file #### Useful functions: - sc_parse_raw_trf_folder(trf_raw_folder, output_trf_file, project=None) ```python from trseeker.seqio.trf_file import TRFFileIO from trseeker.seqio.trf_file import sc_parse_raw_trf_folder for trf_obj in TRFFileIO(file_name, filter=True): print trf_obj.trf_id ``` ```python sc_parse_raw_trf_folder(trf_raw_folder, output_trf_file, project="mouse_genome") ``` При чтение данных TRF происходит их фильтрация по следующим параметрам: 1. Убираются все вложенные поля меньшей длины. 2. Если поля overlapping, то сейчас ничего не делается, а раньше если ```python overlap_proc_diff >= settings["trf_settings"]["overlapping_cutoff_proc"] gc_dif <= settings["trf_settings"]["overlapping_gc_diff"] ``` поля объяединяются в одно поле. Иначе считаем, что это отдельные поля. 3. Если поля совпадают, то выбирается то которое с большим trf_pmatch. TODO: убрать пересечение, так как это видимо в том числе и ошибки ассмеблера и это будет мешать корректной классификации. Убрать в отдельную часть. ### TRs file ```python from trseeker.seqio.tr_file import * ``` Functions: - save_trs_dataset(trs_dataset, output_file) - save_trs_class_dataset(tr_class_data, output_file) - get_trfid_obj_dict(trf_large_file) - get_classification_dict(fam_kmer_index_file) Save trf file as fasta file: ```python # don't save trf_obj if trf_family is ALPHA save_trs_as_fasta(trf_file, fasta_file, add_project=False, skip_alpha=False) ``` ```python from trseeker.seqio.tr_file import get_trfid_obj_dict trid2trfobj = get_trfid_obj_dict(trf_large_file) ``` - get_all_trf_objs(trf_large_file) - get_all_class_objs(trf_class_file) - get_class_objs_dict(trf_class_file) ```python from trseeker.seqio.tr_file import get_all_trf_objs from trseeker.seqio.tr_file import get_all_class_objs trf_obj_list = get_all_trf_objs(trf_large_file) class_obj_list = get_all_class_objs(trf_class_file) ``` - get_trf_objs_dict(trf_large_file) ```python trf_obj_dics = get_trf_objs_dict(trf_large_file) ``` - read_trid2meta(file_name) Load trid to full index dictionary as string: ```python read_trid2ngrams(annotation_ngram_folder, trf_large_file) ``` TODO: rewrite this with AbstractReaders ### Ngram file TODO: rewrite this with AbstractReaders ```python from trseeker.seqio.ngram_file import save_ngram_index save_ngram_index(ngram_index_file, hash2id, result_tf, result_df, result_rtf, result_rdf, seen_rev, hash2rev, ngram2kmerann) save_ngram_pos_index(ngram_trids_file, id2trids, id2trid2tf) save_distance_data(dist_file, distances) ``` ### Blast results file ```python from trseeker.seqio.blast_file import get_blast_result get_blast_result(blast_file, length, gap_size=1000, min_align=500, min_length=2400, format_function=None, min_score=90) ``` Для фильтров используются following parameters: 1. **cutoff** - TRs array length. 2. **match** and **min_match** from **blast_match_proc** settings 3. **min_align** is minimal length of alignment (array_length / .min_alig) 4. **blast_gap_size** is size of the gap between two match regions (array_length * min_match) 5. **min_length** is minimal total length (array_length * match) Этапы анализа: 1. From blast output files extracted ref_gi to BlastResultModel dictionary. 2. Removed all nested matches. 3. Overlapping matches joined. 4. Joined matches with length less than **min_align** are dropped. 5. Matches with gap length less than **gap_size** are joined 6. Joined gapped matches with length less than **min_align** are dropped. 7. Results formatted with given format_function Dataset updating functions: ```python update_with_repbase_blast_result(trs_dataset, annotation_self_folder, filters) # inside function # result is semicolon-delimited trs_dataset[i].trf_repbase = result # filters example filters = { "blast_gap_size": 300, "min_align": 1000, "min_length": 3000, } ``` ```python update_with_self_blast_result(trs_dataset, annotation_self_folder, filters_obj) # inside function # result comma-delimited # inside get_filters(array_length) generate parameters by array_length: # filters = filters_obj.get_filters(trf_obj.trf_array_length) trs_dataset[i].trf_family_self = result ``` ```python update_with_ref_blast_result(trs_dataset, annotation_self_folder, filters) # inside function # comma-delimited data trs_dataset[i].trf_family_ref = result ``` TODO: refractor to more generic form with setattr(...) Self and ref annotation comma-delimited. Repbase annotation semicolon-delimited. Avalibale format functions: - format_function_repbase(blast_dataset) - format_function_self(blast_dataset) [DEFAULT] While blast results parsing: - skipped all inner matches - joined overlapped - skipped all alignments with length less thatn min_align paramter - joined gapped if gap less than gap_size paramter ### SRA file TODO: move FastqObj to models ```python from trseeker.seqio.sra_file import FastqObj fastq_obj = FastqObj(head, seq, strain, qual_str, phred33=False) print fastq_obj.fastq print fastq_obj.fastq_with_error(error) print fastq_obj.sequence print fastq_obj.fasta print fastq_obj.trimmed_fastq print fastq_obj.gc ``` Properties: - head - seq - qual - strain - id - trimmed_seq - trimmed_qual - qv: list of Q - phredX - adaptor_positions: positions of adapter - adapter_contamination - qual_junk - status - parts Methods related to trimming: - trim(), NotImplemented - trim_by_quality_cutoff(cutoff=30) - trim_exact_adaptor(adaptor, verbose=False), NotImplemented - trim_inexact_adaptor(adaptor, verbose=False, num_errors=2), NotImplemented Additional functions: - fastq_reader ```python for fastq_obj in fastq_reader(fastq_file, phred33=False): print fastq_obj.seq ``` ## Toolkit ### Assembly annotation Compute intersection between two datasets. Dataset format: list of tuples (chrName, startPos, endPos, feature_id). ```python from trseeker.tools.annotation import * compute_intersection_intervals(data_a, data_b) ``` Usage example: ```python data_b = open("cat_good_trf.gff3").readlines() data_b = [x.strip().split("\t") for x in data_b] data_b = [(x[0], int(x[3]), int(x[4]), x[-1]) for x in data_b] for x in compute_intersection_intervals(data_a, data_b): print x ``` ### Assembly tools ```python from trseeker.tools.assembly_tools import * N50_contig_length, N50, shortest_seq, longest_seq = get_N50(lengths) ``` ### Tulip files Format and write network data to tulip format from distance_file with given cutoff (defaul=90). - distance_file: (node_a, node_b, distance) - tulip_file_output: network in tulip format - cutoff: distance cutoff ```python from trseeker.tools.tulip_tools import * format_distances_to_tulip(distance_file, tulip_file_output, cutoff=90) ``` ### TRs nomenclature TODO: rename package ```python from trseeker.tools.trs_groups import * ``` Get next family index. Index limited to latin alphabet characters. Otherwise will return X1,X2 and so on: ```python letter = get_index(i) ``` Get most frequent element of list: ```python element = get_popular(s) ``` Get unit and letter for family. Unit picked by most common pmatch value. A seen_units is a dictionary unit_size to frequence, it is needed for letter choosing: ```python unit, letter, seen_units = get_family_name(ids, seen_units) ``` Join families with common members, sort by family size: ```python join_families_with_common(families) ``` ### Sequence tools ```python from trseeker.tools.sequence_tools import * ``` Return complementary sequence: ```python revcom = get_revcomp(sequence) ``` Get shift variants for TR monomer: ```python sequences = get_revcomp(sequence) ``` Return normalized sequence with following rules: 1. if T > A then return reverse complement 2. if A == T and G > C then return reverse complement 3. else return sequence ```python sequence = fix_strand(sequence) ``` Count GC content: ```python gc = get_gc(sequence) ``` Check for N|n in sequence. Return n(with N) or w(whole): ```python bool_value = check_gapped(sequence) ``` Return subsequence: ```python get_subseq(seq, start, end) ``` Clear sequence (ACTGN alphabet): 1. lower case 2. remove any gaps 3. remove all letters except atgcn ```python sequence = clear_sequence(sequence) ``` Return list of sequences splited by pattern with added end. Pattern is regexp: ```python restriction(sequence, pattern, end=None) ``` Return number of fragments after restriction of sequence with given regexp pattern: ```python restriction_fragments_n(sequence, pattern) ``` Check tandem repeat synonims between two sequence with same length: ```python check_cyclic_repeats(seq_a, seq_b) ``` Return sequence with n mutations: ```python random_mutation(seq, n, alphabet="actgn +") ``` Return consensus string for given list of strings: ```python get_consensus(strs) ``` Take a minimal sequence from lexicographically sorted rotations of sequence and its reverse complement. Example: ACT, underlined - reverse complement seqeunces ACT, AGT, CTA, GTA, TAC, TAG. Find all possible multimers, e.g. replace GTAGTAGTA consensus sequence with ACT. Return: - list of sorted TRs - list of (df, consensus) pairs sorted by array number ```python remove_consensus_redundancy(trf_objs) ``` ### Various useful functions for working with files ```python from trseeker.tools.seqfile import * ``` - save_list(file_name, data) - save_dict(file_name, dictionary) - save_sorted_dict(d, file, by_value=True, reverse=True, min_value=None, key_type=None) - count_lines(file) - sort_file_by_int_field(file_name, field) ### Other useful functions (other_tools.py) ```python from trseeker.tools.other_tools import * ``` Sort dictionary by key. Retrun list of (k, v) pairs: ```python sort_dictionary_by_key(d, reverse=False) ``` Cast list elements to string: ```python as_strings(alist) ``` Return list from dictionary, return list of (key, value) pairs: ```python dict2list(adict) ``` Sort dictionary by value. Retrun list of (v, k) pairs: ```python sort_dictionary_by_value(d, reverse=False) ``` Remove duplicates from list and sort it: ```python remove_duplicates_and_sort(data) remove_duplicates(data) ``` Remove redundancy or empty elements in given list: ```python remove_redundancy(alist) ``` Remove nested fragments, ata format [start, end, ...]: ```python clear_fragments_redundancy(data, extend=False, same_case_func=None) ``` ### Ngrams (kmers) tools ```python from trseeker.tools.ngrams_tools import * ``` Yields all ngrams of length n from given text: ```python generate_ngrams(text, n=12) ``` Yields all ngrams of length n from given text: ```python generate_window(text, n=12, step=None) ``` Returns m most frequent (ngram of length n, tf) tuples for given text: ```python get_ngrams(text, m=None, n=23, k=None, skip_n=False) ``` Returns m most frequent (ngram of length n, fraction of possible ngrams) tuples for given text: ```python get_ngrams_freq(text, m=None, n=23, k=None) ``` Returns a feature set {'ngram':'ngram',...} of m most frequent ngram of length n for given text: ```python get_ngrams_feature_set(text, m=5, n=12) ``` Returns a distance between two ngram sets where distance is a sum(min(ngram_a, ngram_b) for each common ngram). Format for ngrams_a and ngrams_b is a dictionary {ngram:n, ...}: ```python get_ngram_freq_distance(ngrams_a, ngrams_b) ``` Returns a distance between two ngram sets where distance is a len(). Format for ngrams_a and ngrams_b is a dictionary {ngram:n, ...}: ```python get_ngram_common_distance(ngrams_a, ngrams_b) ``` Return repeatness coefficient. From 0 (e.g. polyA) to 1 (unique sequence): ```python get_repeatness_coefficent(length, k, kmern) # Inside this function: # kmern * (N-1) / (N**2) ``` Return expressivenesss coefficient. ```python get_expessiveness_coefficent(kmern, k) # Inside this function: # kmern * 1. / 4^k ``` Update tf and df data with k-mers from given sequence (it will be lowered): ```python from trseeker.tools.ngrams_tools import count_kmer_tfdf k = 23 for sequence in sequences: tf_dict, df_dict, local_tf, local_df = count_kmer_tfdf(sequence, tf_dict, df_dict, k) ``` Compute max df, number and procent of sequence with given ngram. Return (maxdf, nmaxdf, pmaxdf, ngram_seqs) ```python (maxdf, nmaxdf, pmaxdf, ngram_seqs) = get_df_stats_for_list(data, k, kmer2df): ``` Get tf and df frequencies for kmer list: ```python get_kmer_tf_df_for_data(data, k, docids=False) ``` Get list of (kmer, revkmer, tf, df, docids) for given data: ```python process_list_to_kmer_index(data, k, docids=True, cutoff=None) ``` Get list of (kmer, revkmer, tf, df, docids) for multifasta file: ```python compute_kmer_index_for_fasta_file(file_name, index_file, k=23) ``` Get list of (kmer, revkmer, tf, df, docids) for TRs file, filter ny sequence complexity defined by get_zlib_complexity function: ```python compute_kmer_index_for_trf_file(file_name, index_file, k=23, max_complexity=None, min_complexity=None) ``` Compute kmer coverage and set of kmers: ```python (m, kmers) = get_sequence_kmer_coverage(sequence, kmers, k) ``` ### Edit distance functions ```python from trseeker.tools.edit_distance import * ``` Return list of element (*d*) positions in given list: ```python get_pos(L, d) ``` Return edit distance between two given strings. Edit distance dynamic programming implementation. Length of second sequence does not change similarity value: 1. monomer mode: double short monomer sequence if len2/len1 < 2 2. full_info: boolean, return full information Return: percent of ED similarity, or if full_info (distance, number of d, positions of d, S matrix) ```python get_ed_similarity(s1, s2, monomer_mode=False, full_info=False, verbose=False) ``` Return two sequences edit distance full information. 1. s1: sequence of tandem repeat monomer 2. s2: sequence of tandem repeat monomer 3. verbose: boolean 4. monomer mode: double short monomer sequence if len2/len1 < 2. Return: ("max_sim Nmax_sim %sim pos_list", pos_list, all_result, length seq1, length seq2) ```python get_edit_distance_info(s1, s2, verbose=False, monomer_mode=False) ``` Return ED valie for two sequences: ```python get_edit_distance(s1, s2) ``` Get last row of ED matrix between two strings: ```python get_edit_distance_row(s1, s2) ``` Get Hamming distance: the number of corresponding symbols that differs in given strings: ```python hamming_distance(s1, s2) ``` ### Working with Repbase files TODO: move to Readers ```python from trseeker.tools.repbase_tools import join_repbase_files ``` Function join all Repbase fasta files in one huge fasta; reformat headers for compatibility with NCBI tools: ```python join_repbase_files(input_folder, output_file) ``` ### Working with Trace files ```python from trseeker.tools.trace_tools import unclip_trace_file ``` Unclip. Remove vector flanks from Trace data. ```python unclip_trace_file(fasta_file, clip_file, uncliped_file) ``` Read clip dictionary from clip_file_name gz archive: ```python get_clip_data(clip_file_name) ``` ### Function related to statistics ```python from trseeker.tools.statistics import * ``` Calculate sigma, return sum(module(xi - mean): ```python get_sigma(data) ``` Return dictionary containing simple statistics for given list. Dictionary keys: mean, variance, sigma, sample deviation: ```python get_mean(data) get_standard_deviation(variance) t_test(sample_mean, dist_mean, variance, N) get_element_frequences(data) get_simple_statistics(data) ``` ### Parsers ```python from trseeker.tools.parsers import * parse_fasta_head(fa_head) parse_chromosome_name(head) trf_parse_line(line) trf_parse_param(line) trf_parse_head(line) get_wgs_prefix_from_ref(ref) get_wgs_prefix_from_head(head) ``` ### Functions related to TRF ```python from trseeker.tools.trf_tools import * ``` ```python trf_search(file_name) ``` ```python trf_search_in_dir(folder, verbose=True, file_suffix=".fa", output_folder=None) ``` Parallel TRF seeach: ```python trf_worker(args) trf_search_in_dir_parallel(folder, verbose=True, file_suffix=".fa", output_folder=None, threads=1) ``` Create output TRF file with tandem repeats with length greater than from input file. Function returns number of tandem repeats in output file: ```python trf_filter_by_array_length(trf_file, output_file, cutoff) ``` Create output TRF file with tandem repeats with unit length greater than from input file. Function returns number of tandem repeats in output file: ```python trf_filter_by_monomer_length(trf_file, output_file, cutoff) ``` Create output TRF file with tandem repeats with GI that don't match GI_LIST. List of GI, see TRF and FA specifications, GI is first value in TRF row. ```python trf_filter_exclude_by_gi_list(trf_file, output_file, gi_list_to_exclude) ``` Write TRF file tab delimited representation.Representation: numerical|index|agc_apm|with_monomer|family ```python trf_representation(trf_file, trf_output, representation) ``` Write statistics data: field, N: ```python trf_write_field_n_data(trf_file, file_output, field, field_format="%s") ``` Write statistics data: field_a, field_b: ```python trf_write_two_field_data(trf_file, file_output, field_a, field_b) ``` Function prints chr, all trs, 3000 trs, 10000 trs: ```python count_trs_per_chrs(all_trf_file) ``` Function prints number of items with given fasta head fragment: ```python count_trf_subset_by_head(trf_file, head_value) ``` Several fasta heads impossible to parse, so it is simpler to fix them postfactum: ```python fix_chr_names(trf_file, temp_file_name=None, case=None) ``` ### Working with TRs datasets ```python from trseeker.tools.trs_dataset import * ``` - COLORS dictionary - COLORS_50 dictionary ```python COLORS_50 = { "denim": ("#1560bd", (21,96,189)), } # TODO: finish it ``` - get_colors(family) - get_colors_rgb(family) - create_mathematice_dataset_by_family(trs_dataset, path_to_mathematica_folder, min_borders, max_borders) ### Working with SRA data ```python from trseeker.tools.sra_tools import * ``` Iterate over fastaq data. ```python sra_fastaq_reader(file_name) ``` Iterate over fasta SRA data. ```python sra_fasta_reader(file_name) ``` - read_fastaq_freq_to_memory(file_name) - fastaq_to_fasta(file_name, output_file) - write_fastaq_repeats(input_file, output_file, min_tf=1000) - seq_to_bin(seq) - bin_to_seq(bseq) Read SRA fasta data without read repeats. ```python write_reduced_fasta(input_file, output_reduced) ``` Write ngrams data from SRA fasta data. ```python write_ngrams(input_file, output_ngram, NGRAM_N) ``` ### Working with sequence patterns ```python from trseeker.tools.sequence_patterns import * ``` Avaliable patterns: - MARS1 - MARS2 - CENPB - PRDB9 Functions: Translate sequence in DNA15 in reg exp: ```python re_translate(sequence) ``` Return list of sequences where one letter changed to N: ```python re_get_plus_minus(sequence) re_get_mutations(sequence) ``` Return list of patterns one not mutated and second mutated: ```python get_double_pattern(pattern_static, pattern_dynamic) ``` - get_mutated_pattern_twice(pattern) - get_mutated_pattern_trice(pattern) - pattern_search(name, sequence, pattern_function, pattern_function_params) ### Working with BLAST ```python from trseeker.tools.blast_tools import * ``` - blastn(database, query, output, e_value=None) - create_db(fasta_file, output, verbose=False, title=None) - alias_tool(dblist, output, title) - bl2seq(input1, input2, output) - create_db_for_genome(file_pattern=None, chromosome_list=None, output=None, title=None) - get_gi_list(gi_score_file, score_limit=90) - get_all_gi_from_blast(blast_file, mode="gi") - get_all_blast_obj_from_blast(blast_file, mode="ref") - blastn_search_for_trs(trf_large_file, db, annotation_self_folder, temp_file, skip_by_family=None, is_huge_alpha=False) - bl2seq_search_for_trs(trf_large_file, annotation_bl2seq_folder, temp_file) ### Working with suffix arrays ```python from trseeker.tools.sa_tools import * ``` Fasta to SA input. Text file of sequences delimeted by $ symbol: ```python fasta_to_sa_input(fasta_file, sa_input_file, index_file_name=None, file_name=None, start_id=None, increment=False) ``` SA input file to fasta file: ```python sa_input_to_fasta(input_file, output_file) ``` Function precompiles doc2id and id2doc pickled dictionary: ```python pickle_dictionary_for_docid_trid(sa_doc_index, doc_to_trf_file, trf_to_doc_file) ``` Function filters and writes sa data with given min tf and df: ```python filter_sa_dataset(sa_file, output_file, min_tf, min_df) ``` Yields corpus texts. Corpus is $ delimited text file: ```python iterate_sa_corpus(corpus) ``` ### Working with NCBI genome information ```python from trseeker.tools.ncbi_genomes import * ``` Scrap finished WGS genome projects wtih given SubGroup name: ```python load_genome_projects(subgroup) ``` Scrap finished WGS genome projects wtih given SubGroup name: ```python load_genome_projects_exclude(notsubgroups) ``` Print data for initiating PySatDNA projects: ```python print_add_project_data(genome_projects, pid_type, pr_type) ``` ### Working with graph data ```python from trseeker.tools.network_tools import * ``` Compute network slices using graph_tool library or networkx. ```python compute_network(network_file, output_file_pattern, id2nodename) ``` - create_graph_on_fly(network_file) - load_graph(ml_file) - analyse_network_graph_tool(G, output_file_pattern, trf_large_index_file) - write_classification_graph_tool(output_file, components, id2nodename) - write_classification_neworkx(output_file, components, trid2meta) - create_graphml(network_file, ml_file, id2nodename) - load_networkx(network_file) - init_graph_networkx(network_data, start=0, precise=1, id2nodename=None) - analyse_networkx(G, network_data, output_file_pattern, id2nodename) ### Working with NCBI annotations ```python from trseeker.tools.ncbi_annotation_tools import * ``` Read ideogram file and return return dict chr -> list. ```python get_ideogram_dict(idiogram_file, mode="NCBI") get_gene_list(file_gene_list, color='#000000', left_padding=30, gene_group_label='C57BL/6J') # NB: Not implemented yet. ``` ### Jellyfish wrapper ```python from trseeker.tools.jellyfish_tools import * # jellyfish settings: location = settings["blast_settings"]["jellyfish_location"] ``` Count kmers with Jellyfish ```python count_kmers(input_fasta, ouput_prefix, k, mintf=None) merge_kmers(folder, ouput_prefix, ouput_file) stats_kmers(db_file, stats_file) histo_kmers(db_file, histo_file) dump_kmers(db_file, fasta_file, dumpmintf) query_kmers(db_file, query_hashes, both_strands=True, verbose=True) get_kmer_db_and_fasta(folder, input_file, kmers_file, k=23, mintf=None) query_and_write_coverage_histogram(db_file, query_sequence, output_file, k=23) # Save coverage histogram into output_file for given query_sequence. ``` Shortcut: ```python from trseeker.tools.jellyfish_tools import sc_compute_kmer_data sc_compute_kmer_data(fasta_file, jellyfish_data_folder, jf_db, jf_dat, k, mintf, dumpmintf) ``` ### Classifiction TRs in types Classification of TRs into types. ```python from trseeker.tools.trs_types import * ``` Available types: ```python trs_types = { "all": 0, "micro": 0, "perfect": 0, "too_short": 0, "100bp": 0, "gc": 0, "x4": 0, "entropy": 0, "ok": 0, } ``` Usage: ```python trs_types, class_objs, trf_objs = get_trs_types(trf_all_file, trf_all_class_file, settings) ``` Used settings: ```python settings["other"]["trs_types"]["possible_array_length"] settings["other"]["trs_types"]["microsat_monomer"] settings["other"]["trs_types"]["min_array_length"] settings["other"]["trs_types"]["min_gc"] settings["other"]["trs_types"]["max_gc"] settings["other"]["trs_types"]["min_copy_number"] settings["other"]["trs_types"]["min_entropy"] ``` ### Tree data ```python from trseeker.tools.tree_tools import * ``` #### NodeData class Stores tree-relevant data associated with nodes. ```python node = NodeData(taxon=None, branchlength=1.0, support=None, comment=None, bootstrap=None, taxon_name=None, distance=None, elements=None) ``` #### Chain class Stores a list of nodes that are linked together. ```python chain = Chain() # Return a list of all node ids ids = chain.all_ids() # Attaches node to another: (self, node, prev) chain.add(node, prev=None) # Deletes node from chain and relinks successors to predecessor: collapse(self, id). chain.collapse(id) # Kills a node from chain without caring to what it is connected chain.kill(id) # Disconnects node from his predecessor chain.unlink(id) # Connects son to parent chain.link(id) # Check if grandchild is a subnode of parent chain.is_parent_of(parent, grandchild) # Returns a list of all node_ids between two nodes (excluding start, including end) chain.trace(start, finish) ``` #### Node datastructure A single node ```python # Represents a node with one predecessor and multiple successors node = Node(data=None) node.set_id(id) node.set_data(data) id = node.get_id() # Returns a list of the node's successors id = node.get_succ() # Returns the id of the node's predecessor id = node.get_prev() node.set_prev() # Adds a node id to the node's successors node.add_succ(id) # Removes a node id from the node's successors node.remove_succ(id) # Sets the node's successors node.set_succ(id) ``` #### Tree datastructure Represents a tree using a chain of nodes with on predecessor (=ancestor) and multiple successors (=subclades) ```python t = Tree(weight=1.0, rooted=False, name="", data=NodeData, max_support=1.0) t.is_leaf(node) t.is_subtree(node) t.is_terminal(node) t.is_internal(node) t.is_preterminal(node) t.has_support(node=None) l = t.get_node_to_one_d(tree) t.is_identical(tree) node = t.node(id) nodes = t.get_terminals() n = t.count_terminals(node=None) node = t.common_ancestor(node1, node2) d = t.distance(nod1, node2) dist_dict = t.get_distance_for_all_from_node(id) bl = t.sum_branchlength(root=None, node=None) ids = t.split(parent_id=None, n=2, branchlength=1.0) # Bootstrap tree with given cutoff value. Nodes with suppert below cutoff will be collapsed t.bootstrap(value) # Prunes a terminal taxon from the tree t.prune(taxon) # Return a list of all otus downwards from a node t.get_taxa(node_id=None) ids = t.search_taxon(taxon) t.to_string(support_as_branchlengths=False, branchlengths_only=False, plain=True, plain_newick=False, ladderize=None) t.make_info_string(data, terminal=False) t.ladderize_nodes(nodes, ladderize=None) t.newickize(node, ladderize=None) ``` Functions: Get list of (slice_id, file_path): ```python slice_files = get_slice_files(folder) ``` read_tree_slices(slice_files): ```python slice_files = read_tree_slices(slice_files) ``` ### Function for downloading datasets from NCBI ```python from trseeker.tools.entrez_datanase import * ``` Download all proteins according to a given query into output file and then return these proteins as fasta text. ```python fasta_data = download_proteins_from_ncbi(query, output_file, email, batch=500, verbose_step=1000) ``` Download items from given database according to query, rettype, and retmode. Save output to output_file and return as text. ```python data = download_items_from_ncbi(query, database, output_file, email, rettype="fasta", retmode="text", batch=500, verbose_step=1000) ``` Get items from given database according to query, rettype, and retmode. ```python items = get_items_from_ncbi(query, database, email, rettype="fasta", retmode="text", batch=500, verbose_step=1000) ``` Get RNA SRA datasets from NCBI according to taxid. Output includes LIBRARY_SOURCE, STUDY_ABSTRACT, DESIGN_DESCRIPTION, PRIMARY_ID, DESCRIPTION, LINKS. And the second part of the output contains full xml data. ```python data, _ = get_rna_sra_datasets_by_taxid(taxid, email, batch=500, verbose_step=1) all_links = {} for *item, links in datasets.values(): links = dict([ (url.split("/")[-1], url) for url in links]) for sra in links: all_links[sra] = links[sra] ``` Download and unpack SRA filrs from NCBI according to taxid. ```python download_rna_sra_datasets_by_taxid(taxid, email, output_folder, threads=30, batch=500, verbose_step=1) ``` Download genomes and annotation from NCBI according to taxid. ```python download_genome_assemblies_and_annotation_from_ncbi(taxid, output_folder, threads=30, only_refseq=True) ```
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。