trseeker
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Trseeker
# Trseeker: a Python library and framework for working with various genomics data

## Contents

- [Introduction](#_toolkit_intro)
- [Settings](#_toolkit_settings)
- [Models](#_models)
    - [DNA Sequence](#_models_seq)
    - [Tandem repeat](#_models_trf)
    - [Organism](#_models_organism)
    - [Dataset](#_models_dataset)
    - [Blast output](#_models_blast)
    - [Chromosome](#_models_chr)
    - [WGS assembly](#_models_wgs)
    - [Genome](#_models_genome)
    - [Kmer](#_models_kmer)
    - [Kmers](#_models_kmers)
- [Readers](#_io)
    - [Tab file](#_io_tab)
    - [Block file](#_io_block)
    - [Fasta file](#_io_fasta)
    - [GBFF file](#_io_gbff)
    - [FTP](#_io_ftp)
    - [NCBI FTP](#_io_ncbi_ftp)
    - [MONGO DB](#_io_mongo)
    - [TRF output](#_io_trf)
    - [TRs file](#_io_trs)
    - [Ngram file](#_io_ngram)
    - [Blast output](#_io_blast)
    - [Fastq file](#_io_sra)
- [Toolkit](#_tools)
    - [Annotation tools](#_annotation)
    - [Assembly tools](#_assembly_tools)
    - [Tulip](#_tools_tulip)
    - [TRs nomenclature](#_tools_trs_group)
    - [Sequence](#_tools_sequence)
    - [Various for files](#_tools_var)
    - [Various](#_tools_other)
    - [Kmers](#_tools_kmers)
    - [Edit distance](#_tools_ed)
    - [Repbase](#_tools_repbase)
    - [Trace](#_tools_trace)
    - [Statistics](#_tools_stat)
    - [Parsing](#_tools_parsers)
    - [TRF](#_tools_trf)
    - [TRs datasets](#_tools_trs)
    - [SRA datasets](#_tools_sra)
    - [Sequence patterns](#_tools_patterns)
    - [Blast](#_tools_blast)
    - [ESA](#_tools_sa)
    - [NCBI genomes](#_tools_ncbi_genome)
    - [Networks](#_tools_networks)
    - [Networks](#_tools_ncbi_ann)
    - [Jellyfish](#_tools_jellyfish)
    - [Trees](#_tools_tree)
    - [Classifiction TRs in types](#_trs_types)
    - [Working with Entrez NCBI databases](#_tools_entrez)





## Introduction

Framework состоит из следующих частей:

- модели данных
- чтение/запись данных согласно моделям
- инструменты работы с этими данными



## Настройки фреймворка

В файле settings.py:

```python
SETTINGS_FILENAME = "settings.yaml"
NGRAM_LENGTH = 23
NGRAM_N = 100000000
```

Настройки можно прочитать и записать:

```python
from trseeker.settings import load_settings
from trseeker.settings import save_settings

settings_dict = load_settings()
save_settings(settings_dict)
```

Настройки содержат следующие параметры:

```yaml	
trseeker:
    os: linux64
    root_dir: /root/Dropbox/workspace/trseeker
    work_dir: /home
blast_settings:
    blast_location:
    blast_location_NIX: /home/ncbi-blast-2.2.26+/bin/legacy_blast.pl bl2seq
    blast_location_WIN: c:\Program Files\NCBI\blast-2.2.26+\bin\legacy_blast.pl bl2seq.exe
    jellyfish_location: jellyfish
    repbase_db_folder: /home/rebase_blast_db/repbase
    blast_e: 1e-20
    blast_b: 20000000
    blast_v: 20000000
trf_settings:
    trf_location: /root/trf404.linux64
    trf_match: 2
    trf_mismatch: 5
    trf_indel: 7
    trf_p: 80
    trf_q: 10
    trf_threshold: 50
    trf_length: 2000
    trf_param_postfix: 2.5.7.80.10.50.2000
    trf_masked_file: False
    trf_flanked_data: True
    trf_data_file: True
    trf_nohtml: True
    overlapping_cutoff: 10
    overlapping_cutoff_proc: 30
    overlapping_gc_diff: 0.05
ngrams_settings:
    ngram_length: 23
    ngram_m: 10000000
```


## Avaliable Models




### DNA Sequence

```python
from trseeker.models.sequence_model import SequenceModel
```

Attribites:

- seq_gi (int)
- seq_ref
- seq_description
- seq_sequence
- seq_length (int)
- seq_gc (float)
- seq_revcom, reverce complement
- seq_gapped (int)
- seq_chr
- seq_head
- seq_start_position (int)
- seq_end_position (int)

Properties

- length (self.seq_length)
- sequence (self.seq_sequence)
- fasta
- header
- contige_coverage if **_cov_** key in name

```python
print seq_obj.fasta
>>> ">[seq_ref]\n[seq_sequence]\n"
```

- sa_input

```python
print seq_obj.sa_input
>>> "[seq_sequence]$"
```

- ncbi_fasta

```python
print seq_obj.ncbi_fasta
>>> ">gi|[seq_gi]|ref|[seq_ref]|[seq_description]\n[seq_sequence]\n"
```

Methods:

- add_sequence_revcom()
- set_dna_sequence(self, title, sequence, description=None)

```python
self.seq_ref = title
```

- set_ncbi_sequence(self, head, sequence)

Chromosome name is **?** or setted with parse_chromosome_name(head).

```python	
(self.seq_gi, self.seq_ref, self.seq_description) = parse_fasta_head(head)
```

- set_gbff_sequence(self, head, sequence)

Head is a dictionary with gi, ref, description keys.

Chromosome name is **?** or setted with parse_chromosome_name(head["description"]).

Sequence is cleared with clear_sequence(s) function. Lowercase and all non-DNA characters replacing with **n**. If the sequence has **n** then it is gapped.



### TRF results

```python
from trseeker.models.trf_model import TRModel
```

Attributes:

- project, project name
- id (float)
- trf_id (int)
- trf_type,
- trf_family,
- trf_family_prob (float),
- trf_l_ind (int)
- trf_r_ind (int)
- trf_period (int)
- trf_n_copy (float)
- trf_pmatch (float)
- trf_pvar (float)
- trf_entropy (float)
- trf_consensus
- trf_array
- trf_array_gc (float)
- trf_consensus_gc (float)
- trf_gi
- trf_head
- trf_param
- trf_array_length (int)
- trf_chr
- trf_joined (int)
- trf_superfamily
- trf_superfamily_self
- trF_superfamily_ref
- trf_family
- trf_subfamily
- trf_subsubfamily
- trf_family_network
- trf_family_self
- trf_family_ref
- trf_hor (int)
- trf_n_chrun (int)
- trf_chr_refgenome
- trf_bands_refgenome
- trf_repbase
- trf_strand

Methods

- set_project_data(project), set self.project to given project
- set_raw_trf(head, body, line), head, body and line from TRF parser
- get_index_repr()

```python	
print trf_obj.get_index_repr()
'''
Tab delimted string with \n-symbol:
trf_id
trf_period
trf_array_length
trf_pvar
trf_gi
trf_l_ind
trf_r_ind
trf_chr
'''
```

- get_numerical_repr()

```python
print trf_obj.get_numerical_repr()
>>> [trf_period]\t[trf_array_length]\t[trf_array_gc]\n
```

- get_fasta_repr(), where head is trf_obj.trf_id and sequence is trf_obj.trf_array
- or **fasta** property
- get_monomer_fasta_repr(), where head is trf_obj.trf_id and sequence is trf_obj.trf_consensus
- get_family_repr()
- get_gff3_string(self, chromosome=True, trs_type="complex_tandem_repeat", probability=1000, tool="PySatDNA", prefix=None, properties={"id":"trf_id", "family": "trf_family_self"})

```python
print trf_obj.get_family_repr()
'''
Tab delimted string with \n-symbol:
trf_id
trf_period
trf_array_length
trf_array_gc
trf_pvar
trf_gi
trf_l_ind
trf_r_ind
trf_chr
trf_repbase
trf_superfamily
trf_family
trf_subfamily
'''
```

For network slice added one more index - gid (group id)

```python
from trseeker.models.trf_model import NetworkSliceModel	

slice_obj = NetworkSliceModel()
```

#### TRsClassificationModel(AbstractModel):

Model for keeping classification data.

Attributes:

- project
- id (int)
- trf_id (int)
- trf_period (int)
- trf_array_length (int)
- trf_array_gc
- trf_type
- trf_family
- trf_subfamily
- trf_family_prob (float)
- trf_family_kmer
- trf_subfamily_kmer
- trf_family_self
- class_ssr
- class_tssr
- class_sl
- class_good
- class_micro
- class_100bp
- class_perfect
- class_x4
- class_entropy
- class_gc
- trf_consensus

```python

class_obj = TRsClassificationModel()
class_obj.set_with_trs(trf_obj)

print class_obj.network_head()

```



### Organism model

```python
from trseeker.models.organism_model import OrganismModel		
```

Attributes:

- organism_taxon
- organism_common_name
- organism_acronym
- organism_description
- organism_wgs_projects
- organism_genome_assemblies



### Dataset model

```python
from trseeker.models.dataset_model import DatasetModel		
```

Attributes:

- dataset_taxon
- dataset_id
- dataset_sources
- dataset_description
- dataset_gc (float)
- dataset_length (int)
- dataset_trs_n (int)
- dataset_trs_length (int)
- dataset_trs_mean_gc (float)
- dataset_trs_fraq (float)



### Blast Results Model

```python
from trseeker.models.blast_model import BlastResultModel		
```

Attributes:

- query_id (int)
- query_gi (int)
- query_ref
- subject_id
- subject_gi(int)
- subject_ref
- query_start (int)
- query_end (int)
- subject_start (int)
- subject_end (int)
- evalue  (float)
- bit_score (flaot)
- score (int)
- alignment_length (int)
- proc_identity (float)
- identical (int)
- mismatches (int)
- positives (int)
- gap_opens (int)
- gaps (int)
- proc_positives (float)
- frames
- query_frame (int)
- subject_frame (int)
- fraction_of_query (float)  

Additional functions:

- read_blast_file(blast_file, length), return subject_ref -> list of matches (BlastResultModel models).

```python
from trseeker.models.blast_model import read_blast_file

ref_to_blast_obj = read_blast_file(file_name)
```



### Chromosome model

```python
from trseeker.models.chromosome_model import ChromosomeModel
```

Attributes:

- chr_genome
- chr_number
- chr_taxon
- chr_prefix
- chr_gpid
- chr_acronym
- chr_contigs
- chr_length
- chr_mean_gc
- chr_trs_all
- chr_trs_3000
- chr_trs_all_proc
- chr_trs_3000_proc
- chr_trs_all_length
- chr_trs_3000_length
- genome_gaps
- chr_sum_gc



### WGS assembly model

```python
from trseeker.models.wgs_model import WGSModel
```

Attributes:

- wgs_prefix
- wgs_taxon
- wgs_gpid
- wgs_acronym
- wgs_contigs (int)
- wgs_length (int)
- wgs_mean_gc (float)
- wgs_trs_all (int)
- wgs_trs_3000 (int)
- wgs_trs_1000 (int)
- wgs_trs_500 (int)
- wgs_trs_all_proc (float)
- wgs_trs_3000_proc (float)
- wgs_trs_1000_proc (float)
- wgs_trs_500_proc (float)
- wgs_trs_all_length (int)
- wgs_trs_3000_length (int)
- wgs_trs_1000_length (int)
- wgs_trs_500_length (int)
- wgs_sum_gc (float)

Methods:

Clear trf information (set to 0):

```python
wgs_obj.clear_trf()
```



### Genome model

```python
from trseeker.models.genome_model import GenomeModel
```

- genome_taxon
- genome_prefix
- genome_gpid
- genome_acronym
- genome_chromosomes
- genome_contigs
- genome_length
- genome_mean_gc
- genome_trs_all
- genome_trs_3000
- genome_trs_all_proc
- genome_trs_3000_proc
- genome_trs_all_length
- genome_trs_3000_length
- genome_gaps
- genome_sum_gc

### RepeatMasker track data

```python
from trseeker.models.genome_model import RepeatMaskerAnnotation
```

Container for RepeatMasker track. For example from http://hgdownload.cse.ucsc.edu/goldenPath/felCat5/database/rmsk.txt.gz

```
Table fields:
      `bin` smallint(5) unsigned NOT NULL,
      `swScore` int(10) unsigned NOT NULL,
      `milliDiv` int(10) unsigned NOT NULL,
      `milliDel` int(10) unsigned NOT NULL,
      `milliIns` int(10) unsigned NOT NULL,
      `genoName` varchar(255) NOT NULL,
      `genoStart` int(10) unsigned NOT NULL,
      `genoEnd` int(10) unsigned NOT NULL,
      `genoLeft` int(11) NOT NULL,
      `strand` char(1) NOT NULL,
      `repName` varchar(255) NOT NULL,
      `repClass` varchar(255) NOT NULL,
      `repFamily` varchar(255) NOT NULL,
      `repStart` int(11) NOT NULL,
      `repEnd` int(11) NOT NULL,
      `repLeft` int(11) NOT NULL,
      `id` char(1) NOT NULL,
```



### Ngram/kmer model

```python
from trseeker.models.ngram_model import NgramModel
```

```python
ngram_obj = NgramModel(seq_f, seq_r)
ngram_obj.add_tr(trf_obj, tf)

print ngram_obj
>>> '[fseq]\t[rseq]\t[tf]\t[df]\t[len taxons]\t[len fams]\n'

print ngram_obj.get_families()
>>> ???
```

Attributes

- seq_r
- seq_f
- tf (float)
- df (int)
- taxons (set)
- trs (set)
- families (dict)



### Kmer index model

```python
from trseeker.models.ngrams_model import KmerIndexModel
```

Attributes:

- kmer
- rkmer
- tf (int)
- df (int)
- docs (list of int)
- freqs (list of float)

### Kmer Slice model

```python
from trseeker.models.ngrams_model import KmerSliceModel
```

#### Attributes

- kmer
- local_tf (int)
- df (int)
- f_trs
- f_wgs
- f_mwgs
- f_trace
- f_sra
- f_ref1
- f_ref2
- f_caroli

### SliceTreeModel

```python
from trseeker.models.ngrams_model import SliceTreeModel
```

#### Attributes

- deep (int)
- size (int)
- blast_fams (list of str)
- maxdf (int)
- nmaxdf (int)
- pmaxdf (float)
- gc_var (float)
- units (list of int)
- trs (list of int)
- kmers (list of SliceTreeModel)

#### Additional function

Yield KmerIndexModel:

```python
sc_ngram_reader(file_name)
```

Yield (ngram id, [(seq id, tf), ...]):

```python
sc_ngram_trid_reader(file_name)
```

Read kmer index data as list:

Parameters:

- index file
- microsatellite kmer dictionary, kmer->index


```python
read_kmer_index(ngram_index_file, micro_kmers, cutoff=1)
```



## IO functions




### Tab file

```python
from trseeker.seqio.tab_file import TabDelimitedFileIO

reader = TabDelimitedFileIO(skip_first=False, format_func=None, delimiter='\t', skip_startswith='#')

reader.read_from_file(file_name)
```

Avaliable all functions from parent class AbstractFileIO from PyExp package.

Useful functions:

- sc_iter_tab_file(input_file, data_type, remove_starts_with=None)
- sc_iter_simple_tab_file(input_file)
- sc_read_dictionary(dict_file, value_func=None)
- sc_read_simple_tab_file(input_file)
- sc_write_model_to_tab_file(output_file, objs)

```python	
from trseeker.seqio.tab_file import sc_iter_tab_file

for wgs_obj in sc_iter_tab_file(file_name, WGSModel):
	print wgs_obj.wgs_taxon
```

```python
from trseeker.seqio.tab_file import sc_iter_simple_tab_file

for item in sc_iter_simple_tab_file(file_name):
	print item[0]
```

```python
from trseeker.seqio.tab_file import sc_read_dictionary

for data in sc_read_dictionary(file_name):
	for key, value in data.items():
		print key, value
```

```python
from trseeker.seqio.tab_file import sc_read_simple_tab_file

for data in sc_read_simple_tab_file(file_name):
	for item in data:
		print "id = ", item[0]	
```



### Block file

```python
from trseeker.seqio.block_file import AbstractBlockFileIO

reader = AbstractBlockFileIO(token, **args)
for (head, body, start, next) in reader.read_online(file_name):
	pirnt head
```

Avaliable all functions from parent class AbstractFileIO from PyExp package.



### Fasta file

```python
from trseeker.seqio.fasta_file import FastaFileIO

reader = FastaFileIO()
```

#### Useful functions:

- sc_iter_fasta(file_name)
- fasta_reader(file_name), same as sc_iter_fasta
- sc_iter_fasta_simple(file_name)
- save_all_seq_with_exact_substring(fasta_file, substring, output_file)
- sort_fasta_file_by_length(file_name)

```python
from trseeker.seqio.fasta_file import sc_iter_fasta

for seq_obj in sc_iter_fasta(file_name):
	print seq_obj.sequence
```

```python
from trseeker.seqio.fasta_file import sc_iter_fasta_simple

for (gi, sequence) in sc_iter_fasta_simple(file_name):
	print gi
```

```python
from trseeker.seqio.fasta_file import save_all_seq_with_exact_substring

save_all_seq_with_exact_substring(fasta_file, substring, output_file)
```

Sort by length and save fasta file:

```python
sort_fasta_file_by_length(file_name)
```



### GBFF file

```python
from trseeker.seqio.gbff_file import GbffFileIO

reader = GbffFileIO()
```

#### Useful functions:

- sc_iter_gbff(file_name)
- sc_iter_gbff_simple(file_name)
- sc_parse_gbff_in_folder(gbbf_folder, fasta_folder, fasta_postfix, mask)

```python
from trseeker.seqio.gbff_file import sc_iter_gbff

for seq_obj in sc_iter_gbff(file_name):
	print seq_obj.length
```

```python
from trseeker.seqio.gbff_file import sc_iter_gbff_simple

for (gi, sequence) in sc_iter_gbff_simple(file_name):
	print gi
```

```python
from trseeker.seqio.gbff_file import sc_parse_gbff_in_folder

sc_parse_gbff_in_folder("/home/user/", "/home/user/fasta", "fa", "mouse")
```




### FTP IO

```python	
from trseeker.seqio.frt_io import AbstractFtpIO

reader = AbstractFtpIO(ftp_address=address)

reader.connect()

reader.cd(['/home', 'user', 'data'])

reader.ls()
>>> ['readme.txt', 'data.fa']

file_size = reader.size(file_name)

reader.get(file, output_file)

reader.unzip(file_name)
```

Download from NCBI ftp with aspera:

```python
download_with_aspera_from_ncbi(source, destination)
```



### NCBI ftp

```python
from trseeker.seqio.ncbi_ftp import NCBIFtpIO

reader = NCBIFtpIO()

reader.download_wgs_fasta(wgs_list, file_suffix, output_folder, unzip=False)

reader.download_all_wgs_in_fasta(output_folder, aspera=False, get_size=False, unzip=False)

reader.download_all_wgs_in_gbff(output_folder)

reader.download_chromosomes_in_fasta(ftp_address, name, output_folder)

reader.download_trace_data(taxon, output_folder)

reader.download_sra_from_ddbj(ftp_address, output_folder)

reader.download_with_aspera(local_path, remove_server, remote_path)
```



### Mongo db reader

Not implemented yet.

```python	
from trseeker.seqio.mongodb_reader import MongoDBReader

reader = MongoDBReader()
db = reader.get_trdb_conn()
```



### TRF file

#### Useful functions:

- sc_parse_raw_trf_folder(trf_raw_folder, output_trf_file, project=None)

```python
from trseeker.seqio.trf_file import TRFFileIO

from trseeker.seqio.trf_file import sc_parse_raw_trf_folder

for trf_obj in TRFFileIO(file_name, filter=True):
	print trf_obj.trf_id
```

```python
sc_parse_raw_trf_folder(trf_raw_folder, output_trf_file, project="mouse_genome")
```

При чтение данных TRF происходит их фильтрация по следующим параметрам:

1. Убираются все вложенные поля меньшей длины.
2. Если поля overlapping, то сейчас ничего не делается, а раньше если 

```python
overlap_proc_diff >= settings["trf_settings"]["overlapping_cutoff_proc"] 
gc_dif <= settings["trf_settings"]["overlapping_gc_diff"]
``` 
поля объяединяются в одно поле. Иначе считаем, что это отдельные поля.
3. Если поля совпадают, то выбирается то которое с большим trf_pmatch.

TODO: убрать пересечение, так как это видимо в том числе и ошибки ассмеблера и это будет мешать корректной классификации. Убрать в отдельную часть.


### TRs file

```python
from trseeker.seqio.tr_file import *
```

Functions:

- save_trs_dataset(trs_dataset, output_file)
- save_trs_class_dataset(tr_class_data, output_file)
- get_trfid_obj_dict(trf_large_file)
- get_classification_dict(fam_kmer_index_file)

Save trf file as fasta file:

```python
# don't save trf_obj if trf_family is ALPHA
save_trs_as_fasta(trf_file, fasta_file, add_project=False, skip_alpha=False)
```

```python
from trseeker.seqio.tr_file import get_trfid_obj_dict

trid2trfobj = get_trfid_obj_dict(trf_large_file)
```

- get_all_trf_objs(trf_large_file)
- get_all_class_objs(trf_class_file)
- get_class_objs_dict(trf_class_file)

```python
from trseeker.seqio.tr_file import get_all_trf_objs
from trseeker.seqio.tr_file import get_all_class_objs

trf_obj_list = get_all_trf_objs(trf_large_file)
class_obj_list = get_all_class_objs(trf_class_file)
```

- get_trf_objs_dict(trf_large_file)

```python
trf_obj_dics = get_trf_objs_dict(trf_large_file)
```

- read_trid2meta(file_name)

Load trid to full index dictionary as string:

```python
read_trid2ngrams(annotation_ngram_folder, trf_large_file)
```

TODO: rewrite this with AbstractReaders



### Ngram file

TODO: rewrite this with AbstractReaders

```python
from trseeker.seqio.ngram_file import save_ngram_index

save_ngram_index(ngram_index_file,
		       hash2id,
		       result_tf,
                     	       result_df, 
                     	       result_rtf, 
                     	       result_rdf,
	                      seen_rev, 
	                      hash2rev,
                                    ngram2kmerann)

save_ngram_pos_index(ngram_trids_file, id2trids, id2trid2tf)    

save_distance_data(dist_file, distances)
```



### Blast results file

```python	
from trseeker.seqio.blast_file import get_blast_result

get_blast_result(blast_file, length, gap_size=1000, min_align=500, min_length=2400, format_function=None, min_score=90)
```

Для фильтров используются following parameters:

1. **cutoff** - TRs array length.
2. **match** and **min_match** from **blast_match_proc** settings
3. **min_align** is minimal length of alignment (array_length / .min_alig)
4. **blast_gap_size** is size of the gap between two match regions (array_length * min_match)
5. **min_length** is minimal total length (array_length * match)

Этапы анализа:

1. From blast output files extracted ref_gi to BlastResultModel dictionary.
2. Removed all nested matches.
3. Overlapping matches joined.
4. Joined matches with length less than **min_align** are dropped.
5. Matches with gap length less than **gap_size** are joined
6. Joined gapped matches with length less than **min_align** are dropped.
7. Results formatted with given format_function


Dataset updating functions:

```python
update_with_repbase_blast_result(trs_dataset, annotation_self_folder, filters)

# inside function
# result is semicolon-delimited
trs_dataset[i].trf_repbase = result

# filters example
filters = {
	"blast_gap_size": 300,
	"min_align": 1000,
	"min_length": 3000,
}
```

```python
update_with_self_blast_result(trs_dataset, annotation_self_folder, filters_obj)
	
# inside function
# result comma-delimited
# inside get_filters(array_length) generate parameters by array_length:
# 	filters = filters_obj.get_filters(trf_obj.trf_array_length)
trs_dataset[i].trf_family_self = result
```

```python
update_with_ref_blast_result(trs_dataset, annotation_self_folder, filters)

# inside function
# comma-delimited data
trs_dataset[i].trf_family_ref = result
```

TODO: refractor to more generic form with setattr(...)

Self and ref annotation comma-delimited. Repbase annotation semicolon-delimited. Avalibale format functions:

- format_function_repbase(blast_dataset)
- format_function_self(blast_dataset) [DEFAULT]

While blast results parsing:
- skipped all inner matches
- joined overlapped
- skipped all alignments with length less thatn min_align paramter
- joined gapped if gap less than gap_size paramter



### SRA file

TODO: move FastqObj to models

```python
from trseeker.seqio.sra_file import FastqObj

fastq_obj = FastqObj(head, seq, strain, qual_str, phred33=False)
print fastq_obj.fastq
print fastq_obj.fastq_with_error(error)
print fastq_obj.sequence
print fastq_obj.fasta
print fastq_obj.trimmed_fastq
print fastq_obj.gc
```

Properties:

- head
- seq
- qual
- strain
- id
- trimmed_seq
- trimmed_qual
- qv: list of Q - phredX
- adaptor_positions: positions of adapter
- adapter_contamination
- qual_junk
- status
- parts

Methods related to trimming:

- trim(), NotImplemented
- trim_by_quality_cutoff(cutoff=30)
- trim_exact_adaptor(adaptor, verbose=False), NotImplemented
- trim_inexact_adaptor(adaptor, verbose=False, num_errors=2), NotImplemented

Additional functions:

- fastq_reader

```python
for fastq_obj in fastq_reader(fastq_file, phred33=False):
	print fastq_obj.seq
```




## Toolkit



### Assembly annotation

Compute intersection between two datasets.
Dataset format: list of tuples (chrName, startPos, endPos, feature_id).

```python
from trseeker.tools.annotation import *

compute_intersection_intervals(data_a, data_b)
```

Usage example:

```python
data_b = open("cat_good_trf.gff3").readlines()
data_b = [x.strip().split("\t") for x in data_b]
data_b = [(x[0], int(x[3]), int(x[4]), x[-1]) for x in data_b]

for x in compute_intersection_intervals(data_a, data_b):
    print x
```



### Assembly tools

```python
from trseeker.tools.assembly_tools import *

N50_contig_length, N50, shortest_seq, longest_seq = get_N50(lengths)
```



### Tulip files

Format and write network data to tulip format from distance_file with given cutoff (defaul=90).
    
- distance_file: (node_a, node_b, distance)
- tulip_file_output: network in tulip format
- cutoff: distance cutoff

```python
from trseeker.tools.tulip_tools import *

format_distances_to_tulip(distance_file, tulip_file_output, cutoff=90)
```



### TRs nomenclature

TODO: rename package
```python
from trseeker.tools.trs_groups import *
```
Get next family index. Index limited to latin alphabet characters. Otherwise will return X1,X2 and so on:
```python
letter = get_index(i)
```
Get most frequent element of list:
```python
element = get_popular(s)
```
Get unit and letter for family. Unit picked by most common pmatch value. A seen_units is a dictionary unit_size to frequence, it is needed for letter choosing:
```python
unit, letter, seen_units = get_family_name(ids, seen_units)	
```
Join families with common members, sort by family size:
```python
join_families_with_common(families)
```



### Sequence tools

```python
from trseeker.tools.sequence_tools import *
```
Return complementary sequence:
```python
revcom = get_revcomp(sequence)
```

Get shift variants for TR monomer:

```python
sequences = get_revcomp(sequence)
```

Return normalized sequence with following rules:

1. if T > A then return reverse complement
2. if A == T and G > C  then return reverse complement
3. else return sequence

```python
sequence = fix_strand(sequence)
```
Count GC content:
```python
gc = get_gc(sequence)
```
Check for N|n in sequence. Return n(with N) or w(whole):
```python
bool_value = check_gapped(sequence)
```
Return subsequence:
```python
get_subseq(seq, start, end)
```
Clear sequence (ACTGN alphabet):

1. lower case
2. remove any gaps
3. remove all letters except atgcn

```python
sequence = clear_sequence(sequence)
```
Return list of sequences splited by pattern with added end. Pattern is regexp:
```python
restriction(sequence, pattern, end=None)
```
Return number of fragments after restriction of sequence with given regexp pattern:
```python
restriction_fragments_n(sequence, pattern)
```
Check tandem repeat synonims  between two sequence with same length:
```python
check_cyclic_repeats(seq_a, seq_b)
```
Return sequence with n mutations:
```python
random_mutation(seq, n, alphabet="actgn +")
```
Return consensus string for given list of strings:
```python
get_consensus(strs)
```

Take a minimal sequence from lexicographically sorted rotations of sequence and its reverse complement.
Example: ACT, underlined - reverse complement seqeunces
ACT, AGT, CTA, GTA, TAC, TAG.

Find all possible multimers, e.g. replace GTAGTAGTA consensus sequence with ACT.

Return:

- list of sorted TRs
- list of (df, consensus) pairs sorted by array number

```python
remove_consensus_redundancy(trf_objs)
```



### Various useful functions for working with files

```python
from trseeker.tools.seqfile import *
```

- save_list(file_name, data)
- save_dict(file_name, dictionary)
- save_sorted_dict(d, file, by_value=True, reverse=True, min_value=None, key_type=None)
- count_lines(file)
- sort_file_by_int_field(file_name, field)



### Other useful functions (other_tools.py)

```python
from trseeker.tools.other_tools import *
```

Sort dictionary by key. Retrun list of (k, v) pairs:
```python
sort_dictionary_by_key(d, reverse=False)
```
Cast list elements to string:
```python
as_strings(alist)
```
Return list from dictionary, return list of (key, value) pairs:
```python
dict2list(adict)
```
Sort dictionary by value. Retrun list of (v, k) pairs:
```python
sort_dictionary_by_value(d, reverse=False)
```
Remove duplicates from list and sort it:
```python
remove_duplicates_and_sort(data)

remove_duplicates(data)
```
Remove redundancy or empty elements in given list:
```python
remove_redundancy(alist)
```
Remove nested fragments, ata format [start, end, ...]:
```python
clear_fragments_redundancy(data, extend=False, same_case_func=None)
```






### Ngrams (kmers) tools

```python
from trseeker.tools.ngrams_tools import *
```

Yields all ngrams of length n from given text:
```python
generate_ngrams(text, n=12)
```
Yields all ngrams of length n from given text:
```python
generate_window(text, n=12, step=None)
```
Returns m most frequent (ngram of length n, tf) tuples for given text:
```python
get_ngrams(text, m=None, n=23, k=None, skip_n=False)
```
Returns m most frequent (ngram of length n, fraction of possible ngrams) tuples for given text:
```python
get_ngrams_freq(text, m=None, n=23, k=None)
```
Returns a feature set {'ngram':'ngram',...}  of m most frequent ngram of length n for given text:
```python
get_ngrams_feature_set(text, m=5, n=12)
```
Returns a distance between two ngram sets where distance is a sum(min(ngram_a, ngram_b) for each common ngram). Format for ngrams_a and ngrams_b is a dictionary {ngram:n, ...}:
```python
get_ngram_freq_distance(ngrams_a, ngrams_b)
```
Returns a distance between two ngram sets where distance is a len(). Format for ngrams_a and ngrams_b is a dictionary {ngram:n, ...}:
```python
get_ngram_common_distance(ngrams_a, ngrams_b)
```
Return repeatness coefficient. From 0 (e.g. polyA) to 1 (unique sequence):
```python
get_repeatness_coefficent(length, k, kmern)
# Inside this function:
# kmern * (N-1) / (N**2)
```
Return expressivenesss coefficient.
```python
get_expessiveness_coefficent(kmern, k)
# Inside this function:
# kmern * 1. / 4^k
```
Update tf and df data with k-mers from given sequence (it will be lowered):
```python	
from trseeker.tools.ngrams_tools import count_kmer_tfdf

k = 23
for sequence in sequences:
	tf_dict, df_dict, local_tf, local_df = count_kmer_tfdf(sequence, tf_dict, df_dict, k)
```

Compute max df, number and procent of sequence with given ngram. Return (maxdf, nmaxdf, pmaxdf, ngram_seqs)

```python
(maxdf, nmaxdf, pmaxdf, ngram_seqs) = get_df_stats_for_list(data, k, kmer2df):
```

Get tf and df frequencies for kmer list:

```python
get_kmer_tf_df_for_data(data, k, docids=False)
```

Get list of (kmer, revkmer, tf, df, docids) for given data:

```python
process_list_to_kmer_index(data, k, docids=True, cutoff=None)
```

Get list of (kmer, revkmer, tf, df, docids) for multifasta file:

```python
compute_kmer_index_for_fasta_file(file_name, index_file, k=23)
```

Get list of (kmer, revkmer, tf, df, docids) for TRs file, filter ny sequence complexity defined by get_zlib_complexity function:

```python
compute_kmer_index_for_trf_file(file_name, index_file, k=23, max_complexity=None, min_complexity=None)
```

Compute kmer coverage and set of kmers:

```python
(m, kmers) = get_sequence_kmer_coverage(sequence, kmers, k)
```



### Edit distance functions

```python
from trseeker.tools.edit_distance import *
```

Return list of element (*d*) positions in given list:
```python
get_pos(L, d)
```
Return edit distance between two given strings. Edit distance dynamic programming implementation. Length of second sequence does not change similarity value:

1. monomer mode: double short monomer sequence if len2/len1 < 2
2. full_info: boolean, return full information

Return: percent of ED similarity, or if full_info (distance, number of d, positions of d, S matrix)
```python
get_ed_similarity(s1, s2, monomer_mode=False, full_info=False, verbose=False)
```

Return two sequences edit distance full information.    

1. s1: sequence of tandem repeat monomer
2. s2: sequence of tandem repeat monomer
3. verbose: boolean
4. monomer mode: double short monomer sequence if len2/len1 < 2.

Return: ("max_sim Nmax_sim %sim pos_list", pos_list, all_result, length seq1, length seq2)

```python
get_edit_distance_info(s1, s2, verbose=False, monomer_mode=False)
```

Return ED valie for two sequences:
```python
get_edit_distance(s1, s2)
```

Get last row of ED matrix between two strings:
```python
get_edit_distance_row(s1, s2)
```

Get Hamming distance: the number of corresponding symbols that differs in given strings:
```python
hamming_distance(s1, s2)
```



### Working with Repbase files

TODO: move to Readers

```python
from trseeker.tools.repbase_tools import join_repbase_files
```
Function join all Repbase fasta files in one huge fasta; reformat headers for compatibility with NCBI tools:
```python
join_repbase_files(input_folder, output_file)
```



### Working with Trace files

```python
from trseeker.tools.trace_tools import unclip_trace_file
```

Unclip. Remove vector flanks from Trace data.
```python
unclip_trace_file(fasta_file, clip_file, uncliped_file)
```

Read clip dictionary from clip_file_name gz archive:
```python
get_clip_data(clip_file_name)
```




### Function related to statistics

```python
from trseeker.tools.statistics import *
```

Calculate sigma, return sum(module(xi - mean):
```python
get_sigma(data)
```
Return dictionary containing simple statistics for given list. Dictionary keys: mean, variance, sigma, sample deviation:
```python
get_mean(data)

get_standard_deviation(variance)

t_test(sample_mean, dist_mean, variance, N)

get_element_frequences(data)

get_simple_statistics(data)
```



### Parsers

```python
from trseeker.tools.parsers import *

parse_fasta_head(fa_head)

parse_chromosome_name(head)

trf_parse_line(line)

trf_parse_param(line)

trf_parse_head(line)

get_wgs_prefix_from_ref(ref)

get_wgs_prefix_from_head(head)
```



### Functions related to TRF
```python
from trseeker.tools.trf_tools import *
```

```python
trf_search(file_name)
```
```python
trf_search_in_dir(folder, verbose=True, file_suffix=".fa", output_folder=None)
```

Parallel TRF seeach:

```python
trf_worker(args)
trf_search_in_dir_parallel(folder, verbose=True, file_suffix=".fa", output_folder=None, threads=1)
```

Create output TRF file with tandem repeats with length greater than from input file. Function returns number of tandem repeats in output file:

```python
trf_filter_by_array_length(trf_file, output_file, cutoff)
```
Create output TRF file with tandem repeats with unit length greater than from input file. Function returns number of tandem repeats in output file:

```python
trf_filter_by_monomer_length(trf_file, output_file, cutoff)
```
Create output TRF file with tandem repeats with GI that don't match GI_LIST. List of GI, see TRF and FA specifications, GI is first value in TRF row.

```python
trf_filter_exclude_by_gi_list(trf_file, output_file, gi_list_to_exclude)
```
Write TRF file tab delimited representation.Representation: numerical|index|agc_apm|with_monomer|family
```python
trf_representation(trf_file, trf_output, representation)
```
Write statistics data: field, N:
```python
trf_write_field_n_data(trf_file, file_output, field, field_format="%s")
```
Write statistics data: field_a, field_b:
```python
trf_write_two_field_data(trf_file, file_output, field_a, field_b)
```
Function prints chr, all trs, 3000 trs, 10000 trs:
```python
count_trs_per_chrs(all_trf_file)
```
Function prints number of items with given fasta head fragment:
```python
count_trf_subset_by_head(trf_file, head_value)
```
Several fasta heads impossible to parse, so it is simpler to fix them postfactum:
```python
fix_chr_names(trf_file, temp_file_name=None, case=None)
```




### Working with TRs datasets
```python
from trseeker.tools.trs_dataset import *
```

- COLORS dictionary
- COLORS_50 dictionary

```python
COLORS_50 = {
  "denim": ("#1560bd", (21,96,189)),
}
# TODO: finish it
```

- get_colors(family)
- get_colors_rgb(family)
- create_mathematice_dataset_by_family(trs_dataset, path_to_mathematica_folder, min_borders, max_borders)



### Working with SRA data
```python
from trseeker.tools.sra_tools import *
```

Iterate over fastaq data.
```python
sra_fastaq_reader(file_name)
```
Iterate over fasta SRA data.

```python
sra_fasta_reader(file_name)
```

- read_fastaq_freq_to_memory(file_name)
- fastaq_to_fasta(file_name, output_file)
- write_fastaq_repeats(input_file, output_file, min_tf=1000)
- seq_to_bin(seq)
- bin_to_seq(bseq)

Read SRA fasta data without read repeats.
```python
write_reduced_fasta(input_file, output_reduced)
```
Write ngrams data from SRA fasta data.
```python
write_ngrams(input_file, output_ngram, NGRAM_N)
```




### Working with sequence patterns
```python
from trseeker.tools.sequence_patterns import *
```

Avaliable patterns:

- MARS1
- MARS2
- CENPB
- PRDB9

Functions:

Translate sequence in DNA15 in reg exp:
```python
re_translate(sequence)
```

Return list of sequences where one letter changed to N:
```python
re_get_plus_minus(sequence)
re_get_mutations(sequence)
```

Return list of patterns one not mutated and second mutated:
```python
get_double_pattern(pattern_static, pattern_dynamic)
```

- get_mutated_pattern_twice(pattern)
- get_mutated_pattern_trice(pattern)
- pattern_search(name, sequence, pattern_function, pattern_function_params)



### Working with BLAST
```python
from trseeker.tools.blast_tools import *
```

- blastn(database, query, output, e_value=None)
- create_db(fasta_file, output, verbose=False, title=None)
- alias_tool(dblist, output, title)
- bl2seq(input1, input2, output)
- create_db_for_genome(file_pattern=None, chromosome_list=None, output=None, title=None)
- get_gi_list(gi_score_file, score_limit=90)
- get_all_gi_from_blast(blast_file, mode="gi")
- get_all_blast_obj_from_blast(blast_file, mode="ref")
- blastn_search_for_trs(trf_large_file, db, annotation_self_folder, temp_file, skip_by_family=None, is_huge_alpha=False)
- bl2seq_search_for_trs(trf_large_file, annotation_bl2seq_folder, temp_file)



### Working with suffix arrays
```python
from trseeker.tools.sa_tools import *
```

Fasta to SA input. Text file of sequences delimeted by $ symbol:

```python
fasta_to_sa_input(fasta_file, sa_input_file, index_file_name=None, file_name=None, start_id=None, increment=False)
```

SA input file to fasta file:

```python
sa_input_to_fasta(input_file, output_file)
```

Function precompiles doc2id and id2doc pickled dictionary:

```python
pickle_dictionary_for_docid_trid(sa_doc_index, doc_to_trf_file, trf_to_doc_file)
```

Function filters and writes sa data with given min tf and df:

```python
filter_sa_dataset(sa_file, output_file, min_tf, min_df)
```

Yields corpus texts. Corpus is $ delimited text file:

```python
iterate_sa_corpus(corpus)
```



### Working with NCBI genome information
```python
from trseeker.tools.ncbi_genomes import *
```

Scrap finished WGS genome projects wtih given SubGroup name:
```python
load_genome_projects(subgroup)
```
Scrap finished WGS genome projects wtih given SubGroup name:
```python
load_genome_projects_exclude(notsubgroups)
```
Print data for initiating PySatDNA projects:
```python
print_add_project_data(genome_projects, pid_type, pr_type)
```



### Working with graph data
```python
from trseeker.tools.network_tools import *
```

Compute network slices using graph_tool library or networkx.
```python
compute_network(network_file, output_file_pattern, id2nodename)
```

- create_graph_on_fly(network_file)
- load_graph(ml_file)
- analyse_network_graph_tool(G, output_file_pattern, trf_large_index_file)
- write_classification_graph_tool(output_file, components, id2nodename)
- write_classification_neworkx(output_file, components, trid2meta)
- create_graphml(network_file, ml_file, id2nodename)
- load_networkx(network_file)
- init_graph_networkx(network_data, start=0, precise=1, id2nodename=None)
- analyse_networkx(G, network_data, output_file_pattern, id2nodename)



### Working with NCBI annotations
```python
from trseeker.tools.ncbi_annotation_tools import *
```

Read ideogram file and return return dict chr -> list.

```python
get_ideogram_dict(idiogram_file, mode="NCBI")

get_gene_list(file_gene_list, color='#000000', left_padding=30, gene_group_label='C57BL/6J')
# NB: Not implemented yet.
``` 



### Jellyfish wrapper

```python
from trseeker.tools.jellyfish_tools import *

# jellyfish settings:
location = settings["blast_settings"]["jellyfish_location"]
```

 Count kmers with Jellyfish

```python
count_kmers(input_fasta, ouput_prefix, k, mintf=None)

merge_kmers(folder, ouput_prefix, ouput_file)

stats_kmers(db_file, stats_file)

histo_kmers(db_file, histo_file)

dump_kmers(db_file, fasta_file, dumpmintf)

query_kmers(db_file, query_hashes, both_strands=True, verbose=True)

get_kmer_db_and_fasta(folder, input_file, kmers_file, k=23, mintf=None)

query_and_write_coverage_histogram(db_file, query_sequence, output_file, k=23)
# Save coverage histogram into output_file for given query_sequence.
```

Shortcut:

```python
from trseeker.tools.jellyfish_tools import sc_compute_kmer_data

sc_compute_kmer_data(fasta_file, jellyfish_data_folder, jf_db, jf_dat, k, mintf, dumpmintf)
```



### Classifiction TRs in types

Classification of TRs into types.

```python
from trseeker.tools.trs_types import *
```

Available types:

```python
trs_types = {
    "all": 0,
    "micro": 0,
    "perfect": 0,
    "too_short": 0,
    "100bp": 0,
    "gc": 0,
    "x4": 0,
    "entropy": 0,
    "ok": 0,
}
```

Usage:

```python
trs_types, class_objs, trf_objs = get_trs_types(trf_all_file, trf_all_class_file, settings)
```

Used settings:

```python
settings["other"]["trs_types"]["possible_array_length"]
settings["other"]["trs_types"]["microsat_monomer"]
settings["other"]["trs_types"]["min_array_length"]
settings["other"]["trs_types"]["min_gc"]
settings["other"]["trs_types"]["max_gc"]
settings["other"]["trs_types"]["min_copy_number"]
settings["other"]["trs_types"]["min_entropy"]
```



### Tree data

```python
from trseeker.tools.tree_tools import *
```

#### NodeData class

Stores tree-relevant data associated with nodes.

```python
node = NodeData(taxon=None, branchlength=1.0,
                 support=None, comment=None, bootstrap=None,
                 taxon_name=None, distance=None, elements=None)
```

#### Chain class

Stores a list of nodes that are linked together.

```python
chain = Chain()

# Return a list of all node ids
ids = chain.all_ids()

# Attaches node to another: (self, node, prev)
chain.add(node, prev=None)

# Deletes node from chain and relinks successors to predecessor: collapse(self, id).
chain.collapse(id)

# Kills a node from chain without caring to what it is connected
chain.kill(id)

# Disconnects node from his predecessor
chain.unlink(id)

# Connects son to parent
chain.link(id)

# Check if grandchild is a subnode of parent
chain.is_parent_of(parent, grandchild)

# Returns a list of all node_ids between two nodes (excluding start, including end)
chain.trace(start, finish)
```

#### Node datastructure

A single node

```python
# Represents a node with one predecessor and multiple successors
node = Node(data=None)
node.set_id(id)
node.set_data(data)
id = node.get_id()

# Returns a list of the node's successors
id = node.get_succ()
# Returns the id of the node's predecessor
id = node.get_prev()
node.set_prev()

# Adds a node id to the node's successors
node.add_succ(id)
# Removes a node id from the node's successors
node.remove_succ(id)
# Sets the node's successors
node.set_succ(id)
```

#### Tree datastructure

Represents a tree using a chain of nodes with on predecessor (=ancestor)
    and multiple successors (=subclades)

```python

t = Tree(weight=1.0, rooted=False,
                 name="", data=NodeData, max_support=1.0)

t.is_leaf(node)
t.is_subtree(node)
t.is_terminal(node)
t.is_internal(node)
t.is_preterminal(node)
t.has_support(node=None)

l = t.get_node_to_one_d(tree)
t.is_identical(tree)

node = t.node(id)

nodes = t.get_terminals()
n = t.count_terminals(node=None)
node = t.common_ancestor(node1, node2)
d = t.distance(nod1, node2)

dist_dict = t.get_distance_for_all_from_node(id)

bl = t.sum_branchlength(root=None, node=None)

ids = t.split(parent_id=None, n=2, branchlength=1.0)

# Bootstrap tree with given cutoff value. Nodes with suppert below cutoff will be collapsed
t.bootstrap(value)

# Prunes a terminal taxon from the tree
t.prune(taxon)

# Return a list of all otus downwards from a node
t.get_taxa(node_id=None)

ids = t.search_taxon(taxon)

t.to_string(support_as_branchlengths=False, branchlengths_only=False, plain=True, plain_newick=False, ladderize=None)

t.make_info_string(data, terminal=False)

t.ladderize_nodes(nodes, ladderize=None)

t.newickize(node, ladderize=None)
```

Functions:

Get list of (slice_id, file_path):

```python
slice_files = get_slice_files(folder)
```

read_tree_slices(slice_files):

```python
slice_files = read_tree_slices(slice_files)
```




### Function for downloading datasets from NCBI

```python
from trseeker.tools.entrez_datanase import *
```

Download all proteins according to a given query into output file and then return these proteins as fasta text.

```python
fasta_data = download_proteins_from_ncbi(query, output_file, email, batch=500, verbose_step=1000)
```

Download items from given database according to query, rettype, and retmode. Save output to output_file and return as text.

```python
data = download_items_from_ncbi(query, 
                             database, 
                             output_file, 
                             email, 
                             rettype="fasta", 
                             retmode="text", 
                             batch=500, 
                             verbose_step=1000)

```

Get items from given database according to query, rettype, and retmode.

```python
items = get_items_from_ncbi(query, 
                         database, 
                         email, 
                         rettype="fasta", 
                         retmode="text", 
                         batch=500, 
                         verbose_step=1000)
```

Get RNA SRA datasets from NCBI according to taxid. 

Output includes LIBRARY_SOURCE, STUDY_ABSTRACT, DESIGN_DESCRIPTION, PRIMARY_ID, DESCRIPTION, LINKS.

And the second part of the output contains full xml data.

```python
data, _ = get_rna_sra_datasets_by_taxid(taxid, email, batch=500, verbose_step=1)

all_links = {}

for *item, links in datasets.values():
    links = dict([
                   (url.split("/")[-1], url)
                        for 
                            url 
                                in links])
    for sra in links:
        all_links[sra] = links[sra]
```

Download and unpack SRA filrs from NCBI according to taxid.

```python
download_rna_sra_datasets_by_taxid(taxid, email, output_folder, threads=30, batch=500, verbose_step=1)
```

Download genomes and annotation from NCBI according to taxid.

```python
download_genome_assemblies_and_annotation_from_ncbi(taxid, output_folder, threads=30, only_refseq=True)
```



本源码包内暂不包含可直接显示的源代码文件,请下载源码包。