FALDO-paper
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:
This GitHub repository is being used to draft a scientific manuscript
to describe FALDO, a formal ontology for Feature Annotation Locations
in RDF: https://github.com/JervenBolleman/FALDO

FALDO was begun at the BioHackathon 2012 meeting in Japan,
https://github.com/dbcls/bh12/wiki/Feature-annotation-locations-in-RDF


Citation
========

A preprint of this work is now available to be cited as follows:

> Jerven Bolleman, Christopher J. Mungall, Francesco Strozzi, Joachim Baran,
> Michel Dumontier, Raoul J. P. Bonnal, Robert Buels, Robert Hoehndor,
> Takatomo Fujisawa, Toshiaki Katayama, Peter J. A. Cock (2014)
> **FALDO: A semantic standard for describing the location of nucleotide
> and protein feature annotation.** bioRxiv http://dx.doi.org/10.1101/002121
> http://biorxiv.org/content/early/2014/01/31/002121

Formal journal submittion is planned shortly.


LaTeX
=====

We are currently writing the manuscript using LaTeX using a BMC journal
template (files named `bmc_article.*`) with `location.tex` as the
primary file which includes the sub-sections as separate child files:

 * `abstract.tex` - Abstract
 * `background.tex` - Background
 * `implementation.tex` - Implementation
 * `results.tex` - Results
 * `discussion.tex` - Discussion
 * `conclusions.tex` - Conclusions
 * `avareq.tex` - Availability and requirements

To produce the whole PDF file, use LaTeX and BibTex:

    $ pdflatex location.tex
    $ bibtex location
    $ pdflatex location.tex

------------------------------------------------------------------------------

FALDO
=====

FALDO is the Feature Annotation Location Description Ontology.
It is a simple ontology to describe sequence feature positions and regions as found in 
[GFF3](http://www.sequenceontology.org/gff3.shtml), [DBBJ](http://www.ddbj.nig.ac.jp),
[EMBL](http://www.embl.org), [GenBank](http://www.ncbi.nlm.nih.gov/genbank) files,
[UniProt](http://www.uniprot.org), and many other bioinformatics resources.

The aim of this ontology is to describe the position of a sequence region or a feature.
It does not aim to describe features or regions itself, but instead depends on resources
such as the Sequence Ontology or the UniProt core ontolgy.

Examples
--------

The examples in turtle avoid declaring prefixes for space reasons.

### Known positions
 faldo:Region
A genomic region where we know exactly where it starts and ends on the reference genome sequence:

```turtle
<_:1> a faldo:Region ;
           faldo:begin <_:1b> ;
           faldo:end <_:1e> .

<_:1b> a faldo:Position ; 
           a faldo:ExactPosition ;
           a faldo:ForwardStrandPosition ;
            faldo:position "1"^^xsd:integer ;
            faldo:reference ddbj:XXXDSDS .

<_:1e> a faldo:Position ; 
           a :FuzzyPosition ;
           a :ForwardStrandPosition ;
           faldo:begin <_:1ea> ;
           faldo:end <_:1eb> ;
           faldo:reference ddbj:XXXDSDS .

<_:1ea> a faldo:Position ;
        a faldo:ExactPosition ;
        a faldo:ForwardStrandPosition ;
           faldo:position "3"^^xsd:integer ;
           faldo:reference ddbj:XXXDSDS .

<_:1eb> a faldo:Position ;
        a faldo:ExactPosition ;
        a faldo:ForwardStrandPosition ;
           faldo:position "7"^^xsd:integer ;
           faldo:reference ddbj:XXXDSDS .
```

A genomic region where the begin is on one contig and the end on an other:

```turtle
<_:2> a faldo:Region
           faldo:begin <_:2b> ;
           faldo:end <_:2e> .
<_:2b> a faldo:Position ; 
            a faldo:ExactPosition ;
            faldo:position "1"^^xsd:integer ;
            faldo:reference <_:contig17> .
<_:2e> a faldo:Position; 
           a faldo:ExactPosition ;
           faldo:position "4"^^xsd:integer ;
           faldo:reference <_:contig29> .
```

A rather curcial difference with most begin and end conventions here they are biological begin and end. 
Not smallest number is start and the larger number is end.

```
----->increasing count of position
123456789012345678901234567890
actgacgactagatcgatcgatcgactagt

tgactgctgatctagctagctagctgatca
     <----- direction of transcription 
     |    |--transcription on reverse strand begins here
     |--transcription on reverse strand ends here      
```

For example the *cheY* gene in
Escherichia coli str. K-12 substr. [MG1655](http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.2)
is described in the INSDC feature table as `complement(1965072..1965461)`,
which is 390 base pairs using inclusive one-based counting. In FALDO

```turtle
<_:geneCheY> a  ; # A gene as defined by the Sequence Ontology
           rdfs:label "cheY" ;
           faldo:location <_:example> ;

uniprot:P0AE67 up:encodedBy <_:geneCheY> .

<_:example> a faldo:Region ;
           faldo:begin <_:example_b> ;
           faldo:end <_:example_e> .

<_:example_b> a faldo:Position ,
                faldo:ExactPosition ,
                faldo:ReverseStrandPosition ;
            faldo:position "1965461"^^xsd:integer ; #see the end is smaller than the begin
            faldo:reference refseq:NC_000913.2 .


<_:example_e> a faldo:Position ,
                faldo:ExactPosition ,
                faldo:ReverseStrandPosition ;
            faldo:position "1965072"^^xsd:integer ; #see the end is smaller than the begin
            faldo:reference refseq:NC_000913.2 .
```

### Fuzy positions

Assume we have a protein aminoacid sequence "ACK" and a massspectrometry experiment says the amino acid 
A or C is glycosylated. But we don't know which of the two it is. We do know it is not "K".


```turtle
<_:glysolyatedAminoAcid>            a 	glycan:glycol:glycosylated_AA ; #The glycan ontology is used here
				faldo:location <_:fuzzyPosition> .
<_:fuzzyPosition> 	a 	faldo:FuzzyPosition ,
				faldo:InRangePosition ;
			faldo:begin <_:exactBegin> ;
			faldo:end   <_:exactEnd> .
<_:faldoBegin>		a	faldo:ExactPosition ;
			faldo:position 1 ;
			faldo:refence <_:sequence> .
<_:faldoEnd>		a	faldo:ExactPosition ;
			faldo:position 2 ;
			faldo:refence <_:sequence> .
<_:sequence> a uniprot:Sequence ;
           rdf:value "ACK" .
```
In the above example uniprot and glyco refer to the glycoprotein and uniprot schema's.

### Probabilistic fuzzy positions

Here we have a begin position that could be one of two nucleotides. This case uses
a probablisitic model that denotes that the feature could start at both positions 1 or 2. Position 1
has a likelihood of 0.1 and position 2 has a likelihood of 0.9. 

```turtle
<_:3> a    faldo:Region faldo:begin ;
           faldo:begin <_:3b> ;
           faldo:end <_:3e> .

<_:3b> a   faldo:ProbablePosition ;
           faldop:posibilities(<_:3bp1>,<_:3bp2>) .

<_:3bp1> a faldop:ProbablePosition ;
           faldop:probability "0.1"^^xsd:double ;
           faldop:location <_:3bb1> .

<_:3bp2> a faldop:ProbablePosition ;
           faldop:probability "0.9"^^xsd:double ;
           faldop:location <_:3bb2> .
<_:3bb1> a faldo:Position ,
           faldo:ExactPosition ;
           faldo:position "1"^^xsd:integer ;
           faldo:reference <_:1Strand> .

<_:3bb2> a faldo:Position ,
           faldo:ExactPosition ;
           faldo:position "2"^^xsd:integer ;
           faldo:reference <_:1Strand> .
```

License
-------

[![Creative Commons License](http://i.creativecommons.org/l/by/3.0/88x31.png)](http://creativecommons.org/licenses/by/3.0/) This work is licensed under a [Creative Commons Attribution 3.0 Unported License](http://creativecommons.org/licenses/by/3.0/).


本源码包内暂不包含可直接显示的源代码文件,请下载源码包。