Changeset 1242:a148f32ba0a0
- Timestamp:
- 04/29/08 10:36:04
(8 months ago)
- Author:
- Anton Nekrutenko <anton@bx.psu.edu>
- branch:
- default
- convert_revision:
- svn:9bcadc22-80f8-0310-8a53-c8f022958886/galaxy/trunk@2603
- Message:
More clean up and modification of FASTA tools
-
Files:
-
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
| r1239 |
r1242 |
|
| 55 | 55 | <tool file="maf/maf_to_bed.xml" /> |
|---|
| 56 | 56 | <tool file="maf/maf_to_fasta.xml" /> |
|---|
| | 57 | <tool file="fasta_tools/tabular_to_fasta.xml" /> |
|---|
| | 58 | </section> |
|---|
| | 59 | <section name="FASTA manipulation" id="fasta_manipulation"> |
|---|
| | 60 | <tool file="fasta_tools/fasta_compute_length.xml" /> |
|---|
| | 61 | <tool file="fasta_tools/fasta_filter_by_length.xml" /> |
|---|
| | 62 | <tool file="fasta_tools/fasta_concatenate_by_species.xml" /> |
|---|
| | 63 | <tool file="fasta_tools/fasta_to_tabular.xml" /> |
|---|
| 57 | 64 | <tool file="fasta_tools/tabular_to_fasta.xml" /> |
|---|
| 58 | 65 | </section> |
|---|
| r1213 |
r1242 |
|
| 265 | 265 | <tool file="fasta_tools/fasta_filter_by_length.xml" /> |
|---|
| 266 | 266 | <tool file="fasta_tools/fasta_concatenate_by_species.xml" /> |
|---|
| | 267 | <tool file="fasta_tools/fasta_to_tabular.xml" /> |
|---|
| | 268 | <tool file="fasta_tools/tabular_to_fasta.xml" /> |
|---|
| 267 | 269 | </section> |
|---|
| 268 | 270 | <section name="Short Read Analysis" id="short_read_analysis"> |
|---|
| r1214 |
r1242 |
|
| 13 | 13 | input_filename = sys.argv[1] |
|---|
| 14 | 14 | output_filename = sys.argv[2] |
|---|
| | 15 | keep_first = int( sys.argv[3] ) + 1 |
|---|
| 15 | 16 | tmp_title = tmp_seq = '' |
|---|
| 16 | 17 | tmp_seq_count = 0 |
|---|
| 17 | 18 | seq_hash = {} |
|---|
| | 19 | |
|---|
| | 20 | if keep_first == 0: |
|---|
| | 21 | keep_first = None |
|---|
| 18 | 22 | |
|---|
| 19 | 23 | for i, line in enumerate( file( input_filename ) ): |
|---|
| … | … | |
| 39 | 43 | for i, fasta_title in title_keys: |
|---|
| 40 | 44 | tmp_seq = seq_hash[ ( i, fasta_title ) ] |
|---|
| 41 | | output_handle.write( "%s\t%d\n" % ( fasta_title[ 1: ], len( tmp_seq ) ) ) |
|---|
| | 45 | output_handle.write( "%s\t%d\n" % ( fasta_title[ 1:keep_first ], len( tmp_seq ) ) ) |
|---|
| 42 | 46 | output_handle.close() |
|---|
| 43 | 47 | |
|---|
| r1182 |
r1242 |
|
| 1 | | <tool id="fasta_compute_length" name="Count FASTA Length"> |
|---|
| 2 | | <description> </description> |
|---|
| 3 | | <command interpreter="python">fasta_compute_length.py $input $output </command> |
|---|
| | 1 | <tool id="fasta_compute_length" name="Compute"> |
|---|
| | 2 | <description>sequence length</description> |
|---|
| | 3 | <command interpreter="python">fasta_compute_length.py $input $output $keep_first</command> |
|---|
| 4 | 4 | <inputs> |
|---|
| 5 | | <param name="input" type="data" format="fasta" label="Fasta file"/> |
|---|
| | 5 | <param name="input" type="data" format="fasta" label="Compute length for these sequences"/> |
|---|
| | 6 | <param name="keep_first" type="integer" size="5" value="0" label="How many title characters to keep?" help="'0' = keep the whole thing"/> |
|---|
| 6 | 7 | </inputs> |
|---|
| 7 | 8 | <outputs> |
|---|
| … | … | |
| 11 | 12 | <test> |
|---|
| 12 | 13 | <param name="input" value="454.fasta" /> |
|---|
| | 14 | <param name="keep_first" value="0"/> |
|---|
| 13 | 15 | <output name="output" file="fasta_tool_compute_length_1.out" /> |
|---|
| 14 | 16 | </test> |
|---|
| 15 | 17 | <test> |
|---|
| 16 | 18 | <param name="input" value="extract_genomic_dna_out1.fasta" /> |
|---|
| | 19 | <param name="keep_first" value="0"/> |
|---|
| 17 | 20 | <output name="output" file="fasta_tool_compute_length_2.out" /> |
|---|
| 18 | 21 | </test> |
|---|
| … | … | |
| 22 | 25 | **What it does** |
|---|
| 23 | 26 | |
|---|
| 24 | | This tool counts the length of each fasta sequence in the file. The output file has two columns per line (separated by tab): fasta titles and lengths of the sequences. |
|---|
| | 27 | This tool counts the length of each fasta sequence in the file. The output file has two columns per line (separated by tab): fasta titles and lengths of the sequences. The option *How many characters to keep?* allows to select a specified number of letters from the beginning of each FASTA entry. |
|---|
| 25 | 28 | |
|---|
| 26 | 29 | ----- |
|---|
| … | … | |
| 28 | 31 | **Example** |
|---|
| 29 | 32 | |
|---|
| 30 | | - assume the input file contains fasta sequences:: |
|---|
| | 33 | Suppose you have the following FASTA formatted sequences from a Roche (454) FLX sequencing run:: |
|---|
| 31 | 34 | |
|---|
| 32 | | >seq1 |
|---|
| 33 | | TCATTTA |
|---|
| 34 | | >seq2 |
|---|
| 35 | | ATGGCGTCGGCC |
|---|
| 36 | | >seq3 |
|---|
| 37 | | TCACATGATG |
|---|
| | 35 | >EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ |
|---|
| | 36 | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACG |
|---|
| | 37 | TTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG |
|---|
| | 38 | >EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ |
|---|
| | 39 | AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA |
|---|
| 38 | 40 | |
|---|
| 39 | | - the tool will return(first column is the titles, second column is the length of the sequences):: |
|---|
| | 41 | Running this tool while setting **How many characters to keep?** to **14** will produce this:: |
|---|
| 40 | 42 | |
|---|
| 41 | | >seq1 7 |
|---|
| 42 | | >seq2 12 |
|---|
| 43 | | >seq3 10 |
|---|
| 44 | | |
|---|
| | 43 | EYKX4VC02EQLO5 108 |
|---|
| | 44 | EYKX4VC02D4GS2 60 |
|---|
| | 45 | |
|---|
| 45 | 46 | |
|---|
| 46 | 47 | </help> |
|---|
| r1124 |
r1242 |
|
| 1 | | <tool id="fasta_concatenate0" name="Concatenate FASTA" version="0.0.0"> |
|---|
| 2 | | <description>alignment by species</description> |
|---|
| | 1 | <tool id="fasta_concatenate0" name="Concatenate" version="0.0.0"> |
|---|
| | 2 | <description>FASTA alignment by species</description> |
|---|
| 3 | 3 | <command interpreter="python">fasta_concatenate_by_species.py $input1 $out_file1</command> |
|---|
| 4 | 4 | <inputs> |
|---|
| … | … | |
| 15 | 15 | </tests> |
|---|
| 16 | 16 | <help> |
|---|
| | 17 | |
|---|
| | 18 | **What it does** |
|---|
| | 19 | |
|---|
| 17 | 20 | This tools attempts to parse FASTA headers to determine the species for each sequence in a multiple FASTA alignment. |
|---|
| 18 | 21 | It then linearly concatenates the sequences for each species in the file, creating one sequence per determined species. |
|---|
| | 22 | |
|---|
| | 23 | ------- |
|---|
| 19 | 24 | |
|---|
| 20 | 25 | **Example** |
|---|
| … | … | |
| 46 | 51 | |
|---|
| 47 | 52 | |
|---|
| 48 | | Becomes:: |
|---|
| | 53 | becomes:: |
|---|
| 49 | 54 | |
|---|
| 50 | 55 | >hg18 |
|---|
| r1214 |
r1242 |
|
| 1 | | <tool id="fasta_filter_by_length" name="Filter FASTA by Length"> |
|---|
| 2 | | <description> </description> |
|---|
| | 1 | <tool id="fasta_filter_by_length" name="Filter"> |
|---|
| | 2 | <description>sequences by length</description> |
|---|
| 3 | 3 | <command interpreter="python">fasta_filter_by_length.py $input $min_length $max_length $output </command> |
|---|
| 4 | 4 | <inputs> |
|---|
| 5 | 5 | <param name="input" type="data" format="fasta" label="Fasta file"/> |
|---|
| 6 | | <param name="min_length" type="integer" size="15" value="0" label="Minimal length of the return sequence" /> |
|---|
| 7 | | <param name="max_length" type="integer" size="15" value="0" label="Maximum length of the return sequence" help="no limitation if 0"/> |
|---|
| | 6 | <param name="min_length" type="integer" size="15" value="0" label="Minimal length" /> |
|---|
| | 7 | <param name="max_length" type="integer" size="15" value="0" label="Maximum length" help="Setting to '0' will return all sequences longer than the 'Minimal length'"/> |
|---|
| 8 | 8 | </inputs> |
|---|
| 9 | 9 | <outputs> |
|---|
| … | … | |
| 22 | 22 | .. class:: infomark |
|---|
| 23 | 23 | |
|---|
| 24 | | **TIP**. If only want to show sequences longer than a threshold, set *minimal length* to the threshold and leave *maximum length* to zero. |
|---|
| | 24 | **TIP**. To return sequences longer than a certain length, set *Minimal length* to desired value and leave *Maximum length* set to '0'. |
|---|
| 25 | 25 | |
|---|
| 26 | 26 | ----- |
|---|
| … | … | |
| 28 | 28 | **What it does** |
|---|
| 29 | 29 | |
|---|
| 30 | | This tool accepts two parameters: *minimal length* and *maximum length*, and returns sequences of length within the two thresholds. |
|---|
| | 30 | Outputs sequences between *Minimal length* and *Maximum length*. |
|---|
| 31 | 31 | |
|---|
| 32 | 32 | ----- |
|---|
| … | … | |
| 34 | 34 | **Example** |
|---|
| 35 | 35 | |
|---|
| 36 | | - assume the input file contains fasta sequences:: |
|---|
| | 36 | Suppose you have the following FASTA formatted sequences:: |
|---|
| 37 | 37 | |
|---|
| 38 | 38 | >seq1 |
|---|
| … | … | |
| 45 | 45 | ATGGAAGC |
|---|
| 46 | 46 | |
|---|
| 47 | | - return sequences with length longer than 10bp (set the *minimal length* to 10, and the *maximum length* to 0 (no limitation)):: |
|---|
| | 47 | Setting the **Minimal length** to **10**, and the **Maximum length** to **0** will return all sequences longer than 10 bp:: |
|---|
| 48 | 48 | |
|---|
| 49 | 49 | >seq1 |
|---|
| r1214 |
r1242 |
|
| 14 | 14 | infile = sys.argv[1] |
|---|
| 15 | 15 | outfile = sys.argv[2] |
|---|
| | 16 | keep_first = int( sys.argv[3] ) + 1 |
|---|
| 16 | 17 | title = '' |
|---|
| 17 | 18 | sequence = '' |
|---|
| 18 | 19 | sequence_count = 0 |
|---|
| | 20 | |
|---|
| | 21 | if keep_first == 0: |
|---|
| | 22 | keep_first = None |
|---|
| | 23 | |
|---|
| 19 | 24 | for i, line in enumerate( open( infile ) ): |
|---|
| 20 | 25 | line = line.rstrip( '\r\n' ) |
|---|
| … | … | |
| 39 | 44 | for i, fasta_title in title_keys: |
|---|
| 40 | 45 | sequence = seq_hash[( i, fasta_title )] |
|---|
| 41 | | out.write( "%s\t%s\n" %( fasta_title, sequence ) ) |
|---|
| | 46 | out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], sequence ) ) |
|---|
| 42 | 47 | out.close() |
|---|
| 43 | 48 | |
|---|
| r1182 |
r1242 |
|
| 1 | 1 | <tool id="fasta2tab" name="FASTA-to-Tabular" version="1.0.0"> |
|---|
| 2 | | <description>Converts a FASTA file to Tabular format</description> |
|---|
| 3 | | <command interpreter="python">fasta_to_tabular.py $input $output</command> |
|---|
| | 2 | <description>converts FASTA file to tabular format</description> |
|---|
| | 3 | <command interpreter="python">fasta_to_tabular.py $input $output $keep_first</command> |
|---|
| 4 | 4 | <inputs> |
|---|
| 5 | | <param name="input" type="data" format="fasta" label="Fasta file"/> |
|---|
| | 5 | <param name="input" type="data" format="fasta" label="Convert these sequences"/> |
|---|
| | 6 | <param name="keep_first" type="integer" size="5" value="0" label="How many title characters to keep?" help="'0' = keep the whole thing"/> |
|---|
| 6 | 7 | </inputs> |
|---|
| 7 | 8 | <outputs> |
|---|
| … | … | |
| 11 | 12 | <test> |
|---|
| 12 | 13 | <param name="input" value="454.fasta" /> |
|---|
| | 14 | <param name="keep_first" value="0"/> |
|---|
| 13 | 15 | <output name="output" file="fasta_to_tabular_out1.tabular" /> |
|---|
| 14 | 16 | </test> |
|---|
| 15 | 17 | <test> |
|---|
| 16 | 18 | <param name="input" value="4.fasta" /> |
|---|
| | 19 | <param name="keep_first" value="0"/> |
|---|
| 17 | 20 | <output name="output" file="fasta_to_tabular_out2.tabular" /> |
|---|
| 18 | 21 | </test> |
|---|
| … | … | |
| 20 | 23 | <help> |
|---|
| 21 | 24 | |
|---|
| | 25 | **What it does** |
|---|
| | 26 | |
|---|
| | 27 | This tool converts FASTA formatted sequences to TAB-delimited format. The option *How many characters to keep?* allows to select a specified number of letters from the beginning of each FASTA entry. |
|---|
| | 28 | |
|---|
| | 29 | ----- |
|---|
| | 30 | |
|---|
| 22 | 31 | **Example** |
|---|
| 23 | 32 | |
|---|
| 24 | | A fasta file with two sequences:: |
|---|
| | 33 | Suppose you have the following FASTA formatted sequences from a Roche (454) FLX sequencing run:: |
|---|
| 25 | 34 | |
|---|
| 26 | | >seq1 |
|---|
| 27 | | CCGGTATCCG |
|---|
| 28 | | >seq2 |
|---|
| 29 | | CTTACC |
|---|
| | 35 | >EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ |
|---|
| | 36 | TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACG |
|---|
| | 37 | TTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG |
|---|
| | 38 | >EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ |
|---|
| | 39 | AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA |
|---|
| 30 | 40 | |
|---|
| 31 | | Returns:: |
|---|
| | 41 | Running this tool while setting **How many characters to keep?** to **14** will produce this:: |
|---|
| | 42 | |
|---|
| | 43 | EYKX4VC02EQLO5 TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACGTTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG |
|---|
| | 44 | EYKX4VC02D4GS2 AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA |
|---|
| 32 | 45 | |
|---|
| 33 | | >seq1 CCGGTATCCG |
|---|
| 34 | | >seq2 CTTACC |
|---|
| 35 | 46 | |
|---|
| 36 | 47 | </help> |
|---|
| r1183 |
r1242 |
|
| 1 | 1 | <tool id="tab2fasta" name="Tabular-to-FASTA" version="1.1.0"> |
|---|
| 2 | | <description>Converts a tabular file to FASTA format</description> |
|---|
| | 2 | <description>converts tabular file to FASTA format</description> |
|---|
| 3 | 3 | <command interpreter="python">tabular_to_fasta.py $input $title_col $seq_col $output </command> |
|---|
| 4 | 4 | <inputs> |
|---|
| … | … | |
| 20 | 20 | <help> |
|---|
| 21 | 21 | |
|---|
| | 22 | **What it does** |
|---|
| | 23 | |
|---|
| | 24 | Converts tab delimited data into FASTA formatted sequences. |
|---|
| | 25 | |
|---|
| | 26 | ----------- |
|---|
| | 27 | |
|---|
| 22 | 28 | **Example** |
|---|
| 23 | 29 | |
|---|
| 24 | | Solexa data:: |
|---|
| | 30 | Suppose this is a sequence file produced by Illumina (Solexa) sequencer:: |
|---|
| 25 | 31 | |
|---|
| 26 | 32 | 5 300 902 419 GACTCATGATTTCTTACCTATTAGTGGTTGAACATC |
|---|
| 27 | 33 | 5 300 880 431 GTGATATGTATGTTGACGGCCATAAGGCTGCTTCTT |
|---|
| 28 | 34 | |
|---|
| 29 | | Selecting **c3 and c4** as the Title Columns and **c5** as the Sequence Column will result in:: |
|---|
| | 35 | Selecting **c3** and **c4** as the **Title column(s)** and **c5** as the **Sequence column** will result in:: |
|---|
| 30 | 36 | |
|---|
| 31 | 37 | >902_419 |
|---|