Changeset 1242:a148f32ba0a0

Show
Ignore:
Timestamp:
04/29/08 10:36:04 (8 months ago)
Author:
Anton Nekrutenko <anton@bx.psu.edu>
branch:
default
convert_revision:
svn:9bcadc22-80f8-0310-8a53-c8f022958886/galaxy/trunk@2603
Message:

More clean up and modification of FASTA tools

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • tool_conf.xml.main

    r1239 r1242  
    5555    <tool file="maf/maf_to_bed.xml" /> 
    5656    <tool file="maf/maf_to_fasta.xml" /> 
     57    <tool file="fasta_tools/tabular_to_fasta.xml" /> 
     58  </section> 
     59  <section name="FASTA manipulation" id="fasta_manipulation"> 
     60    <tool file="fasta_tools/fasta_compute_length.xml" /> 
     61    <tool file="fasta_tools/fasta_filter_by_length.xml" /> 
     62    <tool file="fasta_tools/fasta_concatenate_by_species.xml" /> 
     63    <tool file="fasta_tools/fasta_to_tabular.xml" /> 
    5764    <tool file="fasta_tools/tabular_to_fasta.xml" /> 
    5865  </section> 
  • tool_conf.xml.sample

    r1213 r1242  
    265265    <tool file="fasta_tools/fasta_filter_by_length.xml" /> 
    266266    <tool file="fasta_tools/fasta_concatenate_by_species.xml" /> 
     267    <tool file="fasta_tools/fasta_to_tabular.xml" /> 
     268    <tool file="fasta_tools/tabular_to_fasta.xml" /> 
    267269  </section> 
    268270  <section name="Short Read Analysis" id="short_read_analysis"> 
  • tools/fasta_tools/fasta_compute_length.py

    r1214 r1242  
    1313    input_filename = sys.argv[1] 
    1414    output_filename = sys.argv[2] 
     15    keep_first = int( sys.argv[3] ) + 1 
    1516    tmp_title = tmp_seq = '' 
    1617    tmp_seq_count = 0 
    1718    seq_hash = {} 
     19 
     20    if keep_first == 0: 
     21        keep_first = None 
    1822 
    1923    for i, line in enumerate( file( input_filename ) ): 
     
    3943    for i, fasta_title in title_keys: 
    4044        tmp_seq = seq_hash[ ( i, fasta_title ) ] 
    41         output_handle.write( "%s\t%d\n" % ( fasta_title[ 1: ], len( tmp_seq ) ) ) 
     45        output_handle.write( "%s\t%d\n" % ( fasta_title[ 1:keep_first ], len( tmp_seq ) ) ) 
    4246    output_handle.close() 
    4347 
  • tools/fasta_tools/fasta_compute_length.xml

    r1182 r1242  
    1 <tool id="fasta_compute_length" name="Count FASTA Length"> 
    2         <description> </description> 
    3         <command interpreter="python">fasta_compute_length.py $input $output </command> 
     1<tool id="fasta_compute_length" name="Compute"> 
     2        <description>sequence length</description> 
     3        <command interpreter="python">fasta_compute_length.py $input $output $keep_first</command> 
    44        <inputs> 
    5                 <param name="input" type="data" format="fasta" label="Fasta file"/> 
     5                <param name="input" type="data" format="fasta" label="Compute length for these sequences"/> 
     6                <param name="keep_first" type="integer" size="5" value="0" label="How many title characters to keep?" help="'0' = keep the whole thing"/> 
    67        </inputs> 
    78        <outputs> 
     
    1112                <test> 
    1213                        <param name="input" value="454.fasta" /> 
     14                        <param name="keep_first" value="0"/> 
    1315                        <output name="output" file="fasta_tool_compute_length_1.out" /> 
    1416                </test> 
    1517                <test> 
    1618                        <param name="input" value="extract_genomic_dna_out1.fasta" /> 
     19                        <param name="keep_first" value="0"/> 
    1720                        <output name="output" file="fasta_tool_compute_length_2.out" /> 
    1821                </test> 
     
    2225**What it does** 
    2326 
    24  This tool counts the length of each fasta sequence in the file. The output file has two columns per line (separated by tab): fasta titles and lengths of the sequences.  
     27This tool counts the length of each fasta sequence in the file. The output file has two columns per line (separated by tab): fasta titles and lengths of the sequences. The option *How many characters to keep?* allows to select a specified number of letters from the beginning of each FASTA entry.  
    2528 
    2629-----    
     
    2831**Example** 
    2932 
    30 - assume the input file contains fasta sequences:: 
     33Suppose you have the following FASTA formatted sequences from a Roche (454) FLX sequencing run:: 
    3134 
    32         &gt;seq1 
    33         TCATTTA 
    34         &gt;seq2 
    35         ATGGCGTCGGCC 
    36         &gt;seq3 
    37         TCACATGATG 
     35    &gt;EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ 
     36    TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACG 
     37    TTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG 
     38    &gt;EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ 
     39    AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA 
    3840 
    39 - the tool will return(first column is the titles, second column is the length of the sequences):: 
     41Running this tool while setting **How many characters to keep?** to **14** will produce this:: 
    4042         
    41         &gt;seq1        7 
    42         &gt;seq2        12 
    43         &gt;seq3        10 
    44          
     43        EYKX4VC02EQLO5  108 
     44        EYKX4VC02D4GS2   60 
     45 
    4546 
    4647        </help> 
  • tools/fasta_tools/fasta_concatenate_by_species.xml

    r1124 r1242  
    1 <tool id="fasta_concatenate0" name="Concatenate FASTA" version="0.0.0"> 
    2   <description>alignment by species</description> 
     1<tool id="fasta_concatenate0" name="Concatenate" version="0.0.0"> 
     2  <description>FASTA alignment by species</description> 
    33  <command interpreter="python">fasta_concatenate_by_species.py $input1 $out_file1</command> 
    44  <inputs> 
     
    1515  </tests> 
    1616  <help> 
     17   
     18**What it does** 
     19   
    1720This tools attempts to parse FASTA headers to determine the species for each sequence in a multiple FASTA alignment. 
    1821It then linearly concatenates the sequences for each species in the file, creating one sequence per determined species. 
     22 
     23------- 
    1924 
    2025**Example** 
     
    4651 
    4752 
    48 Becomes:: 
     53becomes:: 
    4954   
    5055  >hg18 
  • tools/fasta_tools/fasta_filter_by_length.xml

    r1214 r1242  
    1 <tool id="fasta_filter_by_length" name="Filter FASTA by Length"> 
    2         <description> </description> 
     1<tool id="fasta_filter_by_length" name="Filter"> 
     2        <description>sequences by length</description> 
    33        <command interpreter="python">fasta_filter_by_length.py $input $min_length $max_length $output </command> 
    44        <inputs> 
    55                <param name="input" type="data" format="fasta" label="Fasta file"/> 
    6                 <param name="min_length" type="integer" size="15" value="0" label="Minimal length of the return sequence" /> 
    7                 <param name="max_length" type="integer" size="15" value="0" label="Maximum length of the return sequence" help="no limitation if 0"/>  
     6                <param name="min_length" type="integer" size="15" value="0" label="Minimal length" /> 
     7                <param name="max_length" type="integer" size="15" value="0" label="Maximum length" help="Setting to '0' will return all sequences longer than the 'Minimal length'"/>  
    88        </inputs> 
    99        <outputs> 
     
    2222.. class:: infomark 
    2323 
    24 **TIP**. If only want to show sequences longer than a threshold, set *minimal length* to the threshold and leave *maximum length* to zero
     24**TIP**. To return sequences longer than a certain length, set *Minimal length* to desired value and leave *Maximum length* set to '0'
    2525 
    2626----- 
     
    2828**What it does** 
    2929         
    30  This tool accepts two parameters: *minimal length* and *maximum length*, and returns sequences of length within the two thresholds
     30Outputs sequences between *Minimal length* and *Maximum length*
    3131  
    3232----- 
     
    3434**Example** 
    3535 
    36 - assume the input file contains fasta sequences:: 
     36Suppose you have the following FASTA formatted sequences:: 
    3737 
    3838        &gt;seq1 
     
    4545        ATGGAAGC 
    4646 
    47 - return sequences with length longer than 10bp (set the *minimal length* to 10, and the *maximum length* to 0 (no limitation))::  
     47Setting the **Minimal length** to **10**, and the **Maximum length** to **0** will return all sequences longer than 10 bp:: 
    4848 
    4949        &gt;seq1 
  • tools/fasta_tools/fasta_to_tabular.py

    r1214 r1242  
    1414    infile = sys.argv[1] 
    1515    outfile = sys.argv[2] 
     16    keep_first = int( sys.argv[3] ) + 1 
    1617    title = '' 
    1718    sequence = '' 
    1819    sequence_count = 0 
     20     
     21    if keep_first == 0: 
     22        keep_first = None 
     23 
    1924    for i, line in enumerate( open( infile ) ): 
    2025        line = line.rstrip( '\r\n' ) 
     
    3944    for i, fasta_title in title_keys: 
    4045        sequence = seq_hash[( i, fasta_title )] 
    41         out.write( "%s\t%s\n" %( fasta_title, sequence ) ) 
     46        out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], sequence ) ) 
    4247    out.close() 
    4348 
  • tools/fasta_tools/fasta_to_tabular.xml

    r1182 r1242  
    11<tool id="fasta2tab" name="FASTA-to-Tabular" version="1.0.0"> 
    2         <description>Converts a FASTA file to Tabular format</description> 
    3         <command interpreter="python">fasta_to_tabular.py $input $output</command> 
     2        <description>converts FASTA file to tabular format</description> 
     3        <command interpreter="python">fasta_to_tabular.py $input $output $keep_first</command> 
    44        <inputs> 
    5                 <param name="input" type="data" format="fasta" label="Fasta file"/> 
     5                <param name="input" type="data" format="fasta" label="Convert these sequences"/> 
     6                <param name="keep_first" type="integer" size="5" value="0" label="How many title characters to keep?" help="'0' = keep the whole thing"/> 
    67        </inputs> 
    78        <outputs> 
     
    1112                <test> 
    1213                        <param name="input" value="454.fasta" /> 
     14                        <param name="keep_first" value="0"/> 
    1315                        <output name="output" file="fasta_to_tabular_out1.tabular" /> 
    1416                </test> 
    1517                <test> 
    1618                        <param name="input" value="4.fasta" /> 
     19                        <param name="keep_first" value="0"/> 
    1720                        <output name="output" file="fasta_to_tabular_out2.tabular" /> 
    1821                </test> 
     
    2023        <help> 
    2124         
     25**What it does** 
     26 
     27This tool converts FASTA formatted sequences to TAB-delimited format. The option *How many characters to keep?* allows to select a specified number of letters from the beginning of each FASTA entry.  
     28 
     29-----    
     30 
    2231**Example** 
    2332 
    24 A fasta file with two sequences:: 
     33Suppose you have the following FASTA formatted sequences from a Roche (454) FLX sequencing run:: 
    2534 
    26         &gt;seq1 
    27         CCGGTATCCG 
    28         &gt;seq2 
    29         CTTACC 
     35    &gt;EYKX4VC02EQLO5 length=108 xy=1826_0455 region=2 run=R_2007_11_07_16_15_57_ 
     36    TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACG 
     37    TTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG 
     38    &gt;EYKX4VC02D4GS2 length=60 xy=1573_3972 region=2 run=R_2007_11_07_16_15_57_ 
     39    AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA 
    3040 
    31 Returns:: 
     41Running this tool while setting **How many characters to keep?** to **14** will produce this:: 
     42         
     43        EYKX4VC02EQLO5  TCCGCGCCGAGCATGCCCATCTTGGATTCCGGCGCGATGACCATCGCCCGCTCCACCACGTTCGGCCGGCCCTTCTCGTCGAGGAATGACACCAGCGCTTCGCCCACG 
     44        EYKX4VC02D4GS2  AATAAAACTAAATCAGCAAAGACTGGCAAATACTCACAGGCTTATACAATACAAATGTAA 
    3245 
    33         &gt;seq1        CCGGTATCCG 
    34         &gt;seq2        CTTACC 
    3546 
    3647        </help> 
  • tools/fasta_tools/tabular_to_fasta.xml

    r1183 r1242  
    11<tool id="tab2fasta" name="Tabular-to-FASTA" version="1.1.0"> 
    2         <description>Converts a tabular file to FASTA format</description> 
     2        <description>converts tabular file to FASTA format</description> 
    33        <command interpreter="python">tabular_to_fasta.py $input $title_col $seq_col $output </command> 
    44        <inputs> 
     
    2020        <help> 
    2121         
     22**What it does** 
     23 
     24Converts tab delimited data into FASTA formatted sequences. 
     25 
     26----------- 
     27         
    2228**Example** 
    2329 
    24 Solexa data:: 
     30Suppose this is a sequence file produced by Illumina (Solexa) sequencer:: 
    2531 
    2632        5       300     902     419     GACTCATGATTTCTTACCTATTAGTGGTTGAACATC 
    2733        5       300     880     431     GTGATATGTATGTTGACGGCCATAAGGCTGCTTCTT 
    2834         
    29 Selecting **c3 and c4** as the Title Columns and **c5** as the Sequence Column will result in:: 
     35Selecting **c3** and **c4** as the **Title column(s)** and **c5** as the **Sequence column** will result in:: 
    3036 
    3137        &gt;902_419