Adding Additional ENCODE Datasets

Adding additional datasets to the encode import tool involves editing the file /cache/encode_datasets/encode_datasets.loc which is located on g2.bx.psu.edu.

Currently, only files adhering to the Browser Extensible Data (BED) format are allowed.

Once you have added your datasets, the Galaxy server must be reset so that it can be made aware of the changes.

Format of encode_datasets.loc

  • Tab-delimited file
    • There are 5 required fields
  • Lines beginning with # are ignored

Description of Fields

First Field

  • Abbreviation of the Encode Group where data belongs
  • Valid abbreviations are as follows:
    • CC = Chromatin and Chromosomes
    • GT = Genes and Transcripts
    • MSA = Multi-species Sequence Analysis
    • TR = Transcription Regulation

Second Field

  • Database build for which the data is valid
  • Examples:
    • hg17
    • hg16

Third Field

  • Description of the dataset
    • This is displayed in the tool's select page and also the history

Fourth Field

  • A unique ID for the dataset
    • Any combination of letters and/or numbers is acceptable
      • Except the keyword None, do not use it or else your data won't be accessible
    • Make sure that the ID that you select is different than any other
      • If not, one of the datasets will be unknown to the tool

Fifth Field

  • The full path including file name of the dataset you are adding
  • This file must be accessible to the Galaxy Server

An Example Entry

You want to add a dataset with the following characteristics:

  • Belongs in the Chromatin and Chromosomes group
  • Is based on the hg17 build
  • Has the description of "Some really cool data"
  • The file is located (accessible to the galaxy server) at the path of /cache/encode_datasets/encodeData1.bed
  • You checked, and double checked, that the ID you want, encodeCCReallyCoolData, hasn't been taken yet

The entry would look like this:

CC	hg17	Some really cool data	encodeCCReallyCoolData	/cache/encode_datasets/encodeData1.bed

Some Questions/Answers

Why doesn't my data set appear?

  • You didn't reset the server
    • The server must be reset in order for the tool to be aware of its presence
  • You did not include all the required fields
    • Fields are delimited by tabs
  • The file you specified isn't accessible to the Galaxy server
    • Check permissions
  • The file you specified doesn't exist
    • Check your spelling
  • You used an ID (field 4) which matches another dataset
    • Or someone reused your ID