II. Overview of datasets available in the database
b. Basic genome annotation data
g. Phylogenetic distribution data
i. Curated data on gene validation
III. Using the database
3. Query history functionality page
4. Posting lists of genes and compounds for community sharing
5. Gathering information about targets and compounds using community surveys
IV. Some example queries and their results
1. Example 1 (Search and rank genes)
2. Example 2 (Search compounds)
I. Introduction (back to Table of Contents)
The large and growing need for antiparasitic drugs (attributable to poor efficacy, high toxicity, and/or emergence of drug resistance), coupled with increasing interest in this area from both academic and pharmaceutical sector research programs, has motivated the development of a "Drug Target Portfolio" database that is accessible via the web at TDRtargets.org. This database facilitates two different approaches for identification of drug targets and potential lead compounds against tropical disease causing pathogens: I. “Search for target genes” - A genomic approach for identifying and prioritizing parasite genes as potential therapeutic targets. II. “Search for Small molecules” - A Chemoinformatics approach to identify links between small molecule compounds and parasite genes using evidence such as protein orthology; similarity of chemical compounds to biologically active small molecules or biological ligands/metabolites; small molecule docking to potential targets, etc. Both these approaches employ multiple different search criteria and query strategies and provide the users with the power and flexibility to analyze several parasite genomes or chemical datasets from a single web accessible, open source platform. Users can run queries and browse the results obtained, save results and publish them on the database to share with the wider scientific community, upload external datasets and modify previous results, prioritize results by ranking them and export results along with genomic datasets as spreadsheets. This document provides a detailed description of the various datasets, data integration, functionality and tools available in the database along with some examples which showcase the utility of the database. You may also want to check out the slideshow supplementing this tutorial, a link for which is available from the home page.
II. Overview of datasets available in the database
II. 1. Genomic datasets (back to Table of Contents) (running searches for target genes of interest)
Genomic datasets include complete genome sequence information for the pathogen of interest and all other annotation data that comes with it: gene ID, gene name, functional annotation such as enzymes or transporters, metabolic pathway mapping, and other information dependent on protein primary structure like length, molecular weight, isoelectric point, signal peptide, GPI anchor, transmembrane domain, etc. Most of this information has been obtained from the respective genome databases for these parasites. In addition, other datasets relevant for drug target prioritization including structure, expression, antigenicity, phylogenetic distribution, essentiality, genetic/chemical validation, druggability, assayability and bibliographic references are obtained from a variety of source and mapped to parasite genes. Sections II. 1. a. to II. 1. l. provide more information regarding the source and utility of the above described data types. Section III. 1. provides an overview of the functionality available in TDRtargets for browsing and querying genomic datasets. Section IV provides some use case examples of searching datasets and prioritizing targets.
II. 1. a. Species included (back to Table of Contents) (selecting pathogens to search targets)
Species for which datasets are currently loaded in the database:
Species for which future loading of datasets is planned:
II. 1. b. Basic genome annotation data (back to Table of Contents) (formulating queries using this data)
Under this category all basic genome information has been captured for the species of interest. This includes gene ID, gene name, gene product name, exon count, length of gene, length of protein, molecular weight of protein, isoelectric point of protein, hydrophobicity of proteins, number of transmembrane domains and presence of signal peptide. Data were obtained from respective genome databases (GenBank, GeneDB, PlasmoDB, ToxoDB, Leproma, and TubercuList).
Genes were classified into enzyme, transporter and receptor categories as follows. Enzymes: genes that have one or more of the following features: (1) an EC number; (2) a GO term for catalytic activity (GO:0003824) or one of its more specific subterms, e.g., kinase activity (GO:0016301); (3) an enzyme name (e.g., dehydrogenase, calpain, etc.); (4) orthology to known enzymes from other organisms (e.g., Saccharomyces cerevisiae). Transporters: genes that have one or more of the following features: (1) a GO term for transporter activity (GO:0005215) or one of its more specific subterms, e.g., chloride channel activity (GO:0005254), excluding any genes associated with non-transmembrane transport e.g., carrier proteins or proteins involved in vesicle transport); (2) a transporter name, e.g., pteridine transporter. Receptors: genes that have one or more of the following features: (1) a GO term for receptor activity (GO:0004872) or one of its child terms, e.g., peptide receptor activity (GO:0001653); (2) a name that includes "receptor."
II. 1. c. Functional annotation data (back to Table of Contents) (formulating queries using this data)
Under this category, functional annotation information available for individual genes is captured. These are protein domain (pfam / interpro), gene ontology (GO) annotation and EC number annotation for enzymes. Data were obtained from respective genome databases of the pathogens and interpro scans were run to get the most recent protein family and GO assignments. Using the GO annotation data, GO slim categories were created for assigning genes to various functional class categories and metabolic process categories.
II. 1. d. Structure data (back to Table of Contents) (formulating queries using this data)
Under this category, information on the availability of crystal structures for proteins is captured. In addition, molecular modeling of the protein structure was undertaken on a genome-wide scale for all genomes of interest. Structure models for either the whole protein or part of a protein (protein domain) were obtained based on a template structure. Structure data were obtained from PDB. Structure model data were obtained from Andrej Sali's lab; the models can be accessed from http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi.
II. 1. e. Expression data (back to Table of Contents) (formulating queries using this data)
Expression evidence was collected from several microarray experiments and combined in a simplified scheme. Genes were grouped in five categories depending on how much expression or upregulation they showed in the selected life cycle stage: 0-20%, 20-40%, 40-60%, 60-80% and 80-100%. The 0-20% category represents the lowest 20% of genes, showing the least expression or upregulation in the corresponding life cycle stage, and the 80-100% category represents the top 20% of genes.
Percentiles of P. falciparum genes are based simply on their expression levels relative to all other genes; e.g., a gene expressed highly at all life-cycle stages would fall into the 80-100% category for all stages. In contrast, percentiles for the Murphy & Brown M. tuberculosis data are based on the fold induction of each gene relative to a baseline stage (normal conditions for rapid growth). Percentiles for the Hasan et al. M. tuberculosis data are based on the genes' predicted importance in dormant stages based on expression data and other data.
More detailed (fine-grained) queries on gene expression can be performed with the original datasets at their respective sources (for example PlasmoDB).
II. 1. f. Antigenicity data (back to Table of Contents) (formulating queries using this data)
Putative antigenic peptides were predicted for genes using the method of Kolaskar and Tongaonkar, as implemented in EMBOSS (antigenic). Each predicted epitope has an associated score based on the physicochemical properties of the amino acid residues. A cumulative antigenicity score was calculated as the sum of scores of all predicted epitopes for a given protein. A normalized antigenicity index was calculated as the ratio of this cumulative score to the protein length. Finally, percentile values for antigenicity index was obtained by calculating the percent of proteins in a genome that fall below a given antigenicity index. Thus, a query for antigenicity index percentile greater than 80 will retrieve all proteins in the top 20 percent of antigenicity index values for the given genome.
II. 1. g. Phylogenetic distribution data (back to Table of Contents) (formulating queries using this data)
Phyletic distribution data for orthologs were obtained from the OrthoMCL database. Paralogs and orthologs for individual genes were identified and clustered into ortholog groups by reciprocal best blast hits (all-against-all) and Markov clustering. Clustering of genes that are orthologs from different species allows us to transfer or adopt functional information for a gene from a reference species to the parasite species of interest. In addition, this data is also useful for identifying duplication of genes and expansion of gene families. More information on ortholog identification and clustering can be obtained from http://www.orthomcl.org.
II. 1. h. Essentiality data (back to Table of Contents) (formulating queries using this data)
This is a collection of experimental data from various species on gene essentiality. Genome wide gene knockout and knockdown data from certain species (C. elegans, E. coli, S. cerevesiae and M. tuberculosis) can be used to predict whether or not orthologous genes from parasitic species of interest are essential. More data from other reference species will be added in the future. Data for essentiality were obtained from the Saccharomyces Genome Database (SGD), Profiling of E. coli Chromosome (PEC), Keio Collection, National Microbial Pathogen Data Resource (NMPDR), WormBase and New England Biolabs (NEB).
II. 1. i. Curated data on gene validation (back to Table of Contents) (formulating queries using this data)
Genetic and biochemical data on functional studies of various pathogen genes were collected from the literature by manually looking through publications of individual genes and by community-wide surveys. Data collected in this fashion were represented in a structured ontology format for better data querying and retrieval purposes. The structured ontology represents validation of phenotypes observed for each gene. Validation is classified as genetic or chemical. Under genetic validation, data are available for validation of phenotype by overexpression, loss-of-function mutant, knockout unrecovered, or RNAi/antisense assay. Under chemical validation, data are available for validation of phenotype by cell-free assay, in vitro culture assay, animal system, and clinical assay. The phenotypes observed from these studies are categorized as 'abnormal' or 'lethal.' This curation process was primarily carried out by the TDR Drug Target Group. The community wide survey for drug targets in Human African Trypanosomiasis (HAT) was initiated by the Drug Discovery group in University of Dundee following a workshop on drug discovery for Trypanosomatid diseases and was hosted at https://decide.ideareach.com/. Further such surveys for helminth and other protozoan parasites will be conducted at the tdrtargets.org site in the near future. All the survey datasets will be uploaded into the database using the structured ontology format.
II. 1. j. Druggability data (back to Table of Contents) (formulating queries using this data)
Druggability analysis was carried out to get an estimate of the likelihood of a protein being druggable. The druggability index (Dindex) is a composite score consisting of a weighted normalized sum, where each of the different druggability prediction methods is given a different weight depending on its relative contribution to the prediction. The Dindex values range from 0 to 1, where a larger index score for a gene means that the gene is more likely to be a druggable target. By doing sequence similarity searches (using ortholog clustering and BLAST) against a database of known targets derived from the latest Inpharmatica SAR literature database (StARlite), a large number of proteins in TDR priority species could be linked to a known druggable target with at least 1 small molecule compound with a binding affinity less than 10 uM. The StARlite database is now accessible from the ChEMBL (www.ebi.ac.uk/chembldb) site.
Drug-to-gene association data available from DrugBank (www.drugbank.ca) and ChEMBL were mined and homologs and orthologs of these genes were mapped to pathogen species of interest. While some of the compounds in DrugBank might prove relevant for parasite diseases, it is more likely that these compounds are best used as informatics probes to identify small focused diversity sets for screening. The DrugBank dataset was obtained from Robert K. Campbell of the Marine Biological Laboratory at Woods Hole and the ChEMBL dataset from John Overington of ChEMBL. In addition, data for associations of drugs/compounds with genes was also mined from the literature through PubMed (www.ncbi.nlm.nih.gov/pubmed).
II. 1. k. Assayability (back to Table of Contents) (formulating queries using this data)
For the purposes of this database, a target is considered assayable if it is an enzyme included in Sigma-Aldrich's collection of assays, or if it has been assayed according to the BRENDA database.
The BRENDA database contains categories for cloned and purified genes but not assayed genes per se, so to create our "assayed" category we combined entries from the Km and Specific Activity categories, which give the clearest picture of whether a protein has actually been enzymatically assayed.
The mapping of the BRENDA entries to the TDR genes was carried out as follows: (1) Mapping by EC number: EC numbers in BRENDA were used to map the entries to those TDR genes with EC numbers; (2) for BRENDA entries where there was no match to a gene by EC, the gene was identified by name in the species-specific database (e.g. PlasmoDB) and mapped to that gene; (3) if there was no gene in the species-specific database with the same EC or name as in the BRENDA entry, the gene was identified by BLASTing the sequence from the associated BRENDA literature reference.
For each TDR species all entries for the genus were mapped. So, for example, genes that were assayed/purified/cloned in Plasmodium knowlesi were mapped to Plasmodium falciparum. The source species is specified in the TDR gene entry page under Assayability.
II. 1. l. Bibliographic references (back to Table of Contents) (formulating queries using this data)
Reference data for genes were mined from PubMed, or were provided by curators and/or users of the site. Entries in PubMed contain references to EC numbers (for enzymes) and CAS registry numbers (for drugs) and where available these were used to map references to genes. In each case this mapping also required the name of the organism (e.g. Leishmania) to be present in the title or abstract of the cited reference.
II. 2. Chemical datasets (back to Table of Contents) (running searches for small molecules of interest)
The chemical datasets contained in TDRtargets comprise a compendium of information on small molecules obtained from various sources. In similar fashion to genomic datasets, TDRtargets captures a wide variety of information regarding small molecules including names and synonyms, identifiers and tags, and others based on chemical properties such as structure, molecular weight, LogP value, number of H+ donor/acceptor groups, number of flexible bonds, rule of five compliance, atomic composition, etc. In addition, bioactivity data and information regarding associated target organisms or target genes for a subset of small molecules have also been captured by TDRtargets. By integrating a comprehensive chemical dataset and enabling chemoinformatic approaches in TDRtargets, the database provides users with the ability to: i) search and obtain a list of drugs/inhibitors known to target a particular pathogen (some of these may have a defined mechanism of action and known target whereas others may simply be hits from a screen); ii) search for novel small molecules with similar scaffold to drugs/inhibitors of interest; iii) link novel target genes to drug like small molecules. Sections II. 2. a. to II. 2. g. provide more information regarding the source and utility of the chemical datasets. Section III. 2. Provides an overview of the functionality available in TDRtargets for browsing and querying chemical datasets. Section IV provides some use case examples of searching chemical datasets to obtain a list of useful chemicals.
II. 2. a. Basic information (back to Table of Contents) (formulating queries using this data)
The information captured in this section allows searches based on the chemical names (including synonyms) and tags like InChi or InChi Key.
II. 2. b. Chemical properties (back to Table of Contents) (formulating queries using this data)
Under this category, all relevant information based on the chemical composition of small molecules is captured to allow queries based on molecular weight, solubility (LogP value), H-bond donors/acceptors, number of flexible bonds and Lipinski rule of 5 qualification.
II. 2. c. Number of atoms (back to Table of Contents) (formulating queries using this data)
This category contains data on the atomic composition of the small molecules.
II. 2. d. Pharmacological activity (back to Table of Contents) (formulating queries using this data)
Many of the small molecule datasets integrated into the TDRtargets database come with some form of bioactivity data resulting from the screening experiments they were originally obtained from. The bioactivity data are represented in a variety of ways owing to the diversity of the assay protocols used for the screens. A list of bioactivity measurements can be seen in the “Activities” menu and include 50% inhibitory concentration (IC50), inhibitory constant (Ki), minimum inhibitory concentration (MIC), inhibition, activity, 50% effective concentration (EC50), 50% effective dose (ED50), percent growth inhibition, ratio, Log ki, 50% growth inhibition (GI50), half life (T1/2), Log IC50, inhibition frequency index (IFI), XC50, inhibitory dose (ID50), CC50, selectivity, T/C, pA2.
II. 2. e. Associated genes (back to Table of Contents) (formulating queries using this data)
Curated information regarding which proteins are targeted by which drugs/inhibitors was obtained from various different sources and integrated into the TDRtargets database. Sections II. 1. i to II. 1. k. provide information about how genes were curated for known association with pharmacological agents from sources such as the DrugBank, ChEMBL, Brenda and published literature. The same curated data is used to make the association from the small molecule side in this section. In addition to these curated datasets, many predicted associations between were made between target genes and small molecules. In brief, OrthoMCL was used to convert known associations between compounds and proteins from non-TDR Targets species to predicted associations between these same compounds and orthologous proteins in TDR Targets species.
II. 2. f. Information source (back to Table of Contents) (formulating queries using this data)
This section contains information regarding the original source from where the small molecule datasets contained in TDRtargets were obtained. These sources are most often databases such as ChEMBL and Starlite databases (www.ebi.ac.uk/chembldb), DrugBank (www.drugbank.ca), and PubChem (www.ncbi.nlm.nih.gov/pccompound) or published, open access data from academia and industry obtained from one of the above-mentioned databases (for example, hits from antimalarial screens run in St. Jude Children"s Research Hospital, Novartis-GNF and GlaxoSmithKline were obtained from www.ebi.ac.uk/chemblntd), or in-house curation efforts of TDRtargets from published literature and surveys of the scientific community.
II. 2. g. Structure based searches (back to Table of Contents) (formulating queries using this data)
Under this section, TDRtargets database contains information regarding the chemical structure of small molecules stored as SDF files. This allows for searching and finding small molecules based on the similarities in structure. The available chemoinformatic tools allow users to draw a structure of interest for making substructure similarity searches. More information on this will be discussed below in section III. 2.
III. Using the database (back to Table of Contents)
The TDRtargets database is an open access, lightweight database and is quite simple to use. The primary functionalities of the database include: i) Search pages - where users may search for genes and small molecule compounds of interest using many different criteria to query the underlying databases; ii) History page - where the results from gene and compound queries are archived and manipulated in various ways to rank targets identified using the search page functionalities; iii) Posted lists page - for sharing data with the broader scientific community; iv) Targets survey page - for gathering curated information about individual target genes from experts in the research community. In this section we will provide a guide to all the functionalities available in the TDRtargets database and to effectively use them for browsing, filtering and ranking genes and small molecule compounds of interest.
Registration and logging in - Many useful functions of the database are only available upon logging in. Users will have to make an initial registration at the site using the link provided on the top right hand corner of the web page. Users will need to provide their name, email ID and create a password at this step. Once created, users can log in using their email ID as the user name and associated password. In case the user has forgotten the password, it can be reset by filling out the registration form again. At this point, the users will be prompted that the user name already exists and will be given the choice to reset their password. The new password will be emailed to the user subsequently. Once logged in, users can save their queries and results permanently (see below in saving data), post their queries for use by others from the scientific community (see below in section III. 4.) and contribute to ongoing curation of genes and compounds (see below in section III. 5.).
III. 1. TARGETS search page (back to Table of Contents) (close this window) (go to compound search)
The TARGETS search page allows the user to query the genome for a pathogen of interest using one or more of the criteria listed on the page (see discussion above in section II. 1.). The functionalities provided on this page are segregated into three steps to help in framing and managing queries: 1. Choosing pathogen(s), 2. Filtering chosen pathogen(s) genome(s) based on any one or a combination of various search parameters (criteria of interest), and 3. Naming the query appropriately for future reference. Steps 1 & 2 can be carried out on the Targets search page only while step 3 can be executed from either this page or the History page. (Analyzing and/or combining query results can also be done on the History page.)
For choosing pathogens in step 1, users will have to select from the list shown on the top of the search page (see also section II. 1. a.). If no pathogen is selected, the search will retrieve genes qualifying for the search criteria from all pathogens by default. Although we expect that the majority of the searches will be run on a particular pathogen of interest, it is conceivable that groups of pathogens may be selected to run a query (for example - search for secreted kinases in all apicomplexan parasites or search for ion channels from all helminth pathogens or search for genes with only plant orthologs from all TryTryp parasites). It is also conceivable that searches can be run on all pathogens (for example - search for all enzymes from all parasites that DO NOT have orthologs in any animals). This search can be run by either selecting all the parasites or by NOT selecting any parasite and by default retrieving genes from all parasites. Simply selecting one or more parasites and running a search without selecting any other search parameter will retrieve all the genes from the selected organism(s). Simply clicking on a search button on this page without choosing any organism or any other search parameter will dump a list of all genes from all organisms (WARNING! - this last search can take up a minute or so to execute). These kinds of genome wide dumps of gene lists can be exported out along with information for many different criteria using the export option on the History page (see section III. 3. a.)
After choosing the parasite the next step is to filter the genome of the parasite using one or more of the criteria listed under the heading “Filter targets based on”. The criteria listed here are grouped into different categories. These are name/annotation, features, structures, expression, antigenicity, phylogenetic distribution, essentiality, validation, druggability, assayability and bibliographic references. Details about how these datasets were generated or where they were obtained from is provided in section II.1. Here we will discuss how to use these datasets.
Name/annotation (back to Table of Contents) - The criteria under this category can be accessed by clicking the search link (blue fonts) shown below the heading. This will open up a search form which will allow searches based on - a) Name. This is text search based on the original annotation contained in the parasite genome; b) Identifier/accession. This is text search based on the original gene identifier obtained from the primary source for the parasite genome; c) EC number. This search is based on the 4 digit enzyme commission number (www.chem.qmul.ac.uk/iubmb/enzyme/) as annotated in the original source genome. A typical entry will be 22.214.171.124 for searches based on all 4 digits and 1.1.1.- for searches based on only 3 digits and so on. To obtain all genes annotated with EC numbers from the selected parasite genome, just type "any" in the box; d) Gene ontology. This is again a text search and is based on the GO hierarchical functional annotation as provided in the OBO foundry (www.obofoundry.org); e) Pfam/interprodomains. This is a text search by Pfam (www.pfam.sanger.ac.uk) accession or description; f) Functional category. This is a curated dataset based on protein functional class (see section II. 1. c. for more details); g) GO Slim category. This is a search based on high level GO terms which can be selected from the pull down menu; h) KEGG high-level pathway and i) KEGG detailed pathway. Both of these cater to metabolic pathway searches and are based on KEGG (www.genome.jp/kegg) mapping. The pathways can be selected from the associated pull down menus. At the bottom of the search form for this category, there are clickable buttons to either run the search or reset the form. Selecting and running searches on more than one criterion under this category will retrieve genes that intersect for all chosen criteria (AND search). To perform a union (OR search) of a combination of criteria, run queries on individual criterion of interest and then combine then in the History page.
Features (back to Table of Contents) - Most criteria under this category require numerical input except a couple which a "Y" or "N" entry and an operator menu allows users to select +, > or < operations with reference to the numerical entry used. For example, protein length > 150 residues or molecular weight < 50000 daltons. There is currently no restriction or designated range for numerical entries and users will simply get no results when there is no matching for the numerical value entered. All the datasets are associated with +ve numerical values only. The search form can be accessed by clicking the search link (blue fonts) shown below the heading for framing searches based on - a) Protein length. The search is run based on the number of protein residues and numbers 0 and upwards can be used; b) Molecular weight. The search is run based on the molecular weight of the protein and numbers 0 and upwards can be used. NOTE: The values are stored as Daltons rather than as kiloDaltons; c) Isoelectric point. The search is run based on the calculated ionization constant (pka) values for proteins which is expected to lie within pH maximum minimum range (0 - 14); d) Signal peptide. This is a present or absent call and "Y" or "N" is selected from the pull down menu; e) GPI anchor. This is a present or absent call and "Y" or "N" is selected from the pull down menu; e) Number of transmembrane domains. This search is based on the predicted occurrence of transmembrane domains and numbers 0 and upwards can be used; f) Number of exons. This searchc is based on the number of exons as annotated in the gene models and numbers 0 and upwards can be used. Selecting and running searches on more than one criterion under this category will retrieve genes that intersect for all chosen criteria (AND search). To perform a union (OR search) of a combination of criteria, run queries on individual criteria of interest and then combine them in the History page.
Structure (back to Table of Contents) - Under this category users can search for genes that either have a crystal structure from PDB (www.pdb.org) or a structure model predicted from the Modbase (www.modbase.compbio.ucsf.edu) pipeline. This is a simple selection form. Selecting and running searches on both criteria under this category will retrieve genes that qualify for either one of the chosen criteria (OR search). To perform an intersection (AND search) of the two criteria, run queries on individual criterions and then combine them in the History page.
Expression (back to Table of Contents) - Under this category users can search for genes with a certain expression profile. As of this release TDRtargets provides expression data for only two pathogens - P. falciparum and M. tuberculosis. For each of these pathogens users can select the stage from which expression data is required. For M. tuberculosis there are data for dormancy stage only while for P. falciparum data are available for all erythrocytic life cycle stages. In addition to selecting the pathogen stage, for M. tuberculosis only users will have to select the experiments from which the expression data was obtained (see section II. 1. e. for more details). The expression level pull- down menu shows the binned intervals for the expression percentile. As an example query for M. tuberculosis one can make the following selections - "stage" = dormancy; "dataset" = Hasan; "expression level" = 80 - 100%. This query will retrieve all highly expressed genes during dormancy from M. tuberculosis according to the experiment conducted by Hasan et. al.
Antigenicity (back to Table of Contents) - Under this category users can search for genes based on their predicted antigenic properties. Further information regarding the methods used for these calculations are given in the "more info on this search" link shown under the search form for this category. This is again a numerical search and users can search based on number of epitopes or normalized cumulative score or antigenicity index percentile or a combination (intersection) of any of these measures. The numerical entries can be used in combination with the =, < or > operators. Selecting and running searches on more than one criterion under this category will retrieve genes that intersect for all chosen criteria (AND search). To perform a union (OR search) of a combination of criteria, run queries on individual criterion of interest and then combine them in the History page.
Phylogenetic distribution (back to Table of Contents) - This search is used to find orthologs present or absent in various organisms of interest. For example, one can search for all Plasmodium genes that are conserved in all plants and absent in animals. Currently the search form lists a selection of organisms on which orthology searches can be run but this can be expanded if required to any of the organisms that are included in OrthoMCL DB (see section II. 1. g. for more details). In addition, duplicated or expanded gene families can be identified by searching for paralogs using the "number of paralogs" option. This is a numerical entry and any number 0 and above is accepted in combination with the operator function =, < or >. Selecting and running searches on more than one criterion under this category will retrieve genes that intersect for all chosen criteria (AND search). To perform a union (OR search) of a combination of criteria, run queries on individual criterion of interest and then combine them in the History page.
Essentiality (back to Table of Contents) - This allows for searches based on genome wide essentiality data for pathogen genes and evidence for essentiality was obtained from various different sources (see section II. 1. h.). Since genome wide essentiality data is available for only a few species, most of which are not pathogens, mapping of this data to pathogen genomes was done based on orthology. User need to decide which organism is a good model for the pathogen under consideration. For example, C. elegans is a better than E. coli as a model for B. malayi essentiality. Choosing the "any evidence for essentiality" option will search for essential genes based on evidence from any or all of the listed model organisms. The pull down menus provided for each organism lists the essentiality calls made in the original data source. For example, by selecting "inviable" in S. cerevisiae one can search for all genes from the pathogen of interest that are orthologs to S. cerevisiae genes required for the viability of yeast. Users can also use a combination essentiality evidence from more than one model organism. For example, essential in E. coli and M. tuberculosis. Selecting and running searches on both criterion under this category will retrieve genes that qualify for either one of the chosen criteria (OR search). To perform an intersection (AND search) of the two criteria, run queries on individual criterions and then combine them in the History page.
Validation data (back to Table of Contents) - This criterion allows users to search for genes which have been experimentally validated as being good or useful targets. Since this data was obtained by manual curation on a gene by gene basis there is no genome wide coverage for this dataset. Curation is ongoing and for some pathogens there may be no data available yet. Users can choose between "genetic validation" and "pharmacological validation" and selected the corresponding experiment or assay from the pull down menus. For a subset of these curations, phenotypic data is also available and is provided as an additional choice.
Druggability (back to Table of Contents) - This criteria allows users to search for pathogen genes that are considered druggable (for more details see section II. 1. j. and the "more information about this search" link under druggability search menu). The "druggability index" is a simple numerical entry based on previously determined druggability value ranging from 0 - 1. The numerical entries can be combined with the operator function =, < or >. Users can also search for gene with "associated compounds" using various evidence types such as "curated" or "predicted" or a combination of these by choosing "any" from the pull down menu. Selecting and running searches on more than one criterion under this category will retrieve genes that intersect for all chosen criteria (AND search). To perform a union (OR search) of a combination of criteria, run queries on individual criterion of interest and then combine them in the History page.
Assayability (back to Table of Contents) - This search is simple to execute. Users only need to select the criterion by clicking on the check boxes. Users can get more info about this search under the "more about assayability" link. Selecting and running searches on both criteria listed under this category will retrieve genes that intersect for both criteria (AND search). To perform a union (OR search) on these, run queries on individual criterion of interest and then combine them by "Union" functionality in the History page.
Bibiliographic references (back to Table of Contents) - This is again a simple search for genes which have publications associated with them. Two different searches can be run - one is a search for genes with any associated publication and the other, is a more detailed search for genes associated with a particular publication. For the later search, users can fill in one or more of choices on publication details such as PubMed ID, journal name, volume number, year of publication, page numbers or the title of the paper. Users can combine more than one choice of search criteria under "detailed search for genes associated with a particular publication". These combinations will retrieve genes that intersect for all chosen criteria (AND search). To perform a union (OR search) on these, run queries on individual criterion of interest and then combine them by "Union" functionality in the History page.
Naming, viewing and weighting query results (back to Table of Contents) - Steps 3 and 4 on the TARGETS search page are primarily designed to allow users to better navigate the query naming, weighting and HISTORY functionalities. While running multiple queries on a selected pathogen, users can name each query and assign weight values to the genes retrieved as query results in step 3. While naming is optional, it is very useful to clearly name queries making it easier for future reference. Similarly weighting is also optional but recommended as it helps to rank genes within any given list. By default, each gene is associated with a weight value of 1.
Running a single query (back to Table of Contents) - In step 3, by clicking on "Run this Query" button, users can view the results of the current query on the result page where the retrieved genes are shown as a tabular list. The query name (if given one) is shown at the top of the list and the show "query parameters" link provides details of the selections made to run the query. On the next line, users can see the number of genes retrieved for the query and while only 25 genes are listed per page by default, one can choose to see up to 200 per page and navigate from first to last page. The columns in the table display species name, gene ID, ortholog group id and gene product name. In addition, the current release of the database provides a link to associated compounds from the result page through a "show all / curated / predicted compounds" link provided just above the gene list. The links to compounds are provided by default for all search results (irrespective of whether or not pharmacological validation or compound associations were include as search criteria) and by clicking these links users can view a list of small molecule compounds associated with the listed genes. Once the associated compounds list is viewed, this will also appear as an entry under the "My Drug Queries" section of the History page.
Running multiple queries (back to Table of Contents) - In step 3, by clicking on "Next Query" button, users will be asked to continue to run searches on the chosen organism. This cycle can be repeated until the user has performed searches for all relevant criteria. As explained above, naming each search query will help keep track of all the queries. After selecting the criteria for the last round of search, the user needs to click the "This is the last query" button to view the cumulative result from the batch of searches run as a set. In this case the result page shows the Union of all the searches. If numerical weight values had been applied to the searches, the values will summed up in a cumulative manner and the final weight values for the genes will be shown on the result page. This result page will appear different from the one described above. Here, users will see a "Your Scoring Strategy" box listing all the queries and the associated weight values. Users have the option to modify assigned weight values by changing the assigned values for the respective queries and then clicking the "update" at the bottom to re-rank the list of gene accordingly. The graph plotted on the right top of the result page depicts the binned distribution of the listed genes based on the final cumulative weight assignments. This is a histogram of gene frequency (the Y axis) versus weight (the X axis). To allow automatic generation of reasonable histograms for all queries, the X axis is divided into 10 weight "bins" of equal range, with weights increasing from left to right. The number displayed underneath each bar represents the mean of that bin. The table listing the genes on this page is also different from the one described above. Here, in addition to gene ID and product name, the final cumulative weight values and information on how many criteria each gene qualifies for are also shown. At the bottom of this result page, a link is provided to export the gene list as a tab delimited spread sheet (for more details see the export data topic in section III. 3.).
Assigning weights to rank genes (back to Table of Contents) - For each search described above, users have the option of assigning a numerical weight value for the genes retrieved as query results. This can be either done on the search page itself or in the History page. By default a weight value of "1" is associated with each gene. So, even if the user did not assign any weights, the genes can still be ranked by adding up the default value. For example, one can run the following set of queries - Q1: All P. falciparum kinases (by text search on product name) = 147 results; Q2: All P. falciparum enzymes (by "any" in EC number search) = 723 results; All P. falciparum secreted proteins (by signal peptide "Y" search) = 815 results. Since no weight values have been assigned to these three queries, a default weight of "1" has been applied to the results. Now on the History page, when these 3 queries are combined by the Union functionality, the resulting gene list (1497 genes) will have cumulative weight values assigned to order them accordingly. In this example, 3 genes are ranked above the remaining 1494 genes as they have "3" as the final weight value. The example described here illustrated running the queries one by one (see above in running single queries). When this set of queries are run as a batch (see above in running multiple queries) the union is done automatically and the final ranked list is shown as the result. More information on assigning and combining weight can be found below in section III. 3.
III. 2. COMPOUNDS search page (back to Table of Contents) (close this window)
Using the COMPOUNDS search page users can find a list of small molecule compounds that are of interest to them. Two kinds of searches can be performed on this page: 1). Text-based searches. This part is divided into different search types - basic information, chemical properties, atoms, activities, genes and information source - each offering a distinct set of criteria to search on. 2). Structure-based searches. This part provides users with the chemo-informatic tools necessary to make 2 dimensional chemical structure similarity and substructure searches.
Please note that these two parts of the search page represent indenpendent forms, that cannot be submitted simultaneously. Either you submit the first form, or you submit the second. If you need to search for compounds that need to meet criteria from these two search forms (e.g. compounds with a specified molecular weight -- using the first form -- and that are also a substructure of a another compound -- that you've drawn using the second form) then you can combine them at the HISTORY page.
Basic information (back to Table of Contents) - In this type of search, small molecule compounds can be identified by their name or by their standard InChi and InChi Key representations. You can read more about these representations/identifiers at the excellent Unofficial InCHI FAQ. These are simple text-based searches where the search terms have to be filled into the respective text boxes. When running searches based on the chemical names, users have the option to run either an "exact match" search or a "partial match" search. However, because the InChi and InChi Key identifiers are standard tags for individual chemicals, they are always run on exact match mode.
Chemical properties (back to Table of Contents) - All the criteria listed under this type of search require numerical inputs in combination with the =, > and < operators except for the Lipinski"s rule of five filter for which the number of qualifying rules can be selected from a pull-down menu.
Atoms (back to Table of Contents) - This is a simple search based on the atomic composition of the chemical compounds. Users can select the kind of atom they are interested in from the pull down menu and enter a numerical value in combination with the =, > and < operators as an indication of the number of times the selected atom should occur in a chemical compound. Users can select more than one kind of atom by clicking on the "Add Atom" button. Selecting multiple atoms in a single step will run a "AND" search (intersection) automatically. To perform a "OR" search, users will have to combine individual atom queries using the "union" functionality in the History page.
Activities (back to Table of Contents) - This criterion helps users search for compounds with known bioactivity (see section II. 2. d. for more details). To obtain all compounds with any kind of biological activity users can simply select the "Any Activity" option by placing a tick mark in the check box and running a search. Alternatively, users can select any one of the various bioactivities shown in the pull down menu and enter a numerical value in the text box in combination with the =, > or < operators. When any one bioactivity is selected from the pull down menu, the units in which that activity is documented will appear next to the text box. This will give users some idea about the numerical values that have to be entered.
Genes (back to Table of Contents) - This criterion helps to identify small molecules for which a target gene is known or predicted. The known target information is a curated dataset and the association between small molecules and genes in this category could be drug/inhibitor - protein interaction or a ligand - protein interaction (see section II. 2. e. for more details). Users need to select the evidence type (dataset selection) from the pull-down menu to run the search. NOTE: The predicted compounds can be a large list and it may take a minute or so to retrieve all the results.
Information source (back to Table of Contents) - This is a simple search and helps to identify small molecules based on the source from where these were acquired by TDRtargets DB. Users only need to select the source from the pull-down menu to run the search.
Structure based searches (back to Table of Contents) - There are two ways in which structure based searches can be run on the Compounds search page. One way is to draw the molecule of interest using the JME applet provided on the search page (obtained from Novartis, courtesy of Peter Ertl). Using the tools in the applet, users can draw the structure of interest and then perform either substructure or exact or similarity search using the options given below the applet. The second method of running this search is by pasting the details from the SDF or MOL file describing the structure of the molecule of interest. To do this users need to select the "Paste MDL-SDF file / MDL-Mol file" option and then paste the details in the text box that appears. Users also have the option of restricting the search by atom/bond comparisons and by checking bond configuration (E/Z and R/S). NOTE: Structure based searches from molecules of interest can also be performed from the result page that lists the retrieved compounds. For each compound, there are links for substructure and similarity searches provided on the third column along with other details like name and molecular formula. Links to these searches are also provided inside the compound page which can be viewed by clicking on either the compound name or the molecule ID (1st column in the results table).
Running a compound query (back to Table of Contents) - To run a query on compounds, users need to go to the Compounds search page and either choose to run a text based search or a structure based search. The parameters used for the searches can be entered into the search form following the details discussed above. The search can be initiated by hitting the "search" button provided on the search page (the text and structure based searches have separate "search" buttons) and the results are viewed as a tabular list on the results page. This table has 3 columns - the first column lists the molecule ID, the 2nd column shows the 2 dimensional structure of the molecule and the 3rd column provides details of properties such as the name, molecular weight and formula and, links for substructure and similarity searches. 25 compounds are shown per page and there are buttons above the table that can be used to go to the next/last page or come back to the previous/first page. Each small molecule compound has an information page containing all the details associated with that compound and this page is accessed by clicking on the molecule ID or the compound name from the compound result page. The results of compound queries are archived as search history and can be viewed in the History page. In similar fashion to results from searches on genes, results of compound searches can also be combined by the various functionalities available on the History page.
Here is an example set of queries and their combination. Q1: Search for compounds with molecular weight < 300. Result = 159269 compounds; Q2: Search for compounds that meet all 4 rules listed in Lipinski's rule of five. Result = 603995 compounds; Q3: Search for compounds with bioactivity and have an IC50 value < 10 nM. Result = 32872 compounds. Now on the History page, select results obtained from these 3 queries and take the intersection of these (see section III. 3.). 1986 compounds will be retrieved as a result of this intersection and will be listed as a new query result (Q4) on the History page. This list of compounds can be further filtered for association with genes. To do this, another query (Q5) is run on the search page as (for example) Q5: Search for compounds with gene association based on "curated" information. Result = 2046 compounds. Now, combining Q4 and Q5 by intersection on the History page we will retrieve 48 compounds as a new result. In the example queries discussed above, individual queries were run and combined in the History page to demonstrate the functionality available in the database. An informed user, however, may run the same set of queries as a single query to get the same 48 compounds. This can be done as follows - search for compounds with molecular weight < 300, Lipinski rule = 4, activity by IC50 < 10 nM & associated gene by evidence = curated. Running a query with this combination of criteria will retrieve 48 compounds.
Retrieving genes associated with a list of compounds (back to Table of Contents) - Since the TDRtargets database captures data on association between gene and compounds, users can choose to view a list of compounds associated with genes retrieved as results for a query (see topic Running a single query under section III. 1.) and vice versa. For example, to view all the genes associated with the 48 compounds retrieved in the example search discussed above, users will have to click on the query title listing the 48 compounds in the History page to view the list of compounds on the result page. At the top of the page, just above the table showing the compound list, there are links to view associated genes obtained by either curation or prediction or both. Clicking on the "Curated" button retrieves 95 genes from all the pathogens that are contained in TDRtargets DB.
III. 3. Query HISTORY page (back to Table of Contents) (close this window)
The History page provides access to results of all the queries that have been run to search for both genes and compounds. In addition to archiving query results, this page also contains various functionality tools that can be used to manage and combine query results in various ways. This section will provide a brief description of all the features available on this page. NOTE: To make full use of all the functionalities provided on this page, users will have to be logged in with a user name and password (see beginning of section III).
Upload (back to Table of Contents) - The "Upload" function allows the user to upload a list of genes from a given pathogen, which will be listed as a new query on the History page under "My target queries" and be available for combining and analysis with other available queries. The upload file should be a text file consisting a list of gene IDs which should be the same IDs used by the TDRtargets DB for that pathogen. As an option, users could also upload a list gene IDs associated with numerical weight values listed on the same line but separated by a tab. This functionality is quite useful for uploading an external dataset not contained in the current version of TDRtargets DB. For example, one can upload proteomics data as evidence for expression. The upload text file will have gene IDs as the first column and the number of peptide hits for each gene as the second column. Once uploaded the numerical values representing peptide hits will be considered as the weights associated with the genes. The upload form also provides text boxes for naming the list and providing a description of the data uploaded.
The 'Upload' function is available only for lists of genes.
My target queries (back to Table of Contents) - This portion of the History page contains all the "gene queries" run using the search options available from the Targets search page. The queries are listed here in chronological order and numbered from 1 continuing upwards. For each entry, the details shown include the query name, number of genes contained in the query result (indicated as records), a link allowing the export of that list, a link to see the parameters (criteria) used to run the query, and a link to delete the entry. There are also 2 text boxes associated with each query. On box accepts any numerical value as entry and will be used to weight or rank genes. The other text box can be used to enter a new name or modify an existing name of the query (see below in rename queries). To view the list of genes retrieved as results for the queries users need to click on the query name. This will open up the result page for genes (see Running a single query and Running multiple queries topics under section III. 1.). Queries are stored here only on a temporary basis and may be lost when you close your web browser. To permanently retain the queries listed here, it is recommended that users search and manage queries after logging into the DB with their user name and password.
My drug queries (back to Table of Contents) - This portion of the History page contains all the "compound queries" run using the search options available from the Compounds search page. The queries are listed here in chronological order and numbered from 1 upward. For each entry, the details shown include the query name, number of small molecule compounds contained in the query result (indicated as records), a link to see the parameters (criteria) used to run the query, and a link to delete the entry. Queries can be renamed by using the text box associated with each query (see below in rename queries). As of now, there is no mechanism in place for ranking chemical compounds by assigning numerical weights. To view the list of compounds retrieved as results for the queries users need to click on the query name. This will open up the result page for compounds (see Running a compound query topic in section III. 2.). Queries are stored here only on a temporary basis and may be lost when you close your web browser. To permanently retain the queries listed here, it is recommended that users search and manage queries after logging into the DB with their user name and password.
Combine or act on selected queries (back to Table of Contents) - This part of the History page lists all the functionalities available for managing the collection of gene and compound queries listed on the History page. These functionalities enable the following actions on the selected queries - Union, Intersection, Subtraction, Change species, Delete, Rename & Save. To run these functionalities, one or more queries have to be selected from the ones listed by placing a tick mark in the check box associated with each query (except for query subtraction, see below).
Query Union (back to Table of Contents) - This is the OR Boolean functionality and can be used to combine and make a union of all the selected queries. This function is applicable for both gene and compound queries. NOTE: gene and compound queries cannot be combined by "Union" with each other. However it is OK to select a set of gene queries to perform "Union" between them and a set of compound queries to perform "Union" between them at the same time. These two unions will be restricted to the query types but can be performed at the same time. Beware that combining some compound queries can take up more time than gene queries and hence it is good to perform "Union" on one type of query at a time. The result from a "Union" combination will be listed as a new query under the relevant query type with its name reading "Union of (serial numbers of the combined queries), 600 records". NOTE: It is a good idea to rename the result query appropriately as the serial numbers listed will not always be the same. The resulting gene or compound list obtained by "Union" will contain only the unique entries from the individual lists. The contents of the "Union" result can be viewed by clicking on the title.
In addition to simply combining lists of genes, the "Union" function also performs the critical task of ranking the genes contained in the combined list. This is accomplished by adding up the individual weight values associated with each query (see Assigning weights to rank genes topic in section III. 1.). The "Show parameters" link in the union result can be opened to see the details of the queries used to make the union combination and the weight values associated with each one.
Query Intersection (back to Table of Contents) - This is the AND Boolean functionality and can be used to combine and make an intersection of all the selected queries. This function is applicable for both gene and compound queries. NOTE: gene and compound queries cannot be combined by "Intersection" with each other. However it is OK to select a set of gene queries to perform "Intersection" between them and a set of compound queries to perform "Intersection" between them at the same time. These two intersections will be restricted to the query types but can be performed at the same time. Beware that combining some compound queries can take up more time than gene queries and hence it is good to perform "Intersection" on one type of query at a time. The result from a "Intersection" combination will be listed as a new query under the relevant query type with its name reading "Intersection of (serial numbers of the combined queries), 600 records". NOTE: It is a good idea to rename the result query appropriately as the serial numbers listed will not always be the same. The resulting gene or compound list obtained by "Intersection" will automatically filter out any entries not present in all selected lists and so this type of combination is more restrictive. The contents of the “Intersection" result can be viewed by clicking on the title.
Query Subtraction (back to Table of Contents) - This is the NOT Boolean functionality and can be used to combine and make a subtraction of gene or compounds from a given list. Users will have to specify which query(ies) have to be subtracted from which others using the subtract selection box that can be accessed by clicking on the link “click here to specify your subtraction choices” in the subtraction functionality. The following is a simple illustration how this functionality works. Suppose there are 4 gene queries (Q1, Q2, Q3 & Q4). For subtracting Q2 from Q1, the number "1" is selected from the "These" column and the number "2" is selected from the "Minus these" column in the subtract selection box. Similarly for subtracting Q3 & Q4 from Q1, the number "1" is selected from the "These" column and the numbers "3" & "4" are selected from the "Minus these" column in the subtract selection box. In this manner any selection of queries can be subtracted from one another. The same approach work for subtracting compound queries also. NOTE: gene and compound queries cannot be combined by "Subtraction" with each other. However it is OK to select a set of gene queries to perform "Subtraction" between them and a set of compound queries to perform "Subtraction" between them at the same time. These two subtractions will be restricted to the query types but can be performed at the same time. Beware that combining some compound queries can take up more time then gene queries and hence it is good to perform "Subtraction" on one type of query at a time. The result from a "Subtraction" combination will be listed as a new query under the relevant query type with its name reading "Subtraction of (query serial number(s)) minus (query serial number(s)), 600 records". NOTE: It is a good idea to rename the result query appropriately as the serial numbers listed will not always be the same. The contents of the “Intersection" result can be viewed by clicking on the title.
Change species for a query (back to Table of Contents) - This functionality is useful in rerunning existing queries on a different organism. To do this, one or more queries will have to be selected from the list shown under "My target queries" by checking the tick box and then the "Change species" option is selected after the name of pathogen on which to re-run the query has been selected from the pull down menu. The resulting list of genes from the newly chosen organism will be listed as a new query result. For example, if a P. falciparum query was already run and has yielded a list of genes qualifying for the criteria - P. falciparum genes that are kinases AND have signal peptide AND with no animal ortholog, then the same set of criteria can be used to retrieve genes from T. gondii or T. brucei by using this option. There is no need to go back to the search page and make the same selection again for the 2 other organisms. However this option to change species come with the risk that criteria that is unique to one organism may be used unknowingly to run searches on other organisms. For example, considering the above example, if the initial query formulate was P. falciparum genes that are kinases AND have signal peptide AND are highly expressed in red blood stages of the parasite, applying this query to T. gondii is not feasible because there is no blood stage expression data for T. gondii. Similarly, confusion can arise from phylogenetic searches. Therefore, users will have to have an understanding of which criteria can be seamlessly changed to other species and which cannot be changed.
Delete a query (back to Table of Contents) - This is a simple functionality that is used to delete (permanently remove) the selected queries from database storage. Single queries can be deleted by clicking on the delete button present in each query entry. To delete several queries at once, select queries using the check boxes and then choose the delete option from the functionalities section and hit the "Do it" button. This works the same way for both gene and compound queries. In fact, gene and compound queries can be selected and deleted at the same time.
Renaming queries (back to Table of Contents) - This is a simple functionality that is used to rename the selected queries listed on the History page. To rename a query, select the concerned query by checking the tick box and then enter the new name for the query using the "Rename" text box associated with the query entry. Finally, choose the rename option from the functionalities section and hit the "Do it" button. This works the same way for both gene and compound queries. In fact, gene and compound queries can be selected and renamed at the same time.
Saving queries (back to Table of Contents) - This is again a simple functionality that is used to save (permanently retain) the results of the gene performed in this database. NOTE: Users should be logged in if they need to save their work (see beginning of section III). This functionality is not currently available for compounds queries; however, compound queries performed within a user session will remain saved (but they will not appear in 'My Saved query sets'). To save one or more queries, select the queries by checking the corresponding check boxes and then choose the save option from the functionality box. The save option has a text box which will allow users to enter a name under which the chosen queries will be saved. For example, assume that 4 different queries were run on P. falciparum, weighted and then combined by union to yield a 5th query entry on the History page containing the final ranked list of genes. Each of these 5 query entries for P. falciparum genes can be saved separately but a more convenient and practical way will be to select all the five queries and save them as a single entry. This way all the queries used to generate the final ranked list of P. falciparum genes are kept together and are easy to track for future reference. The saved queries are listed in the "My Saved query sets" portion of the History page. There are three links shown under the title of each saved query set. These links are for publishing, deleting and retrieving the saved queries.
Publish (back to Table of Contents) - This allows saved queries to be posted on the TDRtargets site for others to view and manipulate the posted list (see below in III. 4. for more details).
Delete (back to Table of Contents) - Clicking on the "Delete" button allows saved queries to be deleted permanently from the database. Warning: there is no way to retrieve the queries once deleted from this area. Before permanently deleting anything, retrieve the concerned query to the current session (see below) to view and modify if required.
Move to my current session (back to Table of Contents) - This allows the users to move the saved queries from "My saved query sets" area to either the "My target queries". These retrieved queries will stay in the current session until they are deleted. Deleting the retrieved genes in the current session does not delete the saved version of the query.
Export (back to Table of Contents) - This functionality allows users to download a list of genes with or without any other associated datasets as tab delimited text files or excel spread sheets. The easiest way to export out data is by clicking on the "export" link associated with each query entry for genes on the History page. This will open the "export options" page which lists all the data types that are available for download from the search categories Name/annotation, Features, Structures, Phylogenetic Distribution, Druggability, Assayability and Bibiliographic references. By default, the gene name/accession, organism name and product description are downloaded along with any assigned weight values (for weighted union results only). At the bottom of this list the users can choose the download format from a pull down menu. While trying to export the query results obtained by running the "Union" functionality, users are provided with the choice of downloading the data either as a simple spread sheet file only or as a simple spread sheet file that also includes a second "dynamic spread sheet". This dynamic spread sheet allows users to change the weight values for the individual criteria used to run the queries leading to re-ranking of genes on the spread sheet and dynamically re-ordering the position of the genes in the list reflecting the modification to weight values. Generating the dynamic spreadsheet may take a while and so a link indicating that there is a pending request for file export is shown at the top of the History page. Once this file is ready for download, it is listed under the "My files" section at the bottom of the History page. Clicking on the links shown here will download the files.
The export functionality is currently available for genes only.
III. 4. Posting lists of genes for community sharing (back to Table of Contents)
Using this option, users can post or publish a list of genes for sharing with other community members. Only those lists which have been saved in the database are eligible to be published and so users will have to be logged in on the website to be able to save and publish their datasets. To post a list of genes, users will have to click on the "Publish" link shown in the saved query entry. This will open up the publishing page in which users will be able to provide an appropriate query name and description and include relevant references. Then hitting the publish button on this page will post the selected query set for community viewing on the POSTED LISTS page. In addition the posted list will also appear as an entry in the "My published query sets" area towards the bottom of the History page. Here, details like the name of the published query set, date & time of publication, a link to the description of the publication (as entered by the user while publishing) and a link to "Unpublish" the dataset are shown. By clicking on the "Unpublish" button, that dataset can be removed from public viewing.
POSTED LISTS page (back to Table of Contents) - This page contains all the datasets that have been posted on the website by various users. The posted datasets are listed in chronological order under the title "Community Share". Each dataset is represented by a clickable title followed by the name of the user who posted the dataset and the data and time of posting. Below this, the description of the dataset that has been posted is shown and the full description can be viewed by clicking on the "Show complete description" link. Clicking on the title opens a page containing all data associated with the posted query set. The individual queries in the set are listed in the "Queries in this set" area. Users can click on the name of the individual queries to view the list of genes and see the parameters (criteria and associated weight values in case of ranked lists) used to run the query. The check boxes allow query selection and selected queries can be imported into the current session on the History page by pressing the "Import into my history" button. In the History page, imported gene queries will be listed in the "My target queries" area and imported compound queries will be listed in the "My compound queries" area. All users who visit the database will be able to view and access published queries.
III. 5. Gathering information about targets using community surveys (back to Table of Contents) (close this window)
The TDRtargets database resource is spearheading on an ongoing effort to collect and curate literature information related to target identification for various pathogens of interest. In addition to the in house effort from the network involved in database construction and maintenance, we are actively soliciting community input on this. The original idea for surveying the scientific community came from a workshop on Drug Discovery for Trypanosomatid Diseases, held in Dundee in February 2007. After a successful pilot survey conducted for Human African Trypanosomiasis, the idea was to expand the action to other TDR priority organisms. The result will be a central repository containing information on current drug targets, providing a framework for collaborations and an important resource for database curators. In order to implement this, the TDRtargets database contains a Targets Survey page that allows users to browse, search and submit relevant information for genes. This page allows users to browse, search and submit entries in the database. The browse and search options can be accessed by all users but to submit entries, users should be registered and logged in.
Browsing survey entries (back to Table of Contents) - There are a total of 60 entries made by different users and all 60 are available for browsing by clicking on the link under the browse heading. The "survey entries" page contains a list of entries, each of which is represented by the gene name, author name, pathogen name and date of entry. To view all associated information of each entry, users will have to click on the "Go to the page for this entry" link provided with each entry.
Searching survey entries (back to Table of Contents) - Under the search topic, clicking on the "Search the contents of entries" link will open up a small search form. This form provides a text search for Author and pull-down menu selection for organism, validation, assay status, availability and activity. The search and reset buttons are provided at the bottom of the search form. Searches can be done using just one parameter or a combination of parameters. For example, searching for "Organism = Trypanosoma brucei" and "Validation = Genetic OR Chemical" retrieves 33 genes as results. The results are shown on a new page with similar layout to the "Browse" page.
Submitting survey entries (back to Table of Contents) - Users can contribute information regarding one or more genes in one or more pathogens using the "Submit a new entry" link shown under the submit heading. As mentioned above, only registered users can contribute curated information. The submit entry form consists of a series of text boxes and pull down menus which can be used to contribute information. The author name is automatically entered for the person logged in. The species for which the user wants to contribute can be selected from the "Species" pull down menu. Right now this list is restricted to TDR priority pathogens only. The most important entry is the gene ID information as this is required for mapping the details entered to the gene records contained in the database. The users have the choice of entering the gene ID (matching what is already in TDRtargets) or the EC number or GenBank/EMBL/Swissprot accessions or simply the target name. The validation, assayability, availability and activity entries are made by selecting the appropriate entries from the associated pull down menus. Finally there is a text box provided for entering a description about the entry. The submission process is completed by pressing the "submit idea" button at the bottom of the form.
IV. Some example queries and their results
IV. 1. Example 1 (search and rank genes) (back to Table of Contents)
Go to the Targets search page to run the following example.
Step 1: Pathogen = Plasmodium falciparum
Step 2: Text Search = Kinase
Step 3: Query-1 Name = P. falciparum Kinase. Weight = 25
Step 4: Press “next query”
Step 5: Signal Peptide = Y
Step 6: Query-2 Name = P. falciparum genes with signal peptide. Weight = 25
Step 7: Press “next query”
Step 8: Phylogenetic Distribution = NOT IN Homo sapiens; NOT IN Mus musculus;
Step 9: Query-3 Name = P. falciparum genes absent in animals. Weight = 25
Step 10: Press “this is the last query”
Step 11: View ranked list of genes on the result page. Total = 3864 genes obtained by weighted Union of genes from 3 queries. (P. falciparum Kinase = 147 genes; P. falciparum genes with signal peptide = 815; P. falciparum genes absent in animals = 3669 genes).
There are 5 P. falciparum genes with weight = 75 and these are the top ranking genes in a total of 3864 for the chosen criteria.
Step 12: To get a list of curated compounds associated with any of the 3864 genes obtained as result above, click on the "Show curated compounds link" on the page showing the list of genes. Result = 144 compounds.
IV. 2. Example 2 (search compounds) (back to Table of Contents)
Go to the Compounds search page to run the following example.
Step 1: Molecular weight < 300
Step 2: Press "Search"
Step 3: View list of 159269 compounds
Go back to Compounds search page (if you use the browser button to go back you have to reset the search page)
Step 4: Lipinski"s rule of five = 4
Step 5: Press "Search"
Step 6: View list of 603995 compounds
Go back to Compounds search page (if you use the browser button to go back you have to reset the search page)
Step 7: Activity = IC50 < 10 nM
Step 8: Press "Search"
Step 9: View list of 32872 compounds
Step 10: Go to the History page and combine the results from the above 3 compound queries by intersection. Result = 1986 compounds (NOTE: this step may take a minute or so to execute)
Step 11: Click and view the list of compounds obtained by intersection.
Step 12: To get a list of curated genes associated with any of the 1986 compounds obtained as result above, click on the "Show curated genes" link on the result page showing the list of compounds. Result = 95 genes (retrieved from various different pathogens).