LIAKMLogo

 LIAKM  

 Home    About Us      Laboratory      Project Management     Partners

The Laboratory for Information Analysis and Knowledge Mobilization

 

Systems Biology

Researchers:      Igor Jurisica, University of Toronto
Partners:
            IBM Life Sciences Discovery Centre

 

OBJECTIVES

Tools and resources for systematically discovering clinically-relevant biomarkers will be created. Prognostic, predictive and early disease detection biomarkers will be identified and characterized for pancreatic cancer and sarcoma. These cancers have been selected since they represent a particularly grave health threat, and because of the strength of local experts and collaborators to effectively follow-though and best translate to improve patient care. Improving characterization of prognostic and predictive biomarkers may also lead to identifying new drug target possibilities, treatment options and new drug targets – thus, leading to reduced cost of biological experiments and of cancer-related health care by improving cure rates. Two major areas for innovation in these cancers include improved early diagnostics and tumor characterization, which cover the discovery of more effective and curative therapies against cancer, following a “personalized medicine” approach that effectively tailors treatments to individual patients. This will be implemented in three specific aims:

  • Expand the integrative computational biology resource for cancer signatures (Cancer Data Integration Portal; http://ophid.utoronto.ca/cdip), known and predicted protein interactions database I2D (http://ophid.utoronto.ca/i2d), and pathway annotation databases as presented on the Cancer Gene Encyclopedia (http://ophid.utoronto.ca/cgep). To eliminate overlap and further synergize with ORF-GL2, we will focus on pancreatic cancer and sarcoma.
  • Develop and apply novel network-based approaches to biomarker discovery for early disease detection, improved diagnosis and prognosis, and treatment response prediction (http://ophid.utoronto.ca/navigator). While the tools themselves are being developed in ORF-GL2, their application to sarcoma and pancreatic cancer is not covered.
  • Identify new therapeutic targets and treatment options tailored to each individual patient. Develop new tools for supporting “personalized medicine”; namely, to effectively present information to patients when predictive biomarkers suggest altering standard of care (http://opid.utoronto.ca/natworx; http://dcv.uhnres.utoronto.ca/SCRIPDB). While the resources themselves are being developed in ORF-GL2, their application to sarcoma and pancreatic cancer is not covered.

CONTEXT

Cancer development is a multi-step process that leads to uncontrolled tumor cell growth caused by and resulting in complex changes: many genes are amplified, deleted, mutated, up- or down-regulated, many pathways are activated or suppressed. Estimating from the CONCORD study across 1.9 million patients from 31 countries and 5 continents, current treatments achieve a 5-year survival rate for less than 50% of diagnosed cancers{Coleman, 2008 #732}. Years of research improved survival in breast and prostate cancers by finding molecular markers for early diagnosis and by individualized treatment. However, pancreatic cancer remains almost 100% lethal, and the overall survival rate for lung cancer has improved barely during the past decades, having only moved from 13% to 16%. Tools and resources will be implemented and integrated to support integrative computational biology, focusing on cancer genome, proteome and interactome (funds from ORF-GL2). More protein structures will be determined and characterized, focusing on cancer-related proteins (funds from NIH, IBM, and ORF-GL2). Individual tools and resources will be built in collaboration with other leading institutes (e.g., NIH, DFCI, CNIO, HWI, U of Toronto, and Weizmann). Markers for early disease detection, prognostic and predictive markers will be identified and validated with the overall goal to improve survival rates in pancreatic cancer (Drs. Hedley, Reis), and sarcoma (Dr. Maestro). The research will enhance predictive signatures by identifying new potential treatment targets and discovering effective drug combinations. This translational research is in part supported from existing grants—e.g., ORF-RE4 & -GL2, PSA, CIRM, NIH, DOD, CIHR, and OICR—and such approaches are a foundational element of the institution’s strategic research plan (see below). Importantly, all tools and resources are and will be in public domain, to benefit research.

The LONG-TERM GOAL: Develop and then apply novel tools for the integration, analysis and interpretation of complex biomedical data with aim to identify testable hypothesis and build useful models.

NEED: Existing algorithms either do not scale with the size and complexity of the realistic problems or provide only partial answers. Using combination of heuristic algorithms that intelligently reduce search space, compiler and hardware optimization we strive to develop “fast enough” approaches. Required speed is context-dependent: 1) in protein crystallography(J43) measured in years; 2) in biomarker optimization(MS1, J14-16, J20, J23, J24, J27, J36, J39, J40, J45) measured in weeks/months; and 3) in integrative interpretation and hypothesis generation(J5, J18, J19, J21, J22, J26, J30-33, J35, J41, J42) it must be interactive. We will be building upon tools/resources and experience from successful systems (J4, J5, J14, J18, J19, J21, J22, J26, J30-33, J35, J41-43, J47-49).

SYNERGY: Developing such comprehensive platform requires interdisciplinary team and sufficient infrastructure. In this application, we seek funding for graduate students and a PDF. Importantly: 1) optimized computing infrastructure is already in place; 2) programmers, technicians and system/DB administrators are already funded from ORF; 3) integrative computational biology resources are being created using ORF-GL2; 4) validation of testable hypothesis is funded by ORF-RE, NIH, CIHR. Application of this research will “enable” a “personalized medicine” that effectively tailors treatments to individual patients, which in turn will lead to a more effective and curative therapies against cancer.

The SHORT-TERM GOAL:  1) develop scalable, probabilistic, network-based algorithm for comprehensive identification of effective biomarkers for early disease detection, improved diagnosis and prognosis, and treatment response prediction; 2) develop scalable network inference approaches using Bayesian approach to predict combination treatment options using drug target databases and screens combined with networks of physical protein interactions. 3) develop planning approaches for drug synthesis. The algorithms have to handle incomplete, contradictory and ambiguous information in an automated fashion, they have to support multiple viewpoints and contexts, some need to be interactive, and seamlessly integrate diverse data sources, and they have to scale to ultra-high dimensions, support multimodal and rapidly evolving representations, and handle incompleteness of domain theories.

To TEST & EVALUEATE these approaches, we will use multiple strategies. First, leave one our cross validation, then validation on independent data sets, and finally, biological validation (other grants).

CANCER INFORMATICS BACKGROUND

Most cancers lack any effective early disease markers, prognostic and predictive signatures. The main challenge is learning from and using the enormous amounts of complex molecular data from tumor samples. Many methods have been introduced to generate prognostic and predictive signatures from these data (reviewed in{Dupuy, 2007 #13230}), but results remain less than satisfactory – suffer from severe overtraining{Dupuy, 2007 #13230}, are overoptimistic{Michiels, 2005 #13324}, or lack robustness{Ein-Dor, 2005 #3823;Ein-Dor, 2006 #3824} due to small numbers of samples{Ein-Dor, 2006 #3824}. Existing signatures overlap only partially, and often do not validate on external cohorts or by other biological assays(MS3, J8). Reasons include: 1) patient diversity and tumor heterogeneity, 2) range of platforms, 3) diversity of statistical and bioinformatics approaches, 4) small number of samples analyzed{Ein-Dor, 2006 #3824}, 5) and existence of multiple equivalent signatures{Ein-Dor, 2005 #3823}(J14). A promising alternative to the brute-force approach that works in the space of expression levels of thousands of genes, microRNAs, metabolites or proteins is an approach that strives to characterize each tumor on the basis of a small number of biologically relevant pathway-related variables{Bauer-Mehren, 2009 #13383;Mazumder, 2008 #13366;Yu, 2007 #5} integrating expression and other high-throughput{Kelder,  #1} as well as clinical data{Chen,  #2824}.

Despite noise present in interaction data sets, their systematic analysis uncovers biologically relevant information: lethality{Jeong, 2001 #3;Hahn, 2005 #453}, functional organization{Maslov, 2002 #8;Gavin,  #7;Wuchty, 2006 #2685}, hierarchical structure{Ravasz, 2002 #16;Yu, 2006 #2655}, modularity {Han, 2004 #215}(J22) and network-building motifs{Milo, 2002 #11;Rice, 2005 #4419;Przulj, 2004 #511}(MS1,J5). Thus, networks have a strong structure-function relationship{Przulj, 2004 #511}, which can aid interpreting cancer data. Many interactions are transient, and the networks change across tissue or under different stimuli{Hu, 2007 #4421}(MS4). Studying the dynamics of these networks is an exponentially more complex task. Many stable complexes show strong co-expression of corresponding genes, whereas transient complexes lack this support{Jansen, 2002 #2270}(MS2). These contextual network dynamics must be considered when linking interactions to phenotypes and when studying the networks topology. Systematic graph theory analysis of dynamic changes in interaction networks, combined with gene/protein cancer profiles will enable systematic analysis of cancer{Kato, 2006 #2643;Jonsson, 2006 #2670;Wachi, 2005 #2610}(MS1,MS3). Implementing algorithms using heuristics fine-tuned for interaction networks(J5,J30,J31) will ensure scalability.

Most successful network-based methods of gene group identification for class prediction have been the score-based sub-network markers{Ideker, 2002 #555;Chuang, 2007 #3515;Nacu, 2007 #4423;Hwang, 2009 #13142}. Sub-networks identified using these approaches were recently shown to be highly conserved across studies and to perform better than individual genes or pre-defined gene groups{Chuang, 2007 #3515}. Considering network modularity results in further improved prediction(J22). Combining existing known and predicted interactions from I2D with novel local co-expression annotation of existing edges will elucidate disease-specific dynamics and identify local network structure, graphlets(J5, J30), which are the most aberrant components in the cancer network, as compared to normal.

We fail treating cancer due to multiple ways cancer initiates and develops treatment resistance{Yun, 2008 #2002;Tsao, 2005 #1992}(J8, J11). Many cancers can be fractionated into therapeutic subsets based on characteristic molecular phenotypes{Pegram, 1998 #53;Slamon, 2009 #63;Buyse, 2006 #3376;Spentzos, 2004 #2013}(J27), but due to many dynamic changes occurring in the cancer milieu, efforts to systematically analyze and model these changes are still in their infancy. We need an integrative approach to distinguish causal and by-stander targets, i.e., combine multiple genomic/proteomic datasets, pathway and annotation databases using a systematic, probabilistic analysis and modeling.

Identifying novel targets will require designing new drugs. Organic syntheses can be modeled as resource-constrained branching plans in a discrete, fully-observable state space, similar to branched planning domain like nondeterministic planning{Rintanen,  #43}. Despite 40 years of work{Corey, 1969, 1969;Law, 2009 #29;Todd, 2005 #41;Clayden, 2000 #39}, the chemical synthesis planning problem is largely unknown to AI community. Plans cannot have cycles and the branching factor in the search for a synthesis can be large. ChemAxon (www.chemaxon.com) is a free library of 145 manually-curated commonly-used reactions. Since the reaction patterns can match multiple distinct parts of the goal molecule, dozens of precursor reagents can be proposed for each reaction{Agarwal, 1978 #45}. The reaction library can be considerably enlarged using a statistical analysis of synthesis databases, since{Law, 2009 #29} estimate the existence of 92,781 unique known transformations.

Integrative computational biology approaches will be used to achieve the following four goals:

  1. TARGET DISCOVERY: Identify novel drug protein targets by exploring physical binding partners (known and predicted);
  2. MODE OF ACTION: Identify all biochemical pathways and biological processes affected by a drug using genome-wide response data from phenogenomics screens and drug:target databases;
  3. OFF-TARGET EFFECTS: Identify new affected pathways and possibly explain drug side effects; and
  4. DRUG REPURPOSING: Identify new drug candidates with relevant effect on proteins and pathways. Relevant targets will be determined by systematic and integrative analysis of data from our OC data portal.

Importantly, all data resources and tools will be made publicly available to further support academic research and broaden translation of our results.

AIM 1

Counter intuitively, good signatures comprise not only the most differential genes. Identifying “best supporting actors” calls for integrative approaches. To train and validate signatures requires comprehensive set of relevant datasets. To identify “almost all” good signatures demands scalable heuristics. FOCUS of this grant: pancreatic cancer and sarcoma.

To enable comprehensive training and testing data sets and gold standards, we created Cancer Data Integration Portal (CDIP; http://ophid.utoronto.ca/cdip), which covers lung, ovarian, prostate and head&neck cancers (funded by ORF-GL2). As one example, the lung portal covers 2,799 primary tumor samples from 50 published and 8 private/internal datasets (focusing on datasets with prognostic and histological classifiers where raw data and clinical data exist). These data are extended with significantly deregulated genes in CDIP, identified by our internal analyses and data from CancerGenes MSKCC and GeneSigDB.

Cancer Data Integration Portal implements a comprehensive curated molecular data repository across multiple cancers (lung, ovarian, prostate, head&neck) and histological subtypes (http://ophid.utoronto.ca/cdip); and its ovarian cancer (OC) subset was developed with DOD#W81XWH-05-1-0104 funds. Multiple experimental platforms are included such as mRNA microarray, aCGH, SNP, miRNA, proteomics, methylation, and different histologies, xenografts and cell lines. The portal integrates significantly deregulated genes identified by our analyses and data from CancerGenes MSKCC, GeneSigDB and TCGA. For example, CDIP v.1.0 includes 243 significantly up-regulated and 45 down-regulated genes in at least 10 different analyses of OC data. It also comprises over 4,000 unique profiles for non-small cell lung cancer (NSCLC) samples, with 1,232 significantly up-regulated and 1,710 are down-regualted genes.

Pathway databases (KEGG, Reactome, PathwayCommons) will be combined to diminish false positives and negatives. We will then add protein interactions (I2D(MS2, J21), STRING, PSICQUIC, PhosphoSite, iHOP and BCMS web servers) to improve their coverage and relevance(J9,J15,J16,J23).

We will improve algorithms using experience from our earlier work, e.g., (MS3,J24,J27,J40,J48), and mainly(J14). Here, modified steepest descent algorithm identifies a good signature. However, comprehensive permutation analysis identified 1,789 unique 6-gene signatures that validated as equally good. Genes forming these signatures are the most differential–so current methods do not find them. Thus, we need probably-approximately correct machine learning techniques to identify “almost all” and “almost optimal” signatures. One strategy is combining existing strong methods(J27,MS3,J48,J45,J40) to identify differential genes, with annotation from pathways, interactions, literature(J21), etc., combined with FPClass association mining algorithm(J4) to derive equivalence classes of features, which could be used to produce alternative signatures. We will also improve our method to identify network signatures by growing sub-networks by trading off class relevance and modularity(J22) and ensemble methods(J32).

Aim 2

Single drugs are not effective for all patients. Using chemogenomics screens and drug:target databases, combined with signatures from Aim#1 and probabilistic network modeling, we can rationally design drug combinations for individual patients that maximize positive outcome and minimize harmful effects.

We integrate multiple investigative approaches as none of the existing drug resources provide a full disease picture. Our preliminary studies have determined that currently tested CTEP drugs do not have any extensive molecular response data available and as such, we will use “CTEP described” targets mapped to protein interaction networks and pathways. We are, however, unable to identify up- and down-regulated genes by these drugs from existing data. We will take more descriptive modes of action from Drug Bank, but we cannot identify perturbed pathways. Although not all drugs have been tested on ConnectivityMap or chemogenomics screens, those which have been can provide invaluable information about their activity.

To enable combination drug therapy prediction we have integrated several drug:target databases. DrugBank (http://www.drugbank.ca) encodes drug chemical features and their known targets. We extracted 9,906 human drug:target interactions covering 4,252 unique drugs and 2,108 unique proteins. Anatomical Therapeutic ChemicalCodes (http://www.whocc.no/atc_ddd_index) categorize drugs according to organ/system on which they act. The NCI-60 dataset (http://dtp.cancer.gov/mtargets/mt_index.html) contains information regarding drug concentrations inhibiting cancer cell growth by 50% for 60 well-characterized cancer cell lines including NSCLC with 5,183 drugs. The ConnectivityMap (http://www.broadinstitute.org/cmap) provides a series of microarray experiments performed on cell lines treated with 1,309 drugs at various dosages across several time points. Additional sources include OMIM (http://www.ncbi.nlm.nih.gov/Omim), CTD (http://ctd.mdibl.org), and PharmGKB (http://www.pharmgkb.org), and side-effects information extracted from publicly available package inserts (http://sideeffects.embl.de), covering 1,450 side-effects among 888 drugs. We have extracted 4,854,447 compounds (ChemDraw CDX or MOL files) from the last 10 years of US patents (112,329 patents).

Drug modes of action are complex and still poorly understood. A high-throughput assay unique to yeast (barcode-based chemogenomic screens) measures the 468 individual drug response of every yeast deletion mutant in parallel  Here we integrate the three largest yeast chemogenomic experiments and develop a data-mining approach to investigate drug effects at the system level.  We identify yeast pathways, functions, and phenotypes targeted by particular drugs, generate and identify clusters in drug-drug and pathway-pathway networks, and build NetwoRx, which allows users to screen new gene lists against the entire drug collection. To study drug action on pathways, we created NetwoRx data portal integrating the Our NetwoRx portal (http://ophid.utoronto.ca/networx) enables users to explore drug effects at the systems level. It stores pre-computed drug lists for KEGG pathways, GO categories, YEASTRACT transcription factor targets, and orthologs of human KEGG DISEASE groups. Users can interactively explore or download pathway-drug, pathway-pathway, and drug-drug networks, or submit a new gene set to NetwoRx and retrieve drugs that target it. Each drug links to the PubChem database. 

We will facilitate drug therapy prediction by integrating drug databases. We extracted 9,906 human drug:target interactions covering 4,252 unique drugs and 2,108 unique proteins from DrugBank (www.drugbank.ca). Anatomical Therapeutic Chemical Codes categorize drugs according to organ/system on which they act (www.whocc.no/atc_ddd_index). The NCI-60 data comprises information about 5,183 drugs inhibiting cell growth by 50% for 60 cancer lines (hdtp.cancer.gov/mtargets). The ConnectivityMap provides a series of microarray experiments performed on cell lines treated with 1,309 drugs at various dosages across several time points (www.broadinstitute.org/cmap). Additional sources include OMIM (www.ncbi.nlm.nih.gov/Omim), CTD (ctd.mdibl.org), 1,450 side-effects among 888 drugs (sideeffects.embl.de), and PharmGKB (www.pharmgkb.org).

While drug modes of action are complex and poorly understood using NetwoRx is effective way to identify drug combinations(J46). We will extend drug target proteins in DrugBank with 3,134 interactions in I2D and 11,656 interactions from FPClass predictions. We will use data from ConnectivityMap and a rank-based approach to determine drugs that when combined invert gene expression of a given set. A bipartite graph of drug:pathway associations can be used to further prioritize targets, leveraging existing information from the I2D database with a seeded Bayesian network approach{Djebbari, 2008 #13137} to capture gene expression activity in response to drug treatment. Extending current descriptive drug target networks, our Bayesian approach can infer predictive models of complex behavior of biological systems in response to drugs. Predictive network models may provide a mechanism to develop testable hypotheses to prioritize drug targets as we move from cancer signatures to pathways and individualized medicine. Scalability of this modeling is essential, so we can target real problems. We will validate the most frequent genes across our top pathways using CDIP and GeneSigDB, and by biological experiments (external funds).

Signatures mapped to protein interactions will be annotated with profiles from proteomic, CGH, miRNA studies, and with network structures(J5). Network structure relates to network and protein function{Przulj, 2004 #511}. Motif identification enables useful biological insight{Milo, 2002 #11}. However, motifs are network structures identified with human bias. Although it is possible to comprehensively induce all small sub-graphs to systematically analyze protein interaction networks{Przulj, 2004 #208;Colak, 2009 #3540}, two challenges remain. First, functional characterization of these “building blocks” remains to be performed. Second, these building blocks form larger sub-graphs that need to be systematically analyzed. Several approaches compare network similarity through counting frequency of modules{Kalaev, 2008 #2784}; however, computing network similarity in general cannot be solved in polynomial time. Finally, ‘communities’ – functionally related network modules – will be identified using graph clustering methods; enrichment of disease-related proteins within these communities will be used to implicate new proteins in a given disorder{Goh, 2007 #4160}. Then, calculating similarity for multiple sub-networks will lead to predicting possible drug combinations that modify sub-networks to increase the similarity to networks from responders and normal samples.

Integrating interaction networks with gene expression data, phenotypic data, genetic interactions, and established disease-related proteins has been used successfully to identify novel disease proteins{Pujana, 2007 #12682}, similarity as predicting gene annotation and protein function{Tasan, 2008 #12911}. Integrating gene expression data with interactions may provide evidence of the dynamic nature of the interactions within cancer network.

A few examples of NetwoRx-enabled analysis: 1. Some gene sets are drug-network hubs. In the integrated network of connections between drugs and transcription-factor targets, the IFH1 and GCR1 gene sets are much more druggable than others.  2. Some drugs share modes of action. A focused search for the connection between the chemotherapeutic agent Cisplatin and DNA damage pathways returns two KEGG pathways: nucleotide excision repair (sce03420) and homologous recombination (sce03440). Querying NetwoRx for all drugs connected to these pathways reveals that many cancer drugs affect both. 3. Clustering reveals drugs with similar pathway response profiles. Analyzing the drug-pathway matrix using WGCNA method (REF) identified dozens of modules with shared modes of action. 4. NetwoRx identifies drugs that target sets of aging genes in multiple studies. Three studies – Matecic et al. 2010, Fabrizio et al. 2010  and Powers et al. 2006– assayed thousands of yeast knockout strains to identify genes which extend chronological lifespan. These studies showed very poor overlap in terms of the genes identified, but we find that they share many targeting drugs (p<<0.01). 5. NetwoRx identifies drugs with modes of action similar to known aging drugs. We queried the drug-drug network to identify drugs whose activity profiles were similar to those of resveratrol, rapamycin, and caffeine.

Existing tools and resources will be further integrated and expanded to support integrative analyses of ovarian cancer profiles (http://ophid.utoronto.ca/navigator, http://ophid.utoronto.ca/i2d, http://ophid.utoronto.ca/cdip, http://ophid.utoronto.ca/mirDIP, and mirror of GeneCards (http://ophid.utoronto.ca/genecards). External pathway databases will be combined and annotated to diminish false positives and false negatives data. It is useful to combine annotated pathways with protein interactions to improve their coverage and relevance{Radulocich, 2010; Savas, 2009}. The following resources will be integrated: KEGG, Reactome, PathwayCommons, I2D, and interactions from STRING, PSICQUIC, PhosphoSite, iHOP and BCMSweb servers. This annotated portal for pathways and interactions is highly beneficial for improved biomarker discovery and interpretation, e.g.,{Notta, 2011; Eppert, 2011; Elschenbroich, 2011; Navab, 2011; Shirdel, 2011; Wei, 2011; Zhu, 2010; Reis, 2010; Radulovich, 2010; Fortney, 2010; Agarwal, 2009; Cox, 2009; Deribe, 2009; Savas, 2009; Tomasini, 2009; Gortzak-Uzan, 2008}.

PARP1 Example: Compounds that inhibit mTOR and also direct interactors of PARP1 (Table 1). Networks of PARP1 related drugs and interactors (Figure ). There is only Sirolimus (highlighted) that is an mTOR inhibitor and also associated with PARP1 with the KEGG base excision repair pathway in ConnectivityMap analysis. In addition, network and drugs identified from CTD are shown in Figure .

   


Using ChEMBL and looking for PARP1 and mTOR inhibitors: PARP1 has 1,637 bioactive compounds associated with it and mTOR has 4,024.  The overlap includes CHEMBL1090, CHEMBL50, and CHEMBL38165. 1,090 is the approved drug Vidarabine.

Then using ConnectivityMap we identified genes consistently up/down regulated by Vidarabine. We then identified those that are significantly up/down regulated in CDIP_ovarian_cancer (Cancer Data Integration Portal; http://ophid.utoronto.ca/cdip) to focus on UPregulatedgenes that Vidarabine represses, and DOWNregulated genes that Vidarabine can overexpress.

 

The corresponding protein interaction network from I2D for this analysis is shown in Figure .

Searching PARP1 in the Google Patent database yields 712 results. Restricting to post-2001 leaves 681 patents and patent applications. Restricting only to issued patents, which is what we have data on although we could download applications as well, leaves 43 patents. Several of these have phthalazinone or indole moieties, which seem like a pretty common pharmaceutical motif and appear in a number of structures in our drug database. 

 

 

 

 

Aim3

Designing effective drugs for targets from Aims#1&2 is complex and expensive. Systematically exploring possible space for compound synthesis requires expert medicinal chemist. Learning from past synthesis plans and using efficient planning, one can automate the process.

We have extracted 4,854,447 compounds from US patents (112,329 patents) and created a database of compounds and reactions(J49)(http://dcv.uhnres.utoronto.ca/SCRIPDB), which will serve as our domain knowledge. Usefully, over 57 thousand patents include up to 100 compounds, and almost 1,300 patents include over thousand compounds each.

Representations of molecules, reactions, and their use in a synthesis can be seen as generalizations of domain independent planning problems, where sets of Boolean variables have been replaced with labeled graphs. States of organic syntheses are molecules, which are represented by graphs where vertices correspond to atoms (of elements such as carbon, oxygen, hydrogen, etc.) and edges correspond to bonds of different types (single, double, aromatic). The operators in chemistry are reactions, which describe the change in bonds that can be enacted on molecules containing appropriate reactive centers. Reaction specifications include required activating substructures, which must be present to enable a reaction, and interfering substructures that forbid them. These required and forbidden substructures correspond to positive and negative preconditions of planning operators. Chemists create syntheses using a version of regression. As with operators in domain-independent planning, reaction application can be considered in the forward or reverse directions. A reversed reaction application asks whether it is possible to create a given product via a given reaction and, if so, what reagents are sufficient. These reagents may be recursively analyzed in turn until all proposed starting materials are commercially available. This recursive reversed reaction application, called retrosynthetic analysis, provides an algorithm to generate syntheses (Corey’s development of this technique was honored with the 1990 Nobel Prize in Chemistry). Retrosynthetic analysis is conceptually similar to regression except regression typically begins with a set of goal states whereas a synthesis has a single desired product that is fully specified.

By finding a minimal common structure among compounds in the same patent in SCRIPDB we identify core structure and possible synthesis steps. Then, given a new lead, the system will be able to propose meaningful molecules. Importantly, it has to provide a solution that is reachable (possible steps from lead to drug), reasonable (so the human medicinal chemist would not find impossible molecules), and has ranking that prioritizes solutions. As number of transformations increases, we achieve improved reachability but require better ranking. We will use past experience to guide the ranking: Drugs for cancer “X” should look more like other cancer “X” drugs than other medication. This would prevent the system from discovering a completely new drug – but would enable it to find useful solutions.

Copyright ©2012 LIAKM, Toronto, Canada