Chemistry space refers to the combinatorial and configurational space spanned by all possible molecules (i.e. those combination of atoms allowed by the rules of valence in energetically stable spatial arrangements). It is estimated that the total number of possible small organic molecules populating chemistry space could exceed 1060 — a number that exceeds the total number of atoms in the known universe, and is vastly greater than the number of molecules that have actually been isolated or synthesized.

Chemistry space is, of course, more than an uncategorized list of possible molecules. Molecules in chemistry space are related to each other in different ways. They are related to each other by similarity relationships, by chemical reaction pathways connecting different molecules, and in other ways. Chemical reactions also allow us to move from one molecule or chemical structure to another, defining a chemical reaction network in chemistry space. The similarity relationships include those of constitutional similarity (similarity of atoms in the molecule), structural similarity (similarity of substructures comprising the molecule), similarity of three-dimensional shape, similarity of chemical properties, or similarity of effects on the human (or animal) body due to binding to similar proteins. There are certainly more ways to assess molecular similarity than there are to skin the proverbial cat. Each similarity metric can be used to define a pairwise distance between molecules, which in turn can be used to generate a weighted or unweighted network. While many of these similarity measures are related to each other, they are not identical, and thus each will result in a different network.

The topological characteristics of these chemistry space networks are of considerable interest, both for fundamental reasons and for practical applications to drug design. But the enormous size of chemistry space makes its thorough exploration impossible. Thus a key question in drug design is how to optimally direct research efforts towards regions of chemistry space that are most likely to contain molecules with useful biological activity. The regions of chemistry space that have been mapped through experimental investigations are extremely limited and constitute an obviously biased sample. Chemists isolate, synthesize and study molecules for a variety of reasons, which include but are not limited to novelty, structural diversity, similarity to known drug leads, availability of source materials, unusual properties, peer pressure, etc. Thus it is not clear a priori whether different regions of chemistry space or chemistry spaces constructed using different similarity metrics should have any common characteristics or whether the network topology of chemistry spaces should be more similar to biological networks or to social networks.

Not all chemical spaces are created equal!

Relating chemical similarity to similarity in biological activity produced by the molecules introduces yet another level of complication [1]. Changes in biological activities resulting from changes in molecular structure are described by chemists through structure-activity relationships. The fundamental assumption implicit in such studies is that similar molecules should exhibit similar activities in biological assays — this is known as the similarity principle [2]. (More generally, while similar molecules may not always exhibit similar activities in individual biological assays, similar molecules do display similar broad patterns of biological activities across a range of related protein targets [3-6). Significant deviations from the similarity principle have been observed even between very similar molecules, leading to very similar molecules often exhibiting very different biological activities [2]. This is one of the major reasons for the failure of structure activity relationship models [7]. Gerry Maggiora postulated that such deviations arise on account of the complex nature of the activity landscape associated with biological assays, and he coined the term “activity cliffs” to characterize such regions of the structure activity landscape [8]. In Maggiora's topographical metaphor, smooth regions of the structure activity landscape (either flat like Kansas or like the rolling hills of England) are those that best satisfy the similarity principle. Measures such as the structure activity landscape index (SALI) [9-12], which quantifies the change in biological activity produced by a given change in chemical structure, have been devised to characterize activity cliffs. Utilizing a cutoff value of the index enables one to represent sets of molecules through network graphs that highlight abrupt changes in biological activity associated with the steepest cliffs. Steep activity cliffs (Bryce canyon-like regions), associated with high SALI values, represent the most challenging regions of a structure activity relationship to model quantitatively, but they are also the most interesting regions for purposes of drug design, because small structural modifications in a molecule can lead to a drug with vastly improved potency. This process is known as lead optimization.

Network topology of chemistry spaces

The degree distribution P(k) is the probability that a given node in a network has exactly k links or connections to other nodes. Scale-free networks are characterized by a power-law degree distribution: the probability that a node has k links follows P(k) ∼ k-γ. Such distributions appear linear on a plot of log P(k) versus log k. Nodes whose degrees deviate significantly from the average degree are extremely rare. The properties of a scale-free network are often determined by a relatively small number of highly connected nodes (hubs). In contrast, the tail of the degree distribution of a random network decreases exponentially as P(k) ∼ exp(-k) with the degree k. For a chemistry space network, we take each molecule as a node of the network, and use a discretized similarity measure to define the edges. Investigation of number of chemistry space networks using a variety of similarity measures has revealed the heavy tail degree distribution characteristic of a small-world network [13-15], as seen in the figure below.

Degree distribution of a chemistry spaceHubs in chemistry space are represented by molecules with high leverage in structure-activity relationship models. Such molecules are important for maintaining the diversity of a chemical library and for ensuring good predictive performance of structure activity relationship models across a wide domain of applicability. This ability to identify diverse structures spanning very different bond frame works or structural scaffolds with similar activities (known as scaffold hopping) is of great importance for drug design.

Activity cliffs lead to breakdown of simple structure activity relationship models in their vicinity. Differences in the characteristics of biological networks and the networks of commonly used chemical representations is a reason for encountering activity cliffs. Mapping the locations of activity cliffs for different representations, and comparing the global characteristics of SALI sub-networks with those of the underlying chemistry space networks generated using each representation, can guide the modeler in the choice of an appropriate chemical structure representation.

The figure above shows the SALI sub-network (in red) of a small set of molecules superimposed upon the underlying chemistry space network (in black). A higher density of SALI edges in any region of a chemistry space network graph with a particular chemical structure representation is an indication of a more challenging structure activity relationship using that representation in that region of chemistry space. Appreciation for the role of polypharmacology (the interaction of a drug with multiple targets) is also leading to a rapidly growing interest in the investigation of networks in chemistry space [16-17].


  1. Bajorath, J.; Peltason, L.; Wawer, M.; Guha, R.; Lajiness, M. S.; Van Drie, J. H. Navigating structure-activity landscapes. Drug Discov. Today, 2009, 14 (1314), 698–705.

  2. Martin, Y. C.; Kofron, J.L.; Traphagen, L. M. Do structurally similar molecules have similar biological activity? J. Med. Chem., 2002, 45, 4350-4358.

  3. Fliri, A. F.; Loging, W. T.; Thadeio, P. F.; Volkmann, R. A. Biospectra analysis: Model proteome characterization for linking molecular structure and biological response. J. Med. Chem., 2005, 48, 6918-6925.

  4. Fliri, A. F.; Loging, W. T.; Thadeio, P. F.; Volkmann, R. A. Biological spectra analysis: Linking biological activity profiles to molecular structure. Proc. Nat. Acad. Sci. USA, 2005, 102, 261-266.

  5. Klabunde, T. Chemogenomic approaches to drug discovery: similar receptors bind similar ligands. Br. J. Pharmacol., 2007, 152 (1), 5-7.

  6. Rognan, D. Chemogenomic approaches to rational drug design. Br. J. Pharmacol., 2007, 152, 38-52.

  7. Kubinyi, H. Why Models Fail

  8. Maggiora, G. M. On Outliers and Activity Cliffs - Why QSAR Often Disappoints. J. Chem. Inf. Model., 2006, 46 (4), 1535.

  9. Guha, R.; Van Drie, J. H. Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs. J. Chem. Inf. Model., 2008, 48, 646–658.

  10. Guha, R.; Van Drie, J. H. Assessing How Well a Modeling Protocol Captures a Structure-Activity Landscape. J. Chem. Inf. Model., 2008, 48 (8), 1716–1728.

  11. Peltason, L.; Bajorath, J. SAR Index: quantifying the nature of structure-activity relationships. J. Med. Chem., 2007, 50, 5571-5578.

  12. Wawer, M.; Peltason, L.; Weskamp, N.; Teckentrup, A.; Bajorath, J. Structure-activity relationship anatomy by network-like similarity graphs and local structure-activity relationship indices, J. Med. Chem., 2008, 51, 6075-6084.

  13. Benz, R. W.; Swamidass, J.; Baldi, P. Discovery of Power-Laws in Chemical Space. J. Chem. Inf. Model 2008, 48, 1138–1151.

  14. Tanaka, N.; Ohno, K.; Niimi, T.; Moritomo, A.; Mori, K.; Orita, M. Small-World Phenomena in Chemical Library Networks: Application to Fragment-Based Drug Discovery. J. Chem. Inf. Model., 2009, 49 703(12), 2677–2686.

  15. Krein, M. P.; Sukumar, N. Exploration of the Topology of Chemical Spaces with Network Measures, J. Phys. Chem. A, 2011, 11:6; DOI:

  16. Hopkins, A. L. Network pharmacology: the next paradigm in drug discovery. Nature Chem. Biol., 2008, 4, 682-690.

  17. Milletti, F.; Vulpetti, A. Predicting polypharmacology by binding site similarity: from kinases to the protein universe. J. Chem. Inf. Model., 2010, 50 (8), 1418-1431.