Reviewer(s)' Comments to Author: Reviewer: 1 Recommendation: Publish after major revisions noted. Comments: The paper by Wester et al. report an analysis of several compound databases according to their topological diversity building on a topology analysis presented in an accompanying paper. The idea of analyzing topologies is interesting, but the data analysis and discussion are not convincing and should be improved. I recommend publication after major revisions following the comments below. 1. The rule for eliminating divalent nodes is formulated as rule no3 in definition 2, which is not even followed later because many rings finally show 2 divalent nodes and not just one. Obviously, the authors decided that the minimum ring size to be represented is a 3-membered ring, probably in order to draw edges as lines and not curves and show graphs that look like hydrocarbons. This representation is quite misleading because practically none of the graphs represented as multiple 3-rings correspond to a real molecule with the ring size shown, and it is easy to predict that most chemists will be misled into believing that these graphs are hydrocarbons. I would not be surprised to see them end up in the CAS database as hydrocarbons by mistake. Graphs, in particular those of scaffolds ignoring divalent nodes and that are the center of this study, are not molecules. The representation used consistently in the accompanying paper is the only correct one and should also be used here. | We agree with Reviewer 1 that using minimal representatives of scaffold | topologies is potentially confusing, so we have replaced all the lines in | topologies containing 2-nodes by curves. 2. The number of accessible topologies for molecules is a function of molecular size since each ring requires a minimum number of atoms and small rings are more difficult to prepare, particularly with multiple rings. Therefore the topology analysis cannot be made irrespective of molecular size. Please break down the databases by molecular size (in number of atoms) and see which and how many topologies are covered for each molecular size in each library. See also which is the minimum molecular size necessary for each topology to be represented, and at which molecular size the number of examples for any given topology decreases. Such an analysis might reveal interesting differences between the different databases. Is the example with 165 rings in Pubchem a peptide? | Reviewer 1 is correct in stating that atom-typed topologies (i.e., those | associated with chemical structures) are a function of molecular size. | However, topologies are mathematical objects and, as such, have no mass (and | no size). We suggested that topologies can be associated with molecules by | providing the "minimum" topologies, i.e., the absolute minimum number of | atoms that could conceivably be coerced into a molecule representing a | particular topology. However, certain topologies require a larger number of | atoms before actual "real world" representation is found. One such example | is the missing topology from the 4-ring topology family, which is a | non-planar graph, which can only be realized as a 120-atom tri-ester molecule | shaped like a Moebius band. Modifying the reviewer's suggestion slightly, we | examined the average number of atoms per scaffold for the most frequent | topologies and compare with the size of the minimum topologies. We put this | analysis at the end of section 3. | | The 165-ring topology was computed from a copper | tetracarboranylphenylporphyrin. 3. In principle the number of compounds per possible topology should grow with the exhaustiveness of the enumeration in each library. Thus the topological diversity per compound decreases as the completeness of the enumeration increases, and vice versa, therefore having low topology-numbers per compound is not necessarily bad, as implied in the writing, it just means that the library is exhaustive and detailed. Partial sampling (here many topologies) versus exhaustiveness has always been an argument in combinatorial chemistry to best use screening capacity, and each strategy has it's pros and cons. This aspect should be discussed here in a balanced manner rather than professing that more topology per compound is better. | We did not intend to imply that one classification was better or worse than | the other. Indeed, these two schemes are complementary (exhaustive | enumeration and scaffold topologies). We have made a number of changes | throughout the text to clarify our thoughts on the issue, and added a much | stronger discussion in the conclusions of how the two strategies interact and | complement each other. 4. An important hypothesis to justify this study is that topology is a key categorization of molecules in the perspective of selecting compounds for biological testing, as expressed in the last sentence of the introduction. Although figure 2b is a nice selection, the possible importance of topological diversity for bioactivity should be discussed critically. At least it's not obvious to this reviewer that the classification should be useful. For instance, the set consisting of Vitamin A, Vitamin C, Niacin, serotonin, menthol, vanillin, aspirin, paracetamol, and TNT, has just one topology but a diversity of biological activities, but the set consisting of benzene, naphthalene, pyrene, anthracene and phenanthrene, has five topologies but a rather narrow bioactivity spectrum. Topologically, glucose and benzene fall together, while sucrose and biphenyl fall together in a different topology, is this useful? The 262144 possible decanucleotides consisting of four DNA-bases can have 512 different topologies, but are these diverse? The 1024 decapeptides consisting of all possible combinations of phenylalanine and tryptophane also have 512 different topologies, but the 1010 decapeptides made of all combinations of phenylalanine with Gly, Asp, Lys, Leu, Gln, Ser, Met, Val, and Glu only represent 11 topologies, which set is more diverse? | Reviewer 1, most certainly, provides some interesting examples as to the | diversity of bioactivities for a single topology. However, the stated goal | of our classification was to provide a comprehensive overview of the | scaffolds that are potentially available across all chemical space within the | constraints of this work. We did not intend to substitute any of the | existing methods, rather to provide a rapid classification system, which is | strengthened by its generality as opposed to its subclassification abilities. | Such a general method cannot be a substitute for finer-detail methods that | examine chemistry:biology interactions. Rather, it is provided to give a | most-general viewpoint. We hope the reviewer will agree that, when it comes | to mapping, the exact 3D map of North America serves a different purpose | compared to the map of interstate highways; in a sense, this is exactly what | we're asked to account for, and would like to suggest that general | topological mapping tools serve a different purpose, compared to exact, | atom-typed analyses. We have expanded the conclusions to make these points | clearer. 5. Extending on the last point, in fact the idea of "scaffold hopping" is an attempt to dissociate bioactivity from topology and there are good evidences that this works. A comment on this would be wellcome. | Without doubt, scaffold hopping as envisioned, say, by Cramer, is intended to | find biologically equiactive replacement for known chemistry, often by | identifying a scaffold that can be replaced. Our methods cannot easily be | used to "scaffold hop", since no atom types are provided. Hence, it could be | quite difficult to identify which topologically-equivalent atom-typed | scaffolds can replace any given query. ------------------------------------------------------------------------------- Reviewer: 2 Recommendation: Publish after major revisions noted. Comments: In this paper the authors study the distribution of topology scaffolds in six databases of interest for medicinal chemists, using the methods described in a previous paper. This is a timely issue and the paper is well written. There are a few points that the authors should consider addressing: 1) The relation between topological diversity and chemical diversity should be explored further. There are a few program that determine the number of scaffolds present in a library (ClassPharmer, SARvision, etc., these program(s) are likely available to the author with Pharma affiliation) it would make the paper more interesting to a broader audience, if the authors traced a correlation between the topological scaffolds and the scaffolds found in the different databases. | The stated goal of our classification is to provide a comprehensive overview | of the scaffold classes (topologies) that are potentially available across | all chemical space within the constraints of this work. By definition, such | a general examination is not intended to substitute for fine-detail analyses, | such as those found in ClassPharmer and SARvision, among others. Indeed, | the two types of analyses are complementary and we have expanded the | conclusions to make this point clearer. However, we agree that some | commentary on scaffolds is appropriate and so we have added general comments | on the relative numbers of scaffolds per database. Also, we have converted | the scaffolds in Bemis and Murcko's paper to topologies and compared with our | results. Please see our answer to comment (2) below as well. 2) A metric for diversity of a library can be obtained by looking at the number of scaffolds per molecule. It may be interesting to explore this parameter. | Reviewer 2 makes an interesting suggestion. However, the problem one faces | in this analysis is that topology is most general, whereas atom-typed | scaffolds tend to be more specific. When we look at the number of scaffolds | per molecule and compare per topology across libraries, we find that the size | of the database is the predominant trend in the observed values. We report | this result near the end of the analysis section, where we examine the most | frequent topologies. 3) A truly non-drug like database should be added for comparison (pigments and/or environmental toxicants, etc.). All the sets selected, including PubMed have a bias towards biological relevance, including databases for screening | This is an excellent suggestion, and we have added the DSSTox dataset (a | subset of PubChem) to our analyses. The paper is technically sound, but it is somewhat repetitive and does not provide a significant insight, but just a compilation of results. Perhaps addressing some of the points above would contribute to increaser its interest. ------------------------------------------------------------------------------- Reviewer: 3 Recommendation: Publish after major revisions noted. Comments: The study of rings in chemical universe and ring composition of available chemical databases is actual and of interest to current chemical informatics. The communication consists of 2 parts. The first, shorter part is description of methodology and in the 2nd part application to the study of several chemical databases follows. I do not have any critical points concerning the actual study. To make the paper more readable and more useful for general chemoinformatics community, however, I suggest the following modification: Since the two parts are closely related and refer to each other, I strongly recommend to merge the two manuscripts into one communication. The first theoretical part may be somehow shortened, the more technical part (which is too technical for general reader of JCIM) should be moved to the supporting information section. | These papers have distinctly different audiences and are also authored | by different groups. The first paper is intended to provide a general | algorithm for mathematical enumeration of scaffold topologies, while the | application part is intended as an example. For these two reasons, we | believe that the manuscripts should not be merged. Some additional points: I am missing some citations relevant to this topic, for example: Lipkus, A. (2001), 'Exploring chemical rings in a simple topological-descriptor space.', J Chem Inf Comput Sci. 41, 430 - 438. or Xu, Y. & Johnson, M. (2002), 'Using Molecular Equivalence Numbers To Visually Explore Structural Features that Distinguish Chemical Libraries', J. Chem. Inf. Comput. Sci. 42, 912-926. | We wish to thank the referee for pointing these papers out. We have added | a reference to the Lipkus paper in the chemical database evaluation. The PubChem database contains currently 10.9 million unique compounds (the PubChem command all[filt]). The author listed 11.6 million in Nov. 2006. How it is possible? | We have added the following footnote to Table 3: | | PubChem substances were used in the analyses as at the time they | were performed, substances but not compounds could be identified as active. The form of citations is not conform with the JCIM format. | We have corrected the format of the citations. ------------------------------------------------------------------------------- Reviewer: 4 Recommendation: Publish after minor revisions noted. Comments: The manuscript presents a novel method to analyze large chemical databases using topological scaffolds. This is a good paper with a lot of interesting data; however, the discussion falls a bit short. Below are some suggestions the authors may want to take into account: 1) What are the advantages and disadvantages of using scaffolds vs. other characteristics (such as number of rings, k-connected components, topological indices, energetic) when analyzing chemical databases. | The use of scaffold topologies can quickly break down the chemical universe | into subsets, enabling each one to be queried separately while ignoring those | that do not match. This procedure, done once for all chemicals in a | database, can rapidly subset all chemicals and provides an additional tool | for ultra-fast searching. This tool is not intended to replace, but to | complement, existing methods. OEChem versions of this software will be made | available upon publication of this work. We have expanded the conclusions to | make these points clearer. 2) There are other published studies reporting classification of large chemical databases, how does the current work differ from previous? For instance, Bemis & Murcko, J. Med Chem. 1996 found out that 32 scaffolds represented all compounds of relatively small database of 5120 drugs. Does the present study also find these scaffolds? Kerber, et al. MATCH 2005 54, 301-312 compare the set of potential (in silico) vs. known (Belstein) organic compounds and found that most in silico compounds have yet to be synthesized. Fink & Reymond, J. Chem. Info. Model., 2007 analyze the GDB database with principle component analysis and neural networks for several physicochemical properties. Are any of the conclusions found in the above papers confirmed (or contradict) in the present manuscript? | Regarding the Bemis and Murcko paper, we did not use atom-type scaffolds like | they did. However, on converting their scaffolds to our scaffold topologies, | we indeed find that their results are very similar to ours, and we have | incorporated this analysis near the end of section 3. We thank Reviewer 4 | for the suggestion to look more deeply at this article. Kerber et al. note | that "the overwhelming majority of organic compounds is unknown"; this is | certainly our observation as well and we have made this explicit in the | conclusions. As for Fink et al., we use GDB in our comparison; it is however | impossible to perform the same type of multivariate analyses (such as PCA or | NN) since our topological scaffolds cannot be mapped by traditional chemical | descriptors. 3) It would be worth to discuss further the topologies that are predicted theoretically but not present in specific chemical databases (such as topology 17). | We have added a paragraph at the end of Section 3 discussing some features | of topologies missing from the chemical databases and made explicit how few | topologies are actually present of those possible theoretically. 4) Are there specific topologies that are clearly missing from the natural products database (DNP) but present in other databases? Same goes with biologically active molecules. | We have rewritten many of the conclusions to clarify and extend our findings. | To answer the reviewer's questions, we found that DNP was depopulated with | respect to other databases for topologies with multiple rings emanating from | a central vertex or vertices (topology numbers 9 and 31-33)---see the | conclusions, but there are no clear trends we have discovered so far of other | topologies missing in the biologically oriented databases that are present | elsewhere. However, another reviewer suggested that we look at a | non-druglike database. We selected DSSTox and found that 4-nodes were almost | entirely absent and we have incorporated this finding in the paper. 5) The algorithm used to uniquely identified scaffold (the return index) is described in the first paper of the series. In the current paper the algorithm is related to the characteristic polynomial of molecular graphs could the authors elaborate? 6) The return index algorithm is correct up to eight rings (according to the first paper). The databases studied in this paper contain compounds with more than eight rings (cf. Tables 4), how were these scaffolds canonized? | We added the following footnote to Table 3 (the only place where we consider | topologies of more than 8 rings): | | Since the return-index is not guaranteed to completely distinguish | scaffold topologies for r > 8, the numbers presented in this table | generally are lower bounds, however, we do believe them to be good | estimates as we employed additional strategies for >8-ring | structures to help provide further resolution, such as computing | multiple return-indices using different values in the adjacency | matrix to represent loops. In addition, the total number of | topologies for each database with r > 8 were small: < 0.62% | except for DNP (3.68%) and PC actives (1.33%), both small databases. ------------------------------------------------------------------------------- Reviewer: 5 Recommendation: Publish after minor revisions noted. | We have made changes in the text relevant to all the comments below. Comments: On page 2, line 5: strike "is" from "... space is was recently..." In Table 2, the numbers following "scaffolds" should refer to Figure 2(a) (so the reader doesn't first consider them as counts). In Table 3 and various text, be consistent using "PC actives" or "PubChem actives". Also, referring to just "PubChem" may be confusing. | We now consistently refer to PC actives. On page 4, line 39, "safely" is not the right word in "... safely ignoring higher degree nodes". Consider using "Therefore, we ignored higher degree nodes..." as the last sentence of the paragraph. On page 12, line 45: Consider using "...other databases have a fraction of possibilities", or "other databases do not" instead of "...other databases are incomplete". On page 15, line 42: "... more than do known chemicals" is awkward. Try starting the sentence, "Compared to other databases, ..." On page 18 (top): The GENSMI algorithm did not "compare every new SMILES to all other with a given class". Actually, the algorithm used a hash table indexing scheme was used to avoid making all comparisons. Complexity was O(n) but actual performance was constant due to the large hash table size. | We have deleted the incorrect phrase and thank the reviewer for the | clarification.