Daniel W. Zaharevitz, Frederick Biomedical Supercomputing Center, Developmental Therapeutics Program, PRI/DynCorp
open compounds proprietary compounds total total 233985 218322 452307 no connection table 4686 4430 9116 incomplete connection table 1669 2243 3912 excluded atom 11963 4671 16634 other errors 60 55 115 valid connection table 215607 206923 422530Note that the biggest loss at this step is due to compounds with excluded atoms ( 3.7% of all NSC's ). Missing or incomplete connection tables is also a significant cause of loss ( 2.9% ). Overall, 93.4% of all NSC's had a connection table that was valid for 2D to 3D conversion.
open proprietary total total 7698 9531 17229 no connection table 125 102 227 incomplete conn. table 13 77 90 excluded atom 247 311 558 other errors 0 1 1 valid connection table 7313 9040 16353The reformatting losses here are similar to the first pass; 3.2% of total lost due to excluded atoms, 1.8% due to missing or incomplete connection tables, and 94.9% of all NSC's had a connection table valid for 2D to 3D conversion.
The other major cause of failure was the inability of the builder to handle compounds with large rings and, to some extent, compounds with complicated ring fusions. One simple modification to Chem-X that had some impact was to increase the maximum allowed CPU time for a build to 45 seconds. Another possibility is to add more builder fragments as described above. The disadvantage of course is that unless many compounds can be generated from a single fragment, this method is very inefficient. One advantage, however, is that the builder fragments can be constructed from high quality x-ray structures with the proper stereochemistry. Daniel Zaharevitz has added 22 builder fragments to the fragment database, including compounds such as phorbol, brefeldin, and cyclosporin. The overall impact of these additions is small in terms of numbers (approximately 200 failures fixed) but it does insure that some classes of compounds of special interest to the program are correctly built. The more general method for handling this kind of Chem-X build failure is to use other build programs. We have CONCORD, Version 3.0. CONCORD has generalized ring closure methods that can handle rings with up to 25 members. With the modified Chem-X and CONCORD 3.0, 16250 3D structures were successfully generated from the 16353 valid connection tables in the update. This is 99.4% of the valid connection tables and 94.3% of all the NSC's in the update.
The "hidden" problem of incorrect stereochemistry remains. One possible way to mitigate the problem is to build alternatives for each compound. Of course all alternatives could not be built or the database would grow to impossible size. From the distribution of stereochemical centers in the open database, it is estimated that if a maximum of 4 alternatives is built, the size of the database would increase by somewhat less than a factor of 2. This choice would mean that almost 90% of the database would be sure to have the correct stereochemistry represented at the cost of adding a number of structures with incorrect stereochemistry. It is still not decided whether this is worth doubling the size of ( and the CPU time to key and search ) the database.
Like the the prekey step, the conformational keying step in driven by a file which contains a call to a log file for each segment to be keyed. This .cmd file can be generated by the script addcmd.csh. The top level log file which opens the database and calls the .cmd file is called 3dkey.log. The log file which does the actual keying for each segment is generate_3d_keys.log.
The database was conformationally keyed using a modification of the default Chem-X rule based procedure. The maximum number of rotatable bonds was increased to 15 and the maximum number of conformations was increased to 1,600,000. These limits eliminated approximately 3% of the database from consideration. A time limit of 5 CPU minutes was also included. Keying test databases indicated that 8-12% of the structures had no conformations that satisfied the conformational rules. Examination of these stuctures revealed a large number that had been built with crowded aromatic rings. It was found that increasing the number of search points for the CONJUGATED bond type to 6 from the default of 2, solved the majority of these errors. Rather than have all structures searched with the increased number of points, a check was put into the keying script to check if no conformations were accepted with the default number of points. If so, the number of points for the CONJUGATED bond type was increased and the keying redone. A database field was then set to indicate that the number of points had been increased. With this scheme in effect only about 3-4% of the structures in the database resulted in no acceptable conformations. The keying of the 407912 structures took approximately 4 CPU months on a VAX 9000 and an SGI Indigo workstation.
We first looked at methods to enable us to handle very flexible molecules. There is a /RANDOM n switch in the conformation generation setup that tells the program to try to find n acceptable conformation at random in the search space instead of a systematic search of the space. Our plan was to use a random search when the number of conformations exceeded a set value ( we are now using a limit of 2,000,000). We found two major problems with this: 1) a few structures would get stuck in a seeming infinite loop and would not be cut off by the time limit and 2) when searching, molecules that had random conformations generated would badly distort to fit the query. Chemical Design fixed these bugs with the Oct '93 release of Chem-X and the random search strategy for very flexible compounds now seems to work acceptably well. Chemical Design has also introduced a query directed flexible fit search option that should work well for very flexible molecules, but we have not yet evaluated this option.
An analysis of our first large scale search ( Wang, S. , et. al. submitted to J. Med. Chem. ) showed that a number of false positive hits were due to structures that only satisfied the query in a high energy conformation. The presence of high energy hits is not surprising because the only criterion for judging the acceptability of a conformation was a simple set of rules. For a large database it isn't practical to do a full energy calculation for every conformation generated. The Chem-X program provides a compromise in the form of a simple "bump-check", which simply calulates the number of atom pairs that have overlapping VdW radii [Chem-X manual, Section 14.4]. This is only done on conformations that are accepted by the conformational rules. We wanted to see if implementing this type of energy calculation would cut down on the number of high energy hits while leaving us with a search fast enough to practically search our large database. We tested this by altering the search setup and searching a small database of structures that had been subjected to a conformational search with MM energy calculation.
We found that simply turning the bump-check calculation on was not satisfactory. Although it did cut down on the number of high energy hits, many low energy hits were also missed. We felt that the reason for this, as well as the problem with molecules that had no acceptable conformations, was that the structure built by the Chem-X builder was a poor starting point for the conformational search. In order to get a better starting point we instituted a pre-conformational keying step. The aim in this step was to find the most open ( within a fairly small search range ) starting conformation for the conformational keying step. In the conformational keying step the number of points around the SINGLE, ALPHA, and CONJUGATED bond types was increased to six. With these changes in place, we searched the small database of structures that had been modelled and found 6 hits out of the 7 structures ( 86%) with good low energy fits to the query ( RMS < 0.4 ), 8 hits out of the 19 structures (42%) with moderate low energy fits ( 0.4 < RMS < 1.0 ) and 2 hits out of the 9 structures ( 22% ) with poor low energy fits ( RMS > 1.0 ). Thus this search strategy, at least for this set of compounds, is able to reduce the number of high energy hits while retaining almost all the good low energy hits.
We used this setup to key the updated database ( see generate_3d_keys.log ). With the maximum number of rotatable bonds increased to 100, we now only lost 0.04% of the database to this limit. We also found that now only 0.5% of the molecules in the database had no acceptable conformations.
We estimate that the CPU time for conformationally keying our entire database with this search strategy would increase by about a factor five. Because the pre-key step only needs to be done once and because with the contacted conformations excluded, the key search step becomes somewhat more selective, the search time would only increase by about a factor of two or three. This means that a typical search of the entire database would take 6-8 days of CPU time to complete.