Building the NCI DIS 3D Database

Daniel W. Zaharevitz, Frederick Biomedical Supercomputing Center, Developmental Therapeutics Program, PRI/DynCorp

Extraction of 2D Information from the DIS

Procedures

The first step in building the 3D database is to extract the connection table information from the DIS Chemistry Database. The first step in this procedure is to create a DIS file with the NSC's of interest. ( See DIS manual ). Unfortunately the standard DIS save file format won't work for this application. What needs to be done is to run DIS and make the file active. While DIS is running, use another window to move to the WORK directory. In that directory you will find a file called Z***nn.TMP, where nn is the number of the DIS file of interest. Copy the Z***nn.TMP file to another directory. For a database called name I usually call this file name.NSC. This file is then used as input to the NLMMAN program supplied by Fein-Marquart Associates. The output of this program is an ASCII file in a format that as far as I can tell is no longer widely used. A program is available that will reformat this file into MDL's SDFile format (Dalby, A., et. al. J. Chem. Inf. Comput. Sci. 32:244-255 (1992) ) . This reformatting program is called DIS_CHEMX and is based on a program given to the NCI by Kevin Haracki of American Cyanamid Company, Pearl River, NY 10965. It was heavily modified for NCI use by Scott Pelfrey and Daniel Zaharevitz of the DTP Computing Center. The source for this program resides in the directory U2:[SANSS] on the DTP VAX. In the reformatting process, compounds with missing or incomplete connection tables are eliminated. Compounds with elements that the builders are unable to handle are also eliminated. In the first pass the allowed elements were H, C, O, N, S, P, F, Cl, Br, I, and Si. As described below, further work has allowed Pt to be added to this list. A file NLMBAT.COM to run these procedures has been written. Note that this requires READALL privileges to run and thus this phase of database building can only be done by system staff.

First Pass

Our current main database started with the 452,307 compounds registered as of June 10, 1992. Of these, 218,322 are proprietary compounds. The results of the reformatting are given in the table.


                   open compounds       proprietary compounds          total

total		    233985               218322                       452307

no connection
      table           4686                 4430                         9116

incomplete
connection table      1669                 2243                         3912
 
excluded atom        11963                 4671                        16634

other errors            60                   55                          115

valid connection   
    table           215607               206923                       422530

Note that the biggest loss at this step is due to compounds with excluded atoms ( 3.7% of all NSC's ). Missing or incomplete connection tables is also a significant cause of loss ( 2.9% ). Overall, 93.4% of all NSC's had a connection table that was valid for 2D to 3D conversion.

Update

The update to the database started with 17229 compounds that had been registered between June 11, 1992 and May 17, 1994. For this update, structures with platinum atoms were not excluded. No other changes were made to the reformatting procedure described above. The results of this reformatting is given in the following table.


				open		proprietary		total

total				7698		9531			17229

no connection table		 125		 102			  227

incomplete conn. table	 	  13		  77			   90

excluded atom			 247		 311			  558

other errors			   0               1 			    1

valid connection table		7313		9040			16353

The reformatting losses here are similar to the first pass; 3.2% of total lost due to excluded atoms, 1.8% due to missing or incomplete connection tables, and 94.9% of all NSC's had a connection table valid for 2D to 3D conversion.

Conversion of 2D structures to 3D structures

Procedures

The connection tables in the MDL SDFile format are read into Chem-X and 2D to 3D conversion is done using the Chem-X builder. A Chem-X log file loadmac.log has been written in order to automate this procedure. This file is located in the $CHEMX_LOCAL_LOGS directory. We have found that the resulting database file is about 3.5 megabytes per 1000 compounds. After building the database, the log file writes out the list of successful builds in a .set file.

First Pass

The connection tables were read into Chem-X and 3D coordinates were generated using the Chem-X builder (July '92 version). The default set of builder fragments was used, but the maximum allowable bond angle deviation was increased to 60 degrees to allow 3 membered rings to pass the build quality check. Also, a maximum of 4 CPU seconds was allowed per structure. Overall, 3D coordinates could be successfully generated from 96.5% of the valid connection tables ( 407912 structures ). The majority of the failures were due to compounds that had rings larger than 7 members ( the largest ring represented in the builder fragment library ). An important error that does not show up in this statistic is incorrect stereochemistry. Many compounds in the DIS do not have their stereochemistry specified and for those that do, we haven't found a way to use the information to set the atom and bond centered parity fields in the SDFile. The result is that the configuration at stereochemical centers is arbitrary.

Update

Although successfully building a 3D structure for over 90% of the NSC's in the DIS is encouraging, there is room for improvement. The largest single cause of failure was the inability of the builder to handle organometallic compounds. One possible way to attack this problem is to add atom types to the Chem-X parameterization file and add builder fragments to the Chem-X fragment database. Xinjian Yan in Bill Milne's group in the Laboratory of Medicinal Chemistry investigated the feasability of this approach with respect to platinum compounds ( the largest single group of organometallics in the DIS ). A hexavalent platinum and a tetravalent platinum atom type was added to the parameterization. He looked for platinum compounds in the Cambridge Structural Database and used these geometries to generate 136 new platinum containing builder fragments. With these fragments in place, 3D structures could be generated for approximately 80% of the platinum compounds in the DIS. Although this is a reasonable result, it was felt that the effort required to extend these results to other metal atoms was greater than the benefit at this time.

The other major cause of failure was the inability of the builder to handle compounds with large rings and, to some extent, compounds with complicated ring fusions. One simple modification to Chem-X that had some impact was to increase the maximum allowed CPU time for a build to 45 seconds. Another possibility is to add more builder fragments as described above. The disadvantage of course is that unless many compounds can be generated from a single fragment, this method is very inefficient. One advantage, however, is that the builder fragments can be constructed from high quality x-ray structures with the proper stereochemistry. Daniel Zaharevitz has added 22 builder fragments to the fragment database, including compounds such as phorbol, brefeldin, and cyclosporin. The overall impact of these additions is small in terms of numbers (approximately 200 failures fixed) but it does insure that some classes of compounds of special interest to the program are correctly built. The more general method for handling this kind of Chem-X build failure is to use other build programs. We have CONCORD, Version 3.0. CONCORD has generalized ring closure methods that can handle rings with up to 25 members. With the modified Chem-X and CONCORD 3.0, 16250 3D structures were successfully generated from the 16353 valid connection tables in the update. This is 99.4% of the valid connection tables and 94.3% of all the NSC's in the update.

The "hidden" problem of incorrect stereochemistry remains. One possible way to mitigate the problem is to build alternatives for each compound. Of course all alternatives could not be built or the database would grow to impossible size. From the distribution of stereochemical centers in the open database, it is estimated that if a maximum of 4 alternatives is built, the size of the database would increase by somewhat less than a factor of 2. This choice would mean that almost 90% of the database would be sure to have the correct stereochemistry represented at the cost of adding a number of structures with incorrect stereochemistry. It is still not decided whether this is worth doubling the size of ( and the CPU time to key and search ) the database.

The Pre-Key Step

Procedures

The pre-key step requires the presence of a .pcmd file. This file can be created from the .set file for the database by running the shell script addpcmd.csh. The .pcmd file is just a list of calls to the log file seg_noc.log for each segment. This log file is called from prekey.log , which opens the database. There are other ways to automate this procedure in Chem-X, such as using PCL, but the method used allows for simple crash recovery. Just edit the .pcmd file to exclude those segments that have been successfully completed and restart.

First Pass

The pre-key step was not used in the first pass build of the database.

Update

The motivation for the pre-key step and results of its use are discussed in the following section.

Conformational Keying

Procedures

Like the the prekey step, the conformational keying step in driven by a file which contains a call to a log file for each segment to be keyed. This .cmd file can be generated by the script addcmd.csh. The top level log file which opens the database and calls the .cmd file is called 3dkey.log. The log file which does the actual keying for each segment is generate_3d_keys.log.

First Pass

The database was conformationally keyed using a modification of the default Chem-X rule based procedure. The maximum number of rotatable bonds was increased to 15 and the maximum number of conformations was increased to 1,600,000. These limits eliminated approximately 3% of the database from consideration. A time limit of 5 CPU minutes was also included. Keying test databases indicated that 8-12% of the structures had no conformations that satisfied the conformational rules. Examination of these stuctures revealed a large number that had been built with crowded aromatic rings. It was found that increasing the number of search points for the CONJUGATED bond type to 6 from the default of 2, solved the majority of these errors. Rather than have all structures searched with the increased number of points, a check was put into the keying script to check if no conformations were accepted with the default number of points. If so, the number of points for the CONJUGATED bond type was increased and the keying redone. A database field was then set to indicate that the number of points had been increased. With this scheme in effect only about 3-4% of the structures in the database resulted in no acceptable conformations. The keying of the 407912 structures took approximately 4 CPU months on a VAX 9000 and an SGI Indigo workstation.

Update

We first looked at methods to enable us to handle very flexible molecules. There is a /RANDOM n switch in the conformation generation setup that tells the program to try to find n acceptable conformation at random in the search space instead of a systematic search of the space. Our plan was to use a random search when the number of conformations exceeded a set value ( we are now using a limit of 2,000,000). We found two major problems with this: 1) a few structures would get stuck in a seeming infinite loop and would not be cut off by the time limit and 2) when searching, molecules that had random conformations generated would badly distort to fit the query. Chemical Design fixed these bugs with the Oct '93 release of Chem-X and the random search strategy for very flexible compounds now seems to work acceptably well. Chemical Design has also introduced a query directed flexible fit search option that should work well for very flexible molecules, but we have not yet evaluated this option.

An analysis of our first large scale search ( Wang, S. , et. al. submitted to J. Med. Chem. ) showed that a number of false positive hits were due to structures that only satisfied the query in a high energy conformation. The presence of high energy hits is not surprising because the only criterion for judging the acceptability of a conformation was a simple set of rules. For a large database it isn't practical to do a full energy calculation for every conformation generated. The Chem-X program provides a compromise in the form of a simple "bump-check", which simply calulates the number of atom pairs that have overlapping VdW radii [Chem-X manual, Section 14.4]. This is only done on conformations that are accepted by the conformational rules. We wanted to see if implementing this type of energy calculation would cut down on the number of high energy hits while leaving us with a search fast enough to practically search our large database. We tested this by altering the search setup and searching a small database of structures that had been subjected to a conformational search with MM energy calculation.

We found that simply turning the bump-check calculation on was not satisfactory. Although it did cut down on the number of high energy hits, many low energy hits were also missed. We felt that the reason for this, as well as the problem with molecules that had no acceptable conformations, was that the structure built by the Chem-X builder was a poor starting point for the conformational search. In order to get a better starting point we instituted a pre-conformational keying step. The aim in this step was to find the most open ( within a fairly small search range ) starting conformation for the conformational keying step. In the conformational keying step the number of points around the SINGLE, ALPHA, and CONJUGATED bond types was increased to six. With these changes in place, we searched the small database of structures that had been modelled and found 6 hits out of the 7 structures ( 86%) with good low energy fits to the query ( RMS < 0.4 ), 8 hits out of the 19 structures (42%) with moderate low energy fits ( 0.4 < RMS < 1.0 ) and 2 hits out of the 9 structures ( 22% ) with poor low energy fits ( RMS > 1.0 ). Thus this search strategy, at least for this set of compounds, is able to reduce the number of high energy hits while retaining almost all the good low energy hits.

We used this setup to key the updated database ( see generate_3d_keys.log ). With the maximum number of rotatable bonds increased to 100, we now only lost 0.04% of the database to this limit. We also found that now only 0.5% of the molecules in the database had no acceptable conformations.

We estimate that the CPU time for conformationally keying our entire database with this search strategy would increase by about a factor five. Because the pre-key step only needs to be done once and because with the contacted conformations excluded, the key search step becomes somewhat more selective, the search time would only increase by about a factor of two or three. This means that a typical search of the entire database would take 6-8 days of CPU time to complete.