Data Structures

Molecules are represented as collections of organized data in computer memory. The organization and type of data used by a molecular modeling program is generically called a "data structure". The data structure in CHARMM, for example, is called the "Principle Structure File", in QUANTA the "molecular structure file", and in SYBYL the "molecule description".

Programs that perform calculations tend to have complex data structures capable of storing many different kinds of data (e.g. multi-purpose modeling programs like Insight II), whereas programs with limited capabilities have correspondingly simple data structures (e.g. graphics programs like RasMol).

How Data Structures are Organized

The organization of the data structure is hierarchical, and essentially mirrors the organization of a molecular structure. The fundamental organizational units of the data structure correspond to atoms, which may in turn, be grouped into bonds, angles and dihedrals. Atoms may also be grouped into substructures, such as monomers, subunits, secondary structure elements, and prosthetic groups. Substructure assignments may be performed automatically and/or manually, depending on the program and type of molecule. In many programs, the data structure can be partitioned to simultaneously accomodate two or more individual molecules.

Bookkeeping and User Interaction with the Data Structure

A bookkeeping system is used to uniquely identify the contents of the data structure and to track the presence and sources of data. Each organizational unit of data (e.g. atom, bond, monomer, substructure, etc.) is assigned an identification number and/or name at the time of input. The user can selectively operate on specific organizational units of the data structure by specifying their identifiers or via interaction with a graphical representation of the molecule(s), depending on the program. Examples of operations include color coding, labeling, and energy calculations.

Data Structures for Biopolymers

The dependence of the data structure on user-supplied and calculated data in the case of biopolymers is kept to a minimum due to the size of these molecules. Instead, some of the data is predefined for each type of monomer and distributed with the software in the form of libraries or dictionaries. When oligomers are built, the data structure is automatically filled in from a user-specified sequence. When data for experimentally determined macromolecules (e.g. proteins) is input, the correspondence between its atoms and their defined groupings (e.g. sequence, backbone, sidechain, secondary structure element, subunit, etc.) can be automatically established. Furthermore, data that cannot be calculated for large molecules (e.g. partial atomic charges) can be automatically added to the data structure.

A number of bookkeeping operations may be automatically performed. For example, sequence bookkeeping can facilitate insertions, deletions, and mutations. Bookkeeping of backbone and side chain torsional angles can facilitate conformational studies.

The major disadvantage of a specialized data structure for biopolymers is that the scope of the dictionary-based data determines the kinds of molecules that can be recognized. This can constrain the applicability of some programs to chemically standard biopolymers, and preclude atom-by-atom modifications (i.e. editing) that result in non-standard structures. Data for non-standard monomers can often be added to the dictionary when this problem arises, but this may be difficult in some programs. CHARMM allows biopolymers to be modified by the process of "patching".

Data

The contents of the data structure describe a specific molecule or collection of molecules. The contents of the data structure may change as operations are performed on the molecule(s), as computational results (e.g partial atomic charges, atomic coordinates) are added, and/or as constants and parameters are input. Most programs safeguard the data structure, requiring user consent to change or add data.

Descriptions of chemical composition, geometry, and bonding are the core components of the data structures of all molecular modeling/graphics programs. Knowledge of chemical composition and bonding (chemical configuration) are prerequisites for molecular modeling. Geometry information may be derived computationally using model building methods (small molecules and oligomers only) or experimentally using methods like X-ray crystallography and NMR spectroscopy.

Chemical Composition and Bonding

Chemical composition is, at the very least, specified by element name (e.g. C, H, O, etc.), and in many programs (particularly programs that perform molecular mechanics calculations), by hybridization state as well. The element name and hybridization are typically denoted together by an atom type code, the nature of which varies from program to program (e.g. 12, c=, c.sp2). Bonding (connectivity) may be user-specified or calculated from interatomic distances, depending on the program.

Geometry

Geometry is specified in the form of coordinates. The two major coordinate systems are:

Cartesian coordinates (x, y, z values for each atom).
Z-matrix or internal coordinates (bond angles, lengths, and dihedral angles for each bonded pair, triplet and quartet of atoms, respectively).

In practice, most multi-purpose molecular modeling and molecular graphics programs require Cartesian coordinate input. Some molecular mechanics and quantum chemical programs require internal coordinates as input, whereas others accept either representation. However, for some purposes (notably energy minimization via a quantum method), direct input of internal coordinates can result in greater computational efficiency. Internal coordinates are usually input manually, which can be both tedius and tricky for non-trivial molecules. Aides for inputting Z-matrices are available.

Other Data

Some data is constant from molecule to molecule, and is generally supplied with the software. Examples include:

van der Waals radii.
Partial atomic charge (biopolymers), basis sets, and other electronic data.
Physical constants (e.g. Planck's constant).
Force-field parameters (e.g. constants for bond stretching, angle bending, etc.).

Data Storage

Data files distributed with a program (e.g. dictionaries, parameter files, etc.) are usually maintained in a central fixed location whose directory path is known to the program. Such data files may be automatically input during program initialization and/or as the data is needed. In some cases, data may be hard coded in a program (usually physical constants that never require modification).

The contents of the data structure can be saved to a file (sometimes a set of files) that is generically called a "molecule file." Molecule files contain all information required to recreate the data structure during subsequent uses of the program. Molecule file formats tend to be highly specific for a given program. The lack of a standard file format can create difficulties in transferring data between programs.

Some programs recognize third-party formats in addition to their native format. The Protein Databank (PDB) format is recognized by many programs, but this format is not well suited for small molecules. The program "Babel" can interconvert many different molecule file formats. Caution must always be used when transferring data between programs, particularly in regard to the translation of atom type codes.