WO2002017112A1

WO2002017112A1 - Genetic programming for performing direct marketing

Info

Publication number: WO2002017112A1
Application number: PCT/US2001/026216
Authority: WO
Original assignee: Minetech, Inc.
Priority date: 2000-08-23
Filing date: 2001-08-21
Publication date: 2002-02-28
Also published as: AU2001285191A1

Abstract

A genetic algorithm process for analyzing direct marketing problems (Fig. 2). The process incorporates the steps of creating and evaluating an initial model population against a training database; creating a working model population by removing models from the initial population which have relatively low fitness functions (114); selecting a model with relatively high fitness as a base for follow-on model generation (116); an iteratively modifying that selected model to generate a set of models from which the highest value fitness function is selected. The automatic genetic program process operates in combination with an interface (200) which displays a current model form along with the genetic process fitness history and cum lift values. The interface interacts with the automatic genetic program process to allow a user to modify the form of the model being used as a base for the genetic program process. Because the user has access to the fitness history and cum lift evaluations of the models, the user can guide the model creation process to more effectively analyze direct marketing problems.

Description

Title

Genetic Programming for Performing Direct Marketing

Field of The Invention

The present invention relates to the field of computer assisted problem solving. In

particular, the present invention relates to a system for applying genetic programming to the

problem of maximizing the response rate to a direct marketing campaign.

Background of The Invention

Direct marketing involves the direct sales of goods and/or services to large numbers

of individual customers. Generally, a marketing representative will contact selected individual customers from a vast number of potential customers. Because it is impossible or

uneconomical (or both) to contact all the potential customers and because a marketer only desires to contact those customers most likely to respond positively to a marketing campaign,

some selection of potential customers must be made. Making the proper selection of highly

responsive potential customers is often the critical difference between successful and

unsuccessful campaigns.

The tools available to assist in the customer selection include a database having individual customer names in a vector with various characteristics of the customer (for example, age, gender, income, zip code, etc.) which may be analyzed with conventional

discriminant analysis, logistic regression and/or ordinary least squares regression. A technique for analyzing the database using genetic programming is also available. A genetic

program ("GP") is a series of mathematical procedures for search and optimization that

transforms a set of individuals represented as mathematical objects, into a new set of individuals using operations patterned after the principle of the survival of the fittest. In short, the GP process randomly generates a set of mathematical functions or programs which

can evaluate the mathematical representation of individuals with respect to a certain fitness function, and then, systemically modifies that random set of functions using reproduction,

crossover and mutation operators until the initial random set of functions is transformed into

a set of individuals wherein the fitness function is maximized. In general, GPs incorporate an

indeterminate number of operators.

In the GP process, the fitness function is essentially a function that scores or evaluates

any one program with respect to a desired response. That is, the fitness function measures the

difference between a desired or optimum response and the response generated by the program at issue. The GP process also includes several types of operators. The reproduction operator is a process which duplicates or copies functions depending on its fitness score; the crossover

operator combines two functions depending on their fitness scores; and the mutation operator randomly changes functions. In the GP process, the reproduction, crossover and mutation

operators are recursively applied to the initial random set of functions until this recursive

application creates a set of functions optimized with respect to the fitness function.

The use of techniques implementing GPs is important because such techniques do not depend on an a priori parameter selection or parametric assumption as in other conventional

statistical analysis. Because GP techniques are not based on assumptions concerning the form or size of the initial function set (e.g., GP's do not require random or normal distribution of the individual customer characteristics), the GP techniques are widely applicable. Moreover, they have been shown to be effective on large datasets and

optimization problems. The difficulty with GP techniques, however, is that they are based on

calculations with respect to an empirically defined fitness function. As a result, there is little

or no feedback to the user during the calculation process. More importantly, external restrictions on the form of the model generated are not conveniently implemented.

Accordingly, a tool for providing feedback and guiding the GP optimization process is

desirable. Objects of The Invention

It is an object of the present invention to use genetic program techniques to determine

the likely response profile for a direct marketing campaign.

It is another object of the present invention to provide feedback concerning the

optimization process to a user of genetic program techniques.

It is still a further object of the present invention to create an interface for using

genetic program techniques that graphically illustrates the optimization process.

It is still another object of the present invention to create an interface for editing the genetic program optimization sequence and graphically representing that edited sequence.

It is still a further object of the present invention to create an interface for editing the

genetic program sequence and graphically displaying the effect of the edited sequence on the fitness function.

Summary of The Invention

The present invention is a genetic algorithm process for analyzing direct marketing

problems. The process incorporates the steps of creating and evaluating an initial model population against a training database; creating a working model population by removing

models from the initial population which have relatively low fitness functions; selecting a model with relatively high fitness as a base for follow-on model generation; and iteratively modifying that selected model to generate a set of models from which the highest value

fitness function is selected. The automatic genetic program process operates in combination

with an interface which displays a current model form along with the genetic process fitness

history and cum lift values. The interface interacts with the automatic genetic program process to allow a user to modify the form of the model being used as a base for the genetic

program process. Because the user has access to the fitness history and cum lift evaluations of the models, the user can guide the model creation process to more effectively analyze

specific direct marketing problems.

Brief Description of The Drawings

Figure 1 illustrates a computer system according to one embodiment of the present

invention.

Figure 2 illustrates a sequence of process steps for carrying out a genetic program

according to one embodiment of the present invention.

Figure 3 illustrates a node structure for representing a function according to one embodiment of the present invention.

Figure 4 illustrates a display for an operator selection interface to a genetic program

process according to one embodiment of the present invention.

Figure 5 illustrates a display for an interface to a genetic program process according to

one embodiment of the present invention.

Figure 6 illustrates a display for an interface to a genetic program process according to

one embodiment of the present invention.

Detailed Description of The Invention

Figure 1 illustrates a computer system according to one embodiment of the present invention. Computer 20 comprises a central processing unit (CPU) 30 and main memory 40.

Computer 20 is coupled to an Input/Output (I/O) system 10, disk storage unit 50, and network connections 55, 57 and 59. The I/O system 10 includes a display 5, a keyboard 7 and a cursor

control device (e.g., mouse, trackball, etc.) In general, the disk storage unit 50 stores a series

of instructions in a program and associated data for operating the computer system. The instructions and data are retrieved from disk storage 50 and stored in main memory 40. The CPU 30 retrieves instructions and associated data from main memory 40 and then executes

those instructions as defined by the program. The computer system 20 may also retrieve program instructions and/or data from world wide web connection 55, intranet or local area

network (LAN) connection 57 or other external network connection 59. The computer

system 20 may retrieve and/or send instructions and/or data by using encoding such data on

carrier signals and transmitting those carrier signals over the network connections. These

connections and methods for transferring data or instructions to the computer system 20 are

well known to those of skill in the art.

In the field of direct marketing, a marketer begins with a large list of potential

individual customers, each individual customer associated with a vector of characteristics such as, for example, age, income, gender, education or number of children. Each of the

characteristics in this vector can be represented numerically (e.g., gender = 1/0, education = number of years, etc.). This list of potential customers and associated characteristics is referred to as a customer database. Given unlimited resources, the marketer would directly

contact all individuals in the customer database. Because the marketer has resource limitations, however, the objective of the marketer is to select that subset of potential

customers (wherein the size of the subset is a function of the resource constraint) which has the highest likelihood of favorably responding to a direct marketing campaign.

In order to achieve this marketing objective, the marketer analyzes the customer

database using a computer implementing an analysis program. Accordingly, the marketer must create an analysis function which, when applied to the customer database, generates the

desired response. For example, the analysis function identifies those customers most likely to

respond to a marketing campaign or identifies those customers most likely to respond and

spend the most money. To create such an analysis function, the marketer creates a training database which is a subset of the customer database wherein actual responses and the quality

of those responses have been measured and quantified. Of course, the training database could

be as large as the customer database. An analysis function may then be created to model the actual responses using the training database because the performance of the analysis function

can be measured against actual results. Once an analysis function has been created, it may

then be applied to the customer database. For example, a customer database having

1,000,000 entries may be available to a marketer but the marketer may only be able to

measure response rates to a certain marketing campaign on 5,000 individuals. Accordingly

then, the marketer would use the measured response from the 5,000 individual sample and create an analysis function that predicted the response rates in that sample. The marketer

would then apply that resulting analysis function to the 1,000,000 entry customer database.

To generate an analysis function, direct marketers need a technique to gauge known responses. One measure is the Proportion of Total Correct Classifications (PTCC) which is

calculated as a cross-tabulation. For example, if a sample contains 100 individuals with a

15% actual response rate and a 24% predicted response rate, then the model would have

correctly predicted 74 nonresponders and 13 responders and had a PTCC value of (74+13)/100 = 87%. Although the PTCC value is frequently used, it is not appropriate for

many problems such as, for example, when the assessment criterion imposes a penalty for misclassifications.

For direct marketers, a more meaningful measure of the response model is referred to as the cum lift. As direct marketers desire to identify individuals most likely to respond to a solicitation, they want to measure the number of responses, beyond a random selection of

individuals, most likely to respond to a solicitation. The cum lift is, then, an index of how

many more responses are expected with a selection based on a model over the responses to be

expected with a random selection of individuals. The following process, for example, maybe used to calculate the cum lift: i) for the form of model under consideration, score a test

sample assigning each individual a predicted probability of response (PPR), ii) rank the scored file by PPR, iii) divide ranked file into ten ranked groups (groups 1 receives top 10%, group 2 receives next 10%, etc.), iv) identify number of actual responses (AR_j=U0) identified

by the model in each decile, v) calculate decile response rate as AR divided by number of

individuals in scored file decile, vi) calculate the cumulative response rate for depth of file

(CR) as (£"_, AR_j ) divided by the total number of individuals; and vii) calculate cum lift as

(CR) divided by the total response rate multiplied by 100.

As shown in this example, the cum lift is a measure of how much better a model

predicts the actual response rate in comparison to a random sample. For example, a cum lift of 294 for the top decile means that when soliciting the top 10% of a customer pool identified

by a model, the total number of responses is 2.94 times the number of responders expected by random sampling. While the cum lift analysis works well for binary responders (e.g.,

whether customer responds or not), it also works well for continuous response data (e.g., where responders spend differing amounts of money and the objective is to optimize the

response of high spenders).

Given a response measurement tool such a cum lift, the marketer must then determine

the analysis function. According to one embodiment of the present invention, such an analysis function is created by using a genetic algorithm in combination with an interactive function editing interface. That is, the genetic algorithm operates to evaluate, in an essentially random order, a large number of possible analysis functions. As any one of these

possible functions are being evaluated, the form of that function is graphically depicted on a

computer display. By using an interactive function editing interface, the form of the analysis

function eventually created in the genetic program process can be edited as the evaluation

process is ongoing. As a result, the genetic program process can be guided to create an analysis function more specifically tailored to a specific marketing problem.

The evaluation of any one analysis function is determined with respect to a fitness function. A fitness function is the difference between the results of the analysis function under consideration at any one time and a known ideal response. For example, the measured

quality of response for one individual of the training database may be X,, while the function

under consideration f (c,, c₂, c₃, c₄) (wherein c_rc₄ are numeric representations of age, gender,

income and education, respectively) generates a value of X₂. The fitness value for that

individual is X₂-Xι- Such a fitness value is then calculated for every individual of a training

set from the customer database. A composite fitness function of the fitness values for the training set with respect to characteristics Cj through c₄ is then generated. Preferably, the

fitness function reduces to a single number such as an average or statistical mean over the elements of the training set. The fitness function could also be a vector, analytical or discrete

function. The genetic program technique creates, evaluates and re-creates new functions until the fitness function over all the program forms is minimized.

Figure 2 is a flow-chart of the genetic programming process of the present invention.

The process 100 starts by the Create Initial Population 102 step which creates a number of programs (typically randomly). A program is simply a mathematical function constructed

from a selection of a set of mathematical operators and from the characteristics associated with each individual of the training (or customer) database. For example, where an individual has characteristics c (age) = 25, c₂ (gender) =1, c₃ (income) = 65 and c₄ (education) = 14; one function, f (c_n, c₁₂, c₁₃, c_I4) for that individual maybe [2 X (age) + 3 X

(gender) - .I X (income) - (education)] =32.5. The form of the function can be any

combination of one or more mathematical operators chosen from a set including but not limited to multiplication, addition, subtraction, division, sine, cosine, tangent, exponential,

log, powers, summation, or tabular representation. Figure 3 illustrates a graphical

representation of the function f, wherein the characteristics (305, 307, 309 and 311) are combined (313, 315, 317, 319, 321 and 323) in a node structure. The set of operators used in

constructing functions according to the genetic programming technique are selected or constructed by a user. For example, figure 4 illustrates an interface for a genetic alphabet

selector as used in a genetic program process. Through this interface, a user can select the

types of operators (e.g., arithmetic 330, circular 340, numeric 350, logicals 360, hyperbolic

370 or custom 380). By selecting the custom function button 390, a user can create a

specialized operator which may then be used as part of the set of operators used in the genetic programming technique.

Referring back to figure 2, the GP process continues with the step Execute Each

Program 106 wherein each program is executed. That is, one of the randomly generated functions is chosen, and that function is evaluated for each individual in the training database. As a result, each individual has an assigned numerical value (step 112) for the function being

evaluated. This numerical value is then compared with a measured response value to

generate a fitness value for each individual in the training database. A composite fitness

function, here the average of individual fitness values, is then generated. While the fitness function characterizes the difference between the modelled and measured response, this

function will vary depending on the problem and the characteristics of the data being modelled. Those of skill in the art recognize the variety of ways in which such a function can be created.

Once a fitness function has been generated, the basic iterative loop of the genetic

program process begins. Here, the termination test 104 is evaluated (for example, achieving a known best solution or achieving a certain degree of improvement in average fitness for the

working population), and if satisfied, the process ENDS 101. If not, the next step is to

Remove Program(s) with Relatively Low Fitness. The phrase "Relatively Low Fitness" is used to connote either selection based on a probability proportionate to normalized fitness values or selection based on equal probability among individuals having fitness outside some

defined threshold (e.g., less than 0.5). Step 114 causes the removal of the less fit members of the program population from being used to breed the next generation of functions. Step 114 improves the average fitness and eases memory requirements by keeping the working

population within reasonable limits.

Step 116, Select Program With High Fitness, then picks at least one program which

has a relatively high fitness value (compared to other programs) as a basis for creating new

programs. Using this chosen program, step 118, Choose an Operation to Perform, then

determines which operation will be used to modify the chosen program. In this step, a random number generator selects between the various operation choices. The output of the random number generator is weighted to select one or more of the operation choices more

often than other of such operation choices. In addition, as discussed more fully below, step

118 may receive instructions from a user through the user interface 200. These instructions may dictate a part of the form of the program generation and so enable the user to guide the program generation process.

Crossover 120 (sexual reproduction) and Reproduction 130 (asexual reproduction) are two basic operations that maybe performed in step 118. Permutation 140 and mutation 150

(asexual reproduction) also play a role. Typically, the vast majority of chosen operations are the reproduction and crossover operations. For example, in a population of 1,000 choices, the

particular weighting of the process might specify that the crossover operation is chosen 700 times, reproduction is chosen 250 times and permutation is chosen 50 times. The preferred

mix of sexual and asexual reproduction is 66% sexual and 34% asexual. The genetic process parameters might also specify that mutation of a single node occur with a probability of

p=0.0001. Thus, if the average individual has Q points at which mutation might occur, a total

of Q*N*p alleles will be mutated (in a population of size N). If Q is 10, then 1 node out of 10,000 alleles in a population of 1,000 individuals will be altered as a result of the mutation

operation. The sexual reproduction crossover operation 120 requires a group of at least two

programs (typically two parents which mate to create two siblings). As a result, second

program(s) are picked to mate with the chosen program(s) from step 118. Typically, the mate

would be the next highest fitness value individual. Of course, the choice of a mate may be

made randomly. For each mating, a crossover point is separately selected at random from

among both internal and external points within each parent at Select Crossover Points 122.

Then newly created programs are produced at Perform Crossover 124 from the mating group using crossover. The crossover points divide each of the parents into first and second parts.

Mating involves connecting the first part of parent A with the second part of parent B and vice- versa. Accordingly, two parents would produce two offspring. There are two varients of

this crossover operation. The first is the one point crossover in which the first part of parent A is identical with the first part of parent B. In such a case, the second parts of parents A and

B are swapped. The second varient of the crossover operation evaluates all common points of

parents A and B and exchanges them based on a 50%> probability. As a general matter, all three versions of the crossover operation are used 33% of the time.

In the GP process, there is no requirement that the population be maintained at a constant size. The version of the crossover operation producing two offspring from two parents has the convenient attribute of maintaining the population at constant size. Other operations each produce one offspring from one parent so that they too maintain constant

population size. On the other hand, if the crossover operation acts on a group of more than

two parents, the size of the population may grow. For example, if three parents formed a mating group, each parent would have two crossover points selected for it and there would be

27 possible offspring (3X3X3). Even if the three offspring equivalent to the three original parents are excluded, there would be 24 possible new offspring available. In general, if there

are N parents, then N — 1 crossover points would be selected for each and there would be N^N — N new offspring available. When an operation produces more offspring than parents,

then either the population can be allowed to grow or the population can be trimmed back to a

desired (presumably constant) size when the next round of fitness proportionate reproduction

takes place.

In asexual reproduction, one parent produces one sibling. Asexual reproduction

involves three operations, reproduction 130, permutation 140 and mutation 150. A method for selecting one of these three computational procedures for reproduction is to select them with a probability proportional to their normalized fitness. A preferred method is to select

mutation 80% of the time (wherein node mutation is selected 6%, terminal mutation is selected 30% and 10% random mutation) and replication is chosen 20% of the time. The

permutation operation 140 selects a Permutation Point 142 from among the internal points

within the selected individual function. Once this point is selected, Permutation 144 is

performed, by reordering the selected program's subprocedures, parameters, or both at the permutation points. For example, one permutation is to switch operators or swap branches of

an existing operator. Also, terminals can be swapped or replaced with constants. Similarly, where the mutation operation selects a Mutation Point 152 for each selected program, at this selected point, Mutation 154 then randomly generates, for each selected program, a portion of a program and inserts it at the mutation point. The portion inserted is typically a single point,

but may be a sub-program or other function.

Once new programs (or functions) have been created, the new program(s) is executed

and evaluated in step 160 for all individuals in the training set. The fitness values for all the

individuals are then used to form a composite fitness function, in this case, an average fitness

values over the individuals. The evaluation creates a fitness value associated with the new

program. Thereafter, the GP process returns to the termination test 104. The GP process then iterates through new generations of functions until either a performance criteria is met (e.g., fitness function minimized) or a user selected number of generations has been created and

evaluated. Each generation of programs are used to breed a subsequent generation in which

the fittest members go on to breed other generations. Refinement of programs within and

over generations (and hence, optimization of the fitness function) is a consequence of

iterating this GP process using fitness proportionate selection for the programs.

Moreover, because of the iterative nature of the GP process, an audit trail can be

created of the entire process from the creation of the initial population of individuals to the current working population of individuals. For example, suppose we denote the individuals of the initial population as II, 12, 13, . . . These individuals can be either stored directly or one

can store the random algorithm (and random seeds) used to generate the initial members. When a crossover is performed on two individuals (say II and 12, at point p of parent 1 and

point q of parent 2), an expression is created involving 5 items — namely, the symbolic string

"CROSSOVER", the identities of the two individuals being crossed at the time (i.e. II and 12) and the two crossover points (i.e. p and q). This new string would be the identity (i.e. audit

trail) of the newly created individual. If a subsequent crossover (or other operation) were performed on this individual, this string would, in turn, become an argument of a new operation. Similarly, when a permutation is performed on an individual, an expression is

created involving 3 items — namely, the symbolic string "PERMUTATION", the identity of the individual, and the permutation point. An example would be (PERMUTE 141) if the permutation operation had been performed on individual 14 at point t.

The first step in the iterative process involves activating each program. Activation

means having each program attempt to accomplish its goal, producing an objective result. In the preferred embodiment, entities are computer programs, so activation requires executing

the programs of the population on the training database. The second step in the process assigns a fitness value to the objective result, and associates that fitness value with its corresponding entity. For programs concerning direct marketing, the fitness value is

generally a number, or a vector, which reflects the difference between the results of the

program execution and measured results. Of course, the fitness value could be any symbolic

representation.

In general, some of the programs will prove to be better than others when a value is

assigned to them after their interaction with the "environment" (e.g., individual characteristics) of the direct marketing problem. The best value (fitness) may be the lowest number (as is the case here where we are measuring the deviation between a result and a

known perfect solution). In other problems, the best value (fitness) may be the highest

number (e.g. scoring direct "hits"). The value (fitness) assigned maybe a single numerical value or a vector of values, although it often most convenient that it be a single numerical

value. In many problems, the best value is known or measured. However, even in such

problems, it is also known that lower (or higher) numbers connoting better fitness may be

attained over time and the best value attained by the process over a given time may need to be identified.

A useful method for organizing raw fitness values involves normalizing the raw values and then calculating probabilities based on the normalized values. The best raw fitness value is assigned a normalized fitness of 1, the worst value is assigned a value of 0, and all intermediate raw values are assigned in the range of 0 to 1. The probability of being

selected for a program breeding is determined by the equation P_; = fj divided by ∑^N _x=1f(x);

where Pjis the probability of breeding for individual (i) having a normalized fitness of f_i5 and N is the total number of the population. Thus, an individual's probability of breeding equals

the individual's normalized fitness value divided by the sum of all the normalized fitness values of the population. In this way, the normalized fitness values range between 0 and 1, with a value of 1 associated with the best fitness and a value of 0 associated with the worst,

and the sum of all the individual's probabilities equals 1.

At any given time, there is one individual in every finite population having a single

fitness value that is the best amongst that population. Moreover, some environments or data

set characteristics have a known best fitness value. Examples are when fitness is measured as

deviation from a known answer or number of matches. The process of the present invention

may occasionally generate an individual whose value (fitness) happens to equal the known best value. Thus, this overall process can produce the best solution to a particular problem. This is an important characteristic of the overall process, but it is only one characteristic.

Another important characteristic (and the one which is more closely analogous to nature) is that a population of programs exists and is maintained which collectively exhibits a tendency

to increase their fitness over a period of time. That is, the average (or other group measure)

of the fitness values for all the programs in the working set tends to increase over time. By

virtue of the many individuals with good, but not the very best, fitness values, the population exhibits the ability to robustly and relatively quickly deal with changes in the data set characteristics. Thus, the variety in the population lowers its overall average fitness value; additionally, the population's variety gives the population an ability to robustly adapt to

changes in the environment.

Another embodiment of the GP process involves the process of affecting the

probabilities of which program or programs will breed further generations. As described

above, one way to determine breeding probability is to select the program with the highest

fitness value. A number of other methods exist, however, which tend to determine entities of relatively high value. The theoretically most attractive way to determine breeding probability is to do so with a probability proportionate to a fitness value (once so normalized between 0

and 1). Thus, an individual with fitness of 0.95 has a 19 times greater chance of breeding than an individual of fitness value 0.05. Occasionally individuals with relatively low fitness

values will be selected. This selection will be appropriately rare, but it may occur.

Furthermore, if the distribution of normalized fitness values is reasonably flat, this method is

especially workable. However, if the fitness values are heavily skewed (perhaps with most

lying near 1.00), then making the selection using a probability that is simply proportionate to normalized fitness will result in the differential advantage of the most fit individuals in the

population being relatively small and the operation of the entire process being prolonged. Thus, as a practical matter, breeding is done with equal probability among those individuals

with relatively high fitness values rather than being made with probability strictly proportionate to normalized fitness. This is typically accomplished by breeding individuals

whose fitness lies outside some threshold value. One implementation of this approach is to

select a threshold as some number of standard deviations from the mean (selecting for example, all individuals whose fitness is one standard deviation from the mean fitness).

During the iterative loop of the GP process, breeding parameters are modified through the Choose Operation step 118. As noted above, the possible operations include crossover,

permutation, and reproduction. The preferred operation is crossover, followed by reproduction, and lastly permutation. However, this preference is only a generalization, different preferences may work better with some specific examples. Also, as illustrated in figure 2, a user may select a particular form of operation through User Interface 200. Figure 5

illustrates User Interface 200. It is a graphical user interface in which the model form

window 250 is displayed along with one or more measures of the effectiveness of the model.

As shown in figure 5, for example, the Fitness History window 220 and cum lift window 230 are graphically displayed.

The User Interface 200 works in conjunction with the genetic algorithm of figure 2 to

provide a user controlled genetic program process. As described above, the genetic program of figure 2 creates an initial population of models and then uses the programs with the best

fit, in steps 118 to 160, to breed a new generation of programs which subsequently become

the new working model population. The automatic iteration of this process eventually

generates a model that exhibits the best fitness. The user interface 200 allows a user to alter

the automatic iterating and edit or lock-in a selected form for part of the model. Editing or

locking in certain model sections is desirable to make the models conform to restraints not otherwise taken into account by the fitness function. Such edits may be based on the

performance measures of the fitness function graphically displayed and dynamically updated as individual models are evaluated.

As illustrated in figure 5, the user interface 200 displays a graphic representation of the program in the model form window 250 along with the fitness history in window 220 and

cum lift in window 230. The program that is graphically represented in window 250 is the

program being evaluated against the training database in step 160 illustrated in figure 2. The particular program illustrated in figure 5 performs the program ((Dollar) multiplied by (Dollar

plus 3)) multiplied by (3 divided by product type). The fitness history 220 is a graphical display of the calculated fitness values for each program that has been displayed in the model form window 250. Specifically, if the GP process is evaluating the Nth program in step 160, the iterative process of the GP has already calculated (N-l) fitness values for (N-l) previous

programs and those (N-l) fitness values are displayed in window 220 according to their numerical order in which they were generated.

While the user interface may simply track fitness values, it may also retain a vector of

parameters for each fitness value such as a high/low range or a normalized distribution or other parameters that represent a fitness value. The interface may also retain a vector for each fitness value such that selection of a fitness value along the fitness history curve prompts the

actual model form corresponding to the selected fitness history value to be displayed. Finally, the cum lift display is a representation of the cum lift by decile for the Nth program of step

160. The display in window 230 may alternatively be a table or bar chart or any

representation of the cum lift decile analysis. Alternatively, the interface may also store the

cum lift history in a similar manner to the fitness value history. For example, the window

230 might display the cum lift history for the 1st decile (top 10%) of all the (N-l) models

generated to date.

In operation, the combination of the display of the Nth model in window 250 with the fitness history in window 220 and the cum lift in window 230 enables the user of the GP

process to evaluate, while models are being created, the effectiveness of the models with respect to the direct marketing problem being analyzed. For example, while the (N-l) model

is being analyzed, a user may have decided that a certain section of that model should be

retained for all follow-on models that will be generated in this GP process. Accordingly, the user would select the "Pause" button 260 on the interface tool bar which will temporarily halt

the evaluation of the (N-l) model over the training database. The user would then drag the cursor over that section of the model to be saved in follow-on models and select each

function or input to be saved. When the user continues the evaluation process by selecting the "Continue" button on the toolbar, the GP process will finish evaluating the (N-l) model and will input at step 180 (figure 2) the selected portions of the (N-l) model to be used on the Nth and follow on models. Alternatively, the user may set a parameter such that the section

of the (N-l) model to be used in the Nth model is only used in the Nth model or is only used

in a set number of models following the (N-l) model.

Importantly, the reason the user of the GP process is able to know that certain sections

of the model may need to be retained (or edited out) is that the user can detect the raised cum lift for a particular model and also, by selecting the fitness history, determine which models might be similar (or different) than the current Nth model. Where high (or low) fitness models have similar sections to the Nth model, the similar section of the Nth model can then

be used as a base for further modifications. The user might also be subject to arbitrary

limitations on individual parameters that are unique to any particular problem. In this case,

the user can specify those limits through combinations of input parameters and operations

such that a section of a model is built and input into the GP process through step 118. In this

way the user of the GP process can guide the model creation process of the GP algorithm such that an analysis of any particular problem is customized to its specific limitations.

Figure 6 illustrates another embodiment of the interface of the present invention. In

particular, the view (or model form) window 250 is displayed alongside the fitness history

window 220 and a decile analysis window 235. In addition, the response profile window 260 and gene pool diversity window 270 are displayed. The fitness history window 220 plots the

fitness of the best-of-generation (BOG) and the 2nd best-of-generation individuals. Do not

expect BOG fitness to increase with every new generation; it will not; it may even go down. Normal behavior is for short periods of growth followed by many generations of apparently no change, followed again by sudden increases. Modern theorists refer to this process as

punctuated equilibrium.

The view window 250 automatically displays the current best-of-generation program. In this window you can view any one individual program in the population. This window has

several functions. The view window contains a small cluster of menu items that allow you to step forwards and backwards throughout the entire population, to stretch and sl rink the

display vertically and horizontally and to export any of the evolved programs in a variety of formats. Every time you select a different program the cum lift window automatically

updates with that program's output. The view window menu items include:

First Displays the current best-of generation program

> » >» Step forward 1, 10 or 100 individuals and display that program < « <« Step back 1, 10 or 100 individuals and display that program

VertH- Stretch vertically a bit

Vert- Shrink vertically a bit

Horz+ Stretch horizontally a bit

Horz- Shrink horizontally a bit

Export Exports the displayed program.

Restore Restores the display if the Program window is maximized

The view window buttons include:

Short Names toggles the displayed variable names between the given names and the

abbreviated names; useful when displaying long programs.

Auto Size redraws the displayed program to fill the current window size.

Font+ Font- increase and decrease the font size of the displayed program.

Best so far shows the program with the highest top decile in prior generations.

Print prints the program tree on your printer. Use landscape for best results.

Sweep Scans the displayed program for redundant branches. The sweep

operation scans the displayed program for redundant branches by sequentially testing the effect of removing every branch in the program. The test is simply the resulting cum uplift at a specific

depth of file chosen by you. Normal practice is to set the depth of file

at the level of your usual mailing.

In the view window, when the cursor is placed on a model form part and clicked, a

sensitivity window is selected This window enables a user to study cross-sections of the

program's performance at any point. Also, the node output diagnostics window is available in the view window and it displays the detailed cumulative average response in fine detail. When you select the analysis view mode from the windows menu, evolution is frozen and

you have the opportunity to investigate any of the programs in the pack in fine detail.

Furthermore, there are four inspection windows in analysis mode, three of which have

already been described (View window, Decile window and Sensitivity window) and a fourth,

which is only visible in this mode. This window, headed "Cumulative response in detail",

displays a chart and sequences the entire file in descending order of model score and displays

the cumulative average response (or profit) for each observed score group that is output by the model. It is perfectly possible to have as many score groups as there are records in the data file. Similarly it is possible to have just two. This chart is used to identify the cut-off

scores for a given model prior to processing an external data file with that model. The vertical cursor may be moved with the spin control and the position in terms of depth-of-file

and model-output-score at that position is displayed in the information boxes at the top.

In addition, the view window includes diagnostic boxes. These boxes show: the

current activity and the program being worked on; the current generation number; the number of failed programs (Runts) in the last generation; the fitness of the best individual in the last

generation; and the time per complete cycle, from one generation to the next in hours, minutes and seconds. At any time during evolution a user can pause and save an entire population of programs together with all the population and breeding parameter settings.

These saved files are called scenarios.

At any time during evolution a user can pause and fetch a previously saved scenario and

continue evolution with the recovered set of programs instead. A user can also remove

unwanted scenarios. The file menu items Save scenario, Fetch scenario and Delete

scenario accomplish these tasks. Work can be easily recovered by fetching the crash-save scenario. The decile analysis window 235 is one of four display modes available for viewing

the predictive power of any individual model. Clicking the left mouse button on the window

region 235 cycles the display through three alternative views of the fit. The view of the decile

analysis is a tabular result of the model results by decile; the first alternative view is a cum

lift plot; the second alternative is a cumulative response by decile and the third alternative is

an advantage index (bar chart of cum delta over 100).

The gene pool diversity window 270 is one of three graphs. The bar chart shows the average frequency of occurrence across the entire population of each element in the genetic alphabet. It is used to monitor the loss of genetic diversity. As evolution progresses and

different sequences of code are favored differently, so it is inevitable that the individual elements of the genetic alphabet will start to appear with differing frequencies. The same is

also true of the candidate variables. This window allows a user to see how things are

progressing. Three different views can be seen:

GENE POOL DIVERSITY shows the average frequency of occurrence of each

element of the current problem's genetic alphabet. Clicking the graph displays a bar graph showing the average frequency of occurrence of each of the independent variables across the entire population of programs. This is a strong indicator of which variables really matter.

VARIABLES: mean frequencies displays the average occurrence of each variable

across the entire population. The small button at the top right shows the chart as a sorted list.

FITNESS vs COMPLEXITY plots the top 300 program's fitness (y axis) against the

number of internal nodes in each program (x axis). The three charts are used to judge whether some sort of intervention is required to alter the course of evolution.

Once a model has evolved that appears to meet your objectives it will be important to

gain some understanding of its potential performance on real world mailing lists or corporate databases. Unfortunately, no simple formula exists for calculating the true variances of the Cum Lifts. For this reason, a simulation facility called the Bootstrap is used which repeatedly draws random samples (with replacement) of a predetermined size from the source data and

recalculates the decile Cum Lifts for every sample. The process may take a few minutes if

the sample sizes are in the millions. Your file data may only 25,000 records, but if you are

planning a mailing shot of 2,000,000, then you should set a bootstrap sample size of

2,000,000. The number of bootstrap samples drawn should be at least 30. On completion of sampling the Bootstrap, results are displayed in a table. The default confidence interval for upper and lower Cum Lifts is set at 80%. You can, however, change this by choosing from

the confidence% menu which offers settings of 95%, 90%, 80%, 75% and 66.66%. In the

present context, 80% is a very sound basis for decision taking.

While this invention has been particularly described and illustrated with reference to a

preferred embodiment, it will be understood by one of skill in the art that changes in the above description or illustrations may be made with respect to form or detail without

departing from the spirit and scope of the invention.

Claims

CLAIMSWe claim:

1. A method for database analysis, comprising: creating a database including measured customer responses;

creating an initial generation of a plurality of models to simulate said measured customer responses in said database, each of said models consisting of at least one operator selected from a defined set of operators;

evaluating a fitness function to quantify the difference between said measured and simulated customer responses from said initial generation of models.

creating a graphical user interface for displaying a decile analysis of said simulated customer responses;

using said evaluated fitness function and said decile analysis to breed a plurality of model generations wherein one of said models of said plurality of model generations optimizes said fitness function.

2. A method, as in claim 1, wherein:

said interface displays the best of generation model form adjacent said decile analysis.

3. A method, as in claim 2, wherein:

said interface edits said model form and said decile analysis is recalculated to

reflect said edits to said model form.

4. A method, as in claim 3, wherein:

said interface uses point and click cursor control operations to edit said model

form.

5. A method, as in claim 1, wherein:

said interface edits the number and type of operators in said defined set during

said breeding.

6. A method, as in claim 2, wherein

said interface displays a genetic diversity window adjacent said decile analysis window wherein said genetic diversity window displays the frequence with which independent variables occur throughout said model generations.

7. A method, as in claim 1, further comprising:

selecting a plurality of random samples of database entries having a predetermined size; and

recalculating said decile analysis on each of said plurality of random samples.

A system for database analysis, comprising:

a central processing unit coupled to a memory system and a display wherein

said central processing unit operates according to a program retrieved from

said memory system;

said program instructs said central processing unit to retrieve data from said

memory;

said program creates a database including measured customer responses;

said program creates an initial generation of a plurality of models to simulate said measured customer responses in said database, each of said models

consisting of at least one operator selected from a defined set of operators;

said program evaluates a fitness function to quantify the difference between

said measured and simulated customer responses from said initial generation of models.

said program creates a graphical user interface for displaying a decile analysis

of said simulated customer responses;

said program uses said evaluated fitness function and said decile analysis to breed a plurality of model generations wherein one of said models of said

plurality of model generations optimizes said fitness function.

9. A system, as in claim 8, wherein:

10. A system, as in claim 9, wherein:

said interface edits said model form and said decile analysis is recalculated to reflect said edits to said model form.

11. A system, as in claim 10, wherein:

said interface uses point and click cursor control operations to edit said model form.

12. A system, as in claim 8, wherein: said interface edits the number and type of operators in said defined set during said breeding.

13. A system, as in claim 9, wherein said interface displays a genetic diversity window adjacent said decile analysis

window wherein said genetic diversity window displays the frequence with which independent variables occur throughout said model generations.

14. A system, as in claim 8, wherein:

said program selects a plurality of random samples of database entries having a predetermined size; and said program recalculates said decile analysis on each of said plurality of

random samples.