WO2002017112A1 - Genetic programming for performing direct marketing - Google Patents

Genetic programming for performing direct marketing Download PDF

Info

Publication number
WO2002017112A1
WO2002017112A1 PCT/US2001/026216 US0126216W WO0217112A1 WO 2002017112 A1 WO2002017112 A1 WO 2002017112A1 US 0126216 W US0126216 W US 0126216W WO 0217112 A1 WO0217112 A1 WO 0217112A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
fitness
program
models
interface
Prior art date
Application number
PCT/US2001/026216
Other languages
French (fr)
Original Assignee
Minetech, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Minetech, Inc. filed Critical Minetech, Inc.
Priority to AU2001285191A priority Critical patent/AU2001285191A1/en
Publication of WO2002017112A1 publication Critical patent/WO2002017112A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Definitions

  • the present invention relates to the field of computer assisted problem solving.
  • the present invention relates to a system for applying genetic programming to the
  • Direct marketing involves the direct sales of goods and/or services to large numbers
  • the tools available to assist in the customer selection include a database having individual customer names in a vector with various characteristics of the customer (for example, age, gender, income, zip code, etc.) which may be analyzed with conventional
  • GPs incorporate an
  • the fitness function is essentially a function that scores or evaluates
  • the fitness function measures the
  • the GP process also includes several types of operators.
  • the reproduction operator is a process which duplicates or copies functions depending on its fitness score; the crossover
  • GP techniques are not based on assumptions concerning the form or size of the initial function set (e.g., GP's do not require random or normal distribution of the individual customer characteristics), the GP techniques are widely applicable. Moreover, they have been shown to be effective on large datasets and
  • the present invention is a genetic algorithm process for analyzing direct marketing
  • the process incorporates the steps of creating and evaluating an initial model population against a training database; creating a working model population by removing
  • the automatic genetic program process operates in combination
  • the interface interacts with the automatic genetic program process to allow a user to modify the form of the model being used as a base for the genetic
  • Figure 1 illustrates a computer system according to one embodiment of the present
  • Figure 2 illustrates a sequence of process steps for carrying out a genetic program
  • Figure 3 illustrates a node structure for representing a function according to one embodiment of the present invention.
  • Figure 4 illustrates a display for an operator selection interface to a genetic program
  • Figure 5 illustrates a display for an interface to a genetic program process according to
  • Figure 6 illustrates a display for an interface to a genetic program process according to
  • FIG. 1 illustrates a computer system according to one embodiment of the present invention.
  • Computer 20 comprises a central processing unit (CPU) 30 and main memory 40.
  • CPU central processing unit
  • main memory 40 main memory
  • Computer 20 is coupled to an Input/Output (I/O) system 10, disk storage unit 50, and network connections 55, 57 and 59.
  • I/O system 10 includes a display 5, a keyboard 7 and a cursor
  • the disk storage unit 50 stores a series
  • the instructions and data are retrieved from disk storage 50 and stored in main memory 40.
  • the CPU 30 retrieves instructions and associated data from main memory 40 and then executes
  • the computer system 20 may also retrieve program instructions and/or data from world wide web connection 55, intranet or local area
  • LAN local area network
  • system 20 may retrieve and/or send instructions and/or data by using encoding such data on
  • each individual customer associated with a vector of characteristics such as, for example, age, income, gender, education or number of children.
  • customers (wherein the size of the subset is a function of the resource constraint) which has the highest likelihood of favorably responding to a direct marketing campaign.
  • the marketer analyzes the customer
  • the marketer must create an analysis function which, when applied to the customer database, generates the
  • the analysis function identifies those customers most likely to
  • the marketer creates a training database which is a subset of the customer database wherein actual responses and the quality
  • An analysis function may then be created to model the actual responses using the training database because the performance of the analysis function
  • a customer database having
  • the marketer would use the measured response from the 5,000 individual sample and create an analysis function that predicted the response rates in that sample.
  • a more meaningful measure of the response model is referred to as the cum lift.
  • direct marketers desire to identify individuals most likely to respond to a solicitation, they want to measure the number of responses, beyond a random selection of
  • the cum lift is a measure of how much better a model
  • a cum lift of 294 for the top decile means that when soliciting the top 10% of a customer pool identified
  • such an analysis function is created by using a genetic algorithm in combination with an interactive function editing interface. That is, the genetic algorithm operates to evaluate, in an essentially random order, a large number of possible analysis functions. As any one of these
  • the genetic program process is ongoing. As a result, the genetic program process can be guided to create an analysis function more specifically tailored to a specific marketing problem.
  • a fitness function is the difference between the results of the analysis function under consideration at any one time and a known ideal response. For example, the measured
  • quality of response for one individual of the training database may be X, while the function
  • a composite fitness function of the fitness values for the training set with respect to characteristics Cj through c 4 is then generated.
  • the fitness function reduces to a single number such as an average or statistical mean over the elements of the training set.
  • the fitness function could also be a vector, analytical or discrete
  • the genetic program technique creates, evaluates and re-creates new functions until the fitness function over all the program forms is minimized.
  • Figure 2 is a flow-chart of the genetic programming process of the present invention.
  • the process 100 starts by the Create Initial Population 102 step which creates a number of programs (typically randomly).
  • a program is simply a mathematical function constructed
  • Figure 3 illustrates a graphical representation of
  • figure 4 illustrates an interface for a genetic alphabet
  • selector as used in a genetic program process. Through this interface, a user can select the
  • custom function button 390 By selecting the custom function button 390, a user can create a
  • Program 106 wherein each program is executed. That is, one of the randomly generated functions is chosen, and that function is evaluated for each individual in the training database. As a result, each individual has an assigned numerical value (step 112) for the function being
  • the termination test 104 is evaluated (for example, achieving a known best solution or achieving a certain degree of improvement in average fitness for the
  • Relatively Low Fitness is used to connote either selection based on a probability proportionate to normalized fitness values or selection based on equal probability among individuals having fitness outside some
  • Step 114 causes the removal of the less fit members of the program population from being used to breed the next generation of functions. Step 114 improves the average fitness and eases memory requirements by keeping the working
  • Step 116 Select Program With High Fitness, then picks at least one program which
  • step 118 Choose an Operation to Perform
  • a random number generator selects between the various operation choices.
  • the output of the random number generator is weighted to select one or more of the operation choices more
  • 118 may receive instructions from a user through the user interface 200. These instructions may dictate a part of the form of the program generation and so enable the user to guide the program generation process.
  • weighting of the process might specify that the crossover operation is chosen 700 times, reproduction is chosen 250 times and permutation is chosen 50 times.
  • the preferred crossover operation is chosen 700 times, reproduction is chosen 250 times and permutation is chosen 50 times.
  • the genetic process parameters might also specify that mutation of a single node occur with a probability of
  • Q*N*p alleles will be mutated (in a population of size N). If Q is 10, then 1 node out of 10,000 alleles in a population of 1,000 individuals will be altered as a result of the mutation
  • the sexual reproduction crossover operation 120 requires a group of at least two
  • program(s) are picked to mate with the chosen program(s) from step 118.
  • the mate is typically, the mate
  • crossover points divide each of the parents into first and second parts.
  • Mating involves connecting the first part of parent A with the second part of parent B and vice- versa. Accordingly, two parents would produce two offspring. There are two varients of
  • the first is the one point crossover in which the first part of parent A is identical with the first part of parent B.
  • each parent would have two crossover points selected for it and there would be
  • N parents are N parents, then N — 1 crossover points would be selected for each and there would be N N — N new offspring available. When an operation produces more offspring than parents,
  • the population can be allowed to grow or the population can be trimmed back to a
  • a method for selecting one of these three computational procedures for reproduction is to select them with a probability proportional to their normalized fitness.
  • a preferred method is to select
  • permutation operation 140 selects a Permutation Point 142 from among the internal points
  • one permutation is to switch operators or swap branches of
  • Mutation 154 then randomly generates, for each selected program, a portion of a program and inserts it at the mutation point.
  • the portion inserted is typically a single point
  • the evaluation creates a fitness value associated with the new
  • the GP process returns to the termination test 104.
  • the GP process then iterates through new generations of functions until either a performance criteria is met (e.g., fitness function minimized) or a user selected number of generations has been created and
  • the first step in the iterative process involves activating each program. Activation
  • entities are computer programs, so activation requires executing
  • the second step in the process assigns a fitness value to the objective result, and associates that fitness value with its corresponding entity. For programs concerning direct marketing, the fitness value is
  • the best value (fitness) may be the lowest number (as is the case here where we are measuring the deviation between a result and a
  • the best value (fitness) may be the highest
  • the value (fitness) assigned maybe a single numerical value or a vector of values, although it often most convenient that it be a single numerical
  • a useful method for organizing raw fitness values involves normalizing the raw values and then calculating probabilities based on the normalized values.
  • the best raw fitness value is assigned a normalized fitness of 1
  • the worst value is assigned a value of 0
  • all intermediate raw values are assigned in the range of 0 to 1.
  • the individual's normalized fitness value divided by the sum of all the normalized fitness values of the population.
  • the normalized fitness values range between 0 and 1, with a value of 1 associated with the best fitness and a value of 0 associated with the worst,
  • the average (or other group measure)
  • the population exhibits the ability to robustly and relatively quickly deal with changes in the data set characteristics.
  • the variety in the population lowers its overall average fitness value; additionally, the population's variety gives the population an ability to robustly adapt to
  • one way to determine breeding probability is to select the program with the highest
  • a threshold as some number of standard deviations from the mean (selecting for example, all individuals whose fitness is one standard deviation from the mean fitness).
  • breeding parameters are modified through the Choose Operation step 118.
  • the possible operations include crossover,
  • User Interface 200 It is a graphical user interface in which the model form
  • window 250 is displayed along with one or more measures of the effectiveness of the model.
  • the Fitness History window 220 and cum lift window 230 are graphically displayed.
  • the User Interface 200 works in conjunction with the genetic algorithm of figure 2 to
  • the genetic program of figure 2 creates an initial population of models and then uses the programs with the best
  • the user interface 200 allows a user to alter
  • Such edits may be based on the
  • the user interface 200 displays a graphic representation of the program in the model form window 250 along with the fitness history in window 220 and
  • the fitness history 220 is a graphical display of the calculated fitness values for each program that has been displayed in the model form window 250. Specifically, if the GP process is evaluating the Nth program in step 160, the iterative process of the GP has already calculated (N-l) fitness values for (N-l) previous
  • While the user interface may simply track fitness values, it may also retain a vector of
  • the interface may also retain a vector for each fitness value such that selection of a fitness value along the fitness history curve prompts the
  • the cum lift display is a representation of the cum lift by decile for the Nth program of step
  • the display in window 230 may alternatively be a table or bar chart or any
  • the interface may also store the
  • the combination of the display of the Nth model in window 250 with the fitness history in window 220 and the cum lift in window 230 enables the user of the GP
  • the GP process will finish evaluating the (N-l) model and will input at step 180 (figure 2) the selected portions of the (N-l) model to be used on the Nth and follow on models.
  • the user may set a parameter such that the section
  • the similar section of the Nth model may need to be retained (or edited out) is that the user can detect the raised cum lift for a particular model and also, by selecting the fitness history, determine which models might be similar (or different) than the current Nth model. Where high (or low) fitness models have similar sections to the Nth model, the similar section of the Nth model can then
  • Figure 6 illustrates another embodiment of the interface of the present invention.
  • the view (or model form) window 250 is displayed alongside the fitness history
  • the fitness history window 220 plots the
  • the view window 250 automatically displays the current best-of-generation program. In this window you can view any one individual program in the population. This window has
  • the view window contains a small cluster of menu items that allow you to step forwards and backwards throughout the entire population, to stretch and sl rink the
  • the view window menu items include:
  • Horz+ Stretch horizontally a bit
  • the view window buttons include:
  • Short Names toggles the displayed variable names between the given names and the
  • Font+ Font- increase and decrease the font size of the displayed program.
  • This window enables a user to study cross-sections of the
  • the node output diagnostics window is available in the view window and it displays the detailed cumulative average response in fine detail.
  • the vertical cursor may be moved with the spin control and the position in terms of depth-of-file
  • model-output-score at that position is displayed in the information boxes at the top.
  • the view window includes diagnostic boxes. These boxes show: the
  • the decile analysis window 235 is one of four display modes available for viewing
  • region 235 cycles the display through three alternative views of the fit.
  • the gene pool diversity window 270 is one of three graphs.
  • the bar chart shows the average frequency of occurrence across the entire population of each element in the genetic alphabet. It is used to monitor the loss of genetic diversity. As evolution progresses and
  • This window allows a user to see how things are
  • GENE POOL DIVERSITY shows the average frequency of occurrence of each
  • VARIABLES mean frequencies displays the average occurrence of each variable
  • the small button at the top right shows the chart as a sorted list.
  • the three charts are used to judge whether some sort of intervention is required to alter the course of evolution.
  • the number of bootstrap samples drawn should be at least 30.
  • results are displayed in a table.
  • the default confidence interval for upper and lower Cum Lifts is set at 80%. You can, however, change this by choosing from

Abstract

A genetic algorithm process for analyzing direct marketing problems (Fig. 2). The process incorporates the steps of creating and evaluating an initial model population against a training database; creating a working model population by removing models from the initial population which have relatively low fitness functions (114); selecting a model with relatively high fitness as a base for follow-on model generation (116); an iteratively modifying that selected model to generate a set of models from which the highest value fitness function is selected. The automatic genetic program process operates in combination with an interface (200) which displays a current model form along with the genetic process fitness history and cum lift values. The interface interacts with the automatic genetic program process to allow a user to modify the form of the model being used as a base for the genetic program process. Because the user has access to the fitness history and cum lift evaluations of the models, the user can guide the model creation process to more effectively analyze direct marketing problems.

Description

Title
Genetic Programming for Performing Direct Marketing
Field of The Invention
The present invention relates to the field of computer assisted problem solving. In
particular, the present invention relates to a system for applying genetic programming to the
problem of maximizing the response rate to a direct marketing campaign.
Background of The Invention
Direct marketing involves the direct sales of goods and/or services to large numbers
of individual customers. Generally, a marketing representative will contact selected individual customers from a vast number of potential customers. Because it is impossible or
uneconomical (or both) to contact all the potential customers and because a marketer only desires to contact those customers most likely to respond positively to a marketing campaign,
some selection of potential customers must be made. Making the proper selection of highly
responsive potential customers is often the critical difference between successful and
unsuccessful campaigns.
The tools available to assist in the customer selection include a database having individual customer names in a vector with various characteristics of the customer (for example, age, gender, income, zip code, etc.) which may be analyzed with conventional
discriminant analysis, logistic regression and/or ordinary least squares regression. A technique for analyzing the database using genetic programming is also available. A genetic
program ("GP") is a series of mathematical procedures for search and optimization that
transforms a set of individuals represented as mathematical objects, into a new set of individuals using operations patterned after the principle of the survival of the fittest. In short, the GP process randomly generates a set of mathematical functions or programs which
can evaluate the mathematical representation of individuals with respect to a certain fitness function, and then, systemically modifies that random set of functions using reproduction,
crossover and mutation operators until the initial random set of functions is transformed into
a set of individuals wherein the fitness function is maximized. In general, GPs incorporate an
indeterminate number of operators.
In the GP process, the fitness function is essentially a function that scores or evaluates
any one program with respect to a desired response. That is, the fitness function measures the
difference between a desired or optimum response and the response generated by the program at issue. The GP process also includes several types of operators. The reproduction operator is a process which duplicates or copies functions depending on its fitness score; the crossover
operator combines two functions depending on their fitness scores; and the mutation operator randomly changes functions. In the GP process, the reproduction, crossover and mutation
operators are recursively applied to the initial random set of functions until this recursive
application creates a set of functions optimized with respect to the fitness function.
The use of techniques implementing GPs is important because such techniques do not depend on an a priori parameter selection or parametric assumption as in other conventional
statistical analysis. Because GP techniques are not based on assumptions concerning the form or size of the initial function set (e.g., GP's do not require random or normal distribution of the individual customer characteristics), the GP techniques are widely applicable. Moreover, they have been shown to be effective on large datasets and
optimization problems. The difficulty with GP techniques, however, is that they are based on
calculations with respect to an empirically defined fitness function. As a result, there is little
or no feedback to the user during the calculation process. More importantly, external restrictions on the form of the model generated are not conveniently implemented.
Accordingly, a tool for providing feedback and guiding the GP optimization process is
desirable. Objects of The Invention
It is an object of the present invention to use genetic program techniques to determine
the likely response profile for a direct marketing campaign.
It is another object of the present invention to provide feedback concerning the
optimization process to a user of genetic program techniques.
It is still a further object of the present invention to create an interface for using
genetic program techniques that graphically illustrates the optimization process.
It is still another object of the present invention to create an interface for editing the genetic program optimization sequence and graphically representing that edited sequence.
It is still a further object of the present invention to create an interface for editing the
genetic program sequence and graphically displaying the effect of the edited sequence on the fitness function.
Summary of The Invention
The present invention is a genetic algorithm process for analyzing direct marketing
problems. The process incorporates the steps of creating and evaluating an initial model population against a training database; creating a working model population by removing
models from the initial population which have relatively low fitness functions; selecting a model with relatively high fitness as a base for follow-on model generation; and iteratively modifying that selected model to generate a set of models from which the highest value
fitness function is selected. The automatic genetic program process operates in combination
with an interface which displays a current model form along with the genetic process fitness
history and cum lift values. The interface interacts with the automatic genetic program process to allow a user to modify the form of the model being used as a base for the genetic
program process. Because the user has access to the fitness history and cum lift evaluations of the models, the user can guide the model creation process to more effectively analyze
specific direct marketing problems.
Brief Description of The Drawings
Figure 1 illustrates a computer system according to one embodiment of the present
invention.
Figure 2 illustrates a sequence of process steps for carrying out a genetic program
according to one embodiment of the present invention.
Figure 3 illustrates a node structure for representing a function according to one embodiment of the present invention.
Figure 4 illustrates a display for an operator selection interface to a genetic program
process according to one embodiment of the present invention.
Figure 5 illustrates a display for an interface to a genetic program process according to
one embodiment of the present invention.
Figure 6 illustrates a display for an interface to a genetic program process according to
one embodiment of the present invention.
Detailed Description of The Invention
Figure 1 illustrates a computer system according to one embodiment of the present invention. Computer 20 comprises a central processing unit (CPU) 30 and main memory 40.
Computer 20 is coupled to an Input/Output (I/O) system 10, disk storage unit 50, and network connections 55, 57 and 59. The I/O system 10 includes a display 5, a keyboard 7 and a cursor
control device (e.g., mouse, trackball, etc.) In general, the disk storage unit 50 stores a series
of instructions in a program and associated data for operating the computer system. The instructions and data are retrieved from disk storage 50 and stored in main memory 40. The CPU 30 retrieves instructions and associated data from main memory 40 and then executes
those instructions as defined by the program. The computer system 20 may also retrieve program instructions and/or data from world wide web connection 55, intranet or local area
network (LAN) connection 57 or other external network connection 59. The computer
system 20 may retrieve and/or send instructions and/or data by using encoding such data on
carrier signals and transmitting those carrier signals over the network connections. These
connections and methods for transferring data or instructions to the computer system 20 are
well known to those of skill in the art.
In the field of direct marketing, a marketer begins with a large list of potential
individual customers, each individual customer associated with a vector of characteristics such as, for example, age, income, gender, education or number of children. Each of the
characteristics in this vector can be represented numerically (e.g., gender = 1/0, education = number of years, etc.). This list of potential customers and associated characteristics is referred to as a customer database. Given unlimited resources, the marketer would directly
contact all individuals in the customer database. Because the marketer has resource limitations, however, the objective of the marketer is to select that subset of potential
customers (wherein the size of the subset is a function of the resource constraint) which has the highest likelihood of favorably responding to a direct marketing campaign.
In order to achieve this marketing objective, the marketer analyzes the customer
database using a computer implementing an analysis program. Accordingly, the marketer must create an analysis function which, when applied to the customer database, generates the
desired response. For example, the analysis function identifies those customers most likely to
respond to a marketing campaign or identifies those customers most likely to respond and
spend the most money. To create such an analysis function, the marketer creates a training database which is a subset of the customer database wherein actual responses and the quality
of those responses have been measured and quantified. Of course, the training database could
be as large as the customer database. An analysis function may then be created to model the actual responses using the training database because the performance of the analysis function
can be measured against actual results. Once an analysis function has been created, it may
then be applied to the customer database. For example, a customer database having
1,000,000 entries may be available to a marketer but the marketer may only be able to
measure response rates to a certain marketing campaign on 5,000 individuals. Accordingly
then, the marketer would use the measured response from the 5,000 individual sample and create an analysis function that predicted the response rates in that sample. The marketer
would then apply that resulting analysis function to the 1,000,000 entry customer database.
To generate an analysis function, direct marketers need a technique to gauge known responses. One measure is the Proportion of Total Correct Classifications (PTCC) which is
calculated as a cross-tabulation. For example, if a sample contains 100 individuals with a
15% actual response rate and a 24% predicted response rate, then the model would have
correctly predicted 74 nonresponders and 13 responders and had a PTCC value of (74+13)/100 = 87%. Although the PTCC value is frequently used, it is not appropriate for
many problems such as, for example, when the assessment criterion imposes a penalty for misclassifications.
For direct marketers, a more meaningful measure of the response model is referred to as the cum lift. As direct marketers desire to identify individuals most likely to respond to a solicitation, they want to measure the number of responses, beyond a random selection of
individuals, most likely to respond to a solicitation. The cum lift is, then, an index of how
many more responses are expected with a selection based on a model over the responses to be
expected with a random selection of individuals. The following process, for example, maybe used to calculate the cum lift: i) for the form of model under consideration, score a test
sample assigning each individual a predicted probability of response (PPR), ii) rank the scored file by PPR, iii) divide ranked file into ten ranked groups (groups 1 receives top 10%, group 2 receives next 10%, etc.), iv) identify number of actual responses (ARj=U0) identified
by the model in each decile, v) calculate decile response rate as AR divided by number of
individuals in scored file decile, vi) calculate the cumulative response rate for depth of file
(CR) as (£"_, ARj ) divided by the total number of individuals; and vii) calculate cum lift as
(CR) divided by the total response rate multiplied by 100.
As shown in this example, the cum lift is a measure of how much better a model
predicts the actual response rate in comparison to a random sample. For example, a cum lift of 294 for the top decile means that when soliciting the top 10% of a customer pool identified
by a model, the total number of responses is 2.94 times the number of responders expected by random sampling. While the cum lift analysis works well for binary responders (e.g.,
whether customer responds or not), it also works well for continuous response data (e.g., where responders spend differing amounts of money and the objective is to optimize the
response of high spenders).
Given a response measurement tool such a cum lift, the marketer must then determine
the analysis function. According to one embodiment of the present invention, such an analysis function is created by using a genetic algorithm in combination with an interactive function editing interface. That is, the genetic algorithm operates to evaluate, in an essentially random order, a large number of possible analysis functions. As any one of these
possible functions are being evaluated, the form of that function is graphically depicted on a
computer display. By using an interactive function editing interface, the form of the analysis
function eventually created in the genetic program process can be edited as the evaluation
process is ongoing. As a result, the genetic program process can be guided to create an analysis function more specifically tailored to a specific marketing problem.
The evaluation of any one analysis function is determined with respect to a fitness function. A fitness function is the difference between the results of the analysis function under consideration at any one time and a known ideal response. For example, the measured
quality of response for one individual of the training database may be X,, while the function
under consideration f (c,, c2, c3, c4) (wherein crc4 are numeric representations of age, gender,
income and education, respectively) generates a value of X2. The fitness value for that
individual is X2-Xι- Such a fitness value is then calculated for every individual of a training
set from the customer database. A composite fitness function of the fitness values for the training set with respect to characteristics Cj through c4 is then generated. Preferably, the
fitness function reduces to a single number such as an average or statistical mean over the elements of the training set. The fitness function could also be a vector, analytical or discrete
function. The genetic program technique creates, evaluates and re-creates new functions until the fitness function over all the program forms is minimized.
Figure 2 is a flow-chart of the genetic programming process of the present invention.
The process 100 starts by the Create Initial Population 102 step which creates a number of programs (typically randomly). A program is simply a mathematical function constructed
from a selection of a set of mathematical operators and from the characteristics associated with each individual of the training (or customer) database. For example, where an individual has characteristics c (age) = 25, c2 (gender) =1, c3 (income) = 65 and c4 (education) = 14; one function, f (cn, c12, c13, cI4) for that individual maybe [2 X (age) + 3 X
(gender) - .I X (income) - (education)] =32.5. The form of the function can be any
combination of one or more mathematical operators chosen from a set including but not limited to multiplication, addition, subtraction, division, sine, cosine, tangent, exponential,
log, powers, summation, or tabular representation. Figure 3 illustrates a graphical
representation of the function f, wherein the characteristics (305, 307, 309 and 311) are combined (313, 315, 317, 319, 321 and 323) in a node structure. The set of operators used in
constructing functions according to the genetic programming technique are selected or constructed by a user. For example, figure 4 illustrates an interface for a genetic alphabet
selector as used in a genetic program process. Through this interface, a user can select the
types of operators (e.g., arithmetic 330, circular 340, numeric 350, logicals 360, hyperbolic
370 or custom 380). By selecting the custom function button 390, a user can create a
specialized operator which may then be used as part of the set of operators used in the genetic programming technique.
Referring back to figure 2, the GP process continues with the step Execute Each
Program 106 wherein each program is executed. That is, one of the randomly generated functions is chosen, and that function is evaluated for each individual in the training database. As a result, each individual has an assigned numerical value (step 112) for the function being
evaluated. This numerical value is then compared with a measured response value to
generate a fitness value for each individual in the training database. A composite fitness
function, here the average of individual fitness values, is then generated. While the fitness function characterizes the difference between the modelled and measured response, this
function will vary depending on the problem and the characteristics of the data being modelled. Those of skill in the art recognize the variety of ways in which such a function can be created.
Once a fitness function has been generated, the basic iterative loop of the genetic
program process begins. Here, the termination test 104 is evaluated (for example, achieving a known best solution or achieving a certain degree of improvement in average fitness for the
working population), and if satisfied, the process ENDS 101. If not, the next step is to
Remove Program(s) with Relatively Low Fitness. The phrase "Relatively Low Fitness" is used to connote either selection based on a probability proportionate to normalized fitness values or selection based on equal probability among individuals having fitness outside some
defined threshold (e.g., less than 0.5). Step 114 causes the removal of the less fit members of the program population from being used to breed the next generation of functions. Step 114 improves the average fitness and eases memory requirements by keeping the working
population within reasonable limits.
Step 116, Select Program With High Fitness, then picks at least one program which
has a relatively high fitness value (compared to other programs) as a basis for creating new
programs. Using this chosen program, step 118, Choose an Operation to Perform, then
determines which operation will be used to modify the chosen program. In this step, a random number generator selects between the various operation choices. The output of the random number generator is weighted to select one or more of the operation choices more
often than other of such operation choices. In addition, as discussed more fully below, step
118 may receive instructions from a user through the user interface 200. These instructions may dictate a part of the form of the program generation and so enable the user to guide the program generation process.
Crossover 120 (sexual reproduction) and Reproduction 130 (asexual reproduction) are two basic operations that maybe performed in step 118. Permutation 140 and mutation 150
(asexual reproduction) also play a role. Typically, the vast majority of chosen operations are the reproduction and crossover operations. For example, in a population of 1,000 choices, the
particular weighting of the process might specify that the crossover operation is chosen 700 times, reproduction is chosen 250 times and permutation is chosen 50 times. The preferred
mix of sexual and asexual reproduction is 66% sexual and 34% asexual. The genetic process parameters might also specify that mutation of a single node occur with a probability of
p=0.0001. Thus, if the average individual has Q points at which mutation might occur, a total
of Q*N*p alleles will be mutated (in a population of size N). If Q is 10, then 1 node out of 10,000 alleles in a population of 1,000 individuals will be altered as a result of the mutation
operation. The sexual reproduction crossover operation 120 requires a group of at least two
programs (typically two parents which mate to create two siblings). As a result, second
program(s) are picked to mate with the chosen program(s) from step 118. Typically, the mate
would be the next highest fitness value individual. Of course, the choice of a mate may be
made randomly. For each mating, a crossover point is separately selected at random from
among both internal and external points within each parent at Select Crossover Points 122.
Then newly created programs are produced at Perform Crossover 124 from the mating group using crossover. The crossover points divide each of the parents into first and second parts.
Mating involves connecting the first part of parent A with the second part of parent B and vice- versa. Accordingly, two parents would produce two offspring. There are two varients of
this crossover operation. The first is the one point crossover in which the first part of parent A is identical with the first part of parent B. In such a case, the second parts of parents A and
B are swapped. The second varient of the crossover operation evaluates all common points of
parents A and B and exchanges them based on a 50%> probability. As a general matter, all three versions of the crossover operation are used 33% of the time.
In the GP process, there is no requirement that the population be maintained at a constant size. The version of the crossover operation producing two offspring from two parents has the convenient attribute of maintaining the population at constant size. Other operations each produce one offspring from one parent so that they too maintain constant
population size. On the other hand, if the crossover operation acts on a group of more than
two parents, the size of the population may grow. For example, if three parents formed a mating group, each parent would have two crossover points selected for it and there would be
27 possible offspring (3X3X3). Even if the three offspring equivalent to the three original parents are excluded, there would be 24 possible new offspring available. In general, if there
are N parents, then N — 1 crossover points would be selected for each and there would be NN — N new offspring available. When an operation produces more offspring than parents,
then either the population can be allowed to grow or the population can be trimmed back to a
desired (presumably constant) size when the next round of fitness proportionate reproduction
takes place.
In asexual reproduction, one parent produces one sibling. Asexual reproduction
involves three operations, reproduction 130, permutation 140 and mutation 150. A method for selecting one of these three computational procedures for reproduction is to select them with a probability proportional to their normalized fitness. A preferred method is to select
mutation 80% of the time (wherein node mutation is selected 6%, terminal mutation is selected 30% and 10% random mutation) and replication is chosen 20% of the time. The
permutation operation 140 selects a Permutation Point 142 from among the internal points
within the selected individual function. Once this point is selected, Permutation 144 is
performed, by reordering the selected program's subprocedures, parameters, or both at the permutation points. For example, one permutation is to switch operators or swap branches of
an existing operator. Also, terminals can be swapped or replaced with constants. Similarly, where the mutation operation selects a Mutation Point 152 for each selected program, at this selected point, Mutation 154 then randomly generates, for each selected program, a portion of a program and inserts it at the mutation point. The portion inserted is typically a single point,
but may be a sub-program or other function.
Once new programs (or functions) have been created, the new program(s) is executed
and evaluated in step 160 for all individuals in the training set. The fitness values for all the
individuals are then used to form a composite fitness function, in this case, an average fitness
values over the individuals. The evaluation creates a fitness value associated with the new
program. Thereafter, the GP process returns to the termination test 104. The GP process then iterates through new generations of functions until either a performance criteria is met (e.g., fitness function minimized) or a user selected number of generations has been created and
evaluated. Each generation of programs are used to breed a subsequent generation in which
the fittest members go on to breed other generations. Refinement of programs within and
over generations (and hence, optimization of the fitness function) is a consequence of
iterating this GP process using fitness proportionate selection for the programs.
Moreover, because of the iterative nature of the GP process, an audit trail can be
created of the entire process from the creation of the initial population of individuals to the current working population of individuals. For example, suppose we denote the individuals of the initial population as II, 12, 13, . . . These individuals can be either stored directly or one
can store the random algorithm (and random seeds) used to generate the initial members. When a crossover is performed on two individuals (say II and 12, at point p of parent 1 and
point q of parent 2), an expression is created involving 5 items — namely, the symbolic string
"CROSSOVER", the identities of the two individuals being crossed at the time (i.e. II and 12) and the two crossover points (i.e. p and q). This new string would be the identity (i.e. audit
trail) of the newly created individual. If a subsequent crossover (or other operation) were performed on this individual, this string would, in turn, become an argument of a new operation. Similarly, when a permutation is performed on an individual, an expression is
created involving 3 items — namely, the symbolic string "PERMUTATION", the identity of the individual, and the permutation point. An example would be (PERMUTE 141) if the permutation operation had been performed on individual 14 at point t.
The first step in the iterative process involves activating each program. Activation
means having each program attempt to accomplish its goal, producing an objective result. In the preferred embodiment, entities are computer programs, so activation requires executing
the programs of the population on the training database. The second step in the process assigns a fitness value to the objective result, and associates that fitness value with its corresponding entity. For programs concerning direct marketing, the fitness value is
generally a number, or a vector, which reflects the difference between the results of the
program execution and measured results. Of course, the fitness value could be any symbolic
representation.
In general, some of the programs will prove to be better than others when a value is
assigned to them after their interaction with the "environment" (e.g., individual characteristics) of the direct marketing problem. The best value (fitness) may be the lowest number (as is the case here where we are measuring the deviation between a result and a
known perfect solution). In other problems, the best value (fitness) may be the highest
number (e.g. scoring direct "hits"). The value (fitness) assigned maybe a single numerical value or a vector of values, although it often most convenient that it be a single numerical
value. In many problems, the best value is known or measured. However, even in such
problems, it is also known that lower (or higher) numbers connoting better fitness may be
attained over time and the best value attained by the process over a given time may need to be identified.
A useful method for organizing raw fitness values involves normalizing the raw values and then calculating probabilities based on the normalized values. The best raw fitness value is assigned a normalized fitness of 1, the worst value is assigned a value of 0, and all intermediate raw values are assigned in the range of 0 to 1. The probability of being
selected for a program breeding is determined by the equation P; = fj divided by ∑N x=1f(x);
where Pjis the probability of breeding for individual (i) having a normalized fitness of fi5 and N is the total number of the population. Thus, an individual's probability of breeding equals
the individual's normalized fitness value divided by the sum of all the normalized fitness values of the population. In this way, the normalized fitness values range between 0 and 1, with a value of 1 associated with the best fitness and a value of 0 associated with the worst,
and the sum of all the individual's probabilities equals 1.
At any given time, there is one individual in every finite population having a single
fitness value that is the best amongst that population. Moreover, some environments or data
set characteristics have a known best fitness value. Examples are when fitness is measured as
deviation from a known answer or number of matches. The process of the present invention
may occasionally generate an individual whose value (fitness) happens to equal the known best value. Thus, this overall process can produce the best solution to a particular problem. This is an important characteristic of the overall process, but it is only one characteristic.
Another important characteristic (and the one which is more closely analogous to nature) is that a population of programs exists and is maintained which collectively exhibits a tendency
to increase their fitness over a period of time. That is, the average (or other group measure)
of the fitness values for all the programs in the working set tends to increase over time. By
virtue of the many individuals with good, but not the very best, fitness values, the population exhibits the ability to robustly and relatively quickly deal with changes in the data set characteristics. Thus, the variety in the population lowers its overall average fitness value; additionally, the population's variety gives the population an ability to robustly adapt to
changes in the environment.
Another embodiment of the GP process involves the process of affecting the
probabilities of which program or programs will breed further generations. As described
above, one way to determine breeding probability is to select the program with the highest
fitness value. A number of other methods exist, however, which tend to determine entities of relatively high value. The theoretically most attractive way to determine breeding probability is to do so with a probability proportionate to a fitness value (once so normalized between 0
and 1). Thus, an individual with fitness of 0.95 has a 19 times greater chance of breeding than an individual of fitness value 0.05. Occasionally individuals with relatively low fitness
values will be selected. This selection will be appropriately rare, but it may occur.
Furthermore, if the distribution of normalized fitness values is reasonably flat, this method is
especially workable. However, if the fitness values are heavily skewed (perhaps with most
lying near 1.00), then making the selection using a probability that is simply proportionate to normalized fitness will result in the differential advantage of the most fit individuals in the
population being relatively small and the operation of the entire process being prolonged. Thus, as a practical matter, breeding is done with equal probability among those individuals
with relatively high fitness values rather than being made with probability strictly proportionate to normalized fitness. This is typically accomplished by breeding individuals
whose fitness lies outside some threshold value. One implementation of this approach is to
select a threshold as some number of standard deviations from the mean (selecting for example, all individuals whose fitness is one standard deviation from the mean fitness).
During the iterative loop of the GP process, breeding parameters are modified through the Choose Operation step 118. As noted above, the possible operations include crossover,
permutation, and reproduction. The preferred operation is crossover, followed by reproduction, and lastly permutation. However, this preference is only a generalization, different preferences may work better with some specific examples. Also, as illustrated in figure 2, a user may select a particular form of operation through User Interface 200. Figure 5
illustrates User Interface 200. It is a graphical user interface in which the model form
window 250 is displayed along with one or more measures of the effectiveness of the model.
As shown in figure 5, for example, the Fitness History window 220 and cum lift window 230 are graphically displayed.
The User Interface 200 works in conjunction with the genetic algorithm of figure 2 to
provide a user controlled genetic program process. As described above, the genetic program of figure 2 creates an initial population of models and then uses the programs with the best
fit, in steps 118 to 160, to breed a new generation of programs which subsequently become
the new working model population. The automatic iteration of this process eventually
generates a model that exhibits the best fitness. The user interface 200 allows a user to alter
the automatic iterating and edit or lock-in a selected form for part of the model. Editing or
locking in certain model sections is desirable to make the models conform to restraints not otherwise taken into account by the fitness function. Such edits may be based on the
performance measures of the fitness function graphically displayed and dynamically updated as individual models are evaluated.
As illustrated in figure 5, the user interface 200 displays a graphic representation of the program in the model form window 250 along with the fitness history in window 220 and
cum lift in window 230. The program that is graphically represented in window 250 is the
program being evaluated against the training database in step 160 illustrated in figure 2. The particular program illustrated in figure 5 performs the program ((Dollar) multiplied by (Dollar
plus 3)) multiplied by (3 divided by product type). The fitness history 220 is a graphical display of the calculated fitness values for each program that has been displayed in the model form window 250. Specifically, if the GP process is evaluating the Nth program in step 160, the iterative process of the GP has already calculated (N-l) fitness values for (N-l) previous
programs and those (N-l) fitness values are displayed in window 220 according to their numerical order in which they were generated.
While the user interface may simply track fitness values, it may also retain a vector of
parameters for each fitness value such as a high/low range or a normalized distribution or other parameters that represent a fitness value. The interface may also retain a vector for each fitness value such that selection of a fitness value along the fitness history curve prompts the
actual model form corresponding to the selected fitness history value to be displayed. Finally, the cum lift display is a representation of the cum lift by decile for the Nth program of step
160. The display in window 230 may alternatively be a table or bar chart or any
representation of the cum lift decile analysis. Alternatively, the interface may also store the
cum lift history in a similar manner to the fitness value history. For example, the window
230 might display the cum lift history for the 1st decile (top 10%) of all the (N-l) models
generated to date.
In operation, the combination of the display of the Nth model in window 250 with the fitness history in window 220 and the cum lift in window 230 enables the user of the GP
process to evaluate, while models are being created, the effectiveness of the models with respect to the direct marketing problem being analyzed. For example, while the (N-l) model
is being analyzed, a user may have decided that a certain section of that model should be
retained for all follow-on models that will be generated in this GP process. Accordingly, the user would select the "Pause" button 260 on the interface tool bar which will temporarily halt
the evaluation of the (N-l) model over the training database. The user would then drag the cursor over that section of the model to be saved in follow-on models and select each
function or input to be saved. When the user continues the evaluation process by selecting the "Continue" button on the toolbar, the GP process will finish evaluating the (N-l) model and will input at step 180 (figure 2) the selected portions of the (N-l) model to be used on the Nth and follow on models. Alternatively, the user may set a parameter such that the section
of the (N-l) model to be used in the Nth model is only used in the Nth model or is only used
in a set number of models following the (N-l) model.
Importantly, the reason the user of the GP process is able to know that certain sections
of the model may need to be retained (or edited out) is that the user can detect the raised cum lift for a particular model and also, by selecting the fitness history, determine which models might be similar (or different) than the current Nth model. Where high (or low) fitness models have similar sections to the Nth model, the similar section of the Nth model can then
be used as a base for further modifications. The user might also be subject to arbitrary
limitations on individual parameters that are unique to any particular problem. In this case,
the user can specify those limits through combinations of input parameters and operations
such that a section of a model is built and input into the GP process through step 118. In this
way the user of the GP process can guide the model creation process of the GP algorithm such that an analysis of any particular problem is customized to its specific limitations.
Figure 6 illustrates another embodiment of the interface of the present invention. In
particular, the view (or model form) window 250 is displayed alongside the fitness history
window 220 and a decile analysis window 235. In addition, the response profile window 260 and gene pool diversity window 270 are displayed. The fitness history window 220 plots the
fitness of the best-of-generation (BOG) and the 2nd best-of-generation individuals. Do not
expect BOG fitness to increase with every new generation; it will not; it may even go down. Normal behavior is for short periods of growth followed by many generations of apparently no change, followed again by sudden increases. Modern theorists refer to this process as
punctuated equilibrium.
The view window 250 automatically displays the current best-of-generation program. In this window you can view any one individual program in the population. This window has
several functions. The view window contains a small cluster of menu items that allow you to step forwards and backwards throughout the entire population, to stretch and sl rink the
display vertically and horizontally and to export any of the evolved programs in a variety of formats. Every time you select a different program the cum lift window automatically
updates with that program's output. The view window menu items include:
First Displays the current best-of generation program
> » >» Step forward 1, 10 or 100 individuals and display that program < « <« Step back 1, 10 or 100 individuals and display that program
VertH- Stretch vertically a bit
Vert- Shrink vertically a bit
Horz+ Stretch horizontally a bit
Horz- Shrink horizontally a bit
Export Exports the displayed program.
Restore Restores the display if the Program window is maximized
The view window buttons include:
Short Names toggles the displayed variable names between the given names and the
abbreviated names; useful when displaying long programs.
Auto Size redraws the displayed program to fill the current window size.
Font+ Font- increase and decrease the font size of the displayed program.
Best so far shows the program with the highest top decile in prior generations.
Print prints the program tree on your printer. Use landscape for best results.
Sweep Scans the displayed program for redundant branches. The sweep
operation scans the displayed program for redundant branches by sequentially testing the effect of removing every branch in the program. The test is simply the resulting cum uplift at a specific
depth of file chosen by you. Normal practice is to set the depth of file
at the level of your usual mailing.
In the view window, when the cursor is placed on a model form part and clicked, a
sensitivity window is selected This window enables a user to study cross-sections of the
program's performance at any point. Also, the node output diagnostics window is available in the view window and it displays the detailed cumulative average response in fine detail. When you select the analysis view mode from the windows menu, evolution is frozen and
you have the opportunity to investigate any of the programs in the pack in fine detail.
Furthermore, there are four inspection windows in analysis mode, three of which have
already been described (View window, Decile window and Sensitivity window) and a fourth,
which is only visible in this mode. This window, headed "Cumulative response in detail",
displays a chart and sequences the entire file in descending order of model score and displays
the cumulative average response (or profit) for each observed score group that is output by the model. It is perfectly possible to have as many score groups as there are records in the data file. Similarly it is possible to have just two. This chart is used to identify the cut-off
scores for a given model prior to processing an external data file with that model. The vertical cursor may be moved with the spin control and the position in terms of depth-of-file
and model-output-score at that position is displayed in the information boxes at the top.
In addition, the view window includes diagnostic boxes. These boxes show: the
current activity and the program being worked on; the current generation number; the number of failed programs (Runts) in the last generation; the fitness of the best individual in the last
generation; and the time per complete cycle, from one generation to the next in hours, minutes and seconds. At any time during evolution a user can pause and save an entire population of programs together with all the population and breeding parameter settings.
These saved files are called scenarios.
At any time during evolution a user can pause and fetch a previously saved scenario and
continue evolution with the recovered set of programs instead. A user can also remove
unwanted scenarios. The file menu items Save scenario, Fetch scenario and Delete
scenario accomplish these tasks. Work can be easily recovered by fetching the crash-save scenario. The decile analysis window 235 is one of four display modes available for viewing
the predictive power of any individual model. Clicking the left mouse button on the window
region 235 cycles the display through three alternative views of the fit. The view of the decile
analysis is a tabular result of the model results by decile; the first alternative view is a cum
lift plot; the second alternative is a cumulative response by decile and the third alternative is
an advantage index (bar chart of cum delta over 100).
The gene pool diversity window 270 is one of three graphs. The bar chart shows the average frequency of occurrence across the entire population of each element in the genetic alphabet. It is used to monitor the loss of genetic diversity. As evolution progresses and
different sequences of code are favored differently, so it is inevitable that the individual elements of the genetic alphabet will start to appear with differing frequencies. The same is
also true of the candidate variables. This window allows a user to see how things are
progressing. Three different views can be seen:
GENE POOL DIVERSITY shows the average frequency of occurrence of each
element of the current problem's genetic alphabet. Clicking the graph displays a bar graph showing the average frequency of occurrence of each of the independent variables across the entire population of programs. This is a strong indicator of which variables really matter.
VARIABLES: mean frequencies displays the average occurrence of each variable
across the entire population. The small button at the top right shows the chart as a sorted list.
FITNESS vs COMPLEXITY plots the top 300 program's fitness (y axis) against the
number of internal nodes in each program (x axis). The three charts are used to judge whether some sort of intervention is required to alter the course of evolution.
Once a model has evolved that appears to meet your objectives it will be important to
gain some understanding of its potential performance on real world mailing lists or corporate databases. Unfortunately, no simple formula exists for calculating the true variances of the Cum Lifts. For this reason, a simulation facility called the Bootstrap is used which repeatedly draws random samples (with replacement) of a predetermined size from the source data and
recalculates the decile Cum Lifts for every sample. The process may take a few minutes if
the sample sizes are in the millions. Your file data may only 25,000 records, but if you are
planning a mailing shot of 2,000,000, then you should set a bootstrap sample size of
2,000,000. The number of bootstrap samples drawn should be at least 30. On completion of sampling the Bootstrap, results are displayed in a table. The default confidence interval for upper and lower Cum Lifts is set at 80%. You can, however, change this by choosing from
the confidence% menu which offers settings of 95%, 90%, 80%, 75% and 66.66%. In the
present context, 80% is a very sound basis for decision taking.
While this invention has been particularly described and illustrated with reference to a
preferred embodiment, it will be understood by one of skill in the art that changes in the above description or illustrations may be made with respect to form or detail without
departing from the spirit and scope of the invention.

Claims

CLAIMSWe claim:
1. A method for database analysis, comprising: creating a database including measured customer responses;
creating an initial generation of a plurality of models to simulate said measured customer responses in said database, each of said models consisting of at least one operator selected from a defined set of operators;
evaluating a fitness function to quantify the difference between said measured and simulated customer responses from said initial generation of models.
creating a graphical user interface for displaying a decile analysis of said simulated customer responses;
using said evaluated fitness function and said decile analysis to breed a plurality of model generations wherein one of said models of said plurality of model generations optimizes said fitness function.
2. A method, as in claim 1, wherein:
said interface displays the best of generation model form adjacent said decile analysis.
3. A method, as in claim 2, wherein:
said interface edits said model form and said decile analysis is recalculated to
reflect said edits to said model form.
4. A method, as in claim 3, wherein:
said interface uses point and click cursor control operations to edit said model
form.
5. A method, as in claim 1, wherein:
said interface edits the number and type of operators in said defined set during
said breeding.
6. A method, as in claim 2, wherein
said interface displays a genetic diversity window adjacent said decile analysis window wherein said genetic diversity window displays the frequence with which independent variables occur throughout said model generations.
7. A method, as in claim 1, further comprising:
selecting a plurality of random samples of database entries having a predetermined size; and
recalculating said decile analysis on each of said plurality of random samples.
A system for database analysis, comprising:
a central processing unit coupled to a memory system and a display wherein
said central processing unit operates according to a program retrieved from
said memory system;
said program instructs said central processing unit to retrieve data from said
memory;
said program creates a database including measured customer responses;
said program creates an initial generation of a plurality of models to simulate said measured customer responses in said database, each of said models
consisting of at least one operator selected from a defined set of operators;
said program evaluates a fitness function to quantify the difference between
said measured and simulated customer responses from said initial generation of models.
said program creates a graphical user interface for displaying a decile analysis
of said simulated customer responses;
said program uses said evaluated fitness function and said decile analysis to breed a plurality of model generations wherein one of said models of said
plurality of model generations optimizes said fitness function.
9. A system, as in claim 8, wherein:
said interface displays the best of generation model form adjacent said decile analysis.
10. A system, as in claim 9, wherein:
said interface edits said model form and said decile analysis is recalculated to reflect said edits to said model form.
11. A system, as in claim 10, wherein:
said interface uses point and click cursor control operations to edit said model form.
12. A system, as in claim 8, wherein: said interface edits the number and type of operators in said defined set during said breeding.
13. A system, as in claim 9, wherein said interface displays a genetic diversity window adjacent said decile analysis
window wherein said genetic diversity window displays the frequence with which independent variables occur throughout said model generations.
14. A system, as in claim 8, wherein:
said program selects a plurality of random samples of database entries having a predetermined size; and said program recalculates said decile analysis on each of said plurality of
random samples.
PCT/US2001/026216 2000-08-23 2001-08-21 Genetic programming for performing direct marketing WO2002017112A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001285191A AU2001285191A1 (en) 2000-08-23 2001-08-21 Genetic programming for performing direct marketing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US64396600A 2000-08-23 2000-08-23
US09/643,966 2000-08-23

Publications (1)

Publication Number Publication Date
WO2002017112A1 true WO2002017112A1 (en) 2002-02-28

Family

ID=24582869

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/026216 WO2002017112A1 (en) 2000-08-23 2001-08-21 Genetic programming for performing direct marketing

Country Status (2)

Country Link
AU (1) AU2001285191A1 (en)
WO (1) WO2002017112A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022147190A1 (en) * 2020-12-30 2022-07-07 Natural Computation LLC Automated feature extraction using genetic programming

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5781698A (en) * 1995-10-31 1998-07-14 Carnegie Mellon University Method of autonomous machine learning
US5933818A (en) * 1997-06-02 1999-08-03 Electronic Data Systems Corporation Autonomous knowledge discovery system and method
US5940825A (en) * 1996-10-04 1999-08-17 International Business Machines Corporation Adaptive similarity searching in sequence databases
US6185561B1 (en) * 1998-09-17 2001-02-06 Affymetrix, Inc. Method and apparatus for providing and expression data mining database
US6253196B1 (en) * 1997-07-10 2001-06-26 International Business Machines Corporation Generalized model for the exploitation of database indexes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5781698A (en) * 1995-10-31 1998-07-14 Carnegie Mellon University Method of autonomous machine learning
US5940825A (en) * 1996-10-04 1999-08-17 International Business Machines Corporation Adaptive similarity searching in sequence databases
US5933818A (en) * 1997-06-02 1999-08-03 Electronic Data Systems Corporation Autonomous knowledge discovery system and method
US6253196B1 (en) * 1997-07-10 2001-06-26 International Business Machines Corporation Generalized model for the exploitation of database indexes
US6185561B1 (en) * 1998-09-17 2001-02-06 Affymetrix, Inc. Method and apparatus for providing and expression data mining database

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022147190A1 (en) * 2020-12-30 2022-07-07 Natural Computation LLC Automated feature extraction using genetic programming

Also Published As

Publication number Publication date
AU2001285191A1 (en) 2002-03-04

Similar Documents

Publication Publication Date Title
Kaya et al. Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rules mining
US6917926B2 (en) Machine learning method
US7225113B2 (en) Systems and methods for statistical modeling of complex data sets
US7827183B2 (en) Recognition of patterns in data
JP2021518024A (en) How to generate data for machine learning algorithms, systems
US20020087540A1 (en) Method and system for database management for data mining
Rahman et al. Discretization of continuous attributes through low frequency numerical values and attribute interdependency
US20200311581A1 (en) High quality pattern mining model and method based on improved multi-objective evolutionary algorithm
WO2005106656A2 (en) Predictive modeling
Malo et al. Employment status mobility from a life-cycle perspective: A sequence analysis of work-histories in the BHPS
CN111932342A (en) User cold start product recommendation method and system based on Apriori algorithm
Hayes et al. Testing models of context-dependent outcome encoding in reinforcement learning
JP4088218B2 (en) Data extraction apparatus, data extraction method, and data extraction program
Hulten et al. Learning Bayesian networks from dependency networks: A preliminary study
WO2002017112A1 (en) Genetic programming for performing direct marketing
CN114519073A (en) Product configuration recommendation method and system based on atlas relation mining
US20220019909A1 (en) Intent-based command recommendation generation in an analytics system
Kaya et al. Genetic algorithms based optimization of membership functions for fuzzy weighted association rules mining
Shelokar et al. A multiobjective variant of the subdue graph mining algorithm based on the NSGA-II selection mechanism
Naing et al. Feature Selection for Customer Churn Prediction: A Review on the Methods & Techniques applied in the Telecom Industry
CN113537731A (en) Design resource capacity evaluation method based on reinforcement learning
CN111274480A (en) Feature combination method and device for content recommendation
Machwe et al. Reducing user fatigue within an interactive evolutionary design system using clustering and case-based reasoning
JP3452308B2 (en) Data analyzer
CN116578611B (en) Knowledge management method and system for inoculated knowledge

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 69(1) EPC (COMMUNICATION OF 08-08-2003, EPO FORM 1205A)

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP