Detecting Cancer with a RBF Network trained by a Genetic Algorithm
This presentation will present genetic algorithms. I will go over a genetic algorithm that I wrote in C-Sharp to predict benignness or malignancy of breast cancer tumors. This is a computer program that I wrote and entered in the third annual forecasting and futurism section of the Society of Actuaries contest. It was selected as the winning entry. Hi, my name is Jeff Heaton. I am a EHR Informatic Scientist, a blogger and an author. You can see my contact information here. We’ll also talk about one of my books during this presentation, which highlights genetic algorithms as well as other nature-inspired algorithms. The problem definition was to use a genetically trained radial basis function neural network to predict malignancy or benignness of breast tumors. You can see the data that I had here. I used data that was available from the machine learning repository of UCI. Here’s some more information on this data set. It was originally obtained by Dr. William Woolberg at the University of Wisconsin hospitals. This is a commonly used data set that is available from the University of California Irvine’s machine learning archive. The purpose of my program was to demonstrate that the parallel class of C-Sharp could be used to train such a neural network using a genetic algorithm in a highly paralyzed fashion using a multi-processor computer. Let’s start by looking at exactly what machine learning is and what form a lot of these models actually fit into. Here you see a very generic sort of model. You have inputs, in this case five, this is simply an array of numbers. And you have outputs from the predictive model. Here we have three outputs. Often there’s just one output, or there may be an output for each class that you want classified from the input data. These inputs and outputs are just arrays of numbers. The job of the machine learning model is to learn to produce close to the same outputs as you would expect from your sample data set, your training data set, for the specified inputs. The real neat part of machine learning algorithms is when your inputs do not exactly match the data that you originally collected to train it with. In this case, the machine learning algorithm should learn to abstract and still produce outputs that are consistent with what you would expect from the given inputs. This is often a black box. You don’t know exactly how the machine learning algorithm is actually doing that. Some machine learning algorithms can give you some degree of rules to tell you what happened, however they are the exception rather than the norm. There are two broad classifications that you can put these machine learning algorithms or models into. Classification models and regression models. There’s other types, but these are two of the main ones. However, these are the two that I will be talking about in this presentation. Classification outputs a class that the input data belongs to. IE a given tumor. If its features make it a benign tumor or a millignant tumor. Regression models output a number. For example, you might give it a particular cars feature and it would output what it thought the miles per gallon would be. Regression models are a lot more like a mathematical function. The actual output from them is a number rather than a class or classification as is the case with classification models. So what exactly is a model and how does one work? A model is usually some sort of a mathematical equation or algorithm. These models have coefficients or other parameters that can be adjusted to find to what the model is going to actually output. Two very simple models that are often discussed are linear regression models and polynomial models. These two are based on very simple mathematical equations. Though they are relatively simple, they can handle a surprising amount of data. Let’s look at a really simple example of how you might use a linear regression model. Temperature conversion. This is converting from Celsius to Fahrenheit or vice versa. In this case, we’re looking at how to convert a Celsius degree temperature reading into a Fahrenheit. Here you can see the simple equation to do this. There’s really two coefficients for this. Minemfits and 32. You may not think of 32 as a coefficient because it’s not really multiplied by anything. However, it actually is. It’s multiplied by the variable to the power of zero, which is actually one. Okay, that’s a bit of a stretch, but it helps to think of it as a coefficient so that it fits in nicely with the other coefficients. You might also think of 32 as B or the Y intercept. Look at the top equation. This is the general form for linear regression. You see the coefficients labeled C and the X inputs labeled X. You can also see the lone coefficient or the Y intercept as C sub N. For temperature conversion, there is only one X, the Celsius value. So we’re left with a much more simple form. We see Y equals coefficient zero times X zero. Well, we’ve only got one X, so that’s essentially X. And then we simply add the remaining coefficient. We could think of coefficient one, like I said before, as multiplied by X sub zero to the power of zero, which is effectively one, which is C one like we have there. Then you can see the two coefficients, 9 fifths and 32 plugged in from before. There are many ways to train or fit a model to the training data. For the temperature model, we know the coefficients. They’re listed here. And here is our coefficient vector array. This is 1.8, which is 9 fifths, comma 32. Those are the values that we would want to adjust if we did not know them to make the model fit temperature conversion. We would probably initially assign those just to random numbers and then adjust them using one of the various fitting or training algorithms. So if we don’t know the coefficients, we can make the computer try to figure out what those are for us. That is called model fitting or training. Training is really nothing more than finding the coefficients that produce the desired output. Training is usually iterative. You start with random coefficients, often and you simply refine them over and over and over again until you get closer and closer to the desired output. Marginal improvements are produced with each iteration. Training can take a very long time. The training method to use depends on if the model that you’re trying to train is differentiable. That means can you take the derivative of it? If you cannot take the derivative of it, then you should use one of these models. These models will also work if you can in fact take the derivative of the model. However, usually one of these will be more efficient if your model is in fact differentiable. One of these is genetic algorithms, which I will be talking about more in the next section. Simulated annealing, knelder mead, hill climbing, and even a very primitive greedy random walk can be used on non-differential models. greedy random walk is simply assigning random values to all of the coefficients or parameters and keeping only keeping the new random values if you actually improve the situation. You’re essentially just trying random brute force attempts until you actually find a good set of parameters or coefficients. This is a really bad way to train that it is a good worst case scenario comparison. If you can, in fact, take the derivative of your model and your objective function or your scoring function, then you should probably use one of the differentiable models. For many of these, I just have a few of the more common ones listed here, such as gradient descent, or sometimes called back propagation when dealing with neural networks, least sum of squares, leavenbergmerquad, and scaled conjugate gradient, just to name a few. We can use many different models beyond just the linear regression model that we saw for temperature conversion. There are general linear models, neural networks, deep belief networks, support vector machines, and various Bayesian models. All of these models can be trained using the training algorithms that I just mentioned for differential and non-differential models. However, some of these models are differentiable, some are not. So that will determine what training algorithm you might use. Additionally, some of these models have very specialized training algorithms that are fine tuned to their unique needs. All of these models operate really very similarly. The models accept inputs and produce one or more outputs. Some model support regression, some support classification, other support both. They all have a concept of weights, coefficients, or other parameters. Weeds are coefficients, or other parameters are updated to train or fit the model. You just need to select and appropriate training algorithm for the model. For my kindest entry, I used an RBF neural network. They can accept multiple inputs, or used for either classification or regression. My entry used classification, the LinkedIn or benign. They used weights and a variable number of RBF functions, training adjusts these weights and RBF parameters for the individual RBF functions. RBF stands for radial basis function. So what exactly is a radial basis function? A radial basis function or RBF is a special class of function. The RBFs used for RBF networks typically have a range of 0 to 1. The RBFs have two parameters, center and width. RBFs are symmetrical about the center. There are many different types of RBF. The RBF that I used for this example is a Gaussian RBF. It’s a very common type of RBF used for a lot of artificial intelligence programs to calculate an RBF. You first must calculate R, which is the sum of the distances between the center and the input. It is important to note that the center has a number of values that is equal to the number of inputs. So you’re definitely dealing with a multi-dimensional point for the center. And then you calculate the RBF using R, below is the Gaussian RBF. The value W is the width. Notice that the width is a single value, a scalar compared to the vector of the center. So you have one number that defines the width and you have an array equal to the number of inputs that defines the center of the RBF function. Here is the graph of the RBF for Gaussian. Notice it looks like the normal distribution. Using several of these RBF functions, because you usually have more than one RBF function in the RBF network, you can train the RBF function to approximate and to fit to models of fairly complex data. Here we see a graphical representation of the exact RBF network that I evolved for this contest entry. You can see all of my inputs at the far left. You can see that I’m using three radial basis functions. And you can see that there are weights between the inputs and those radial basis functions. The output from the radial basis functions are going to simple summations that go to the two output neurons. One represents malignant, the other is benign. Whichever of those two values has the highest output or the highest activation is going to be the class that we go to. This is how you use an RBF network classification. If you wanted to use a progression, you would have one single output. Notice that you also have a bias, which is always a single one value. There are also two sets of weights. There’s the weights between the inputs and the RBFs. There’s more weights between those and the summations. So, exactly is the RBF network calculated. You can see the final output for each of the output neurons given by F. We’re basically taking a summation of the output, the summations with their weights and the RBFs fed into those summations. You can see the whole equation for it here. X is the input vector or the features and is the number of inputs. A and B are those connection weights. Remember we had two different arrays of weights for this network. We had the weights from the inputs to the RBFs and then a second set of weights from those RBFs to the outputs. And that double bar looking thingy, that is a distance that takes into account all of the center values to the X being passed in. This can all be reduced to the training vector. The training vector, this is what we’re going to actually adjust with the training algorithm, are the input coefficients or weights, the output summation coefficients or weight, the RBF with values and the RBF center values. Now there is the same width in every single direction, so multiple widths come in for each RBF has its own width function. This all results in some sort of a numeric vector. You see an example of what that might look like at the very bottom. Even if you don’t understand exactly how the RBF calculation worked in the previous slide, you can still use the RBF. It’s a model and you can throw almost anything you want to in a model. It may not be a very good model, but some sort of equation that produces a number and has parameters that you can fine tune. And that will be adjusted by the training algorithm. In this case, we’re using the well-defined RBF model. So now let’s talk about the basics of genetic algorithm. Remember, we’re simply adjusting that training vector that we just saw. The first step is to create some number of random individuals, say a thousand. These are just random training vectors. Devise a score function. The score function will be used to score each of these random training vectors. They won’t do that well because they’re random, but you have to start somewhere. Then you select a relatively small, for example, five of the top individuals to pass directly to the next generation. This is called the leadism. Then you choose a certain percentage, maybe 20 of the top individuals to pass into the next generation with slight modifications. This is mutation. Then you fill the rest of the next generation with individuals that share traits from pairs of individuals in the general population. You get precedence to the top scoring functions. Now let’s look at mutation. Mutation is asexual reproduction. You choose a parent. Now this parent is going to ideally be one of the top solutions that you have in your population of these vectors. It will then change just slightly. Notice the green trait is something new that was added. This is usually just a random number. You’ll choose to randomly change one of the numbers that is in the solution vector for the particular parent that you’ve chosen. That child will then be passed on to the next generation. This is how new information is added. If not for this, the random traits that were in the first population would be all that you would have. You’d have massive inbreeding as these traits were just shared and recombinated in all sorts of different ways. Mutation lets the population invent new features and add them to itself. Crossover involves two parents, a mother and a father. The gender really doesn’t usually exist, so mother and father are largely arbitrary. Here you see that we basically spliced the array and we took part of it from the mother, apart from the father and we created a child. This allows two successful genomes or solution vectors to pool their traits and to create a child that is hopefully even more suited than the mother or the father actually were. Of course, you want to only choose parents from the top portion of the population, those that got the best scores. This is survival of the fittest. There’s lots of additional very advanced features that you can add to GAs. For example, you might want to use species and only allow crossover between similar solution vectors. Does gender really matter? There are some algorithms that do try to introduce some sort of gender. Do you want to just use two parents? You can use many different parents. Do you want to introduce monogamy? Maybe once two particular solutions have chosen to make, they will meet again in the next generation. An innovation database lets you not keep trying the same sort of mutation over and over that did not necessarily produce good results in the previous generation. You can use tournaments where you basically have genomes fight for survival. You can use islands and land bridges to separate the population and then reintroduce later. I’ve placed my program out on the get hub so that you can get it. You can also see some of the features here. If you enjoy this sort of algorithm, you may want to purchase my nature inspired algorithm book. Depending on where you are in the time space continuum, this is either a book that’s available in Amazon or it is in the middle of its Kickstarter project. If you are looking at this in either February or March of 2014, then the Kickstarter project may very well still be running and I encourage you to look at the link associated with this video and maybe consider backing it. If not, you may want to take a look for this book on Amazon. Thank you for watching my video.