Asynchronous Differential Evolution, with efficient multiprocessing

Differential Evolution (DE) is a type of genetic algorithm that works especially well for optimizing combinations of real-valued variables. The ade Python package does simple and smart population initialization, informative progress reporting, adaptation of the vector differential scaling factor F based on how much each generation is improving, and automatic termination after a reasonable level of convergence to the best solution.

But most significantly, ade performs Differential Evolution asynchronously. When running on a multicore CPU or cluster, ade can get the DE processing done several times faster than standard single-threaded DE. It does this without departing in any way from the numeric operations performed by the classic Storn and Price algorithm, using either a randomly chosen or best candidate scheme.

You get a substantial multiprocessing speed-up and the well-understood, time-tested behavior of the classic DE/​rand/1/bin or DE/​best/1/bin algorithm. (You can pick which one to use.) The very same target1 and base2 selection, mutation,3 scaling, and crossover4 are done for a sequence of targets in the population, just like you’re used to. The underlying numeric recipe is not altered at all, but everything runs a lot faster.

How is this possible? The answer is found in asynchronous processing and the deferred lock concurrency mechanism provided by the Twisted framework.

Lots of Lock

The magic happens in the challenge method of the DifferentialEvolution object. Before considering what’s different about this, let’s consider what’s the same as usual. Here’s a simplified version of the method, omitting some if-then blocks that can be disregarded while the algorithm is running:

def challenge(self, kt, kb):
    k0, k1 = self.p.sample(2, kt, kb)
    # Await legit values for all individuals used here
    yield self.p.lock(kt, kb, k0, k1)
    iTarget, iBase, i0, i1 = self.p.individuals(kt, kb, k0, k1)
    # All but the target can be released right away,
    # because we are only using their local values here
    # and don't care if someone else changes them at this
    # point. The target can't be released yet because its
    # value might change and someone waiting on its result
    # will need the accurate one
    self.p.release(kb, k0, k1)
    # Do the mutation and crossover with unbounded values
    iChallenger = iBase + (i0 - i1) *
    self.crossover(iTarget, iChallenger)
    # Continue with Pr(values) / Pr(midpoints)
        # Now the hard part: Evaluate fitness of the challenger
        yield iChallenger.evaluate()
        if iChallenger < iTarget:
            # The challenger won the tournament, replace
            # the target
            self.p[kt] = iChallenger
    # Now that the individual at the target index has been
    # determined, we can finally release the lock for that index

The DifferentialEvolution.challenge method challenges a target (“candidate,” or “parent”) individual at an index kt in the population with a challenger individual. These “individuals” are instances of the Individual class. Each instance mostly acts like a sequence of float values–a chunk of digital DNA–to feed to your evaluation (fitness) function. It has an evaluate method that you run to obtain a sum-of-squared error result of that function, which gets stored in the individual’s SSE attribute. It also has some methods that make comparisons with other individuals easier.

The DE algorithm runs the challenge method for each individual in the population, considering that individual the target of a possible replacement. It creates a challenger in a multi-step process. First, it computes the vector sum of a base individual (at index kb) and a scaled vector difference5 between two randomly chosen other individuals that are distinct from each other and both the target and base individuals. Then it does a crossover operation between this intermediate result (I call it a mutant) and the target, randomly selecting each real-valued sequence item from either the mutant or the target. Here’s what the crossover method looks like:

def crossover(self, parent, mutant):
    j = random.randint(0, self.p.Nd-1)
    for k in range(self.p.Nd):
        if k == j:
            # The mutant gets to keep at least this one value
        if self.CR < random.uniform(0, 1.0):
            # CR is probability of the mutant's value being
            # used. Only if U[0,1] random variate is bigger (as it
            # seldom will be with typical CR ~ 0.9), is the mutant's
            # value discarded and the parent's used.
            mutant[k] = parent[k]

The mutant is modified in place. Note that, with typical CR values closer to 1.0 than 0.5, it’s more likely that a given sequence item will be from the mutant than from the target. That makes sense, since we are challenging the target with something different, and we’d just be wasting time if we made the something different barely different at all.

So far, all of this is conventional Differential Evolution. If the base individual is randomly chosen (but never the same as the target), then it is referred to as DE/​rand/1/bin. If the base is always the best individual in the population (with the lowest SSE), then you are running DE/​best/1/bin.

In either case, two other individuals (never the same as the target or base) are randomly chosen from the population to compute a scaled vector difference from. The challenge method references these individuals from a Population object via integer indices. The target’s index in the population is kt, the base’s index is kb, and the two randomly chosen other individuals have indices k0 and k1. The sampling of k0 and k1 is shown in FIG. 1. The population members (individuals) are shown in the upper array p.6

Critically, there’s something else going on that’s also shown in FIG. 1. Each individual has its own concurrency lock, in the lower array d.

FIG. 1: Begin the challenge by acquiring concurrency locks.

Each lock is an instance of the Twisted7 framework’s DeferredLock. When you try to acquire one of those, it returns a Deferred object that fires when, and only when, the lock is yours. The challenge method acquires all the locks for all four individuals involved in the challenge and yields a Deferred that fires when all four locks have been acquired. That happens, incidentally, with a call to the Population object’s lock method:

def lock(self, *indices):
    Obtains the locks for individuals at the specified indices,
    submits a request to acquire them, and returns a `Deferred`
    that fires when all of them have been acquired.
    dList = []
    for k in indices:
        if indices.count(k) > 1:
            raise ValueError(
                "Requesting the same lock twice will result in deadlock!")
    return defer.DeferredList(dList).addErrback(oops)

Now, the example of FIG. 1 is a challenge of the individual at p14. That’s the target. The individual at p7 is the base, and the individuals at p1 and p4 were selected for the vector subtraction operation. We need valid digital DNA for all four of those individuals before we can proceed. But when our challenge starts, the individuals at p8, p7, p5, p2, p1, and p0 are busy getting evaluated for other challenges, as indicated by the grey boxes. We can have the locks d14 and d4 right away, but we need to wait for the locks d7 and d1. So we yield the Deferred until we can have it all.8

Once that Deferred fires, we resume the challenge method at FIG. 2. Note that the individuals at p2 and p0 are still not ready. But we don’t care about that, because our particular challenge is completely unaffected by them.

We release the locks d7, d4, and d1 as soon as we’ve retrieved copies ib, i1, and i0 of the individuals at p7, p4, and p1. Once we have those copies, we don’t care how the population updates at those positions. Entirely new individuals could occupy those slots right away for all we care. We have what we need for the challenge to the individual at p14.

We proceed with the conventional DE steps for challenging the target individual:

  • computing a scaled vector difference and adding that to the base individual with the statement
    iChallenger = iBase + (i0 - i1) *,

  • doing crossover with
    self.crossover(iTarget, iChallenger),

  • limiting the parameter values to their bounds with
    self.p.limit(iChallenger), an important but little-discussed part of DE, which ade does with the simple and well-accepted “reflection” method,

  • checking if the challenger passes any constraints put on its combination of parameters with, an optional and unconventional DE step that avoids the CPU load of evaluating hopeless or invalid challengers, and, finally,

  • evaluating the candidate individual ic with
    iChallenger.evaluate(), which must return a Deferred that gets yielded.

That last step is what takes a while; otherwise you wouldn’t be bothering with trying to speed up your DE heuristics. And so your fitness function must not block, but rather return a Deferred that fires with the SSE when it’s available. Look at the two examples included with ade to see how you might implement a Twisted-friendly, asynchronous evaluation function.

FIG. 2: Do the DE stuff, releasing what locks we can.

This is the key to the performance gain that ade gives you: It goes through each target of the population asynchronously. It doesn’t wait for one target to finish being challenged before starting a challenge on the next. So there are other challenges going on when we start our example challenge of the individual at p14. But we can only proceed with a challenge once all the individuals involved are ready.

FIG. 3 shows what happens when your CPU-burning evaluation function finally gets done. A tournament (really, just an if statement comparing the two individuals) determines if the challenger gets to take the place of the target. And now, only now, we can release our last lock, d14.

FIG. 3: The challenger wins; replace target and release its lock.

The individual at p0 still isn’t ready, but who cares? It means nothing to this challenge. But imagine other challenges going on as all this happens. Say there’s one challenging the next individual in the population, at p13, that uses our target individual (at p14) as its base. It can’t proceed until our challenge is done. And, if it’s so unfortunate as to have chosen the individual at p0 as its i0 or i1, it might be waiting a while.

Fortunately, Differential Evolution is typically run with dozens or even hundreds of population members, many more times than the four or eight CPU cores you have available.9 So the asynchronous processing will usually find a target it can work on that does have four individuals all ready to be accessed, and thus without disturbing the classic numeric recipe of DE. When you watch the progress of each generation, you’ll see it start out fast as the CPU cores finds lots of challenges to proceed with and then slow down as the unfortunate challenges that depend on slow results are all that’s left to do. Even so–as long as your evaluation function is dispatched to an idle CPU core and doesn’t block–ade will typically speed up your Differential Evolution by a factor of several times.


The ade package comes with two example Python scripts. You can extract the scripts and their support files to a subdirectory ade-examples of your home directory by typing the following in a shell:

user@computer:~$ ade-examples

You will see something like this:

Extracting ade examples to
Subdirectory created /home/​user/ade-examples/​README.txt created
/home/​user/ade-examples/​ created
/home/​user/ade-examples/​ created
/home/​user/ade-examples/​goldstein-price.c created

The Goldstein-Price Test Function

Do a cd to the newly created ~/ade-examples directory and compile goldstein-price.c there using gcc:

user@computer:~/ade-examples$ gcc -Wall -o goldstein-price goldstein-price.c

Then you can run the example script:

user@computer:~/ade-examples$ ./

You will see stuff fly by on the console that show progress of the Differential Evolution algorithm. This little compiled test function runs so fast that you can barely tell there’s anything significant going on. Take a look at goldstein-price.log to see what the result looks like after running on my eight-core CPU; it takes about 0.7 seconds to converge and stop trying any more DE generations.

The population is fairly small (40 individuals) in this example, so there’s not as much benefit from the asynchronous multi-processing.10 But it still runs about twice as fast with four CPU cores as it does with just one. Try changing the value of N in the Solver class of ~/ade-examples/​ to see how much difference the number of parallel processes makes on your CPU.

One important thing to observe about this example is how it does its evaluation function in an asynchronous multiprocessing-friendly fashion. The key to that is found in the cmd and outReceived methods of the ExecProtocol class, which implements a Twisted process protocol:

def cmd(self, *args):
    Sends function arguments to the executable via stdin, returning
    a C{Deferred} that fires with the result received via
    stdout, with numeric conversions.
    yield self.lock.acquire()
    self.transport.write(" ".join([str(x) for x in args]) + '\n')
    yield self.lock.acquire()

def outReceived(self, data):
    Processes a line of response data from the executable.
    data = data.replace('\r', '').strip('\n')
    if not data or not self.lock.locked:
    self.response = data

Deferred locks again! In this case, a lock is acquired when the x and y variables are sent to the compiled goldstein-price function via STDIN and released when a result is received via STDOUT. The cmd method yields the Deferred object from its attempt to acquire the lock, and so it plays nice with the Twisted reactor and lets ade do its asynchronous thing.

The ExecProtocol is used by a Runner object, which spawns an instance of the goldstein-price executable you compiled. One instance of MultiRunner manages a pool of Runner instances; as a callable object, it serves as the evaluation function for the DE algorithm.11 Each call to the MultiRunner object acquires a (deferred) instance of Runner from a DeferredQueue and calls it (when available) to obtain the Goldstein-Price function’s value for a given pair of x and y values. Take a look at the script to better understand how all of these pieces fit together in just a couple hundred lines of object-oriented Python code.

Nonlinear Curve Fitting

Our second example fits a nonlinear curve to filtered voltage and temperature observations. It is pure Python (nothing compiled), and chews on a lot of data points, and so it takes a while to converge to a solution even with multiple CPU cores cranking away. On my eight-core CPU running at about 4 GHz, ade required just over 31 minutes to reach the solution shown in FIG. 4 and stop. (It took five times as long to reach a comparable solution and stop on another run I did that was limited to using just one core.)

FIG. 4: Nonlinear curve fit found after 70 DE generations

This example resulted from a slightly over-designed solution to an actual engineering problem. I had commisioned an outdoor equipment shed that gets quite hot with the electrical stuff operating inside. I wanted to run a ventilation fan when doing so would cool things off, and only then. What I had handy for measuring inside and outside temperatures was some thermistors and a dual-input USB-connected voltage measuring device made by a cool little Swiss company called YoctoPuce. On a temporary basis, I also had access to a precision temperature measuring device.

On each of the two inputs of the voltage measuring device (a Yocto-0-10V-Rx), I connected a voltage divider with a 100K NTC thermistor in the upper leg. The lower leg is made up of a precision carbon resistor in parallel with the 11.2K input impedance of the Yocto-0-10V-Rx. The upper leg is connected to a 23 VDC power supply that the Yocto-0-10V-Rx provides for some reason. The voltage measured at each input goes up as the temperature increases, because the thermistor resistance goes down (in a nearly exponential fashion) with higher temperatures. I put both thermistors and the borrowed temperature measuring device right next to each other and let a single-board computer log data for about a week of cool nights and warm days. Once I had used ade to evolve parameters for the curve shown in FIG. 4 above, I ran a Python service on the single-board computer to control the fan.

The example script downloads the resulting data set from my web server (about 190 kB compressed) to your ~/ade-examples directory–only the first time you run it. (Privacy note: This will result in a log file entry on my web server; see the really short privacy policy at the bottom of this and every page on this site.) Then it decompresses the dataset on the fly, loads Numpy arrays from it, and, in the method of its Runner object partially copied below, proceeds with the (asynchronous) DE algorithm:

def __call__(self):
    args = self.args
    names_bounds = yield self.ev.setup().addErrback(oops)
    self.p = Population(
        names_bounds[0], names_bounds[1],
        popsize=args.p, constraints=[self.ev.constraint])
    yield self.p.setup().addErrback(oops)
    F = [float(x) for x in args.F.split(',')]
    de = DifferentialEvolution(
        CR=args.C, F=F, maxiter=args.m,
        randomBase=args.r, uniform=args.u, adaptive=not args.n,
    yield de()
    yield self.shutdown()

The evaluation method evaluate of Runner does a call to an Evaluator instance via a ProcessQueue object from my AsynQueue package. There are as many Evaluator instances as you have CPU cores, and the ProcessQueue runs them on separate CPU cores–you guessed it–asynchronously.

Here is the Evaluator method that gets called with each combination:

def __call__(self, values):
    SSE = 0
    self.txy =[self.kTC:])
    for k in (1, 2):
        tv, t_curve, weights = self.curve_k(values, k)
        squaredResiduals = weights * np.square(t_curve - tv[:,0])
        SSE += np.sum(squaredResiduals)
    return SSE

The actual curve function whose parameters vp, rs, a0, and a1 we are trying to determine is found in Evaluator.curve:

def curve(self, v, vp, rs, a0, a1):
    Given a 1-D vector of actual voltages followed by arguments
    defining curve parameters, returns a 1-D vector of
    temperatures (degrees C) for those voltages.
    return a1 * np.log(rs*v / (a0*(vp-v)))

There are two other parameters in the digital DNA that we are trying to evolve: time constants of two filter stages that make the fast thermal response of my little thermistors match up with the slower thermal response of the temperature sensor I was using. Those time constants weren’t important for real-world usage of the thermistors once I got the curve function’s parameters figured out, but I needed them to match up the X (observed voltage) and Y (measured temperature) as closely as possible for the curve fitting. When is called at the beginning of the evaluation function above, it returns X values filtered with those parameters, values[self.kTC:].

You can see all these parameters in the initial population, printed to the console when the script starts:

Parameters tc0 and tc1 are the time constants for two IIR lowpass filter stages, mimicking the thermal mass of the temperature sensor. Parameters v1_a0, v1_a1, and v1_rs are the terms for the nonlinear curve for the first thermistor voltage divider, and the three parameters starting with v2_ are the same for the second voltage divider. Finally, parameter vp is the nominal 23V voltage supplied by the Yocto-0-10V-Rx.

The values of tc0 and tc1 are subject to a constraint, which must be satisfied for the individual to even be evaluated and was defined with a keyword of the Population constructor. In this example, the sole constraint is provided by a method of Evaluator that returns True only if the second filter stage has a larger (slower) time constant than the first:

def constraint(self, values):
    Only allows successive time constants that increase with each
    stage. Avoids duplicate evaluation of equivalent filters.
    tcs = [values[x] for x in sorted(values.keys()) if x.startswith('tc')]
    return np.all(np.greater(np.diff(tcs), 0))

This keeps ade from having to consider combinations of filter stages that are exactly equivalent, cutting the search space in half. You can use constraints to rule out bogus, duplicative, or doomed combinations of parameters and do the hard work of evaluation only on unique and viable individuals.


Now, the best individuals randomly chosen for the initial population in this example don’t look too promising. The very best one has a sum-of-squared-error (SSE) of about 6,000, and things fall off rapidly after that. Let’s watch how quickly a little evolution improves on pure randomness.12

Midway through the second generation, differential evolution has found a successful challenger with half the SSE of the best initial individual!13 And by the fifth generation, the best challenger has an SSE not even half as bad as that.

You can see the outcome of each tournament between target and its challenger with the characters that march across the console while ade is running in your terminal. If the character is an “X”, the challenger was worse and the target keeps its place. If the character is a numeric digit, the challenger won its tournament, and the number represents how much better the challenger was than the target it replaced. Early on, there will be a lot of impressive gains as awful randomly-chosen individuals and their early offspring get the boot.

In later generations, most of the characters will be “X” as very good targets demand even better challengers to replace them. And when they do get replaced, the digits mostly will be “1”, which indicates that the challenger was less than 150% better than its target. It’s still progress, but slower. There will also be less frequent reports of a new overall best individual. Take a look at what happened in generation 40, for example:

One successful challenger became the overall best individual, with an impressively low SSE of 927. There were only 14 or so other success stories (about 10% of the population), none of which broke the best-individual record, and only one of which had an SSE less than half of the target it replaced. One of the successful challengers was so marginal in its improvement that it got a “0” digit, indicating that it was less than 1% better than the target it replaced.

Something else is different in generation 40: The lower bound for the F differential scaling factor is no longer 0.5. The default behavior of ade is to automatically adjust F with each generation once improvements slow down. First it reduces the lower bound of F, then the upper bound, when the challengers’ improvements aren’t quite adding up anymore.

Eventually, when the range for randomly choosing a value of F becomes so small and low that the optimization is just tweaking around a local minimum, ade stops and reports its final result. In this example, that happened after 70 generations, with a result that was only about 1.5% better than the one that showed up in generation 40, and about 5% better than the result found in generation 20.

You can wait for minor improvements as long as you want (see the bitterEnd option to the DifferentialEvolution constructor), but don’t feel bad about seeing slow progress after a certain point. Most of your own DNA evolution got done in single-celled organisms a billion years ago. A newborn baby adopted from your ancestors 20,000 years ago could grow up and go to college, graduate just fine, and not look or act out of place.


Copyright (C) 2017-2018 by Edwin A. Suominen,

See for API documentation as well as information about
Ed's background and other projects, software and otherwise.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the
License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an "AS
express or implied. See the License for the specific language
governing permissions and limitations under the License.


  1. The target is the population member that is compared to a challenger (or child) and replaced if the challenger has better fitness. The target is also called the candidate, although that is a bit misleading in that it is really an incumbent: The target already exists and is being challenged by a usurper that wants to take its place. 

  2. The base individual is also in a sense a parent, in that the challenger’s “chromosome” values are based on its values. With the crossover operation, the target individual represents the other parent. 

  3. Mutation is also referred to as perturbation, since the individuals in DE are defined by a sequence of real values rather than encoded chromosomes with discrete digits (or DNA base pairs) that are randomly changed. 

  4. Crossover is also called recombination. It is (somewhat) analogous to what happens biologically with chromosomes of two parents. 

  5. The vector difference is a subtraction between each item in the values sequence of one individual object and the corresponding item in the values sequence of the other. The scaling is simply a multiplication of all the values after the subtraction has been done. 

  6. The “array” depicted is actually an instance of Population, which does a lot of cool stuff in addition to just holding a bag full of individuals and their DeferredLock instances. 

  7. “Twisted is an event-driven networking engine written in Python and licensed under the open source ​MIT license,” It also happens to work great for asynchronous coding that doesn’t even involve networking. Once you quit writing blocking code, you just start to think differently about programming. 

  8. Deferred object to an immediately available result simply fires right away. So we always yield one, even when all the individuals are ready. 

  9. Or even 32, for those of you who like running expensive space heaters

  10. Each time the algorithm runs a challenge, it locks four individuals, or a full 10% of this small population. Once several challenges have gotten underway, it’s less likely that a new challenge will happen to select four individuals that are not being evaluated and subject to replacement. Take another look at FIG. 1. 

  11. The callable MultiRunner object gets made the DE evaluation function when a reference to it is provided as the first constructor argument of Population. See the constructor of Solver in the script. 

  12. It’s not technically pure randomness, as the default for ade is to initialize the population using a Latin hypercube to initialize pseudorandom parameters with minimal clustering. 

  13. There’s a minor bug in ade I haven’t fixed as of this writing (sorry) that causes the “best individual” reporting to start out with ones worse than the best one found in the initial population because it reports successful challengers, and they only need to be better than the lousy target they are replacing. After a generation or two, the best successful challenger is almost always better than the best initial individual anyhow.