I determine which of two rows are most similar of a Numpy 1-D array X I maintain having Nr rows and Nc columns.

The array's first column contains the SSEs of individuals and the remaining columns contain the parameter values that resulted in each SSE. The array values are normalized such that the average of of each of the columns is 1.0.

Instance Variable Nr The number of rows in X.
Instance Variable Nc The number of columns in X, SSE + parameter values.
Instance Variable X My Numpy 1-D array having up to Nr active rows and exactly Nc columns of SSE+values combinations.
Instance Variable S A Numpy 1-D array having a scaling factor for the sum-of-squared-differences calculated by __call__. The scaling factor is reciprocal of the variance of all active rows in X, or None if the variance needs to be (re)computed.
Instance Variable K A set of indices to the active rows in X.
Instance Variable Kn A set of the indices that have never been in the population.
Class Variable Np_max The maximum number of row pairs to examine for differences in __call__.
Class Variable Kn_penalty The multiplicative penalty to impose on the computed difference to favor pairs where at least one member has been a population member.
Method __init__ ClosestPairFinder(Nr, Nc)
Method clear Sets my K and Kn to empty sets and S to None, returning me to a virginal state.
Method setRow Call with the row index to my X array and a 1-D array Z with the SSE+values that are to be stored in that row.
Method clearRow Call with the row index to my X array to have me disregard the SSE+values that are to be stored in that row. If the index is in my Kn set, discards it from there.
Method pairs_sampled Returns a 2-D Numpy array of N pairs of separate row indices to my X array, randomly sampled from my set K with replacement.
Method pairs_all Returns a 2-D Numpy array of all pairs of separate row indices to my X array where the second value in each pair is greater than the first value.
Method calcerator Does the calculation of most expendable row index for calculate in Twisted-friendly fashion, iterating over computationally intensive chunks of processing.
Method done Undocumented
Method calculate Calculates the most expendable row index in my X array.
Method __call__ Returns a Deferred that fires with the row index to my X array of the SSE+values combination that is most expendable (closest to another one, and not currently in the population).
Nr =
The number of rows in X.
Nc =
The number of columns in X, SSE + parameter values.
X =
My Numpy 1-D array having up to Nr active rows and exactly Nc columns of SSE+values combinations.
S =
A Numpy 1-D array having a scaling factor for the sum-of-squared-differences calculated by __call__. The scaling factor is reciprocal of the variance of all active rows in X, or None if the variance needs to be (re)computed.
K =
A set of indices to the active rows in X.
Kn =
A set of the indices that have never been in the population.
Np_max =
The maximum number of row pairs to examine for differences in __call__.
Kn_penalty =
The multiplicative penalty to impose on the computed difference to favor pairs where at least one member has been a population member.
def __init__(self, Nr, Nc):

ClosestPairFinder(Nr, Nc)

def clear(self):

Sets my K and Kn to empty sets and S to None, returning me to a virginal state.

def setRow(self, k, Z, neverInPop=False):

Call with the row index to my X array and a 1-D array Z with the SSE+values that are to be stored in that row.

Nulls out my S scaling array to force re-computation of the column-wise variances when __call__ runs next, because the new row entry will change them.

Never call this with an inf or NaN anywhere in Z. An exception will be raised if you try.

ParametersneverInPopSet True to indicate that this SSE+value was never in the population and thus should be more likely to be bumped in favor of a newcomer during size limiting.
def clearRow(self, k):

Call with the row index to my X array to have me disregard the SSE+values that are to be stored in that row. If the index is in my Kn set, discards it from there.

Nulls out my S scaling array to force re-computation of the column-wise variances when __call__ runs next, because disregarding the row entry.

def pairs_sampled(self, N):

Returns a 2-D Numpy array of N pairs of separate row indices to my X array, randomly sampled from my set K with replacement.

The second value in each row of the returned array must be greater than the first value. (There may be duplicate rows, however.) Sampling of K continues until there are enough suitable rows.

def pairs_all(self):

Returns a 2-D Numpy array of all pairs of separate row indices to my X array where the second value in each pair is greater than the first value.

The returned array will have N*(N-1)/2 rows and two columns, where N is the length of my K set of row indices.

def calcerator(self, Nr, Np, K=None):

Does the calculation of most expendable row index for calculate in Twisted-friendly fashion, iterating over computationally intensive chunks of processing.

If K is specified (only for unit testing), the result is the 1-D array of differences D (local), properly scaled (see below). Otherwise, the result is the row index with the smallest difference. In either case, my result Bag contains the result.

The D vector is scaled down by mean SSE to favor lower-SSE history. If the history comes to have more never-population records than those that have been in the population, elements of D corresponding to pairs where the first item in the pair was never in the population are scaled down dramatically. The purpose of that is to keep a substantial fraction of the non-population history reserved for those who once were in the population.

def done(self, null):
Undocumented
def calculate(self, Nr, Np, K=None):

Calculates the most expendable row index in my X array.

Nr is the number of rows addressed in my K row index array and Np is the maximum number of pair to examine. The optional K array is for unit testing only.

Returns a Deferred the fires with the most expendable row index.

def __call__(self, Np=None, K=None):

Returns a Deferred that fires with the row index to my X array of the SSE+values combination that is most expendable (closest to another one, and not currently in the population).

If I have just a single SSE+value combination, the Deferred fires with that combination's row index in X. If there are no legit combinations, it fires with None.

If the maximum number of pairs Np to examine (default Np_max) is greater than N*(N-1)/2, where N is the length of my K set of row indices, pairs_all is called to examine all suitable pairs.

Otherwise, pairs_sampled is called instead and examination is limited to a random sample of Np suitable pairs. With the default Np_max of 10000, this occurs at N>142. With Np_max of 1000, it occurs with N>45. Since the N_max of History has a default of 1000, pairs_sampled is what's going to be used in all practical situations.

The similarity is determined from the sum of squared differences between two rows, divided by the column-wise variance of all (active) rows.

ParametersNpSet to the maximum number of pairs to examine. Default is Np_max.
KFor unit testing only: Supply a 2-D Numpy array of pairs of row indices, and the Deferred will fire with just the sum-of-squares difference between each pair.
API Documentation for ade, generated by pydoctor at 2022-11-17 13:13:22.