ade.history.ClosestPairFinder(object)
class documentation
Part of ade.history
(View In Hierarchy)
I determine which of two rows are most similar of a Numpy 1-D array X I maintain having Nr rows and Nc columns.
The array's first column contains the SSEs of individuals and the remaining columns contain the parameter values that resulted in each SSE. The array values are normalized such that the average of of each of the columns is 1.0.
Instance Variable | Nr | The number of rows in X. |
Instance Variable | Nc | The number of columns in X, SSE + parameter values. |
Instance Variable | X | My Numpy 1-D array having up to Nr active rows and exactly Nc columns of SSE+values combinations. |
Instance Variable | S | A Numpy 1-D array having a scaling factor for the
sum-of-squared-differences calculated by __call__ .
The scaling factor is reciprocal of the variance of all active rows in
X, or None if the variance needs to be (re)computed. |
Instance Variable | K | A set of indices to the active rows in X. |
Instance Variable | Kn | A set of the indices that have never been in the population. |
Class Variable | Np_max | The maximum number of row pairs to examine for differences in __call__ . |
Class Variable | Kn_penalty | The multiplicative penalty to impose on the computed difference to favor pairs where at least one member has been a population member. |
Method | __init__ | ClosestPairFinder(Nr, Nc) |
Method | clear | Sets my K and Kn to empty sets and S to
None , returning me to a virginal state. |
Method | setRow | Call with the row index to my X array and a 1-D array Z with the SSE+values that are to be stored in that row. |
Method | clearRow | Call with the row index to my X array to have me disregard the SSE+values that are to be stored in that row. If the index is in my Kn set, discards it from there. |
Method | pairs_sampled | Returns a 2-D Numpy array of N pairs of separate row indices to my X array, randomly sampled from my set K with replacement. |
Method | pairs_all | Returns a 2-D Numpy array of all pairs of separate row indices to my X array where the second value in each pair is greater than the first value. |
Method | calcerator | Does the calculation of most expendable row index for calculate
in Twisted-friendly fashion, iterating over computationally intensive
chunks of processing. |
Method | done | Undocumented |
Method | calculate | Calculates the most expendable row index in my X array. |
Method | __call__ | Returns a Deferred that fires with the row index to my
X array of the SSE+values combination that is most expendable
(closest to another one, and not currently in the population). |
__call__
.
The scaling factor is reciprocal of the variance of all active rows in
X, or None
if the variance needs to be (re)computed.
Call with the row index to my X array and a 1-D array Z with the SSE+values that are to be stored in that row.
Nulls out my S scaling array to force re-computation of the
column-wise variances when __call__
runs next, because the new row entry will change them.
Never call this with an inf
or NaN
anywhere in
Z. An exception will be raised if you try.
Parameters | neverInPop | Set True to indicate that this SSE+value was never in the
population and thus should be more likely to be bumped in favor of a
newcomer during size limiting. |
Call with the row index to my X array to have me disregard the SSE+values that are to be stored in that row. If the index is in my Kn set, discards it from there.
Nulls out my S scaling array to force re-computation of the
column-wise variances when __call__
runs next, because disregarding the row entry.
Returns a 2-D Numpy array of N pairs of separate row indices to my X array, randomly sampled from my set K with replacement.
The second value in each row of the returned array must be greater than the first value. (There may be duplicate rows, however.) Sampling of K continues until there are enough suitable rows.
Returns a 2-D Numpy array of all pairs of separate row indices to my X array where the second value in each pair is greater than the first value.
The returned array will have N*(N-1)/2 rows and two columns, where N is the length of my K set of row indices.
Does the calculation of most expendable row index for calculate
in Twisted-friendly fashion, iterating over computationally intensive
chunks of processing.
If K is specified (only for unit testing), the result is the 1-D
array of differences D (local), properly scaled (see below).
Otherwise, the result is the row index with the smallest difference. In
either case, my result Bag
contains the result.
The D vector is scaled down by mean SSE to favor lower-SSE history. If the history comes to have more never-population records than those that have been in the population, elements of D corresponding to pairs where the first item in the pair was never in the population are scaled down dramatically. The purpose of that is to keep a substantial fraction of the non-population history reserved for those who once were in the population.
Calculates the most expendable row index in my X array.
Nr is the number of rows addressed in my K row index array and Np is the maximum number of pair to examine. The optional K array is for unit testing only.
Returns a Deferred
the fires with the most expendable row
index.
Returns a Deferred
that fires with the row index to my
X array of the SSE+values combination that is most expendable
(closest to another one, and not currently in the population).
If I have just a single SSE+value combination, the Deferred fires with
that combination's row index in X. If there are no legit
combinations, it fires with None
.
If the maximum number of pairs Np to examine (default
Np_max) is greater than N*(N-1)/2, where
N is the length of my K set of row indices, pairs_all
is called to examine all suitable pairs.
Otherwise, pairs_sampled
is called instead and examination is limited to a random sample of
Np suitable pairs. With the default Np_max of 10000, this
occurs at N>142
. With Np_max of 1000, it occurs with
N>45
. Since the N_max of History
has a default of
1000, pairs_sampled
is what's going to be used in all practical situations.
The similarity is determined from the sum of squared differences between two rows, divided by the column-wise variance of all (active) rows.
Parameters | Np | Set to the maximum number of pairs to examine. Default is Np_max. |
K | For unit testing only: Supply a 2-D Numpy array of pairs of row indices,
and the Deferred will fire with just the sum-of-squares
difference between each pair. |