Package astLib :: Module astStats
[hide private]
[frames] | no frames]

Module astStats

source code

module for performing statistical calculations.

(c) 2007-2012 Matt Hilton

(c) 2013-2014 Matt Hilton & Steven Boada

http://astlib.sourceforge.net

This module (as you may notice) provides very few statistical routines. It does, however, provide biweight (robust) estimators of location and scale, as described in Beers et al. 1990 (AJ, 100, 32), in addition to a robust least squares fitting routine that uses the biweight transform.

Some routines may fail if they are passed lists with few items and encounter a `divide by zero' error. Where this occurs, the function will return None. An error message will be printed to the console when this happens if astStats.REPORT_ERRORS=True (the default). Testing if an astStats function returns None can be used to handle errors in scripts.

For extensive statistics modules, the Python bindings for GNU R (http://rpy.sourceforge.net), or SciPy (http://www.scipy.org) are suggested.

Functions [hide private]
float
mean(dataList)
Calculates the mean average of a list of numbers.
source code
float
weightedMean(dataList)
Calculates the weighted mean average of a two dimensional list (value, weight) of numbers.
source code
float
stdev(dataList)
Calculates the (sample) standard deviation of a list of numbers.
source code
float
rms(dataList)
Calculates the root mean square of a list of numbers.
source code
float
weightedStdev(dataList)
Calculates the weighted (sample) standard deviation of a list of numbers.
source code
float
median(dataList)
Calculates the median of a list of numbers.
source code
float
modeEstimate(dataList)
Returns an estimate of the mode of a set of values by mode=(3*median)-(2*mean).
source code
float
MAD(dataList)
Calculates the Median Absolute Deviation of a list of numbers.
source code
float
normalizdMAD(dataList)
Calculates the normalized Median Absolute Deviation of a list of numbers which, for a Gaussian distribution, is related to the standard deviation by 1.4826.
source code
float
biweightLocation(dataList, tuningConstant=6.0)
Calculates the biweight location estimator (like a robust average) of a list of numbers.
source code
float
biweightScale(dataList, tuningConstant=9.0)
Calculates the biweight scale estimator (like a robust standard deviation) of a list of numbers.
source code
float
biweightScale_test(dataList, tuningConstant=9.0)
Calculates the biweight scale estimator (like a robust standard deviation) of a list of numbers.
source code
dictionary
biweightClipped(dataList, tuningConstant, sigmaCut)
Iteratively calculates biweight location and scale, using sigma clipping, for a list of values.
source code
list
biweightTransform(dataList, tuningConstant)
Calculates the biweight transform for a set of values.
source code
float
gapperEstimator(dataList)
Calculates the Gapper Estimator (like a robust standard deviation) on a list of numbers.
source code
dictionary
OLSFit(dataList)
Performs an ordinary least squares fit on a two dimensional list of numbers.
source code
dictionary
clippedMeanStdev(dataList, sigmaCut=3.0, maxIterations=10.0)
Calculates the clipped mean and stdev of a list of numbers.
source code
dictionary
clippedMedianStdev(dataList, sigmaCut=3.0, maxIterations=10.0)
Calculates the clipped mean and stdev of a list of numbers.
source code
dictionary
clippedWeightedLSFit(dataList, sigmaCut)
Performs a weighted least squares fit on a list of numbers with sigma clipping.
source code
dictionary
weightedLSFit(dataList, weightType)
Performs a weighted least squares fit on a three dimensional list of numbers [x, y, y error].
source code
dictionary
biweightLSFit(dataList, tuningConstant, sigmaCut=None)
Performs a weighted least squares fit, where the weights used are the biweight transforms of the residuals to the previous best fit .i.e.
source code
list
cumulativeBinner(data, binMin, binMax, binTotal)
Bins the input data cumulatively.
source code
list
binner(data, binMin, binMax, binTotal)
Bins the input data..
source code
list
weightedBinner(data, weights, binMin, binMax, binTotal)
Bins the input data, recorded frequency is sum of weights in bin.
source code
tuple
bootstrap(data, statistic, resamples=1000, alpha=0.05, output='ci', **kwargs)
Returns the bootstrap estimate of the confidence interval for the given statistic.
source code
tuple
runningStatistic(x, y, statistic='mean', binNumber=10, **kwargs)
Calculates the value given by statistic in bins of x.
source code
numpy.array
slice_sampler(px, N=1, x=None)
Provides N samples from a user-defined discreet distribution.
source code
Variables [hide private]
  REPORT_ERRORS = True
  __package__ = 'astLib'
Function Details [hide private]

mean(dataList)

source code 

Calculates the mean average of a list of numbers.

Parameters:
  • dataList (list or numpy array) - input data, must be a one dimensional list
Returns: float
mean average

weightedMean(dataList)

source code 

Calculates the weighted mean average of a two dimensional list (value, weight) of numbers.

Parameters:
  • dataList (list) - input data, must be a two dimensional list in format [value, weight]
Returns: float
weighted mean average

stdev(dataList)

source code 

Calculates the (sample) standard deviation of a list of numbers.

Parameters:
  • dataList (list or numpy array) - input data, must be a one dimensional list
Returns: float
standard deviation

rms(dataList)

source code 

Calculates the root mean square of a list of numbers.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
Returns: float
root mean square

weightedStdev(dataList)

source code 

Calculates the weighted (sample) standard deviation of a list of numbers.

Parameters:
  • dataList (list) - input data, must be a two dimensional list in format [value, weight]
Returns: float
weighted standard deviation

Note: Returns None if an error occurs.

median(dataList)

source code 

Calculates the median of a list of numbers.

Parameters:
  • dataList (list or numpy array) - input data, must be a one dimensional list
Returns: float
median average

modeEstimate(dataList)

source code 

Returns an estimate of the mode of a set of values by mode=(3*median)-(2*mean).

Parameters:
  • dataList (list) - input data, must be a one dimensional list
Returns: float
estimate of mode average

MAD(dataList)

source code 

Calculates the Median Absolute Deviation of a list of numbers.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
Returns: float
median absolute deviation

normalizdMAD(dataList)

source code 

Calculates the normalized Median Absolute Deviation of a list of numbers which, for a Gaussian distribution, is related to the standard deviation by 1.4826.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
Returns: float
normalized median absolute deviation

biweightLocation(dataList, tuningConstant=6.0)

source code 

Calculates the biweight location estimator (like a robust average) of a list of numbers.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
  • tuningConstant (float) - 6.0 is recommended.
Returns: float
biweight location

Note: Returns None if an error occurs.

biweightScale(dataList, tuningConstant=9.0)

source code 

Calculates the biweight scale estimator (like a robust standard deviation) of a list of numbers.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
  • tuningConstant (float) - 9.0 is recommended.
Returns: float
biweight scale

Note: Returns None if an error occurs.

biweightScale_test(dataList, tuningConstant=9.0)

source code 

Calculates the biweight scale estimator (like a robust standard deviation) of a list of numbers.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
  • tuningConstant (float) - 9.0 is recommended.
Returns: float
biweight scale

Note: Returns None if an error occurs.

biweightClipped(dataList, tuningConstant, sigmaCut)

source code 

Iteratively calculates biweight location and scale, using sigma clipping, for a list of values. The calculation is performed on the first column of a multi-dimensional list; other columns are ignored.

Parameters:
  • dataList (list) - input data
  • tuningConstant (float) - 6.0 is recommended for location estimates, 9.0 is recommended for scale estimates
  • sigmaCut (float) - sigma clipping to apply
Returns: dictionary
estimate of biweight location, scale, and list of non-clipped data, in the format {'biweightLocation', 'biweightScale', 'dataList'}

Note: Returns None if an error occurs.

biweightTransform(dataList, tuningConstant)

source code 

Calculates the biweight transform for a set of values. Useful for using as weights in robust line fitting.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
  • tuningConstant (float) - 6.0 is recommended for location estimates, 9.0 is recommended for scale estimates
Returns: list
list of biweights

gapperEstimator(dataList)

source code 

Calculates the Gapper Estimator (like a robust standard deviation) on a list of numbers. Beers et al. 1990 recommends this for small number statistics as it is insensitive to outliers and more accurately reproduces the true dispersion of the system when compared to the canonical rms standard deviation. See Hou et al. 2009 for a comparison.

Parameters:
  • dataList (list) - input data, must be a one dimensional list
Returns: float
The dispersion of the dataList

OLSFit(dataList)

source code 

Performs an ordinary least squares fit on a two dimensional list of numbers. Minimum number of data points is 5.

Parameters:
  • dataList (list) - input data, must be a two dimensional list in format [x,y]
Returns: dictionary
slope and intercept on y-axis, with associated errors, in the format {'slope', 'intercept', 'slopeError', 'interceptError'}

Note: Returns None if an error occurs.

clippedMeanStdev(dataList, sigmaCut=3.0, maxIterations=10.0)

source code 

Calculates the clipped mean and stdev of a list of numbers.

Parameters:
  • dataList (list) - input data, one dimensional list of numbers
  • sigmaCut (float) - clipping in Gaussian sigma to apply
  • maxIterations (int) - maximum number of iterations
Returns: dictionary
format {'clippedMean', 'clippedStdev', 'numPoints'}

clippedMedianStdev(dataList, sigmaCut=3.0, maxIterations=10.0)

source code 

Calculates the clipped mean and stdev of a list of numbers.

Parameters:
  • dataList (list) - input data, one dimensional list of numbers
  • sigmaCut (float) - clipping in Gaussian sigma to apply
  • maxIterations (int) - maximum number of iterations
Returns: dictionary
format {'clippedMean', 'clippedStdev', 'numPoints'}

clippedWeightedLSFit(dataList, sigmaCut)

source code 

Performs a weighted least squares fit on a list of numbers with sigma clipping. Minimum number of data points is 5.

Parameters:
  • dataList (list) - input data, must be a three dimensional list in format [x, y, y weight]
Returns: dictionary
slope and intercept on y-axis, with associated errors, in the format {'slope', 'intercept', 'slopeError', 'interceptError'}

Note: Returns None if an error occurs.

weightedLSFit(dataList, weightType)

source code 

Performs a weighted least squares fit on a three dimensional list of numbers [x, y, y error].

Parameters:
  • dataList (list) - input data, must be a three dimensional list in format [x, y, y error]
  • weightType (string) - if "errors", weights are calculated assuming the input data is in the format [x, y, error on y]; if "weights", the weights are assumed to be already calculated and stored in a fourth column [x, y, error on y, weight] (as used by e.g. astStats.biweightLSFit)
Returns: dictionary
slope and intercept on y-axis, with associated errors, in the format {'slope', 'intercept', 'slopeError', 'interceptError'}

Note: Returns None if an error occurs.

biweightLSFit(dataList, tuningConstant, sigmaCut=None)

source code 

Performs a weighted least squares fit, where the weights used are the biweight transforms of the residuals to the previous best fit .i.e. the procedure is iterative, and converges very quickly (iterations is set to 10 by default). Minimum number of data points is 10.

This seems to give slightly different results to the equivalent R routine, so use at your own risk!

Parameters:
  • dataList (list) - input data, must be a three dimensional list in format [x, y, y weight]
  • tuningConstant (float) - 6.0 is recommended for location estimates, 9.0 is recommended for scale estimates
  • sigmaCut (float) - sigma clipping to apply (set to None if not required)
Returns: dictionary
slope and intercept on y-axis, with associated errors, in the format {'slope', 'intercept', 'slopeError', 'interceptError'}

Note: Returns None if an error occurs.

cumulativeBinner(data, binMin, binMax, binTotal)

source code 

Bins the input data cumulatively.

Parameters:
  • data - input data, must be a one dimensional list
  • binMin (float) - minimum value from which to bin data
  • binMax (float) - maximum value from which to bin data
  • binTotal (int) - number of bins
Returns: list
binned data, in format [bin centre, frequency]

binner(data, binMin, binMax, binTotal)

source code 

Bins the input data..

Parameters:
  • data - input data, must be a one dimensional list
  • binMin (float) - minimum value from which to bin data
  • binMax (float) - maximum value from which to bin data
  • binTotal (int) - number of bins
Returns: list
binned data, in format [bin centre, frequency]

weightedBinner(data, weights, binMin, binMax, binTotal)

source code 

Bins the input data, recorded frequency is sum of weights in bin.

Parameters:
  • data - input data, must be a one dimensional list
  • binMin (float) - minimum value from which to bin data
  • binMax (float) - maximum value from which to bin data
  • binTotal (int) - number of bins
Returns: list
binned data, in format [bin centre, frequency]

bootstrap(data, statistic, resamples=1000, alpha=0.05, output='ci', **kwargs)

source code 

Returns the bootstrap estimate of the confidence interval for the given statistic. The confidence interval is given by 100*(1-alpha). Passes a 1d array to the function, statistic. Any arguments needed by statistic are passed by **args.

Parameters:
  • data (list) - The data on which the given statistic is calculated
  • statistic (function) - The statistic desired
  • resamples (int) - The number of bootstrap resamplings
  • alpha (float) - The confidence interval given by 100*(1-alpha), 95% defaulti
  • output - The format of the output. 'ci' gives the confidence interval, and 'errorbar' gives the length of the errorbar suitable for plotting with matplotlib.
  • kwargs (Keywords) - Arguments needed by the statistic function
Returns: tuple
(Lower Interval, Upper Interval)

runningStatistic(x, y, statistic='mean', binNumber=10, **kwargs)

source code 

Calculates the value given by statistic in bins of x. Useful for plotting a running mean value for a scatter plot, for example. This function allows the computation of the sum, mean, median, std, or other statistic of the values within each bin.

NOTE: if the statistic is a callable function and there are empty data bins those bins will be skipped to keep the function from falling over.

Parameters:
  • x (numpy array) - data over which the bins are calculated
  • y (numpy array) - values for corresponding x values
  • statistic (string or function) - The statistic to compute (default is 'mean'). Acceptable values are 'mean', 'median', 'sum', 'std', and callable function. Extra arguements are passed as kwargs.
  • binNumber (int) - The desired number of bins for the x data.
Returns: tuple
A tuple of two lists containing the left bin edges and the value of the statistic in each of the bins.

slice_sampler(px, N=1, x=None)

source code 

Provides N samples from a user-defined discreet distribution.

>>> slice_sampler(px, N=1, x=None)

If x=None (default) or if len(x) != len(px), it will return an rray of intergers between 0 and len(px)-1. If x is given, it will return the samples from x according to the distribution px.

Originally written by Adam Laiacano, 2011

Parameters:
  • px (numpy array or list) - A discreet probability distribution
  • N (int) - The number of samples to return, default is 1
  • x (numpy array or list) - Optional array/list of observation values to return, where prob(x) = px
Returns: numpy.array
The desired number of samples drawn from the distribution