fitter module reference¶

Main module of the fitter package.

This module provides the Fitter class for fitting multiple probability distributions to data samples and comparing their goodness of fit using various metrics.

class fitter.fitter.Fitter(data: ndarray | list[float], xmin: float | None = None, xmax: float | None = None, bins: int = 100, distributions: list[str] | str | None = None, timeout: int = 30, density: bool = True, verbose: bool = True)[source]¶

Fit a data sample to known distributions

A naive approach often performed to figure out the undelying distribution that could have generated a data set, is to compare the histogram of the data with a PDF (probability distribution function) of a known distribution (e.g., normal).

Yet, the parameters of the distribution are not known and there are lots of distributions. Therefore, an automatic way to fit many distributions to the data would be useful, which is what is implemented here.

Given a data sample, we use the fit method of SciPy to extract the parameters of that distribution that best fit the data. We repeat this for all available distributions. Finally, we provide a summary so that one can see the quality of the fit for those distributions

Here is an example where we generate a sample from a gamma distribution.

>>> # First, we create a data sample following a Gamma distribution
>>> from scipy import stats
>>> data = stats.gamma.rvs(2, loc=1.5, scale=2, size=20000)

>>> # We then create the Fitter object
>>> import fitter
>>> f = fitter.Fitter(data)

>>> # just a trick to use only 10 distributions instead of 80 to speed up the fitting
>>> f.distributions = f.distributions[0:10] + ['gamma']

>>> # fit and plot
>>> f.fit()
>>> f.summary()

              sumsquare_error     aic            bic     kl_div  ks_statistic  ks_pvalue
loggamma            0.001176  995.866732 -159536.164644     inf      0.008459   0.469031
gennorm             0.001181  993.145832 -159489.437372     inf      0.006833   0.736164
norm                0.001189  992.975187 -159427.247523     inf      0.007138   0.685416
truncnorm           0.001189  996.975182 -159408.826771     inf      0.007138   0.685416
crystalball         0.001189  996.975078 -159408.821960     inf      0.007138   0.685434

Once the data has been fitted, the summary() method returns a sorted dataframe where the index represents the distribution names.

The AIC is computed using $aic = 2 * k - 2 * log(Lik)$ , and the BIC is computed as $k * log(n) - 2 * log(Lik)$ .

where Lik is the maximized value of the likelihood function of the model, n the number of data point and k the number of parameter.

Looping over the 80 distributions in SciPy could takes some times so you can overwrite the distributions with a subset if you want. In order to reload all distributions, call load_all_distributions().

Some distributions do not converge when fitting. There is a timeout of 30 seconds after which the fitting procedure is cancelled. You can change this timeout attribute if needed.

If the histogram of the data has outlier of very long tails, you may want to increase the bins binning or to ignore data below or above a certain range. This can be achieved by setting the xmin and xmax attributes. If you set xmin, you can come back to the original data by setting xmin to None (same for xmax) or just recreate an instance.

distributions: list[str]¶: list of distributions to test

fit(progress: bool = False, n_jobs: int = -1, max_workers: int = -1, prefer: str = 'processes') → None[source]¶

Fit all distributions to the data and compute goodness-of-fit metrics.

Loops over all distributions in parallel and finds the best parameters to fit the data. Populates the following attributes:

df_errors: DataFrame with sum of squared errors and information criteria

fitted_param: Parameters that best fit the data for each distribution

fitted_pdf: PDF values generated with the fitted parameters

Args:: progress: If True, display progress bar during fitting. n_jobs: Number of jobs for parallel processing (deprecated, use max_workers). max_workers: Number of parallel workers (-1 for all CPUs). prefer: Joblib parallelization method (‘processes’ or ‘threads’).
Note:: The fitting uses parallel processing for speed. Distributions that fail or timeout are assigned infinite error values.

get_best(method: str = 'sumsquare_error') → dict[str, dict[str, float]][source]¶

Return the best fitted distribution and its parameters.

Args:: method: Metric to use for ranking (‘sumsquare_error’, ‘aic’, ‘bic’, etc.).
Returns:: Dictionary with distribution name as key and parameter dictionary as value. Example: {‘gamma’: {‘a’: 2.0, ‘loc’: 1.5, ‘scale’: 2.0}}

hist() → None[source]¶

Draw normalized histogram of the data using bins.

Examples:

>>> from scipy import stats
>>> data = stats.gamma.rvs(2, loc=1.5, scale=2, size=20000)
>>> import fitter
>>> fitter.Fitter(data).hist()

plot_pdf(names: list[str] | str | None = None, Nbest: int = 5, lw: float = 2, method: str = 'sumsquare_error') → None[source]¶

Plot probability density functions of fitted distributions.

Args:

names: Distribution name(s) to plot. If None, plots the Nbest distributions.: Can be a single string or list of strings.

Nbest: Number of best-fitting distributions to plot (when names is None). lw: Line width for the plots. method: Metric to use for ranking distributions (‘sumsquare_error’, ‘aic’, ‘bic’, etc.).

summary(Nbest: int = 5, lw: float = 2, plot: bool = True, method: str = 'sumsquare_error', clf: bool = True) → DataFrame[source]¶

Display summary of best fitting distributions.

Args:: Nbest: Number of best distributions to include in summary. lw: Line width for plots. plot: If True, create histogram and PDF overlay plot. method: Metric to use for ranking distributions. clf: If True, clear figure before plotting.
Returns:: DataFrame with fitting results for the Nbest distributions.

property xmax: float¶: consider only data below xmax. reset if None

property xmin: float¶: consider only data above xmin. reset if None

fitter.fitter.get_common_distributions() → list[str][source]¶

Get commonly used distributions that are available in scipy.stats.

Returns:: List of common distribution names that have fit methods.
Note:: Filters based on scipy version to avoid errors with missing distributions.

fitter.fitter.get_distributions() → list[str][source]¶

Get all scipy.stats distributions that have a fit method.

Returns:: List of distribution names as strings.

histfit module reference¶

Histogram fitting module for Gaussian distributions.

This module provides functionality to fit Gaussian distributions to histogram data, with support for error estimation through Monte Carlo sampling.

Fit and plot Gaussian distributions to histogram data.

This class fits a Gaussian (normal) distribution to histogram data using least squares optimization with optional Monte Carlo error estimation.

The input can be either: - Raw data: Histogram is computed automatically - Pre-computed histogram: X (bin centers) and Y (densities) arrays

For better parameter estimation, the fit can be repeated with added noise (controlled by error_rate) to estimate uncertainty in mu, sigma, and amplitude.

Examples:

>>> from fitter import HistFit
>>> import scipy.stats
>>> data = [scipy.stats.norm.rvs(2, 3.4) for _ in range(10000)]
>>> hf = HistFit(data, bins=30)
>>> hf.fit(error_rate=0.03, Nfit=20)
>>> print(hf.mu, hf.sigma, hf.amplitude)

Using pre-computed histogram: >>> Y, X, _ = plt.hist(data, bins=30, density=True) >>> hf = HistFit(X=X, Y=Y) >>> hf.fit(error_rate=0.03, Nfit=20)

Attributes:

mu (float): Mean of the fitted Gaussian distribution. sigma (float): Standard deviation of the fitted distribution. amplitude (float): Amplitude scaling factor. X (np.ndarray): Bin centers of the histogram. Y (np.ndarray): Probability density values.

Warning:

Currently handles only Gaussian distributions. API may change in future versions.

fit(error_rate: float = 0.05, semilogy: bool = False, Nfit: int = 100, error_kwargs: dict[str, Any] | None = None, fit_kwargs: dict[str, Any] | None = None) → tuple[float, float, float][source]¶

Fit Gaussian distribution to histogram data with error estimation.

Performs multiple fits with added noise to estimate parameter uncertainty. Creates two figures: one showing individual fits, another showing uncertainty bands.

Args:: error_rate: Relative error to add as Gaussian noise (e.g., 0.05 = 5%). semilogy: If True, use logarithmic y-axis for the plot. Nfit: Number of Monte Carlo iterations for error estimation. error_kwargs: Plotting kwargs for individual noisy fits (default: thin black transparent lines). fit_kwargs: Plotting kwargs for final averaged fit (default: thick red line).
Returns:: Tuple of (mu, sigma, amplitude) from the averaged fit.

fitter module reference¶

histfit module reference¶

Table of Contents

Previous topic

Next topic

This Page