Click here to Skip to main content
15,867,771 members
Articles / Programming Languages / Python

Finding Probability Distribution Parameters from Percentiles

Rate me:
Please Sign up or sign in to vote.
4.18/5 (8 votes)
26 Jun 2014BSD4 min read 71.6K   646   16   4
How to determine the parameters of a probability distribution given two percentile constraints

Introduction

Suppose there's a 10% chance of something being less than 30 and a 90% chance of it being less than 60. Once you pick a probability distribution family (normal, gamma, etc.) you need a way of determining what parameters will satisfy your two requirements.

More precisely, suppose a random variable X has a two-parameter distribution. The problem is to find values of those parameters so that Pr(X < x1) = p1 and Pr(X < x2) = p2.

This article will show how to compute these parameters for normal, Cauchy, Weibull, gamma, and inverse gamma distributions using Python's SciPy library.

The problem of finding parameters to satisfy two percentile equations is practical. It comes up, for example, in determining prior distributions in Bayesian statistics. This article also serves as a brief introduction to using SciPy.

Background

The mathematical solution to the problem posed here is described in the technical report Determining distribution parameters from quantiles. Here we explain how to implement in Python the calculations in that report. Here is an article giving more motivation for the problem.

Using the Code

The code described here is very simple to call. Suppose you want to find the mean and standard deviation for a normal distribution. You simply call normal_parameters with the appropriate arguments. It returns the mean and standard deviation as a pair.

(mean, stdev) = normal_parameters(x1, p1, x2, p2)

The functions for other distributions are similar:

  • cauchy_parameters
  • weibull_parameters
  • gamma_parameters
  • inverse_gamma_parameters
  • beta_parameters

All functions take the same four arguments and all return two parameters. However, the meaning of the parameters is different for each distribution. For the normal distribution, the problem is finding μ and σ so that F(x1) = p1 and F(x2) = p2 where F(x) is defined by:

F(x) = \frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^x  \exp\left( -\frac{(t-\mu)^2}{2\sigma^2}\right) \, dt

For the Cauchy distribution, the problem is similar except now:

F(x) = \frac{1}{\pi\sigma} \int_{-\infty}^x \frac{dt}{ 1 + \left(\frac{t-\mu}{\sigma} \right)^2 }}

For the Weibull distribution, the problem is finding γ and β where:

F(x) = \frac{\gamma}{\beta^\gamma} \int_0^x t^{\gamma-1} \exp\left(-(t/\beta)^\gamma \right)\, dt

For the gamma distribution, the problem is finding α and β where:

F(x) = \frac{1}{\Gamma(\alpha)\,\beta^\alpha} \int_0^x t^{\alpha-1} \exp(-x/\beta)\, dt

For the inverse gamma distribution, the problem is finding α and β where:

F(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} \int_0^x t^{-\alpha-1}\exp(-\beta/t)\, dt

For the beta distribution, the problem is finding a and b where:

F(x) = \frac{\Gamma(a+b)}{\Gamma(a)\,\Gamma(b)} \int_0^x t^{a-1}(1-t)^{b-1}\, dt

Looking Inside the Code

The code presented here depends critically on SciPy, a large library for scientific computing in Python. For an introduction to SciPy, see the CodeProject article Getting started with the SciPy (Scientific Python) library.

For the normal and Cauchy distributions, the location parameter is given by:

\frac{x_1 F^{-1}(p_2) - x_2 F^{-1}(p_1)}{F^{-1}(p_2) - F^{-1}(p_1)}

and the scale parameter is given by:

\frac{x_2 - x_1}{F^{-1}(p_2) - F^{-1}(p_1)}

where F(x) is the CDF of the normal or Cauchy distribution as in the previous section. The inverse of the CDF is given by the ppt method in SciPy. Here "ppt" stands for "percentile point function." Other libraries may call this the quantile function. The probability distribution classes are located in scipy.stats.

The parameters for the Weibull distribution can be given by a simple formula not requiring any SciPy functionality. The inverse gamma parameters are also easy to find since the inverse gamma problem can be reduced to the problem of finding parameters for the gamma distribution.

The gamma distribution parameters cannot be obtained so simply. The code to find these parameters illustrates SciPy and the functools module.

def gamma_parameters(x1, p1, x2, p2):
    
    # Standardize so that x1 < x2 and p1 < p2
    if p1 > p2:
        (p1, p2) = (p2, p1)
        (x1, x2) = (x2, x1)
    
    # function to find roots of for gamma distribution parameters
    def objective(alpha):
        return stats.gamma.ppf(p2, alpha) / stats.gamma.ppf(p1, alpha) - x2/x1
    
    # The objective function we're wanting to find a root of is decreasing.
    # We need to find an interval over which is goes from positive to negative.
    left = right = 1.0
    while objective(left) < 0.0:
        left /= 2
    while objective(right) > 0.0:
        right *= 2
    alpha = optimize.bisect(objective, left, right)
    beta = x1 / stats.gamma.ppf(p1, alpha)
    
    return (alpha, beta)

Notice the auxiliary function objective defined inside the gamma_parameters function. A nice feature of Python is that you can easily define little functions like this using closures. Only the gamma_parameters needs this function so we define it inside gamma_parameters to keep from cluttering the outer scope.

To find where the function objective is 0, we use one of the root-finding methods in the scipy.optimize. The code here uses the bisect method, an algorithm that is slow compared to other root-finding algorithms but is very reliable.

For the beta distribution, we have to direct method for solving for the distribution parameters. Instead we solve an optimization problem using the SciPy function fmin. We search for the parameters which come closest to satisfying the percentile constraints. In fact these parameters will satisfy the constraints exactly.

def beta_parameters(x1, p1, x2, p2):
    "Find parameters for a beta random variable X 
	so that P(X > x1) = p1 and P(X > x2) = p2."

    def square(x): 
        return x*x

    def objective(v):
        (a, b) = v
        temp  = square( stats.beta.cdf(x1, a, b) - p1 )
        temp += square( stats.beta.cdf(x2, a, b) - p2 )
        return temp
    
    # arbitrary initial guess of (3, 3) for parameters
    xopt = optimize.fmin(objective, (3, 3))
    return (xopt[0], xopt[1])	

The approach used to calculate the beta distribution parameters could be used for other distributions as well. In fact, we could have used this approach for all distributions. However, the specialized methods used for the other distributions are more accurate and more efficient.

History

  • 3rd February, 2010: Initial version
  • 4th February, 2010: Article updated
  • 26th July, 2010: Beta distribution added

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
President John D. Cook Consulting
United States United States
I work in the areas of applied mathematics, data analysis, and data privacy.

Check out my blog or send me a note.

 


Comments and Discussions

 
QuestionLog Normal Parameters Pin
K_Lago2-Feb-19 1:13
K_Lago2-Feb-19 1:13 
QuestionLooking for a C# version Pin
victorbos21-Oct-11 4:15
victorbos21-Oct-11 4:15 
AnswerRe: Looking for a C# version Pin
John D. Cook21-Oct-11 4:36
John D. Cook21-Oct-11 4:36 
GeneralRe: Looking for a C# version Pin
victorbos21-Oct-11 5:02
victorbos21-Oct-11 5:02 
Thanks for your quick reply.

Re 1) I am actually looking for source code (C# or C++) to the numerical algorithm so that I can adapt it for my needs in my own experimental app.

Re 2) Indeed, I do essentially that when I have a large samples. However, when I have smaller samples, of the order of 30-60 observations for each (and in many cases where P(x<x1)=0 and="" p(x="">x2)=0) sampling leads to unsatisfactory results, so I exhaustively compute the distribution of the sum the "old fashioned" way. I guess there is no alternative. I was hoping that there was, as I have thousands of such sets of observations and am looking for a fast way to handle the convolutions. My problem is complicated a bit because the range of x1 and x2 in one distribution is modest (e.g., in the hundreds) and in the other an order of magnitude higher.

Once again, thank you very much for your pointers.

Cheers!

--VVX

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.