Skip to contents

Compare some empirical data set against different distributions to help find the distribution that could be the best fit.

Usage

tidy_distribution_comparison(.x, .distribution_type = "continuous")

Arguments

.x

The data set being passed to the function

.distribution_type

What kind of data is it, can be one of continuous or discrete

Value

An invisible list object. A tibble is printed.

Details

The purpose of this function is to take some data set provided and to try to find a distribution that may fit the best. A parameter of .distribution_type must be set to either continuous or discrete in order for this the function to try the appropriate types of distributions.

The following distributions are used:

Continuous:

  • tidy_beta

  • tidy_cauchy

  • tidy_exponential

  • tidy_gamma

  • tidy_logistic

  • tidy_lognormal

  • tidy_normal

  • tidy_pareto

  • tidy_uniform

  • tidy_weibull

Discrete:

  • tidy_binomial

  • tidy_geometric

  • tidy_hypergeometric

  • tidy_poisson

The function itself returns a list output of tibbles. Here are the tibbles that are returned:

  • comparison_tbl

  • deviance_tbl

  • total_deviance_tbl

  • aic_tbl

  • kolmogorov_smirnov_tbl

  • multi_metric_tbl

The comparison_tbl is a long tibble that lists the values of the density function against the given data.

The deviance_tbl and the total_deviance_tbl just give the simple difference from the actual density to the estimated density for the given estimated distribution.

The aic_tbl will provide the AIC for a lm model of the estimated density against the emprical density.

The kolmogorov_smirnov_tbl for now provides a two.sided estimate of the ks.test of the estimated density against the empirical.

The multi_metric_tbl will summarise all of these metrics into a single tibble.

Author

Steven P. Sanderson II, MPH

Examples

xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.

xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")

output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#>    sim_number     x     y    dx       dy     p     q dist_type
#>    <fct>      <int> <dbl> <dbl>    <dbl> <dbl> <dbl> <fct>    
#>  1 1              1  21    2.97 0.000114 0.625  10.4 Empirical
#>  2 1              2  21    4.21 0.000455 0.625  10.4 Empirical
#>  3 1              3  22.8  5.44 0.00142  0.781  13.3 Empirical
#>  4 1              4  21.4  6.68 0.00355  0.688  14.3 Empirical
#>  5 1              5  18.7  7.92 0.00721  0.469  14.7 Empirical
#>  6 1              6  18.1  9.16 0.0124   0.438  15   Empirical
#>  7 1              7  14.3 10.4  0.0192   0.125  15.2 Empirical
#>  8 1              8  24.4 11.6  0.0281   0.812  15.2 Empirical
#>  9 1              9  22.8 12.9  0.0395   0.781  15.5 Empirical
#> 10 1             10  19.2 14.1  0.0516   0.531  15.8 Empirical
#> # ℹ 342 more rows
#> 
#> $deviance_tbl
#> # A tibble: 352 × 2
#>    name                       value
#>    <chr>                      <dbl>
#>  1 Empirical                0.451  
#>  2 Beta c(1.11, 1.58, 0)    0.366  
#>  3 Cauchy c(19.2, 7.38)    -0.389  
#>  4 Exponential c(0.05)      0.318  
#>  5 Gamma c(11.47, 1.75)    -0.0967 
#>  6 Logistic c(20.09, 3.27)  0.213  
#>  7 Lognormal c(2.96, 0.29)  0.428  
#>  8 Pareto c(10.4, 1.62)     0.391  
#>  9 Uniform c(8.34, 31.84)   0.00787
#> 10 Weibull c(3.58, 22.29)   0.0577 
#> # ℹ 342 more rows
#> 
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#>    dist_with_params        abs_tot_deviance
#>    <chr>                              <dbl>
#>  1 Beta c(1.11, 1.58, 0)              0.687
#>  2 Gamma c(11.47, 1.75)               0.715
#>  3 Logistic c(20.09, 3.27)            0.946
#>  4 Lognormal c(2.96, 0.29)            1.24 
#>  5 Gaussian c(20.09, 5.93)            3.40 
#>  6 Weibull c(3.58, 22.29)             4.24 
#>  7 Uniform c(8.34, 31.84)             5.37 
#>  8 Exponential c(0.05)                7.17 
#>  9 Pareto c(10.4, 1.62)               8.69 
#> 10 Cauchy c(19.2, 7.38)              11.9  
#> 
#> $aic_tbl
#> # A tibble: 10 × 3
#>    dist_type               aic_value abs_aic
#>    <fct>                       <dbl>   <dbl>
#>  1 Beta c(1.11, 1.58, 0)       -27.5    27.5
#>  2 Pareto c(10.4, 1.62)         93.4    93.4
#>  3 Gaussian c(20.09, 5.93)    -162.    162. 
#>  4 Uniform c(8.34, 31.84)     -163.    163. 
#>  5 Weibull c(3.58, 22.29)     -176.    176. 
#>  6 Exponential c(0.05)        -202.    202. 
#>  7 Gamma c(11.47, 1.75)       -210.    210. 
#>  8 Cauchy c(19.2, 7.38)       -228.    228. 
#>  9 Logistic c(20.09, 3.27)    -234.    234. 
#> 10 Lognormal c(2.96, 0.29)    -247.    247. 
#> 
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#>    dist_type              ks_statistic ks_pvalue ks_method alternative dist_char
#>    <fct>                         <dbl>     <dbl> <chr>     <chr>       <chr>    
#>  1 Beta c(1.11, 1.58, 0)         0.781  0.000500 Monte-Ca… two-sided   Beta c(1…
#>  2 Cauchy c(19.2, 7.38)          0.562  0.000500 Monte-Ca… two-sided   Cauchy c…
#>  3 Exponential c(0.05)           0.5    0.00100  Monte-Ca… two-sided   Exponent…
#>  4 Gamma c(11.47, 1.75)          0.156  0.838    Monte-Ca… two-sided   Gamma c(…
#>  5 Logistic c(20.09, 3.2…        0.125  0.966    Monte-Ca… two-sided   Logistic…
#>  6 Lognormal c(2.96, 0.2…        0.188  0.626    Monte-Ca… two-sided   Lognorma…
#>  7 Pareto c(10.4, 1.62)          0.656  0.000500 Monte-Ca… two-sided   Pareto c…
#>  8 Uniform c(8.34, 31.84)        0.188  0.645    Monte-Ca… two-sided   Uniform …
#>  9 Weibull c(3.58, 22.29)        0.219  0.444    Monte-Ca… two-sided   Weibull …
#> 10 Gaussian c(20.09, 5.9…        0.156  0.840    Monte-Ca… two-sided   Gaussian…
#> 
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#>    dist_type abs_tot_deviance aic_value abs_aic ks_statistic ks_pvalue ks_method
#>    <fct>                <dbl>     <dbl>   <dbl>        <dbl>     <dbl> <chr>    
#>  1 Beta c(1…            0.687     -27.5    27.5        0.781  0.000500 Monte-Ca…
#>  2 Gamma c(…            0.715    -210.    210.         0.156  0.838    Monte-Ca…
#>  3 Logistic…            0.946    -234.    234.         0.125  0.966    Monte-Ca…
#>  4 Lognorma…            1.24     -247.    247.         0.188  0.626    Monte-Ca…
#>  5 Gaussian…            3.40     -162.    162.         0.156  0.840    Monte-Ca…
#>  6 Weibull …            4.24     -176.    176.         0.219  0.444    Monte-Ca…
#>  7 Uniform …            5.37     -163.    163.         0.188  0.645    Monte-Ca…
#>  8 Exponent…            7.17     -202.    202.         0.5    0.00100  Monte-Ca…
#>  9 Pareto c…            8.69       93.4    93.4        0.656  0.000500 Monte-Ca…
#> 10 Cauchy c…           11.9      -228.    228.         0.562  0.000500 Monte-Ca…
#> # ℹ 1 more variable: alternative <chr>
#> 
#> attr(,".x")
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32