
Compare Empirical Data to Distributions
Source:R/util-distribution-comparison.R
tidy_distribution_comparison.Rd
Compare some empirical data set against different distributions to help find the distribution that could be the best fit.
Arguments
- .x
The data set being passed to the function
- .distribution_type
What kind of data is it, can be one of
continuous
ordiscrete
Details
The purpose of this function is to take some data set provided and
to try to find a distribution that may fit the best. A parameter of
.distribution_type
must be set to either continuous
or discrete
in order
for this the function to try the appropriate types of distributions.
The following distributions are used:
Continuous:
tidy_beta
tidy_cauchy
tidy_exponential
tidy_gamma
tidy_logistic
tidy_lognormal
tidy_normal
tidy_pareto
tidy_uniform
tidy_weibull
Discrete:
tidy_binomial
tidy_geometric
tidy_hypergeometric
tidy_poisson
The function itself returns a list output of tibbles. Here are the tibbles that are returned:
comparison_tbl
deviance_tbl
total_deviance_tbl
aic_tbl
kolmogorov_smirnov_tbl
multi_metric_tbl
The comparison_tbl
is a long tibble
that lists the values of the density
function against the given data.
The deviance_tbl
and the total_deviance_tbl
just give the simple difference
from the actual density to the estimated density for the given estimated distribution.
The aic_tbl
will provide the AIC
for a lm
model of the estimated density
against the emprical density.
The kolmogorov_smirnov_tbl
for now provides a two.sided
estimate of the
ks.test
of the estimated density against the empirical.
The multi_metric_tbl
will summarise all of these metrics into a single tibble.
Examples
xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.
xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")
output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#> sim_number x y dx dy p q dist_type
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 21 2.97 0.000114 0.625 10.4 Empirical
#> 2 1 2 21 4.21 0.000455 0.625 10.4 Empirical
#> 3 1 3 22.8 5.44 0.00142 0.781 13.3 Empirical
#> 4 1 4 21.4 6.68 0.00355 0.688 14.3 Empirical
#> 5 1 5 18.7 7.92 0.00721 0.469 14.7 Empirical
#> 6 1 6 18.1 9.16 0.0124 0.438 15 Empirical
#> 7 1 7 14.3 10.4 0.0192 0.125 15.2 Empirical
#> 8 1 8 24.4 11.6 0.0281 0.812 15.2 Empirical
#> 9 1 9 22.8 12.9 0.0395 0.781 15.5 Empirical
#> 10 1 10 19.2 14.1 0.0516 0.531 15.8 Empirical
#> # ℹ 342 more rows
#>
#> $deviance_tbl
#> # A tibble: 352 × 2
#> name value
#> <chr> <dbl>
#> 1 Empirical 0.451
#> 2 Beta c(1.11, 1.58, 0) 0.366
#> 3 Cauchy c(19.2, 7.38) -0.389
#> 4 Exponential c(0.05) 0.318
#> 5 Gamma c(11.47, 1.75) -0.0967
#> 6 Logistic c(20.09, 3.27) 0.213
#> 7 Lognormal c(2.96, 0.29) 0.428
#> 8 Pareto c(10.4, 1.62) 0.391
#> 9 Uniform c(8.34, 31.84) 0.00787
#> 10 Weibull c(3.58, 22.29) 0.0577
#> # ℹ 342 more rows
#>
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#> dist_with_params abs_tot_deviance
#> <chr> <dbl>
#> 1 Beta c(1.11, 1.58, 0) 0.687
#> 2 Gamma c(11.47, 1.75) 0.715
#> 3 Logistic c(20.09, 3.27) 0.946
#> 4 Lognormal c(2.96, 0.29) 1.24
#> 5 Gaussian c(20.09, 5.93) 3.40
#> 6 Weibull c(3.58, 22.29) 4.24
#> 7 Uniform c(8.34, 31.84) 5.37
#> 8 Exponential c(0.05) 7.17
#> 9 Pareto c(10.4, 1.62) 8.69
#> 10 Cauchy c(19.2, 7.38) 11.9
#>
#> $aic_tbl
#> # A tibble: 10 × 3
#> dist_type aic_value abs_aic
#> <fct> <dbl> <dbl>
#> 1 Beta c(1.11, 1.58, 0) -27.5 27.5
#> 2 Pareto c(10.4, 1.62) 93.4 93.4
#> 3 Gaussian c(20.09, 5.93) -162. 162.
#> 4 Uniform c(8.34, 31.84) -163. 163.
#> 5 Weibull c(3.58, 22.29) -176. 176.
#> 6 Exponential c(0.05) -202. 202.
#> 7 Gamma c(11.47, 1.75) -210. 210.
#> 8 Cauchy c(19.2, 7.38) -228. 228.
#> 9 Logistic c(20.09, 3.27) -234. 234.
#> 10 Lognormal c(2.96, 0.29) -247. 247.
#>
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#> dist_type ks_statistic ks_pvalue ks_method alternative dist_char
#> <fct> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Beta c(1.11, 1.58, 0) 0.781 0.000500 Monte-Ca… two-sided Beta c(1…
#> 2 Cauchy c(19.2, 7.38) 0.562 0.000500 Monte-Ca… two-sided Cauchy c…
#> 3 Exponential c(0.05) 0.5 0.00100 Monte-Ca… two-sided Exponent…
#> 4 Gamma c(11.47, 1.75) 0.156 0.838 Monte-Ca… two-sided Gamma c(…
#> 5 Logistic c(20.09, 3.2… 0.125 0.966 Monte-Ca… two-sided Logistic…
#> 6 Lognormal c(2.96, 0.2… 0.188 0.626 Monte-Ca… two-sided Lognorma…
#> 7 Pareto c(10.4, 1.62) 0.656 0.000500 Monte-Ca… two-sided Pareto c…
#> 8 Uniform c(8.34, 31.84) 0.188 0.645 Monte-Ca… two-sided Uniform …
#> 9 Weibull c(3.58, 22.29) 0.219 0.444 Monte-Ca… two-sided Weibull …
#> 10 Gaussian c(20.09, 5.9… 0.156 0.840 Monte-Ca… two-sided Gaussian…
#>
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#> dist_type abs_tot_deviance aic_value abs_aic ks_statistic ks_pvalue ks_method
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Beta c(1… 0.687 -27.5 27.5 0.781 0.000500 Monte-Ca…
#> 2 Gamma c(… 0.715 -210. 210. 0.156 0.838 Monte-Ca…
#> 3 Logistic… 0.946 -234. 234. 0.125 0.966 Monte-Ca…
#> 4 Lognorma… 1.24 -247. 247. 0.188 0.626 Monte-Ca…
#> 5 Gaussian… 3.40 -162. 162. 0.156 0.840 Monte-Ca…
#> 6 Weibull … 4.24 -176. 176. 0.219 0.444 Monte-Ca…
#> 7 Uniform … 5.37 -163. 163. 0.188 0.645 Monte-Ca…
#> 8 Exponent… 7.17 -202. 202. 0.5 0.00100 Monte-Ca…
#> 9 Pareto c… 8.69 93.4 93.4 0.656 0.000500 Monte-Ca…
#> 10 Cauchy c… 11.9 -228. 228. 0.562 0.000500 Monte-Ca…
#> # ℹ 1 more variable: alternative <chr>
#>
#> attr(,".x")
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32