PythonRUN(38)➜Statistics 3: Grubb's Outlier Test

Accept or Reject an Outlier?

Stephen Lukacs (2) iquanta.org/instruct/python

Enter the measurements to be averaged below...

Which datapoint to test?

, or, just Upload to run the demonstration.

Understanding Grubb's Outlier Test

The Grubb's test was first proposed by Frank Grubb in 1950. He proposed that with any one-dimensional dataset, that the smallest or largest outlier may be thrown out of that dataset if it is truly an outlier. The Grubb's test uses statistical analysis to determine if the outlier should be rejected and disregarded or must be accepted and thus included and maintained within the dataset.

It is a three step process. First, calculate the Grubb's value using the equation: $$g = \frac{\mid outlier - \bar{x} \mid}{ \sigma }$$ where the outlier is either the minimum or maximum datapoint of the dataset to be tested, $\bar{x}$ and $\sigma$ is the average or mean and standard deviation of the dataset, respectively.

Second, look up the Grubb's critical value for the count of datapoints, n, and the confidence or certainty level, CL, required in the below table. And finally, third, if the above g value is less than or equal to the critical value, then the datapoint must be accepted and included. If the g value is greater than the critical value, then you may reject and disregard that datapoint.

. .. Grubb's Critical Values .. .

n	50%CL	80%CL	90%CL	95%CL	98%CL	99%CL	99.5%CL	99.9%CL
3	1.00000	1.12947	1.14837	1.15312	1.15445	1.15464	1.15468	1.15470
4	1.12500	1.35000	1.42500	1.46250	1.48500	1.49250	1.49625	1.49925
5	1.22903	1.48971	1.60163	1.67139	1.72528	1.74886	1.76368	1.78025
6	1.31660	1.59429	1.72888	1.82212	1.90360	1.94425	1.97282	2.01074
7	1.39131	1.67850	1.82798	1.93813	2.04161	2.09730	2.13911	2.20059
8	1.45604	1.74908	1.90895	2.03165	2.15249	2.22083	2.27437	2.35863
9	1.51291	1.80978	1.97726	2.10956	2.24427	2.32315	2.38681	2.49195
10	1.56350	1.86296	2.03623	2.17607	2.32203	2.40972	2.48208	2.60593
11	1.60895	1.91022	2.08801	2.23391	2.38915	2.48428	2.56412	2.70460
12	1.65016	1.95270	2.13410	2.28495	2.44796	2.54942	2.63573	2.79095
13	1.68780	1.99123	2.17556	2.33054	2.50011	2.60702	2.69897	2.86730
14	1.72241	2.02645	2.21320	2.37165	2.54685	2.65848	2.75537	2.93538
15	1.75440	2.05885	2.24762	2.40904	2.58909	2.70486	2.80611	2.99656
16	1.78413	2.08884	2.27931	2.44327	2.62756	2.74696	2.85208	3.05192
17	1.81187	2.11672	2.30863	2.47481	2.66282	2.78545	2.89401	3.10233
18	1.83786	2.14276	2.33591	2.50402	2.69532	2.82082	2.93248	3.14846
19	1.86230	2.16717	2.36139	2.53119	2.72542	2.85350	2.96795	3.19090
20	1.88534	2.19014	2.38527	2.55658	2.75343	2.88382	3.00080	3.23012
21	1.90714	2.21181	2.40775	2.58039	2.77959	2.91208	3.03136	3.26649
22	1.92780	2.23231	2.42895	2.60278	2.80411	2.93850	3.05988	3.30036
23	1.94745	2.25176	2.44901	2.62392	2.82716	2.96330	3.08659	3.33200
24	1.96616	2.27026	2.46805	2.64391	2.84890	2.98663	3.11169	3.36164
25	1.98401	2.28788	2.48614	2.66287	2.86946	3.00864	3.13533	3.38950

"""
reference: https://iquanta.org/instruct/python ::: Statistics 3: Grubb's Outlier Test ::: Stephen Lukacs, Ph.D. ©2023-02-14
"""
from py4web import URL, request
from yatl.helpers import *
from iquanta.mcp import is_str_float, is_str_int, str_to_float, str_to_int, extra_x
from iquanta.chmpy import Gtest, grubbs

BR, B = TAG['br/'], TAG['b']
#demo_data = "0.1190\n0.09847\n0.09852"
demo_data = "6.18\n6.28\n4.85\n6.49"

rtn = FORM(_action=None, _method="post")
if ('txtfile' in request.forms):
    txt, data = request.forms['txtfile'], [ ]
    for l in txt.strip().split('\n'):
        if is_str_float(l.strip()):
           data.append(str_to_float(l.strip()))
    #rtn.append(CAT(data, BR()))
otype = request.forms.get('otype')

rtn.append(STYLE("input[type=text] { width: 70px; text-align: center; border-radius: 7px; } textarea { margin: 0px; width: 295px; height: 200px; border-radius: 5px; } p { margin: 2px 0px; padding: 8px; border-radius: 10px; border: 2px solid silver; }"))
rtn.append(CAT(DIV("Enter the measurements to be averaged below...", BR(), TEXTAREA(txtfile if ('txtfile' in locals()) else demo_data, _name="txtfile"), _style="float:left;"), DIV(BR(), "Which datapoint to test?", BR(), SELECT(OPTION("Minimum Outlier", _value="minimum"), OPTION("Maximum Outlier", _value="maximum"), _name="otype"), *[BR()]*2, INPUT(_type="submit", _value="Upload"), ", or, just Upload to run the demonstration.", _style="float:left; margin: 5px;"), DIV(_style="float:none;clear:both;")))

#rtn.append(otype)
if ('data' in locals()):
    CIs = (50, 80, 90, 95, 99, 99.5, 99.9,)
    Gtests = [ Gtest(data, otype, ci) for ci in CIs ]
    g = Gtests[-1]
    p = P()
    #p.append(CAT(str(g), BR()))
    p.append(CAT(B(f'You may reject the datapoint ({g[7]}) with {"only" if (g[5] < 50.) else ""} {g[5]:.2f}% confidence, or:', BR(), _style="font-size:18pt; font-weight:bold;")))
    for i, (ci, g) in enumerate(zip(CIs, Gtests), 1):
        #p.append(CAT(XML(g), BR()*2,))
        if g[6]:
            p.append(SPAN(XML(f'<b>at {ci}% confidence</b> the datapoint <b>must be accepted</b> with x&#772; = {g[1]:.4g} and &sigma; = {g[2]:.4g}.<br/>'), _style="font-size:14pt;"))
        else:
            p.append(SPAN(XML(f'<b>at {ci}% confidence</b> the datapoint <b>may be rejected</b> with x&#772; = {g[1]:.4g} and &sigma; = {g[2]:.4g} if accepted, and x&#772; = {g[8]:.4g} and &sigma; = {g[9]:.4g} if rejected.<br/>'), _style="font-size:14pt;"))
    rtn.append(p)
dv = DIV(H3("Understanding Grubb's Outlier Test"), "The Grubb's test was first proposed by Frank Grubb in 1950.  He proposed that with any one-dimensional dataset, that the smallest or largest outlier may be thrown out of that dataset if it is truly an outlier.  The Grubb's test uses statistical analysis to determine if the outlier should be rejected and disregarded or must be accepted and thus included and maintained within the dataset.", *[BR()]*2, r"It is a three step process.  First, calculate the Grubb's value using the equation: $$g = \frac{\mid outlier - \bar{x} \mid}{ \sigma }$$ where the outlier is either the minimum or maximum datapoint of the dataset to be tested, \(\bar{x}\) and \(\sigma\) is the average or mean and standard deviation of the dataset, respectively.", *[BR()]*2, "Second, look up the Grubb's critical value for the count of datapoints, n, and the confidence or certainty level, CL, required in the below table.  And finally, third, if the above g value is less than or equal to the critical value, then the datapoint must be accepted and included.  If the g value is greater than the critical value, then you may reject and disregard that datapoint.", *[BR()]*2, _style="")
g = grubbs()
gt = g.find('table#grubbs')[0]
gt['_style'] = "margin: auto;"
dv.append(CAT(g.find('style')[0], H3(". .. Grubb's Critical Values .. ."), gt,))
#dv.append(CAT(*[BR()]*4, g))
dv.append(CAT("lecture by Stephen Lukacs, Ph.D., ©2011 - 2023; updated: March 7, 2023.  all data confirmed via ", A("lecture_data_analysis.nb", _href=URL('static', "pdf/lecture_data_analysis8.pdf"), _target="data_analysis"), "."))
rtn.append(dv)