JKQtPlotter/examples/simpletest_datastore_groupedstat/README.md
jkriege2 2b5a30a668 statistics library: added functions for grouped statistics and adaptors for them
added example for grouped statistics
some improvements to documentation
2019-06-13 10:29:12 +02:00

15 KiB

Tutorial (JKQTPDatastore): 1-Dimensional Group Statistics with JKQTPDatastore

[JKQTPlotterBasicJKQTPDatastore]: @ref JKQTPlotterBasicJKQTPDatastore "Basic Usage of JKQTPDatastore" [JKQTPlotterBasicJKQTPDatastoreIterators]: @ref JKQTPlotterBasicJKQTPDatastoreIterators "Iterator-Based usage of JKQTPDatastore" [JKQTPlotterBasicJKQTPDatastoreStatistics]: @ref JKQTPlotterBasicJKQTPDatastoreStatistics "Advanced 1-Dimensional Statistics with JKQTPDatastore" [JKQTPlotterBasicJKQTPDatastoreRegression]: @ref JKQTPlotterBasicJKQTPDatastoreRegression "Regression Analysis (with the Statistics Library)" [JKQTPlotterBasicJKQTPDatastoreStatisticsGroupedStat]: @ref JKQTPlotterBasicJKQTPDatastoreStatisticsGroupedStat "1-Dimensional Group Statistics with JKQTPDatastore" [JKQTPlotterBasicJKQTPDatastoreStatistics2D]: @ref JKQTPlotterBasicJKQTPDatastoreStatistics2D "Advanced 2-Dimensional Statistics with JKQTPDatastore" [statisticslibrary]: @ref jkqtptools_math_statistics "JKQTPlotter Statistics Library"

This tutorial project (see ./examples/simpletest_datastore_groupedstat/) explains several advanced functions of JKQTPDatastore in combination with the statisticslibrary conatined in JKQTPlotter.

Note that there are additional tutorial explaining other aspects of data mangement in JKQTPDatastore:

  • [JKQTPlotterBasicJKQTPDatastore]
  • [JKQTPlotterBasicJKQTPDatastoreIterators]
  • [JKQTPlotterBasicJKQTPDatastoreStatistics]
  • [JKQTPlotterBasicJKQTPDatastoreRegression]
  • [JKQTPlotterBasicJKQTPDatastoreStatisticsGroupedStat]
  • [JKQTPlotterBasicJKQTPDatastoreStatistics2D]

[TOC]

The source code of the main application can be found in jkqtplotter_simpletest_datastore_groupedstat.cpp. This tutorial cites only parts of this code to demonstrate different ways of working with data for the graphs.

Barcharts & Boxplots from categorized data

Generating a Dataset for Grouped Barcharts

To demonstrate the grouped statistics, we first have to generate a dataset. The datapoints consist of pairs <group,value>, where the groups are encoded by the numbers 1,2,3 and in each group, several measurements are taken:

    size_t colBarRawGroup=datastore1->addColumn("barchart, rawdata, group");
    size_t colBarRawValue=datastore1->addColumn("barchart, rawdata, value");

    //      data for group 1
    datastore1->appendToColumns(colBarRawGroup, colBarRawValue, 1, 1.1);
    datastore1->appendToColumns(colBarRawGroup, colBarRawValue, 1, 1.5);
    datastore1->appendToColumns(colBarRawGroup, colBarRawValue, 1, 0.8);
    // ...

    //      data for group 2
    datastore1->appendToColumns(colBarRawGroup, colBarRawValue, 2, 2.2);
    // ...
    
    //      data for group 3
    datastore1->appendToColumns(colBarRawGroup, colBarRawValue, 3, 4.1);
    // ...
    

Note that the data does not have to be sorted. You can add the dataset in any order!

This dataset can be visualized with a simple scatter plot:

    JKQTPXYLineGraph* gScatterForBar;
    plotbarchart->addGraph(gScatterForBar=new JKQTPXYLineGraph(plotbarchart));
    gScatterForBar->setXYColumns(colBarRawGroup, colBarRawValue);
    gScatterForBar->setDrawLine(false);
    gScatterForBar->setSymbolType(JKQTPCross);
    gScatterForBar->setSymbolSize(5);
    gScatterForBar->setSymbolColor(QColorWithAlphaF(QColor("red"), 0.5));

jkqtplotter_simpletest_datastore_groupedstat_barchartrawdata

Calculating Grouped Statistics for a Barchart

Now we want to draw a barchart for every group, which indicates the average in each group. This is done using methods from the statistics library. First we need to group the data using jkqtpstatGroupData(), which assembles the data points in each group groupeddataBar

    std::map<double, std::vector<double> > groupeddataBar;
    jkqtpstatGroupData(datastore1->begin(colBarRawGroup), datastore1->end(colBarRawGroup),
                       datastore1->begin(colBarRawValue), datastore1->end(colBarRawValue),
                       groupeddataBar);

Now we can calculate the statistics for each group separately: Data is collected in new columns colBarGroup, colBarAverage and colBarStdDev. The statistics is then calculated by simply iterating over groupeddataBar and calling functions like jkqtpstatAverage() for each group:

    size_t colBarGroup=datastore1->addColumn("barchart, group");
    size_t colBarAverage=datastore1->addColumn("barchart, group-average");
    size_t colBarStdDev=datastore1->addColumn("barchart, group-stddev");

    for (auto it=groupeddataBar.begin(); it!=groupeddataBar.end(); ++it) {
        datastore1->appendToColumn(colBarGroup, it->first);
        datastore1->appendToColumn(colBarAverage, jkqtpstatAverage(it->second.begin(), it->second.end()));
        datastore1->appendToColumn(colBarStdDev, jkqtpstatStdDev(it->second.begin(), it->second.end()));
    }

Finally the calculated groups are drawn:

    JKQTPBarVerticalErrorGraph* gBar;
    plotbarchart->addGraph(gBar=new JKQTPBarVerticalErrorGraph(plotbarchart));
    gBar->setXYColumns(colBarGroup, colBarAverage);
    gBar->setYErrorColumn(static_cast<int>(colBarStdDev));

jkqtplotter_simpletest_datastore_groupedstat_barchart

In order to safe yo the typing of the code above, shortcuts in the form of adaptors exist:

    jkqtpstatAddYErrorBarGraph(plotbarchart->getPlotter(),
                               datastore1->begin(colBarRawGroup), datastore1->end(colBarRawGroup),
                               datastore1->begin(colBarRawValue), datastore1->end(colBarRawValue));

Also other flavors exist that generate different graphs (see the JKQTPlotter documentation):

  • jkqtpstatAddYErrorLineGraph() / jkqtpstatAddXErrorLineGraph()
  • jkqtpstatAddYErrorBarGraph() / jkqtpstatAddXErrorBarGraph()
  • jkqtpstatAddYErrorImpulsesGraph() / jkqtpstatAddXErrorImpulsesGraph()
  • jkqtpstatAddYErrorParametrizedScatterGraph() / jkqtpstatAddXErrorParametrizedScatterGraph()
  • jkqtpstatAddYErrorFilledCurveGraph() / jkqtpstatAddXErrorFilledCurveGraph()
  • jkqtpstatAddYErrorGraph() / jkqtpstatAddXErrorGraph()

Calculating Grouped Statistics for a Boxplot

With the methods above we can also calculate more advanced statistics, like e.g. boxplots:

    size_t colBarMedian=datastore1->addColumn("barchart, group-median");
    size_t colBarMin=datastore1->addColumn("barchart, group-min");
    size_t colBarMax=datastore1->addColumn("barchart, group-max");
    size_t colBarQ25=datastore1->addColumn("barchart, group-Q25");
    size_t colBarQ75=datastore1->addColumn("barchart, group-Q75");
    for (auto it=groupeddataBar.begin(); it!=groupeddataBar.end(); ++it) {
        datastore1->appendToColumn(colBarMedian, jkqtpstatMedian(it->second.begin(), it->second.end()));
        datastore1->appendToColumn(colBarMin, jkqtpstatMinimum(it->second.begin(), it->second.end()));
        datastore1->appendToColumn(colBarMax, jkqtpstatMaximum(it->second.begin(), it->second.end()));
        datastore1->appendToColumn(colBarQ25, jkqtpstatQuantile(it->second.begin(), it->second.end(), 0.25));
        datastore1->appendToColumn(colBarQ75, jkqtpstatQuantile(it->second.begin(), it->second.end(), 0.75));
    }

The result can be plotted using JKQTPBoxplotVerticalGraph, which receives a column for each value class of the final plot:

    JKQTPBoxplotVerticalGraph* gBoxplot;
    plotboxplot->addGraph(gBoxplot=new JKQTPBoxplotVerticalGraph(plotboxplot));
    gBoxplot->setPositionColumn(colBarGroup);
    gBoxplot->setMinColumn(colBarMin);
    gBoxplot->setMaxColumn(colBarMax);
    gBoxplot->setMedianColumn(colBarMedian);
    gBoxplot->setPercentile25Column(colBarQ25);
    gBoxplot->setPercentile75Column(colBarQ75);

jkqtplotter_simpletest_datastore_groupedstat_boxplot

In order to safe yo the typing of the code above, shortcuts in the form of adaptors exist:

    jkqtpstatAddHBoxplotsAndOutliers(plotboxplot->getPlotter(),
                                    datastore1->begin(colBarRawGroup), datastore1->end(colBarRawGroup),
                                    datastore1->begin(colBarRawValue), datastore1->end(colBarRawValue));

Also other flavors exist that generate different graphs (see the JKQTPlotter documentation):

  • jkqtpstatAddVBoxplotsAndOutliers() / jkqtpstatAddHBoxplotsAndOutliers()
  • jkqtpstatVAddBoxplots() / jkqtpstatHAddBoxplots()
  • jkqtpstatAddBoxplots()

(Scatter-)Graphs with X/Y-errors from Categorized Data

Dataset for XY Scatter Graphs

First we generate a second dataset, which is going to be used for a scaterplot. The datapoints consist of pairs <x,y>, that are based on a parabula with random deviations, both in x- and y-direction:

    size_t colScatterRawX=datastore1->addColumn("scatterplot, rawdata, x");
    size_t colScatterRawY=datastore1->addColumn("scatterplot, rawdata, y");
    std::random_device rd; // random number generators:
    std::mt19937 gen{rd()};
    std::normal_distribution<> d1{0,0.5};
    const size_t N=100;
    const double xmax=3.5;
    for (size_t i=0; i<N; i++) {
        const double x=(static_cast<double>(i)-static_cast<double>(N)/2.0)*xmax/(static_cast<double>(N)/2.0);
        const double y=jkqtp_sqr(x)+2.0;
        datastore1->appendToColumns(colScatterRawX, colScatterRawY, x+d1(gen), y+d1(gen));
    }

This dataset can be visualized:

    JKQTPXYParametrizedScatterGraph* gScatterRaw;
    plotscattererrors->addGraph(gScatterRaw=new JKQTPXYParametrizedScatterGraph(plotscattererrors));
    gScatterRaw->setXYColumns(colScatterRawX, colScatterRawY);
    gScatterRaw->setDrawLine(false);
    gScatterRaw->setSymbolType(JKQTPCross);
    gScatterRaw->setSymbolSize(5);

jkqtplotter_simpletest_datastore_groupedstat_scatterrawdata

Calculating x- and y-Errors from Categorized Data

Now we want to draw a scatterchart of the data, where data-points should be grouped together, in x-intervals of width 0.5. From all the points in each interval, we calculate the in both x- and y-direction the average and standard deviation. First we need to group the data using jkqtpstatGroupData(), which assembles the data points in each group groupeddataScatter. For the custom grouping of the datapoints we use the optional functor provided to jkqtpstatGroupData(): We use jkqtpstatGroupingCustomRound1D() with given parameters 0.25 for the (center) location of the first bin and bin width 0.5. The functor is not built by hand (which would be possible using std::bind), but with the generator function jkqtpstatMakeGroupingCustomRound1D(). In addition we use a variant of jkqtpstatGroupData(), which outputs a column with the category assigned to every data pair in the input data range:

    std::map<double, std::pair<std::vector<double>,std::vector<double> > > groupeddataScatter;
    size_t colScatterRawGroup=datastore1->addColumn("scatterplot, rawdata, assigned-group");
    jkqtpstatGroupData(datastore1->begin(colScatterRawX), datastore1->end(colScatterRawX),
                       datastore1->begin(colScatterRawY), datastore1->end(colScatterRawY),
                       datastore1->backInserter(colScatterRawGroup),
                       groupeddataScatter,
                       jkqtpstatMakeGroupingCustomRound1D(0.25, 0.5));

The column colScatterRawGroup can now be used to color the scatter graph:

    gScatterRaw->setColorColumn(colScatterRawGroup);

Now we can calculate the statistics for each group separately: Data is collected in two new columns. Then the statistics is calculated by simply iterating over groupeddataScatter and calling functions like jkqtpstatAverage() for each group:

    size_t colScatterXAvg=datastore1->addColumn("scatter, x, average");
    size_t colScatterXStd=datastore1->addColumn("scatter, x, stddev");
    size_t colScatterYAvg=datastore1->addColumn("scatter, y, average");
    size_t colScatterYStd=datastore1->addColumn("scatter, y, stddev");

    for (auto it=groupeddataScatter.begin(); it!=groupeddataScatter.end(); ++it) {
        datastore1->appendToColumn(colScatterXAvg, jkqtpstatAverage(it->second.first.begin(), it->second.first.end()));
        datastore1->appendToColumn(colScatterXStd, jkqtpstatStdDev(it->second.first.begin(), it->second.first.end()));
        datastore1->appendToColumn(colScatterYAvg, jkqtpstatAverage(it->second.second.begin(), it->second.second.end()));
        datastore1->appendToColumn(colScatterYStd, jkqtpstatStdDev(it->second.second.begin(), it->second.second.end()));
    }

Finally the calculated groups are drawn

    JKQTPXYLineErrorGraph* gScatterErr;
    plotscattererrors->addGraph(gScatterErr=new JKQTPXYLineErrorGraph(plotscattererrors));
    gScatterErr->setXYColumns(colScatterXAvg, colScatterYAvg);
    gScatterErr->setXErrorColumn(static_cast<int>(colScatterXStd));
    gScatterErr->setYErrorColumn(static_cast<int>(colScatterYStd));
    gScatterErr->setSymbolType(JKQTPFilledTriangle);
    gScatterErr->setDrawLine(false);

jkqtplotter_simpletest_datastore_groupedstat_scatter

In order to safe yo the typing of the code above, shortcuts in the form of adaptors exist:

    jkqtpstatAddXYErrorLineGraph(plotscattererrors->getPlotter(),
                                 datastore1->begin(colScatterRawX), datastore1->end(colScatterRawX),
                                 datastore1->begin(colScatterRawY), datastore1->end(colScatterRawY),
                                 jkqtpstatMakeGroupingCustomRound1D(0.25, 0.5));

Also other flavors exist that generate different graphs (see the JKQTPlotter documentation):

  • jkqtpstatAddXYErrorLineGraph()
  • jkqtpstatAddXYErrorParametrizedScatterGraph()
  • jkqtpstatAddXYErrorGraph()

Screenshot of the full Program

The output of the full test program jkqtplotter_simpletest_datastore_groupedstat.cpp looks like this:

jkqtplotter_simpletest_datastore_groupedstat