yoda is hosted by Hepforge, IPPP Durham
YODA - Yet more Objects for Data Analysis 2.0.0
YODA - Yet more Objects for Data Analysis

YODA is a package for creation and analysis of statistical data, written in C++ and usable from C++ and Python. The development and developers of YODA have emerged from the sub-field of Monte Carlo event generator validation and tuning in high-energy physics (HEP), but very intentionally there is nothing in YODA which is specific to that application!

HEP researchers may reasonably ask why we would develop another data analysis package when the ROOT (http://cern.ch/root/) package is so well established? Here are our answers, along the way illustrating the design principles of YODA:

  • ROOT is a very large and monolithic package, containing many, many features other than basic data analysis types. For our applications it was not desirable to introduce such a huge dependency just to get some histogramming functionality. YODA design aim #1: be small and to-the-point, i.e. do one job really well.
  • ROOT does not support sparse binning, i.e. gaps between histogram bins in either 1D or 2D histograms. This feature is required in general, and in particular had long been a problem for the Rivet and Professor systems, which required post-processing scripts to strip out spurious bins not present in reference data from e.g. the HepData repository. The only neat solution was to write our own histogram classes, supporting bin-masking and other useful features like intrinsic support for fill-weighting. YODA design aim #2: support general requirements of statistical data objects.
  • ROOT is infamously difficult to use as a library: it likes to take over the command line parsing, manage object memory with hidden state (leading to crashes or memory leaks), etc. We didn't want to either have to deal with those problems ourselves, nor to pass them on to YODA. Our previous, more lightweight, choice also had design issues that impacted on users. YODA design aim #3: make the nicest, sanest, programming interface for data analysis possible.
  • In ROOT in particular, histograms as statistical data objects and histograms as graphical data representations are conflated concepts. It's hence impossible to declare some data as constant (for safety) without then being unable to change its plotting style. We don't like that. We also want to avoid common mistakes such as confusing the height of a histogram bin (a plotting concept) with its statistical content – YODA is designed to make the meaning of bin contents very clear and unambiguous. YODA design aim #4: separate data handling from presentation.
  • Event generators of various kinds do not always produce simulated events with a physical distribution: the events themselves may be weighted for technical or efficiency reasons. They may also produce negative weights, or a whole collection of weights with different meanings for each event: these aspects of statistical analysis are not well-served by existing systems. YODA design aim #5: handle weights in the best possible way, including negative weights and weight vectors.
  • Like ROOT and AIDA, the emphasis of YODA is on reproducible, programmatic data analysis. Plots should not normally require manual intervention: if made via a script or program, it is possible to reproduce plots after a long delay or when the original author has moved on... that's good for science. Unlike ROOT, we don't think that C++ is a sane scripting language. YODA design aim #6: enable pleasant, clear programmatic analysis and plotting from C++ programs or Python scripts.
  • Finally, high-statistics analysis requires the ability to split analyses into many small runs which are executed in parallel. YODA design aim #7: store all the information required for complete and exact reconstruction of up to second-order statistical moments of complete runs from merging of constituent equivalent or distinct sub-runs. Store sufficient data that normal post-processing operations such as scaling or division may be operated after a run-merging phase.