2020 Campaign, Available Data, models and Data Analysis Resources

The information collected during 2020 has been completely organized. We
could say it is only partially analized in the sense that there’s many
kinds of information that we haven’t applied to any particular study and
therefore we haven’t processed/refined but every piece of information is
ready enough to allow any interested party to work on it.

As there are many points to mention, we’ve structured this article with
as many sections as possible, hoping you can go directly to the bits
that are interesting for you.

If you have a particular need that you believe we might help you with,
don’t hesitate to reach. We have many pieces of yet not published
material and source code we are working on.

Synthetic Description

The BI dataset is generated funadmentaly by analizing produce and soil
from collected on farms, and produce acquired from a variety of
commercial sources. It is as neutral and deep as possible, we want it to
be ready to sustain deep and comited research, to test hypothesis
amongst different enough groups of points, etc. We tried to sample from
sources as heterogeneous as possible and as representative as possible.

It is sample centric: each point is information about either a piece
of crop or a soil/crop pair.

It is heterogeneous: according to the source of the sample,
different amounts of information can be reasonably expected. To give a
simple example: no soil data can be paired to a piece of supermarket
acquired fruit.

It is composite: assembled from an array of complex data sources,
mainly survey stack surveys and it is very extensive. For that reason,
substantial pre processing is needed on our side before sharing and
typically some more work is required to adequate it to each research
project that is going to be based on it.

It is multisampled: each produce sample we receive is fractioned
into up to three sub samples (which all share the same sample id and
have different
s), which means there are repeated measurements
of most specimens. To avoid overfiting problems, care must be taken
while training models
to keep all samples sharing a sample_id
together on the same training fold.

What’s available

    • Crop data: species, variety, color, observed defects, mineral
      composition, antioxidants, polyphenols, brix, protein (just in
      grain). NIR scans over surface and grinded material. UV/VIS scan
      over surface and grinded material.
    • Farm Data: Location, farm practices such as management,
      irrigation, soil treatment, management practices.
    • Soil Data: Soil health metrics (respiration, organic carbon
      content, pH), Uv/Vis and NIR scans, mineral composition.

Data Dictionary

Data Dictionary File

Our Focus and Results So Far

These are some of the analyses that we’ve performed so far over 2020

This material can be found across our 2020 briefs.

Grain Data

Grain Soils Data
all soil models also In this brief.

Produce Data Analysis

Leafy Greens Data Analysis

Non Grain Soils Data analysis, except predictive models, which are
included in the previous brief.


We’ve generated many synthesis materials which you can find in our
briefs, such as

  • Histograms for most variables, by crop, climate region, variety,
  • Reference value tables containing maximums, minimums, means, medians,
    variation ratios and variance.
  • Huge box plots collections comparing sub divisions of crops such as
    climate regions, varieties, and cultivation techniques for
    different variables.
  • Many correlation analyses, both for between crop and soil attributes
    separately and across those categories.
  • Identification of outliers for nutrients, minerals and spectra.


We’ve tried several algorithms. So far we are relying mainly on Quantile
Random Forest when we want to get results as accurate and useful as
possible and on Linear and Multilinear regressions whenever we want to
do deeper statistical analysis, typically comparing results on both.
We’ve tested many more models, which we haven’t chosen for a number of
reasons such as accuracy or suitability to our particular needs.

Most modelling is structured around modular datasets. We typically
organize available data in categories according to the model usage
hypothesis and design different combinations tailoring possibilities and
needs of different user groups, train models for all the different
combinations and compare precision. For crop nutrient prediction, 21
models were designed, targeting a spectra of users from a buyer in a
supermarket to a professional full equipped lab.

Nutrient Prediction based on reflectometer scan (in field modelling)

This was one of our main drivers. Currently, we are predicting
Antioxidants, Polyphenols, Proteins and Brix using the UV/VIS
bionutrient scanner. The QRF model returns a confidence interval for
the value of each nutrient, 80% confident. In most cases, this
prediction allows the user to know in what range of quality the piece of
produce being analyzed is, between a decile and a quartile, using just
an instantaneous scan of the surface of the product.

This models are deployed, so they can currently be used in field for
those crop/nutrient models that yielded enough precision.

Nutrient Prediction based on soil conditions

This work is reflected in our soil briefs. There’s substantial interest
in modelling the relationship between these factors and also in proving
the correlation exists. This is a work in progress so far, but we’ve
done significant work already.

Soil Organic Carbon Content Prediction based on reflectometer scan

We are using attempting to replicate the same strategy we used for
nutrients. The results are encouraging but we decided this models needs
a sharper level of precision to be relevant for the users, so we want
to improve it before releasing.

Statistical Results

This is also reflected in our briefs.

  • Median comparisons trying to quantify the influence of farm
    management and broader farm practices over nutritional content of
    crops. This are non parametrical means comparisons between reference
    neutral farms and farms tagged as users of each practice and are
    available formatted as tables in which we mention both the
    percentage of the mean displacement for the practicing ones.
  • Some clusterization work for the nutrient conte across several
    categories such as farm practices and climate regions.
  • Some ANOVA work comparing climate regions, varieties, practices for
    our main studied nutrients.
  • We registered and plotted correlations between most variables.