Data Analysis, RFC 2019 dataset

Hello, everyone!

We’d like to periodically share our progress over this analysis as there are lots of possible ways to advance over what’s available, it’s very useful to share ideas, get some feedback and also know, compare and test each others hypotheses.

Results (download and open in web browser):

Our results would be mostly available in the dataAnalysis.html file inside our gitlab repo, where you can find associated R objects also.

We have begun with some fundamental steps to make the data available, mostly merging, cleaning, formatting and elimination of outliers. The data is now organized as can be seen in our spectral curves plots. The curves have been clustered by derivative sign successions in order to spot the most obvious patterns. In these plots we observed some curious behaviour which we are still analyzing but for now we mostly classified the most peculiar patterns as atypical data.

Our first target is to predict Polyphenols and Antioxidants contents using data obtained through the Our Sci’s spectrometer. We’ve already done exhaustive modelling work, over a great number of subsets of available dimensions (mostly combinations of OurSci’s spectra, the SWIR spectra measured with the SIWARE device and metadata as produce color) with a variety of available statistical models. We’ve organized a powerful testing framework which enabled us to run hundreds of thousands of modelling instances, with repetitions aimed at reasonable levels of cross validations ( 15 folds ) in order to avoid bias and overfit as much as possible. If you’d like us to run some particular model you believe would be suited for this situation and we haven’t tried, please let us know, we probably can run it using the same procedure without a great effort.

This is still a testing stage in which we are searching for the most promising models, though our plan is almost fulfilled and we are about to move to a second round of selection between a much limited subset of models which will be observed more in depth. This is the reason you’ll find some unreasonable models inside the tables, for example models that are clearly overfitted. This also allowed us to see, for example, that our cross validation was effectively differentiating overfitted linear models with ease (you can see this looking at the original rsquared column in the linear models table, in comparison to the cvRsquared one).

Special effort has been made to fit the classical linear models, first shown in the brief in a tentative ANCOVA fit, then detailed over the modelling section. A variance stabilizing transformation has been tried (without interesting outcomes) and a shallow essay over robust modelling has been carried, this one with an interesting reduction of 20% over the variance, without any special tuning. Robust fitting and outliers weighting is one of the directions we will explore further.

We’ve done preliminary work over deep learning models such as Neural Networks and Support Vector Machines, calibrating the same system we used for massive modelling iteration and looking for reasonable tuning ranges for the hyperparameters. We’ll offer a similar broad survey over these models, but apparently they are not extremely well suited for these problems, as the preliminary observed results are way below what more inferential oriented models have achieved (the observed metrics tend to be 50% less efficient). As involved calculations are exceedingly heavy, with these sections we started a new organization where the heavy modelling work will be done on separate files. This one is not finished and is still unmerged with the main document, in a separate modelling folder.

We already have a promising result on our Random Forest models, which seem to show a very high efficiency for this problem. We’ve worked with them as much for regression over the actual content of the studied molecules as for classifications over these variable’s quartiles. Both procedures appear to work really well, as you may see at what today is the final section about modelling in our brief.

If everything keeps working this way, probably this modelling task will be over soon and we will have powerful models both to predict the contents of these important nutrients on food and to improve our existing tools using what we can learn from the SIWARE’s device broader spectra. Also, the expertise gained in working with the dataset and the iterative tidying and organizing steps these investigations have required will leave us well equipped to test for further hypotheses and proceed with some other modelling options that are possible thanks to the high amount of varied data available.

We are looking forward to also hearing from your findings and observations and wish you luck.

1 Like

Just as more background, we’ve had more discussion here: Please don’t respond there (respond here!), I’m posting just for reference.

Once we are done with traditional inference, it’s probably worth it to try with functional data methods. I was a little scheptic of it usefulnes for this situation but I’ve seen some interesting results on a simmilar investigation.

The most elemental application would be to project into a functional base ( Fourier comes to mind, considering this is spectra after all ) picking elements until a reasonable level of variance is explained and then use the coefficients of this projections as parameters. This allows for a high degree of smoothing and allows to work with actual continous and differentiable functions, so in a way we are getting information for free. I’m eager to see how it impacts on the models.

We kept testing our models and there are reasons to believe precision is below what was originally informed. I’ll be soon posting an updated version with this new more precise testing and some new tooling we’ve added to show atypical data and get tighter modelling.

Octavio - feel free to update the top-level post here - I made it wiki-editable. If there are adjustments needed just keep that post current.