Here is a glossary to help you with the main concepts and the vocabulary used on the DISCOVER module.
A general machine learning model that can be trained with data. These models can be used to make predictions of experimental data, suggest how to guide your discovery or generate new biosequences, among many other things. According to its purpose, our platform considers 3 main types of models: Predictive, Evolutive and Generative.
In our platform, Predictive Models are machine learning models that are trained to predict a given measurement. Given an initial dataset, where each sample considers all interesting measurements and descriptors, you can train a Predictive Model to predict a specific measurement for new samples. This tool also provides you an estimation of the prediction error, calculated using the best machine learning standards.
Evolutive Models is a tool aimed at designing your next experiment. Evolutive models basically are Predictive Models embedded in an optimization framework. This tools provides you with the best next designs (and/or environmental conditions) to test when you are looking for an improvement in some measurement.
Generative Models is a tool based on Deep Learning. This algorithm can be trained with a set of amino acid sequences to try to generate new samples that mimic the properties of the provided data.
To do their job, most machine learning models need to be trained first. Machine learning models are just mathematical models with a number of parameters that must be adjusted to obtain a desired outcome. The optimal parameter values for a given task are found by using a training dataset, which contains useful information about the job to be done, in combination with a specific training algorithm. The whole process of finding the parameters of the model is called training.
As Gene Variants represents the different possibilities of a gene (as a result of a change in its nucleotide sequence), in DISCOVER we use the word variant as each of the possible values of a categorical variable. For example, the variants of the variable “color” may be “yellow”, “white” and “brown”; the variants for a variable that represents a promoter in a specific position may be “plCL1”, “pTDH3”, “pACT1” as they are the labels for specific promoter sequences.
In DISCOVER, the word “descriptors” is used as a synonym for “variables”, “features” or ”columns”.
In Prediction Models “target” is a variable (a measurement or descriptor) that is usually unknown and you want to predict for a set of samples. In Evolution Models, the target is the measurement which values you want to maximize.
Some datasets contain columns or variables with values that can’t be sorted as ordinals. This is the case of features that contain label strings, like variant names of an enzyme, or other measures that aren’t sortable, such as color names or qualitative observations. Within DISCOVER, the values of those types of columns are known as categorical values. You can also specify that a feature that contains numbers is categorical, if there are a limited number of different values in that column and if you consider that there isn’t an evident ordinal relationship between those values.
This is the type of value that is associated with quantitative measurements. Numerical variables can contain integers or floating point numbers. It is important to note that columns values from csv files that contain this type, should contain only numbers (this includes numbers which contain decimals or thousands separator characters).