Design space complexity is the number of all possible combinations of the descriptors* from a dataset that is used to train a model. In other words, it defines the total quantity of designs that can be built from different combinations of samples in your dataset.
This value doesn't depend on the dataset's cardinality, but on the diversity of values within the features in the set of samples.
At the Discover toolkit, after a predictive or evolutive machine learning model is trained, the model's results page will display Design Space Complexity in the summary card:
This is very helpful when comparing different datasets, specially in optimization problems (evolutive models), where the number of combinations represents the number of total possible candidates that an exhaustive search optimization algorithm would need to explore to find optima.
In the case of categorical descriptors, the design space complexity is the multiplication of the number of levels (categories) per descriptor. For example, if a design has 2 descriptors: a promoter and a gene, with 2 different possible promoters and 6 different gene variants, the design space complexity is 2 x 6 = 12.
When numerical descriptors are included in the design, the number of combinations is calculated by discretizing the numerical range of these descriptors into a finite number of “categories”.
FAQ:
How new samples are generated?
In the case of categorical descriptors, the Discover toolkit just builds candidates from the combinatory of the categories of different descriptors. When numeric descriptors are included, Discover toolkit first discretizes numeric descriptors into a finite number of levels, evenly spaced, within the numeric range of the feature in the training dataset. These levels are combined in the same way as categorical features.
*Descriptors is used as a synonym for “variables”, “features” or ”columns”.
More Discover toolkit help articles: