Generate random data with integers. The approach is as follows: - There is one collection for each different cardinality. All collections contain the same fields. - Each field contains the data generated from a certain data distribution. The data could be anything - same type, mixed types, same mathematical distribution (e.g. normal), or a mixed distribution. - The committed configuration file, and the corresponding data file are reduced to only two small collections. For actual experiments one needs to add more data sizes, and re-generate the data locally. This is done so that Evergreen tests can run fast, and to reduce the size of the git repository. - All data is saved in a single JavaScript file: jstests/query_golden/libs/data/ce_accuracy_test.data, with a corresponding schema file jstests/query_golden/libs/data/ce_accuracy_test.schema. - The data file is a JavaScript file that can be loaded directly inside a JS test. When loading this file, it creates a global variable dataSet. The reason is that this is the only way to load an external JSON file that doesn't need to install external tools in Evergreen.
Cost Model Calibrator
Python virtual environment
The following assumes you are using python from the MongoDB toolchain.
/opt/mongodbtoolchain/v4/bin/python3
Getting started
(mongo-python3) deactivate # only if you have another python env activated
sh> /opt/mongodbtoolchain/v4/bin/python3 -m venv cm # create new env
sh> source cm/bin/activate # activate new env
(cm) python -m pip install -r requirements.txt # install required packages
(cm) python start.py # run the calibrator
(cm) deactivate # back to bash
sh>
Install new packages
(cm) python -m pip install <package_name> # install <package_name>
(cm) python -m pip freeze > requirements.txt # do not forget to update requirements.txt