diff --git a/docs/assets/images/api-core-ug-hp.png b/docs/assets/images/api-core-ug-hp.png new file mode 100644 index 00000000000..6c04780a96e Binary files /dev/null and b/docs/assets/images/api-core-ug-hp.png differ diff --git a/docs/assets/images/api-core-ug-metrics.png b/docs/assets/images/api-core-ug-metrics.png new file mode 100644 index 00000000000..f059e99a97b Binary files /dev/null and b/docs/assets/images/api-core-ug-metrics.png differ diff --git a/docs/assets/images/qs04.png b/docs/assets/images/qs04.png index 1ec6148c5e8..a50e3d3339a 100644 Binary files a/docs/assets/images/qs04.png and b/docs/assets/images/qs04.png differ diff --git a/docs/assets/images/qswebui-mnist-pytorch-search.png b/docs/assets/images/qswebui-mnist-pytorch-search.png new file mode 100644 index 00000000000..9adc36d24ec Binary files /dev/null and b/docs/assets/images/qswebui-mnist-pytorch-search.png differ diff --git a/docs/assets/images/qswebui-multi-trial-search.png b/docs/assets/images/qswebui-multi-trial-search.png new file mode 100644 index 00000000000..7835e00b88d Binary files /dev/null and b/docs/assets/images/qswebui-multi-trial-search.png differ diff --git a/docs/assets/images/qswebui-recent-local.png b/docs/assets/images/qswebui-recent-local.png index 6ac205df510..e514eaf95dd 100644 Binary files a/docs/assets/images/qswebui-recent-local.png and b/docs/assets/images/qswebui-recent-local.png differ diff --git a/docs/get-started/webui-qs-multi.rst b/docs/get-started/webui-qs-multi.rst new file mode 100644 index 00000000000..352fea239df --- /dev/null +++ b/docs/get-started/webui-qs-multi.rst @@ -0,0 +1,233 @@ +.. _qs-webui-multi: + +############################# + Run a Hyperparameter Search +############################# + +.. meta:: + :description: Learn how to run your first multi-trial experiment, or search, in Determined. + :keywords: PyTorch API,MNIST,model developer,quickstart,search + +Follow these steps to see how to run your first search in Determined. + +A multi-trial search (or hyperparameter search) allows you to optimize your model by exploring +different configurations of hyperparameters automatically. This is more efficient than manually +tuning each parameter. In this guide, we'll show you how to modify the existing ``const.yaml`` +configuration file used in the single-trial experiment to run a multi-trial search. + +**Now that we have established a baseline performance by creating our single-trial experiment, we +can create a search (multi-trial experiment) and compare the outcome with our baseline. We hope to +see improvements gained through hyperparameter tuning and optimization.** + +*************** + Prerequisites +*************** + +You must have a running Determined cluster with the CLI installed. + +- To set up a local cluster, visit :ref:`basic`. +- To set up a remote cluster, visit the :ref:`Installation Guide ` where you'll + find options for On Prem, AWS, GCP, Kubernetes, and Slurm. + +.. note:: + + Visit :ref:`qs-webui` to learn how to run your first single-trial experiment in Determined. + +********************************* + Prepare Your Configuration File +********************************* + +In our single-trial experiment, our ``const.yaml`` file looks something like this: + +.. code:: yaml + + name: mnist_pytorch_const + hyperparameters: + learning_rate: 1.0 + n_filters1: 32 + n_filters2: 64 + dropout1: 0.25 + dropout2: 0.5 + searcher: + name: single + metric: validation_loss + max_length: + batches: 1000 # approximately 1 epoch + smaller_is_better: true + entrypoint: python3 train.py + +To convert this into a multi-trial search, we will need to modify the hyperparameters section and +the searcher configuration. We'll tell Determined to use Random Search which randomly selects values +from the specified ranges and set ``max_trials`` to 20. + +Copy the following code and save the file as ``search.yaml`` in the same directory as your +``const.yaml`` file: + +.. code:: yaml + + name: mnist_pytorch_search + hyperparameters: + learning_rate: + type: log + base: 10 + minval: 1e-4 + maxval: 1.0 + n_filters1: + type: int + minval: 16 + maxval: 64 + n_filters2: + type: int + minval: 32 + maxval: 128 + dropout1: + type: double + minval: 0.2 + maxval: 0.5 + dropout2: + type: double + minval: 0.3 + maxval: 0.6 + + searcher: + name: random + metric: validation_loss + max_trials: 20 + max_length: + batches: 1000 + smaller_is_better: true + + entrypoint: python3 train.py + +******************* + Create the Search +******************* + +Once you've created the new configuration file, you can create and run the search using the +following command: + +.. code:: bash + + det experiment create search.yaml . + +This will start the search, and Determined will run multiple trials, each with a different +combination of hyperparameters from the defined ranges. + +******************** + Monitor the Search +******************** + +In the WebUI, navigate to the **Searches** tab to monitor the progress of your search. You’ll be +able to see the different trials running, their status, and their performance metrics. Determined +also offers built-in visualizations to help you understand the results. + + .. image:: /assets/images/qswebui-multi-trial-search.png + :alt: Determined AI WebUI Dashboard showing a user's recent multi-trial search + +********************* + Analyze the Results +********************* + +After the search is complete, you can review the best-performing trials and the hyperparameter +configurations that led to them. This will help you identify the optimal settings for your model. + +Select **mnist_pytorch_search** to view all runs including single-trial experiments. Then choose +which runs you want to compare. + + .. image:: /assets/images/qswebui-mnist-pytorch-search.png + :alt: Determined AI WebUI Dashboard with mnist pytorch search selected and ready to compare + +************ + Go Further +************ + +Once you've mastered the basics, you can take your experiments to the next level by exploring more +advanced configurations. In this section, we'll cover how to run two additional configurations: +`dist_random.yaml` and `adaptive.yaml`. These examples introduce new concepts such as distributed +training and adaptive hyperparameter search methods. + +Running `dist_random.yaml` +========================== + +To run the distributed random search experiment, use the following command: + +.. code:: bash + + det experiment create dist_random.yaml . + +Running `adaptive.yaml` +======================= + +To run the adaptive search experiment, use the following command: + +.. code:: bash + + det experiment create adaptive.yaml . + +These advanced configurations allow you to scale your experiments and optimize your model +performance more efficiently. As you become more comfortable with these concepts, you’ll be able to +leverage the full power of Determined for more complex machine learning workflows. + +************** + Key Concepts +************** + +This section provides an overview of the key concepts you’ll need to understand when working with +Determined, particularly when running single-trial and multi-trial experiments. + +Single-Trial Experiment (Run) +============================= + +- **Definition:** A single-trial experiment (or run) allows you to establish a baseline performance + for your model. + +- **Purpose:** Running a single trial is useful for understanding how your model performs with a + fixed set of hyperparameters. It serves as a benchmark against which you can compare results from + more complex searches. + +Multi-Trial Experiment (Search) +=============================== + +- **Definition:** A multi-trial experiment (or search) allows you to optimize your model by + exploring different configurations of hyperparameters automatically. +- **Purpose:** A search systematically tests various hyperparameter combinations to find the + best-performing configuration. This is more efficient than manually tuning each parameter. + +Searcher +======== + +- **Random Search:** Randomly samples hyperparameters from the specified ranges for each trial. It + is straightforward and provides a simple way to explore a large search space. + +- **Adaptive ASHA:** Uses an adaptive algorithm to allocate resources dynamically to the most + promising trials. It starts many trials but continues only those that show early success, + optimizing resource usage. + +Resource Allocation +=================== + +- **Distributed Training:** Involves training your model across multiple GPUs (or CPUs) to speed up + the process. This is particularly useful for large models or large datasets. +- **Slots Per Trial:** Specifies the number of GPUs (or CPUs) each trial will use. For example, + setting `slots_per_trial: 1` means each trial will use one GPU or CPU. + +Metrics +======= + +- **Validation Loss:** A common metric used to evaluate the performance of a model during training. + Lower validation loss usually indicates a better model. + +- **Accuracy:** Measures how often the model correctly predicts the target variable. It is + typically used for classification tasks where you want to maximize the number of correct + predictions. + +Baseline Performance +==================== + +- **Establishing a Baseline:** Before running a search, it's important to establish a baseline + performance using a single-trial experiment. This gives you a reference point to compare the + results of your multi-trial searches. + +- **Comparison in Run Tab:** Once you have established a baseline performance, you can create a + search and compare all outcomes in the Run tab. This helps you determine the effectiveness of + different hyperparameter configurations. diff --git a/docs/get-started/webui-qs.rst b/docs/get-started/webui-qs.rst index 8e7f9ee3b5c..49077eb12bf 100644 --- a/docs/get-started/webui-qs.rst +++ b/docs/get-started/webui-qs.rst @@ -6,7 +6,7 @@ .. meta:: :description: Learn how to run your first experiment in Determined. - :keywords: PyTorch API,MNIST,model developer,quickstart + :keywords: PyTorch API,MNIST,model developer,quickstart,search,run Follow these steps to see how to run your first experiment. @@ -20,24 +20,47 @@ You must have a running Determined cluster with the CLI installed. - To set up a remote cluster, visit the :ref:`Installation Guide ` where you'll find options for On Prem, AWS, GCP, Kubernetes, and Slurm. -******************* - Run an Experiment -******************* +********** + Concepts +********** + +- Single-Trial Run: A single-trial experiment (or run) allows you to establish a baseline + performance for your model. Running a single trial is useful for understanding how your model + performs with a fixed set of hyperparameters. It serves as a benchmark against which you can + compare results from more complex searches. + +- Multi-Trial Search: A multi-trial experiment (or search) allows you to optimize your model by + exploring different configurations of hyperparameters automatically. A search systematically + tests various hyperparameter combinations to find the best-performing configuration. This is more + efficient than manually tuning each parameter. + +- Remote Distributed Training: Remote distributed training jobs enable you to train your model + across multiple GPUs or nodes in a cluster, significantly reducing the time required for training + large models or datasets. This approach allows for efficient scaling and management of resources, + particularly in more demanding machine learning tasks. + +********************************* + Execute and Compare Experiments +********************************* + +In this section, we'll first execute a single-trial run before running a search. This will establish +the baseline performance of our model and will give us a reference point to compare the results of +our multi-trial search. Finally, we'll run a remote distributed training job. .. tabs:: .. tab:: - locally + single-trial run - Train a single model for a fixed number of batches, using constant values for all - hyperparameters on a single *slot*. A slot is a CPU or CPU computing device, which the - Determined master schedules to run. + Follow these steps to train a single model for a fixed number of batches, using constant + values for all hyperparameters on a single *slot*. A slot is a CPU or CPU computing device, + which the Determined master schedules to run. .. note:: - To run an experiment in a local training environment, your Determined cluster requires only - a single CPU or GPU. A cluster is made up of a master and one or more agents. A single + To execute an experiment in a local training environment, your Determined cluster requires + only a single CPU or GPU. A cluster is made up of a master and one or more agents. A single machine can serve as both a master and an agent. **Create the Experiment** @@ -61,9 +84,9 @@ You must have a running Determined cluster with the CLI installed. *context directory* for your model. Determined copies the model context directory contents to the trial container working directory. - **View the Experiment** + **View the Run** - #. To view the experiment in your browser: + #. To view the run in your browser: - Enter the following URL: **http://localhost:8080/**. This is the cluster address for your local training environment. @@ -72,13 +95,13 @@ You must have a running Determined cluster with the CLI installed. #. Navigate to the home page and then visit your **Uncategorized** experiments. + - Determined displays all runs in a flat view for ease of comparison. + .. image:: /assets/images/qswebui-recent-local.png :alt: Determined AI WebUI Dashboard showing a user's recent experiment submissions - #. Select the experiment to display the experiment’s details such as Metrics. - - .. image:: /assets/images/qswebui-metrics-local.png - :alt: Determined AI WebUI Dashboard showing details for a local experiment + #. Selecting the experiment displays more details such as metrics and checkpoints. With this + baseline, we can now execute a multi-trial experiment, or "search". **Create a Strong Password** @@ -91,7 +114,92 @@ You must have a running Determined cluster with the CLI installed. .. tab:: - remotely + multi-trial search + + Once you have established a baseline performance by creating your single-trial experiment (or + "run"), you can create a multi-trial experiment (or "search") and compare the outcome with the + baseline. + + To do this, first create a ``search.yaml`` configuration file for executing the multi-trial + search. + + #. Prepare the configuration file. + + - Copy the following code and save the file as ``search.yaml`` in the same directory as + your ``const.yaml`` file: + + .. code:: yaml + + name: mnist_pytorch_search + hyperparameters: + learning_rate: + type: log + base: 10 + minval: 1e-4 + maxval: 1.0 + n_filters1: + type: int + minval: 16 + maxval: 64 + n_filters2: + type: int + minval: 32 + maxval: 128 + dropout1: + type: double + minval: 0.2 + maxval: 0.5 + dropout2: + type: double + minval: 0.3 + maxval: 0.6 + + searcher: + name: random + metric: validation_loss + max_trials: 20 + max_length: + batches: 1000 + smaller_is_better: true + + entrypoint: python3 train.py + + #. Create the Search + + Once you've created the new configuration file, you can create and run the search using the + following command: + + .. code:: bash + + det experiment create search.yaml . + + This will start the search, and Determined will run multiple trials, each with a different + combination of hyperparameters from the defined ranges. + + #. Monitor the Search + + In the WebUI, navigate to the **Searches** tab to monitor the progress of your search. + You’ll be able to see the different trials running, their status, and their performance + metrics. Determined also offers built-in visualizations to help you understand the results. + + .. image:: /assets/images/qswebui-multi-trial-search.png + :alt: Determined AI WebUI Dashboard showing a user's recent multi-trial search + + #. Analyze the Results + + After the search is complete, you can review the best-performing trials and the + hyperparameter configurations that led to them. This will help you identify the optimal + settings for your model. + + Selecting **mnist_pytorch_search** takes you to the "runs" view where you can choose which + runs you want to compare. + + .. image:: /assets/images/qswebui-mnist-pytorch-search.png + :alt: Determined AI WebUI Dashboard with mnist pytorch search selected and ready to compare + + .. tab:: + + remote distributed training job Run a remote distributed training job. @@ -144,16 +252,10 @@ You must have a running Determined cluster with the CLI installed. #. In your browser, navigate to the home page and then visit **Your Recent Submissions**. - .. image:: /assets/images/qswebui-recent-remote.png - :alt: Determined AI WebUI Dashboard showing a user's recent experiment submissions - #. Select the experiment to display the experiment’s details such as Metrics. Notice the loss curve is similar to the locally-run, single-GPU experiment but the time to complete the trial is reduced by about half. - .. image:: /assets/images/qswebui-metrics-remote.png - :alt: Determined AI WebUI Dashboard showing details for a remote distributed experiment - ************ Learn More ************ diff --git a/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst b/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst index ad745985a56..ca5ad851f98 100644 --- a/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst +++ b/docs/model-dev-guide/api-guides/apis-howto/api-core-ug.rst @@ -181,11 +181,16 @@ Run the following command to run the experiment: det e create metrics.yaml . -Open the Determined WebUI again and go to the **Overview** tab. +Open the Determined WebUI again, select your experiment, and go to the experiment's **Overview** +tab. The WebUI now displays metrics. -The WebUI now displays metrics. In this step, you learned how to add a few new lines of code in -order to report training and validation metrics to the Determined master. Next, we’ll modify our -script to report checkpoints. +.. image:: /assets/images/api-core-ug-metrics.png + :alt: Determined AI WebUI Dashboard showing experiment metrics + +| + +In this step, you learned how to add a few new lines of code in order to report training and +validation metrics to the Determined master. Next, we’ll modify our script to report checkpoints. *********************** Step 3: Checkpointing @@ -290,7 +295,7 @@ Run the following command to run the experiment: det e create checkpoints.yaml . -f -In the Determined WebUI, nagivate to the **Checkpoints** tab. +In the Determined WebUI, nagivate to the experiment's **Checkpoints** tab. Checkpoints are saved and deleted according to the default :ref:`experiment-config-checkpoint-policy`. You can modify the checkpoint policy in the experiment @@ -377,11 +382,16 @@ Run the following command to run the experiment: det e create adaptive.yaml . -In the Determined WebUI, navigate to the **Hyperparameters** tab. +In the Determined WebUI, navigate to the experiment's **Hyperparameters** tab. You should see a graph in the WebUI that displays the various trials initiated by the Adaptive ASHA hyperparameter search algorithm. +.. image:: /assets/images/api-core-ug-hp.png + :alt: Determined AI WebUI Dashboard showing experiment hyperparameters + +| + ****************************** Step 5: Distributed Training ****************************** @@ -503,7 +513,7 @@ Run the following command to run the experiment: det e create distributed.yaml . -In the Determined WebUI, go to the **Cluster** pane. +In the Determined WebUI, go to the **Cluster** pane in the left navigation. You should be able to see multiple slots active corresponding to the value you set for ``slots_per_trial`` you set in ``distributed.yaml``, as well as logs appearing from multiple ranks. diff --git a/docs/tutorials/quickstart-mdldev.rst b/docs/tutorials/quickstart-mdldev.rst index 6d102df66cd..0bdaaba2da3 100644 --- a/docs/tutorials/quickstart-mdldev.rst +++ b/docs/tutorials/quickstart-mdldev.rst @@ -210,11 +210,11 @@ schedules to run. :align: center :alt: Dashboard - The figure shows two experiments. Experiment **11** has **COMPLETED** and experiment **12** is + The figure shows all runs. Experiment **55989** has **COMPLETED** and experiment **55990** is still **ACTIVE**. Your experiment number and status can differ depending on how many times you run the examples. -#. While an experiment is in the ACTIVE, training state, click the experiment name to see the +#. While an experiment is in the ACTIVE, training state, click the experiment ID to see the **Metrics** graph update for your currently defined metrics: .. image:: /assets/images/qs04.png @@ -226,11 +226,6 @@ schedules to run. #. After the experiment completes, click the experiment name to view the trial page: - .. image:: /assets/images/qs03.png - :width: 704px - :align: center - :alt: Trial page - Now that you have a fundamental understanding of Determined, follow the next example to learn how to scale to distributed training. @@ -277,11 +272,12 @@ the number of GPUs per machine. You can change the value to match your hardware det -m http://:8080 experiment create distributed.yaml . #. To view the WebUI dashboard, enter the cluster address in your browser address bar, accept - ``determined`` as the default username, and click **Sign In**. A password is not required. + ``determined`` as the default username, and click **Sign In**. You'll need to set a :ref:`strong + password `. -#. Click the **Experiment** name to view the experiment’s trial display. The loss curve is similar - to the single-GPU experiment in the previous exercise but the time to complete the trial is - reduced by about half. +#. Click the experiment's ID to view the experiment’s trial display. The loss curve is similar to + the single-GPU experiment in the previous exercise but the time to complete the trial is reduced + by about half. ********************************* Run a Hyperparameter Tuning Job