Handbook for PhD Students
This PhD Handbook serves a dual purpose: it describes the research methodology of our group and gives general advice to students, and it sets out the standards and processes that all students in the group are expected to strive for.
Systematic evaluation is a central element of a research methodology. Much of the research in our group is empirical, which means that claims and hypotheses are tested in practice.
An important question to consider is what evaluation tasks to use. The choice of evaluation tasks should be guided by your open problem, in that a solution to the problem is required to successfully complete the tasks. I call such tasks challenge tasks because they allow you to say "this task cannot be solved (well) with current methods due to the open problem". Using a non-contrived, ambitious challenge task gives your research a stronger justification and can force you to think outside the box to find novel ideas.
When designing experiments, try to have specific questions in mind that the experiments should answer. For example:
- How does your method compare to other methods? (e.g. rewards, accuracy, sample complexity, compute time)
- How does your method scale in certain dimensions? (e.g. number of agents, actions, states)
- What is the contribution/effect of different components in your method? (ablation study)
- How does your method perform if certain assumptions are violated?
If you observed a specific result and have a hypothesis to explain the result, you may want to design additional experiments to specifically test your hypothesis.
The use of baselines is crucial to set the bars for what constitutes good and bad performance. There are generally three kinds of baselines:
- Best-case/worst-case: Best/worst-case (or upper/lower) baselines are useful to show the margin of possible improvement. Best/worst can mean different things depending on your work. For example, a best-case may be an exact method with "unlimited" compute time, or with access to privileged information that would normally not be available. A worst-case may be a method that is limited or basic in some specific sense.
- Alterations/ablations: If your method consists of several interacting components, it is important to show the effect of each component. For example, these may be elements of your network architecture and input, loss functions, or other sub-functions in your method. Ablation studies take your full method and remove or alter certain elements to show the performance under these ablated/altered versions.
- State-of-the-art: If suitable, it can be a good idea to include some recent state-of-the-art algorithms for comparison. Ideally, each algorithm should use hyper-parameters that were optimised for that algorithm specifically (rather than using the same hyper-parameters for all algorithms), to maximise performance of each algorithm.
Importantly, try to think of sensible baselines early in your research rather than as an afterthought.
Once you have results from your experiments, you should use the right tools to understand the data. Use good visualisation, including second-order statistics (e.g. standard deviation/error). Statistical hypothesis tests should be used to establish whether observed differences are significant or due to variance. It is crucial to understand the meaning of statistical significance and how to apply such tests correctly. This guide on experimental design in RL research (and this guide on hypothesis testing) is a useful starting point.