For about three years our research group at the Politecnico di Milano has been carrying out an intensive series of experiments with the goal of reproducing the results reported in several papers describing recommender algorithms based on deep learning. The main goal of the research is to assess whether the baselines chosen for comparisons in the original papers are strong enough to confirm the stated progress.
In addition to the main research findings (most of the baselines are weak and the reported progress is phantom), our work has highlighted some (partially) unexpected side outcomes. More specifically, we have discovered several “bad practices” in the evaluation procedures of almost all the papers we have analyzed. Some of these issues are already known in the IR community (lack of reproducibility), other are unexpected (errors and questionable choices in the evaluation procedure) and worryingly common in our study.
The focus of this talk is not on the progress of deep learning recommender algorithms and is not on reproducibility issues (both topics have been widely discussed in other venues) although I will marginally touch both the points. Rather, the focus of the presentation will be on the description of the bad practices we have detected during these years of experiments, along with an analysis on possible causes and possible remedies.