################################################################ ## Reviewer 1 ################################################################ ################################################################ 1. Paper Summary This paper proposes a new video frame interpretation method based on representation learning instead of flow warping. The proposed method increases the inference speed and obtains better results than existing methods. The authors have conducted extensive experiments to demonstrate the effectiveness of FLAVR in interpolating single or multiple frames in the video. It is also shown that the proposed FLAVR method can be primary processing for videos, which helps improve the performance of various applications, such as action recognition, optical flow estimation, and video object tracking. 2. Strengths The organization of this paper is good, and the research problem is well motivated. The proposed method is novel to some extent. Unlike existing methods, this paper interpolates frames based on representation learning rather than flow warping. The method details are presented. The effectiveness of the proposed method is validated and analyzed from many aspects, including the inference speed, accuracy, downstream applications, etc. 3. Weaknesses There are some drawbacks and unclear parts that need further explanation. 1. Experimental results show that the proposed method help improve the performance of the downstream application. Nevertheless, the representations extracted for frame interpolation lack (task-related) sementic guarantee. Some theoretical evidence should be given to support such results. 2. A user study is done for evaluation, which reports the human visual perception of image quality. However, the information on the user community is missing. The characteristics of the surveyed population will influence the results. 3. Although the proposed method increases the inference speed, FLAVR may not save computation time when considering the training process. In addition, the proposed method is not flexible for multiple frame interpolation. Different models should be learned when interpolating different numbers of frames. 4. Recommendation Weak Accept 5. Justification This paper clearly presents the limitations of the existing flow-based approaches for frame interpolation. The authors propose a new idea of representation learning to obtain the interpolation objective, which models the frame interpolation task from a different perspective. The reviewer also appreciates the result analysis and discussion presented in the paper. Although some parts could be improved, as mentioned in the weakness section, this paper is above the acceptance criteria. ################################################################ ## Reviewer 2 ################################################################ ################################################################ 1. Paper Summary This paper tackles the video frame interpolation problem that consists of generating new frames between existing video frames obtaining a final video with more detailed movements (slow motion). The standard approach to this problem is to estimate optical flow between frames and generate new frames by warping the existing frames. This paper, differently, proposes an approach that does not use optical flow estimation since it can induce error in the generated frame. They proposed a neural network (encode-decode style) that directly reconstructs frames between existent frames. They show that such a model can achieve better PSNR and SSIM metrics. It can also be used as a feature learning approach for action recognition and optical flow estimation problems. 2. Strengths - the paper is easy to follow and well structured. - The model is very simple and achieves very good performance in all benchmarks presented. A simple solution with good results is the perfect combination. - The experimental section is very complete: results on reconstruction, feature learning, ablation studies, and user study on video quality. Supplementary material is very complete and provides code and predicted video samples to check the quality of their model outputs. 3. Weaknesses - For the task of frame interpolation, one of the main challenges is to reproduce the movement of deformable objects. For instance, in the video of the dog chasing ducks in your supplemental material, the main disagreement between the compared model is the area around the dog. Then, I think it would be better to compute the reconstruction metrics (SSIM, PSNR) only in these areas. This can be done by using segmentation masks (annotated or generated by an auxiliary model) or optical flow. - There is not much technical novelty in this paper since reconstriction models using encoder-decoder architecture have already been used in many similar problems. - I have also a concern about what happens if you use a very large K or if the frames used as input depicts drastic changes in the scene. Is the network able to hallucinate some of the changes or does it tend to just propagate the movements? For instance, if a new object appears between two frames, how the network is able to hallucinate this object in the scene? Similarly, how does it handles occluded objects? 4. Recommendation Strong Accept 5. Justification This paper proposes a simple approach that achieves state-of-the-art performance on a very complex task. In addition, this submission is very complete with many insightful experiments, code, and prediction samples that facilitate the revision. My only comments against this work are the lack of technical novelty and some doubts about the way they compute their reconstruction metrics. However, since the pros surpass the cons by a large margin, I recommend the acceptance of this paper. ################################################################ ## Reviewer 3 ################################################################ ################################################################ 1. Paper Summary The author proposes a flow-free approach for multi-frame video interpolation. It leverages 3D spatio-temporal kernels for learning motion properties from unlabelled videos and delivers about 6X speed up compared to the current state-of-the-art video interpolation methods. Further, the authors aim to achieve a good trade-off between visual quality and inference speed for video interpolation. 2. Strengths 1. The paper is well written and motivation is clearly defined. FLAVR utilizes spatio- temporal kernels for motion modeling, and is designed to directly predict multiple intermediate frames in a single forward pass without demanding access to external flow or depth maps. Furthermore, it is the first video frame interpolation approach that is both optical flow-free and able to make single-shot multiple-frame predictions. 2. Extensive experiments and results on three novel applications of FLAVR, including action recognition, optical flow estimation, and video object tracking. demonstrate the efficiency of the proposed method. 3. The network is very simple and easy to follow, and shows superior results on the considered tasks. 3. Weaknesses To further prove the effectiveness of the proposed approach, the authors should consider few suggestions: 1. Some lines are confusing and reviewers demand proper justification, "The final prediction layer (the purple block) is implemented as a convolution layer to project the 3D feature maps into (k-1) frame predictions."? 2. Replacing 2D COnvolutions with 3d convolutions, would as per my understanding would definitely add in computational complexity, thus lowering its overall speed. The authors could add an additional experiment, where they could compare 2d convolutions and 3d convolutions along with parameters. 3. "We also remove all temporal striding, as downsampling operations like striding and pooling are known to remove details that are crucial for generating sharper images." Authors should give a relevant citation. 4. Authors can also experiment with latest state-of-the-art video interpolation architectures. 4. Recommendation Strong Accept 5. Justification 1. The paper is well written and motivation is clearly defined. FLAVR utilizes spatio- temporal kernels for motion modeling, and is designed to directly predict multiple intermediate frames in a single forward pass without demanding access to external flow or depth maps. Furthermore, it is the first video frame interpolation approach that is both optical flow-free and able to make single-shot multiple-frame predictions. 2. Extensive experiments and results on three novel applications of FLAVR including action recognition, optical flow estimation, and video object tracking. demonstrate the efficiency of the proposed method. ################################################################ ## Reviewer 4 ################################################################ ################################################################ 1. Paper Summary This paper proposed an optical flow-free approach for frame interpolation. The proposed method leverages the simple 3D U-Nets and achieves compared results with faster inference speeds. 2. Strengths (1) This paper is well-written and easy to follow. (2) The framework of the proposed method (FLAVR) is very simple, which does not need optical flow and can achieve compared performance. (3) Because of the simple architecture, FLAVR achieves a good trade-off between performance and inference speed for video interpolation. 3. Weaknesses My main concern is on the experimental comparison and reference. It seems that this paper does not contain any 2022 works as references and contains very a few frame interpolation works in 2021. The authors are suggested to compare with the latest works. 4. Recommendation Weak Accept 5. Justification As shown in Strengths and Weaknesses