Deep-learning-based video compression technique has been rapidly growing in recent years. This paper adopts the Conditional Augmented Normalizing Flow video codec (CANF-VC)  as our basic system. To improve the quality of the condition signal (image) for CANF, we propose a two-layer structure learning-based video codec. At low cost of extra bit rate, the low-resolution base layer provides side information to improve the quality of motion-compensated reference frame through a super-resolution module with a merge-net. In addition, the base layer also provides information to the skip-mask generator. The skip-mask guides the coding mechanism to reduce the transmitted samples for the high-resolution enhancement layer. The experiment results indicate that the proposed two-layer coding scheme can provide 22.19% PSNR BD-Rate saving and 49.59% MS-SSIM BD-Rate saving over H.265 (HM 16.20) on the UVG test sequences.