Both structural and contextual information is essential and widely used in image analysis. However, current multi-view stereo (MVS) approaches usually use a single common pre-trained model as pixel descriptor to extract features, which mix structural and contextual information together and thus increase the difficulty of matching correspondence. In this paper, we propose FADE (feature aggregation for depth estimation), which treats spatial and context information separately and focuses on aggregating features for efficient learning of the MVS problem. Spatial information includes image details such as edges and corners, whereas context information comprises object features such as shapes and traits. To aggregate these multi-level features, we use an attention mechanism to select important features for matching. We then build a plane sweep volume by using a homography backward warping method to generate match candidates. Furthermore, we propose a novel cost volume regularization network aims to minimize the noise in the matching candidates. Finally, we take advantage of 3D stacked hourglass and regression to produces high-quality depth maps. With these well-aggregated features, FADE can efficiently perform dense depth reconstruction, achieving state-of-the-art performance in terms of accuracy and requiring the least amount of model parameters.