We aim to resolve the difficulties of action recognition arising from the large intra-class variations. These unfavorable variations make it infeasible to represent one action instance by other ones of the same action. We hence propose to extract both instance-specific and class-consistent features to facilitate action recognition. Specifically, the instance-specific features explore the self-similarities among frames of each video instance, while class-consistent features summarize within-class similarities. We introduce a generative formulation to combine the two diverse types of features. The experimental results demonstrate the effectiveness of our approach.