Local contexts here amount to some properties which vary as the spatial location changes. Take the bag-of-visual-words (BoVW) model for example, they can be the very spatial locations of local features. To obtain more complex contexts, one can model the relationships between neighboring local features, or extract some kind of representation from the neighboring region of each local feature. For the extraction, one can use either engineered or learned descriptors, and for the learning, one can apply either supervised or unsupervised algorithms. With the representations obtained, there is still a problem, i.e., how to use it? There are various tasks in image understanding, diverse in form, such as image classification (one label per image), image labeling (one label per pixel), and verification tasks (one label per pair). Besides, the way of leveraging contexts also depends on the diverse approaches to these tasks. Focusing on representing and leveraging the local contexts in images, this thesis covers four topics below. 1. Spatial directed graphs for image classification, which is based on BoVW. Context representations are the very spatial locations of local features, according to which multiple local pooling operations are applied. Directed graphs replace the spatial pyramid, which model not only the sliced blocks but also their relationships. 2. Contextual pooling for image classification, which is also based on BoVW. Context representations are extracted from the adjacent regions of local features with BoVW, according to which multiple local pooling operations are applied. True context representations replace spatial locations, which can help discriminating ambiguous features that are not finely aligned. 3. Hierarchical contexts for image labeling, which is based on deep convolutional neural networks (CNNs). It spontaneously learns hierarchical context representations which are used for classifying pixels. 4. Learned local contexts for cross-view gait based human identification, which is based on deep CNNs. It learns context representations which are fed into spontaneously learned comparator who predicts similarities.