BM3D algorithm is the best image denoising method at present. The algorithm not only makes use of the spatial relationship of internal pixel of the image block, but also the use of the information of the image similarity between blocks for image processing. Therefore the algorithm can better preserve the image details and introduce less artifacts. But the computational complexity of BM3D algorithm is very high, the calculation is very large. For many real-time applications are too complex to meet the requirement of real-time computing. this study used the powerful numerical computation ability of Graphics Processing Units and the characteristics of the parallel computing, paralleled BM3D algorithm with the aim to reduce the algorithm execution time and meet the requirement of real-time processing. The major content include the following three aspects. (1) Explore the working characteristics of the GPU parallel computing and parallel programming. When using CUDA parallel algorithm consider limiting GPU hardware, memory and other resources in order to guarantee the achievement of the best results in parallel. (2) According to the characteristics of the BM3D algorithm, analysis algorithm flow, omit block matching procedure in the second step by the method of approximation, use the result of the first step to instead of it. Divide the algorithm into relatively independent functional processing modules, and modify the module algorithm strategy and individual module processing order in order to reduce the amount of serial algorithm computation and suitable for parallel computing. (3) Parallel algorithm based on BM3D algorithm and CUDA programming model. Parallelism of CUDA parallel modules each memory and instruction from three aspects of algorithm optimization, in order to improve the computational efficiency, shorten the execution time of the algorithm. Experimental results show that the PSNR of CUDA_BM3D has slightly decreased than BM3D algorithm, but still better than 3DDCT and Wiener filtering algorithm. The speedup ratio is at least 36, and when performing the image size of 352×288 can meet the requirement of real-time processing. Using the GPU acceleration BM3D algorithm, the proposed algorithm can satisfy the demand of real-time in image processing area. GPU parallel computing ideas and optimization strategy can also be applied to the complex image algorithm, offer thepossibility of real-time complex algorithm.