In this article I will discuss how you can use OpenGL textures and buffers in a CUDA kernel. I will demonstrate a simple post-process effect that can be applied to off-screen textures and then rendered to the screen using a full-screen quad. I will assume the reader has some basic knowledge of C/C++ programming, OpenGL, and CUDA. If you lack OpenGL knowledge, you can refer to my previous article titled Introduction to OpenGL or if you have never done anything with CUDA, you can follow my previous article titled Introduction to CUDA.
- 1 Introduction
- 2 Setting Up CUDA
- 3 Creating a Texture Object
- 4 Creating a Pixel Buffer Object
- 5 Creating a Renderbuffer
- 6 Creating a Framebuffer
- 7 Register Resources with CUDA
- 7.1 Register a Texture Resource with CUDA To register an OpenGL texture or render-buffer resource with CUDA, you must use the cudaGraphicsGLRegisterImage method. This method will accept an OpenGL texture or render-buffer resource ID as a parameter and provide a pointer to a cudaGraphicsResource_t object in return. The cudaGraphicsResource_t object is then used to map the memory location defined by the texture object so that it can be used as a texture reference in CUDA later. The cudaGraphicsGLRegisterImage has the following signature: cudaError_t cudaGraphicsGLRegisterImage( struct cudaGraphicsResource** resource, GLuint image, GLenum target, unsigned int flags ) Where each property has the following definition: struct cudaGraphicsResource** resource: A pointer to the registered resource object that can be used to map the OpenGL texture object to a CUDA texture reference. GLuint image: The unique identifier for the OpenGL texture or render buffer object that has been previously defined. GLenum target: Identifies the type of the object specified by image. If image is a texture resource, then target must be GL_TEXTURE_2D, GL_TEXTURE_RECTANGLE, GL_TEXTURE_CUBE_MAP, GL_TEXTURE_3D, or GL_TEXTURE_2D_ARRAY. If the image refers to a render-buffer object, then target must be GL_RENDERBUFFER. unsigned int flags: The register flags specify the intended usage and can be one of the following values: cudaGraphicsRegisterFlagsNone: This specifies no hint about the usage of this resource. In this case, CUDA assumes the resource will be used for both reading from and writing to. This is the default value. cudaGraphicsRegisterFlagsReadOnly: This resource will be used for read-only purposes and CUDA will not be used to write to this resource. cudaGraphicsRegisterFlagsWriteDiscard: Specifies that CUDA will not use this resource for reading from and every time it is needed, the entire buffer contents will be discarded. This is safe to do if you assume the entire buffer will be redrawn every frame. cudaGraphicsRegisterFlagsSurfaceLoadStore: This flag specifies that this resource will be bound to a surface reference instead of a texture reference. This option is only available on devices that support compute capability 2.x. This method will return cudaSuccess if nothing went wrong. If you try to use this method to bind an OpenGL resource object that is neither a texture object nor a render buffer (for example, you try to register a pixel buffer object or a vertex buffer object) then this function will probably return cudaErrorUnknown and some message like “Unknown device/driver error”. This vauge error probably indicates your not passing the right object type to the function or your trying to register a render buffer but specifying GL_TEXTURE_2D as the target value when you should be specifying GL_RENDERBUFFER instead. Register a Vertex Buffer or Pixel Buffer with CUDA
- 8 Rendering the Scene
- 9 Post-Process the Scene
- 10 The CUDA Kernel
- 11 Display the Final Result
- 12 Exercise
- 13 Conclusion
- 14 References
- 15 Download the Source
Besides the memory types discussed in previous article on the CUDA Memory Model, CUDA programs have access to another type of memory: Texture memory which is available on devices that support compute capability 1.0 and better and on devices that support compute capability 2.0 and better, you also have access to Surface memory. Texture memory is useful for fetching texture elements from a texture and surface memory is more like a pixel buffer object that simply represents a block of memory that can be both read from and written to.
Texture and surface memory reside in device memory (also called off-chip memory). Global memory also resides in device memory and we know that accessing global memory is relatively slow (about 100x slower) compared to accessing the on-chip (cache) memory. However, the high latency incurred by global memory accesses does not exactly apply to texture memory because unlike global memory, accesses to texture memory is cached on devices of compute compatibility 1.x.
Reading from texture or surface memory costs a single memory read from device memory only if a cache-miss occurs, otherwise it only costs a memory read from texture cache which is very low-latency memory access. Since the texture cache is optimized for 2D locality, threads of the same warp that access texture memory that are located close together in texture space will achieve best performance. Texture memory is also optimized for streaming fetches (when all the threads in a warp access a texture address with 2D locality) so even if a cache-miss does occur the latency to access texture memory will not be high.
There are several benefits to accessing device memory through texture or surface fetching rather than through global or constant memory:
- If the memory reads do not follow strict access patterns that are required to achieve high performance when accessing global or constant memory (coalesced memory access for example), we can still achieve high-bandwidth access as long as we can access the texture memory with spatial locality (texture fetches are located close to each other in the 2D texture).
- Addressing calculations are performed by dedicated units.
- Packed data may be broadcast to separate variables in a single operation.
- 8-bit and 16-bit integer input data can be converted to 32-bit floating point values during the texture fetch operation.
In this article I will show you how you can map an OpenGL 2D texture to a CUDA texture so that it can be accessed in an optimized way in a CUDA kernel.
Setting Up CUDA
By default, the CUDA context is not configured to work with the OpenGL context. To tell CUDA that you will be using it with OpenGL, you must initialize the CUDA context and the OpenGL context together. To do that, you must first call cudaGLSetGLDevice. The only parameter to this method is the ID of the device in your system that should be setup to use the OpenGL context. If you have only 1 CUDA device, you can usually specify 0 this method to initialize the default device to share resources with OpenGL.
Creating a Texture Object
Before we can start manipulating OpenGL textures in CUDA, we must first define a texture. You can create textures of many different pixel formats but for this article, I will use 4-component (Red, Green, Blue, and Alpha) unsigned byte textures (GL_RGBA).
To create an OpenGL texture, you can use the following method:
GLuint texture; glGenTextures( 1, &texture ); glBindTexture( GL_TEXTURE_2D, texture ); // set basic parameters glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); // Create texture data (4-component unsigned byte) glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL ); // Unbind the texture glBindTexture( GL_TEXTURE_2D, 0 );
On line 1, we define a handle that is used to uniquely define the OpenGL texture object. The method glGenTextures is used to obtain unique texture object IDs that we can use to refer to this texture throughout the application.
On line 3, the texture object is bound to the GL_TEXTURE_2D texture target. From this point on, we can use the GL_TEXTURE_2D target identifier to refer to this texture.
Each texture in OpenGL has a set of properties (or attributes) which we can manipulate using the glTexParameter[i|f] methods. The first two settings will determine what happens when we try to fetch a pixel beyond the size of the texture. In this case, we will simply clamp the out-of-bound texture coordinate to the edge of the texture map. Since texture coordinates are usually defined in the range [0..1), accessing a pixel outside of this range would usually result in an error (like trying to access an array out-of-bounds) but the GL_CLAMP_TO_EDGE setting allows us to request a pixel of the texture outside of the normalized range without accessing out-of-bounds memory. The texture coordinates will simply be clamped into the allowed range when the texture is accessed.
The next settings on line 8, and 9 will determine how the pixels of the texture are blended if the pixel is mapped to an area larger (GL_TEXTURE_MIN_FILTER) than a single texture element, or smaller (GL_TEXTURE_MAG_FILTER) than a single texture element. In this case, GL_NEAREST parameter specifies that no filtering should occur – just return the pixel closes to the requested texture coordinate.
We haven’t yet told OpenGL how large our texture and thus no texture memory has been allocated for it. To actually allocate memory for the texture, we use the glTexImage2D method. In addition to the size of the texture, we must also specify the internal format of the texture. In this case, I want to access this texture in CUDA with Red, Green, Blue, and Alpha components with each component being an unsigned byte.
On line 15, the texture object is unbound so we return OpenGL back to it’s normal state.
When no longer needed (when your application is finished running for example), the texture object can be deleted using the glDeleteTextures method.
Creating a Pixel Buffer Object
If you graphics adapter has support for pixel buffer objects (if you have a graphics adapter that supports CUDA, you are pretty much guaranteed to have support for this extension), then you can use a pixel buffer object (PBO) to write the result of the CUDA kernel then copy the contents of the PBO to a texture to be rendered to the screen.
To create a pixel buffer object, you can use the following function:
GLuint bufferID; glGenBuffers( 1, &bufferID ); glBindBuffer( GL_PIXEL_UNPACK_BUFFER, bufferID ); glBufferData( GL_PIXEL_UNPACK_BUFFER, size, NULL, GL_STREAM_DRAW ); glBindBuffer( GL_PIXEL_UNPACK_BUFFER, 0 );
To create a PBO, we must perform 3 simple steps:
- Generate a unique buffer object ID using the glGenBuffers method.
- Bind the buffer using a valid target (for PBO’s this should be either GL_PIXEL_PACK_BUFFER, or GL_PIXEL_UNPACK_BUFFER). In this case, the target isn’t really important yet as long as it’s one of these two.
- Define some data for the buffer. The buffer data is defined using the glBufferData method and it takes the target, the size of the buffer in bytes and the usage hints as parameters.
The final argument to the glBufferData method is the usage hints. In this case, we want a buffer that will be streamed (updated once every frame) and drawn to the screen (via a texture copy) so the GL_STREAM_DRAW usage hint is probably the best for what we want to use this buffer for. If you are curious what other usage hints are available, I encourage you to read the following topic: http://www.songho.ca/opengl/gl_pbo.html.
When the buffer is no longer needed (when your application is finished running for example), you can use the glDeleteBuffers to release the buffer.
Creating a Renderbuffer
Texture objects are great for storing data that contains color information and pixel buffer objects are great for storing general (unspecified) pixel data but what about stencil or depth information? The Render buffer object is well suited for storing depth information.
To create a render buffer for storing depth values, you would use the following methods:
GLuint depthBuffer; glGenRenderbuffers( 1, &depthBuffer ); glBindRenderbuffer( GL_RENDERBUFFER, depthBuffer ); glRenderbufferStorage( GL_RENDERBUFFER, GL_DEPTH_COMPONENT, width, height ); // Unbind the depth buffer glBindRenderbuffer( GL_RENDERBUFFER, 0 );
This isn’t much different than the way we define a PBO except for the way we define the storage for the render buffer. Since we want to use this render buffer for storing the depth information of our rendered scene, we will specify GL_DEPTH_COMPONENT as the internal format of the render buffer. This is perfectly suitable for the depth buffer that will be attached to the frame buffer object that I’ll define next.
Of course, if your finished with your render buffer (at the end of the program for example) then you should delete it using the glDeleteRenderbuffers method.
Creating a Framebuffer
Before we can apply the post-process effect to our scene, we must render it into an off-screen buffer called a frame-buffer. OpenGL defines several default frame-buffers but these buffers are best suited for rending our final post-processed scene onto. To create an intermediate buffer, we can just define our own frame-buffer by attaching a color texture and a depth buffer and render our scene to our custom frame-buffer. Then we can just use the color texture as an input to our CUDA kernel so we can process the scene. Then we render the post-processed image to the default OpenGL frame-buffer so that it appears on the screen.
You may want to check if your graphics card has support for frame-buffers by checking for the “GL_ARB_framebuffer_object” extension. Again, if you have a graphics card that support CUDA, there is a pretty good chance your graphics adapter will support this extension.
To create a frame buffer we need to define at least one color texture and one depth buffer and attach these to the frame-buffer.
Using the methods described above to define a color texture and a depth buffer that match the width and height of our render window, we can then attach those buffers to our frame-buffer that will be used to render our scene.
To define a frame-buffer object you would use the following method:
GLuint framebuffer; glGenFramebuffers( 1, &framebuffer ); glBindFramebuffer( GL_FRAMEBUFFER, framebuffer ); glFramebufferTexture2D( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, colorAttachment0, 0 ); glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_RENDERBUFFER, depthAttachment );
The framebuffer is created using the glGenFramebuffers method shown here on line 2. Before we can populate the frame-buffer, we must bind it using the glBindFramebuffer method supplying GL_FRAMEBUFFER as the target and the ID of the frame-buffer we just generated.
The frame-buffer can support multiple color attachment points and a single depth attachment point and a single stencil attachment point. It is not necessary to have a stencil attachment point and since we aren’t using it in this application, I will skip adding a stencil buffer to the frame-buffer in this example.
The color texture is attached to the frame buffer using the glFramebufferTexture2D method. The fist argument is always going to be GL_FRAMEBUFFER and the second parameter is the attachment point we want to add this texture to. Theoretically, the frame buffer can support up to 32 color attachment points but the actual number of supported color attachment points should be queried using the method:
int maxAttachments = 0; glGetIntegerv( GL_MAX_COLOR_ATTACHMENTS, &maxAttachments );
The minimum supported color attachment points is 1, so if your graphics adapter has support for the GL_ARB_framebuffer_object extension, then you are guaranteed to be able to attach at least one color attachment.
In our case, we want to attach the color texture we defined earlier to the GL_COLOR_ATTACHMENT0 color attachment point.
The next parameters specify the texture target, texture object and mip-map level of the texture we generated earlier. Since we defined a 2D texture with only a single mip-map (at level 0) we specify the texture target should be GL_TEXTURE_2D, the texture object ID of the texture previously generated, and a mip-level of “0”.
The depth buffer was defined as a render buffer. The render buffers are attached to the framebuffer using the glFramebufferRenderbuffer method. The frame-buffer supports at most 1 depth attachment point. We use the GL_DEPTH_ATTACHMENT to specify the only depth buffer that is attached to this frame-buffer. Since it’s a render buffer, the target can only be GL_RENDERBUFFER and final parameter to this method is the depth buffer ID we generated earlier.
Now that we’ve defined a color attachment and a depth attachment for our frame-buffer, it should be ready to render to; but we need to check that our frame-buffer is good enough according to our graphics driver. To do that, we use the method glCheckFramebufferStatus and if this method returns GL_FRAMEBUFFER_COMPLETE then we’re good to go. If it returns something else, then we need to determine what went wrong. If you are having trouble with your frame buffers, I would encourage you to read the topic on OpenGL frame buffer objects located here: http://www.songho.ca/opengl/gl_fbo.html.
Register Resources with CUDA
Before a texture or buffer can be used by a CUDA application, the buffer (or texture) must be registered. A resource that is either a texture object or a render buffer is treated differently than buffer objects (vertex buffer object or pixel buffer object). This might be confusing at first because of the naming of “render buffer” and “pixel buffer” and “vertex buffer”. A good way to remember this is that a pixel buffer object cannot be attached to a frame buffer but a render buffer can. In this way, a render buffer is more like a texture than a pixel buffer is.