Fast Pixel Transfers with Pixel Buffer Objects
I often use the GPU to perform fast pixel format conversions. For example, I convert the current framebuffer to YUV420P so I can feed them into a video encoder that often use this pixel format by default. You can do this on the CPU, though for large (e.g. 4K) video this will add quite a delay.
In this article I'll describe a solution I use to download the
contents of a texture from the GPU back to the CPU. A trivial solution
to do this, is to use glReadPixels()
. Though this solution is slow,
because by default glReadPixels()
has to wait until all queued draw
commands in the GL driver queue are processed before it is able to
download the pixels. When using the plain glReadPixels withouth a PBO,
downloading a 1280 x 720 frame takes about 16ms-22ms. A faster solution
is to use pixel buffer objects
(PBO) which takes about 0.5ms on a 5+ years old GPU (Radeon 4850).
A pixel buffer object (PBO) is a buffer, just like e.g.
GL_ARRAY_BUFFER
. It can be used to store pixel data. When a
GL_PIXEL_PACK_BUFFER
is bound, a call to glReadPixels()
will
return immediately. Though this will not actually read the
pixels. OpenGL will, instead of doing a read, actually write (pack) the
data into the currently bound PBO. Of course GL still needs to make
sure that all drawing calls have been processed before the use
actually downloads the pixels. Though because PBOs gives us a way to
pipeline the read backs we can make sure that we read the pixels after
the draw calls have been processed.
But how do we know when the draw calls in the driver command queue have been processed? The common solution is to make use of the fact that the command queue will probably be about 1 or 2 frames behind. This is kind of a 'guess' but one which has been correct in about 99% of the tests I did.
What we want to accomplish, is that we read the pixel data from the
PBO when the data is actually written into the PBO. So we need to
try and read from PBOs that are N-frames processed before. I'm using
N
here as you should experiment with a size that fits your GPU,
though using 2 or 3 seems like a good solution.
So remember that glReadPixels()
works differently when a GL_PIXEL_PACK_BUFFER
is bound. When a PBO is bound the glReadPixels()
call will trigger GL to
write the data of the currently bound read framebuffer into the bound PBO.
Therefore triggering a write into the PBO looks a bit like this (read on as there is more to it):
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo); glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, 0);
Although the above code will make sure that GL writes data into the PBO
we still need to have a way to actually download the pixels from
GPU to CPU. To download the data from a (any) buffer we can use
glMapBuffer[Range]()
and glUnmapBuffer()
. When we call glMapBuffer()
GL will return us a pointer to the data of the currently bound PBO. We can
use memcpy
to copy the data to a internal buffer that we can process.
After calling glMapBuffer()
GL will make sure that all draw calls
will be finished, just like before when using glReadPixels()
without a
bound PBO. So in this sense it's doing the same thing. But (!) the big
difference is that we can pipeline our calls to glMapBuffer()
. We can
pipeline by calling glMapBuffer with a bound PBO that was triggered
with a glReadPixels()
a couple frames back so that GL doesn't have
to way until all draw calls are ready because they will already be processed.
The image below shows how this should work:
Another important aspect of reading (or writing) data from the GPU, is that
you need to make sure that you're using a pixel format that is efficient for
the GPU. This means you want to use a default format, which is most cases
is GL_BGRA
. When you don't use a default format, it means that GL will
need to convert the data which takes time and will slow down the read back
or event causing GL to not use an optimized path at all and you'll get the
same performance as you get without using PBOs. So with that in mind, it
doesn't make sense to use PBOs with a GL_RGB for instance (though it may
be that your GPU can process GL_RGB quickly). It's best to test a couple of
different formats and see what performance you get.
One last thing I want to share, is the fact that you pass zero as the last
parameter to glReadPixels()
. You can pass any value in there. The value
you pass is used as a byte offset into the currently bound PBO.
The code below shows some code from a test I did with PBOs. Use this as inspiration :). And note that there is a bit more to it and that there are more solutions to download from the GPU in an optimized way, e.g. some GPUs have support for simultaneous upload / download, have extensions that help and that some GPUs work better with PBOs then others.
#ifndef POLY_PBO_DOWNLOADER_H #define POLY_PBO_DOWNLOADER_H #define __STDC_LIMIT_MACROS #include <stdint.h> #include <glad/glad.h> namespace poly { class PboDownloader { public: PboDownloader(); ~PboDownloader(); int init(GLenum fmt, int w, int h, int num); void download(); public: GLenum fmt; GLuint* pbos; uint64_t num_pbos; uint64_t dx; uint64_t num_downloads; int width; int height; int nbytes; /* number of bytes in the pbo buffer. */ unsigned char* pixels; /* the downloaded pixels. */ }; } /* namespace poly */ #endif
#if defined(__linux) # include <string.h> #endif #include <poly/Log.h> #include <poly/PboDownloader.h> #include <poly/Timer.h> namespace poly { PboDownloader::PboDownloader() :fmt(0) ,pbos(NULL) ,num_pbos(0) ,dx(0) ,num_downloads(0) ,width(0) ,height(0) ,nbytes(0) ,pixels(NULL) { } PboDownloader::~PboDownloader() { if (NULL != pixels) { delete[] pixels; pixels = NULL; } } int PboDownloader::init(GLenum format, int w, int h, int num) { if (NULL != pbos) { SX_ERROR("Already initialized. Not necessary to initialize again; or shutdown first."); return -1; } if (0 >= num) { SX_ERROR("Invalid number of PBOs: %d", num); return -2; } if (num > 10) { SX_WARNING("Asked to create more then 10 buffers; that is probaly a bit too much."); } fmt = format; width = w; height = h; num_pbos = num; if (GL_RED == fmt || GL_GREEN == fmt || GL_BLUE == fmt) { nbytes = width * height; } else if (GL_RGB == fmt || GL_BGR == fmt) { nbytes = width * height * 3; } else if (GL_RGBA == fmt || GL_BGRA == fmt) { nbytes = width * height * 4; } else { SX_ERROR("Unhandled pixel format, use GL_R, GL_RG, GL_RGB or GL_RGBA."); return -3; } if (0 == nbytes) { SX_ERROR("Invalid width or height given: %d x %d", width, height); return -4; } pbos = new GLuint[num]; if (NULL == pbos) { SX_ERROR("Cannot allocate pbos."); return -3; } pixels = new unsigned char[nbytes]; if (NULL == pixels) { SX_ERROR("Cannot allocate pixel buffer."); return -5; } glGenBuffers(num, pbos); for (int i = 0; i < num; ++i) { SX_VERBOSE("pbodownloader.pbos[%d] = %d, nbytes: %d", i, pbos[i], nbytes) glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[i]); glBufferData(GL_PIXEL_PACK_BUFFER, nbytes, NULL, GL_STREAM_READ); } glBindBuffer(GL_PIXEL_PACK_BUFFER, 0); return 0; } void PboDownloader::download() { unsigned char* ptr; uint64_t start_ns = nanos(); uint64_t end_ns = 0; uint64_t delta_ns = 0; #define USE_PBO 1 #if USE_PBO if (num_downloads < num_pbos) { /* First we need to make sure all our pbos are bound, so glMap/Unmap will read from the oldest bound buffer first. */ glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[dx]); glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, 0); /* When a GL_PIXEL_PACK_BUFFER is bound, the last 0 is used as offset into the buffer to read into. */ SX_DEBUG("glReadPixels() with pbo: %d", pbos[dx]); } else { SX_DEBUG("glMapBuffer() with pbo: %d", pbos[dx]); /* Read from the oldest bound pbo. */ glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[dx]); ptr = (unsigned char*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY); if (NULL != ptr) { memcpy(pixels, ptr, nbytes); glUnmapBuffer(GL_PIXEL_PACK_BUFFER); } else { SX_ERROR("Failed to map the buffer"); } /* Trigger the next read. */ SX_DEBUG("glReadPixels() with pbo: %d", pbos[dx]); glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, 0); } ++dx; dx = dx % num_pbos; num_downloads++; if (num_downloads == UINT64_MAX) { num_downloads = num_pbos; } glBindBuffer(GL_PIXEL_PACK_BUFFER, 0); #else glBindBuffer(GL_PIXEL_PACK_BUFFER, 0); /* just make sure we're not accidentilly using a PBO. */ glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, pixels); #endif end_ns = nanos(); delta_ns = end_ns - start_ns; SX_VERBOSE("Download took: %f ms. ", ((double)delta_ns) / 1000000.0); } } /* namespace poly */