Blog of roxlu, co-founder of Apollo Media. Contact info[shift+2]apollomedia.nl.

Fast Pixel Transfers with Pixel Buffer Objects

I often use the GPU to perform fast pixel format conversions. For example, I convert the current framebuffer to YUV420P so I can feed them into a video encoder that often use this pixel format by default. You can do this on the CPU, though for large (e.g. 4K) video this will add quite a delay.

In this article I'll describe a solution I use to download the contents of a texture from the GPU back to the CPU. A trivial solution to do this, is to use glReadPixels(). Though this solution is slow, because by default glReadPixels() has to wait until all queued draw commands in the GL driver queue are processed before it is able to download the pixels. When using the plain glReadPixels withouth a PBO, downloading a 1280 x 720 frame takes about 16ms-22ms. A faster solution is to use pixel buffer objects (PBO) which takes about 0.5ms on a 5+ years old GPU (Radeon 4850).

A pixel buffer object (PBO) is a buffer, just like e.g. GL_ARRAY_BUFFER. It can be used to store pixel data. When a GL_PIXEL_PACK_BUFFER is bound, a call to glReadPixels() will return immediately. Though this will not actually read the pixels. OpenGL will, instead of doing a read, actually write (pack) the data into the currently bound PBO. Of course GL still needs to make sure that all drawing calls have been processed before the use actually downloads the pixels. Though because PBOs gives us a way to pipeline the read backs we can make sure that we read the pixels after the draw calls have been processed.

But how do we know when the draw calls in the driver command queue have been processed? The common solution is to make use of the fact that the command queue will probably be about 1 or 2 frames behind. This is kind of a 'guess' but one which has been correct in about 99% of the tests I did.

What we want to accomplish, is that we read the pixel data from the PBO when the data is actually written into the PBO. So we need to try and read from PBOs that are N-frames processed before. I'm using N here as you should experiment with a size that fits your GPU, though using 2 or 3 seems like a good solution.

So remember that glReadPixels() works differently when a GL_PIXEL_PACK_BUFFER is bound. When a PBO is bound the glReadPixels() call will trigger GL to write the data of the currently bound read framebuffer into the bound PBO.

Therefore triggering a write into the PBO looks a bit like this
(read on as there is more to it):

glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, 0);

Although the above code will make sure that GL writes data into the PBO we still need to have a way to actually download the pixels from GPU to CPU. To download the data from a (any) buffer we can use glMapBuffer[Range]() and glUnmapBuffer(). When we call glMapBuffer() GL will return us a pointer to the data of the currently bound PBO. We can use memcpy to copy the data to a internal buffer that we can process. After calling glMapBuffer() GL will make sure that all draw calls will be finished, just like before when using glReadPixels() without a bound PBO. So in this sense it's doing the same thing. But (!) the big difference is that we can pipeline our calls to glMapBuffer(). We can pipeline by calling glMapBuffer with a bound PBO that was triggered with a glReadPixels() a couple frames back so that GL doesn't have to way until all draw calls are ready because they will already be processed. The image below shows how this should work:

Another important aspect of reading (or writing) data from the GPU, is that you need to make sure that you're using a pixel format that is efficient for the GPU. This means you want to use a default format, which is most cases is GL_BGRA. When you don't use a default format, it means that GL will need to convert the data which takes time and will slow down the read back or event causing GL to not use an optimized path at all and you'll get the same performance as you get without using PBOs. So with that in mind, it doesn't make sense to use PBOs with a GL_RGB for instance (though it may be that your GPU can process GL_RGB quickly). It's best to test a couple of different formats and see what performance you get.

One last thing I want to share, is the fact that you pass zero as the last parameter to glReadPixels(). You can pass any value in there. The value you pass is used as a byte offset into the currently bound PBO.

The code below shows some code from a test I did with PBOs. Use this as inspiration :). And note that there is a bit more to it and that there are more solutions to download from the GPU in an optimized way, e.g. some GPUs have support for simultaneous upload / download, have extensions that help and that some GPUs work better with PBOs then others.

#ifndef POLY_PBO_DOWNLOADER_H
#define POLY_PBO_DOWNLOADER_H
 
#define __STDC_LIMIT_MACROS
#include <stdint.h>
#include <glad/glad.h>
 
namespace poly {
 
  class PboDownloader {
  public:
    PboDownloader();
    ~PboDownloader();
    int init(GLenum fmt, int w, int h, int num);
    void download();
 
  public:
    GLenum fmt;
    GLuint* pbos;
    uint64_t num_pbos;
    uint64_t dx;
    uint64_t num_downloads;
    int width;
    int height;
    int nbytes; /* number of bytes in the pbo buffer. */
    unsigned char* pixels; /* the downloaded pixels. */
  };
 
} /* namespace poly */
 
#endif
#if defined(__linux)
#  include <string.h>
#endif
#include <poly/Log.h>
#include <poly/PboDownloader.h>
#include <poly/Timer.h>
 
namespace poly {
 
  PboDownloader::PboDownloader() 
    :fmt(0)
    ,pbos(NULL)
    ,num_pbos(0)
    ,dx(0)
    ,num_downloads(0)
    ,width(0)
    ,height(0)
    ,nbytes(0)
    ,pixels(NULL)
  {
  }
 
  PboDownloader::~PboDownloader() {
    if (NULL != pixels) {
      delete[] pixels;
      pixels = NULL;
    }
  }
 
  int PboDownloader::init(GLenum format, int w, int h, int num) {
 
    if (NULL != pbos) {
      SX_ERROR("Already initialized. Not necessary to initialize again; or shutdown first.");
      return -1;
    }
 
    if (0 >= num) {
      SX_ERROR("Invalid number of PBOs: %d", num);
      return -2;
    }
 
    if (num > 10) {
      SX_WARNING("Asked to create more then 10 buffers; that is probaly a bit too much.");
    }
 
    fmt = format;
    width = w;
    height = h;
    num_pbos = num;
 
    if (GL_RED == fmt || GL_GREEN == fmt || GL_BLUE == fmt) {
      nbytes = width * height;
    }
    else if (GL_RGB == fmt || GL_BGR == fmt) {
      nbytes = width * height * 3;
    }
    else if (GL_RGBA == fmt || GL_BGRA == fmt) {
      nbytes = width * height * 4;
    }
    else {
      SX_ERROR("Unhandled pixel format, use GL_R, GL_RG, GL_RGB or GL_RGBA.");
      return -3;
    }
 
    if (0 == nbytes) {
      SX_ERROR("Invalid width or height given: %d x %d", width, height);
      return -4;
    }
 
    pbos = new GLuint[num];
    if (NULL == pbos) {
      SX_ERROR("Cannot allocate pbos.");
      return -3;
    }
 
    pixels = new unsigned char[nbytes];
    if (NULL == pixels) {
      SX_ERROR("Cannot allocate pixel buffer.");
      return -5;
    }
 
    glGenBuffers(num, pbos);
    for (int i = 0; i < num; ++i) {
 
      SX_VERBOSE("pbodownloader.pbos[%d] = %d, nbytes: %d", i, pbos[i], nbytes)
 
      glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[i]);
      glBufferData(GL_PIXEL_PACK_BUFFER, nbytes, NULL, GL_STREAM_READ);
    }
 
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
 
    return 0;
  }
 
  void PboDownloader::download() {
    unsigned char* ptr;
    uint64_t start_ns = nanos();
    uint64_t end_ns = 0;
    uint64_t delta_ns = 0;
 
#define USE_PBO 1
#if USE_PBO
 
    if (num_downloads < num_pbos) {
      /* 
         First we need to make sure all our pbos are bound, so glMap/Unmap will 
         read from the oldest bound buffer first. 
      */
      glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[dx]);
      glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, 0);   /* When a GL_PIXEL_PACK_BUFFER is bound, the last 0 is used as offset into the buffer to read into. */
      SX_DEBUG("glReadPixels() with pbo: %d", pbos[dx]);
    }
    else {
 
      SX_DEBUG("glMapBuffer() with pbo: %d", pbos[dx]);
 
      /* Read from the oldest bound pbo. */
      glBindBuffer(GL_PIXEL_PACK_BUFFER, pbos[dx]);
 
      ptr = (unsigned char*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
      if (NULL != ptr) {
        memcpy(pixels, ptr, nbytes);
        glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
      }
      else {
        SX_ERROR("Failed to map the buffer");
      }
 
      /* Trigger the next read. */
      SX_DEBUG("glReadPixels() with pbo: %d", pbos[dx]);
      glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, 0);
    }
 
    ++dx;
    dx = dx % num_pbos;
 
    num_downloads++;
    if (num_downloads == UINT64_MAX) {
      num_downloads = num_pbos;
    }
 
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
#else
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0); /* just make sure we're not accidentilly using a PBO. */
    glReadPixels(0, 0, width, height, fmt, GL_UNSIGNED_BYTE, pixels);
#endif
 
    end_ns = nanos();
 
    delta_ns = end_ns - start_ns;
    SX_VERBOSE("Download took: %f ms. ", ((double)delta_ns) / 1000000.0);
  }
 
} /* namespace poly */