Quantcast
Channel: OpenCV Q&A Forum - RSS feed
Viewing all articles
Browse latest Browse all 2088

Fastest way to convert BGR RGB! Aka: Do NOT use Numpy magic "tricks".

$
0
0
---------- **IMPORTANT:** This article is **very** long. Remember to click the `(more)` at the bottom of this post to read the whole article! ---------- *I was reading this question: https://answers.opencv.org/question/188664/colour-conversion-from-bgr-to-rgb-in-cv2-is-slower-than-on-python/ and it didn't explain things very well at all. So here's a deep examination and explanation for everyone's future reference!* **Converting RGB to BGR, and vice versa, is one of the most important operations you can do in OpenCV if you're interoperating with other libraries, raw system memory, etc. And each imaging library depends on their own special channel orders.** There are many ways to achieve the conversion, and `cv2.cvtColor()` is often frowned upon because there are "much faster" ways to do it via numpy "view" manipulation. Whenever you attempt to convert colors in OpenCV, you actually invoke a huge machinery: https://github.com/opencv/opencv/blob/8c0b0714e76efef4a8ca2a7c410c60e55c5e9829/modules/imgproc/src/color.cpp#L20-L25 https://github.com/opencv/opencv/blob/8b541e450b511fde9dd363fa55a30fbb6fc0ace6/modules/imgproc/src/color_rgb.dispatch.cpp#L426-L437 As you can see, internally, OpenCV creates an "OpenCL Kernel" with the instructions for the data transformation, and then runs it. This creates brand new (re-arranged) image data in memory, which is of course a pretty slow operation, involving new memory allocation and data-copying. However, there is another way to flip between RGB and BGR channel orders, which is very popular - and very bad (as you'll find out soon). And that is: Using numpy's built-in methods for manipulating the array data. Note that there are two ways to manipulate data in Numpy: - One of the ways, the bad way, just changes the "view" of the Numpy array and is therefore instant (`O(1)`), but does NOT transform the underlying `img.data` in RAM/memory. This means that the raw memory does NOT contain the new channel order, and Numpy instead "fakes" it by creating a "view" that simply says "when we read this data from RAM, view it as R=B, G=G, B=R" basically... (Technically speaking, it changes the ".strides" property of the Numpy object, which instead of saying "read R then G then B" (stride "1" aka going forwards in RAM when reading the color channels) changes it to say "read B, then G, then R" (stride "-1" aka going backwards in RAM when reading the color channels)). - The second way, which is totally fine, is to *always ensure* that we arrange the pixel data properly in memory too, which is a lot slower but is *almost always necessary*, depending on what library/API your data is intended to be sent to! To determine whether a numpy array manipulation has also changed the underlying MEMORY, you can look at the `img.flags['C_CONTIGUOUS']` value. If `True` it means that the data in RAM is in the correct order (that's great!). If `False` it means that the data in RAM is in the wrong order and that we are "cheating" via a numpy View instead (that's BAD!). Whenever you use the "View-based" methods to flip channels in an ndarray (such as `RGB -> BGR`), its `C_CONTIGUOUS` becomes `False`. If you then flip the image's channels again (such as `BGR -> back to RGB`), its `C_CONTIGUOUS` becomes `True` again. So, the "view" is able to be transformed multiple times, and the "Contiguous" flag only says `True` whenever the view happens to match the actual RAM data's layout. So... in what situations do you need the data to ALWAYS be contiguous? Well, it varies based on API... - OpenCV APIs *all* need data in **contiguous** order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer *internally* makes a COPY of the BADLY FORMATTED image you gave it, and then converts the color channel order, and THEN finally passes the COPIED-AND-CONTIGUOUS image to the internal OpenCV C++ function. This is of course very wasteful! - Matplotlib APIs do not need contiguous data, because they have stride-handling code. But all of their calls are slowed down if given non-contiguous data, [as seen here](https://github.com/scivision/pymap3d/issues/30#issuecomment-537663693). - Other libraries: Depends on the library. Some of them do something like "take the `img.data` memory address and give it to a raw Windows API via a COM call" in which case YES the RAM-data MUST be contiguous too. **What type of data do YOU need?** If you want the SAFEST possible data that is 100% sure to work in ANY API ANYWHERE, you should always make CONTIGUOUS pixel data. It doesn't take long to do the conversion up-front, since we're still talking about very fast operations! There are probably situations where non-contiguous data is fine, such as if you are doing all image manipulations purely in Numpy math without any library APIs (in which case there's no real reason to convert the data layout to contiguous in RAM). But as soon as you invoke various library APIs, you should *pretty much always* have contiguous data, otherwise you'll create huge performance issues (or even completely incorrect results). I'll explain those performance issues further down, but first we'll look at the various "conversion techniques" people use in Python. **Techniques** Without further ado, here are all the ways that people use in Python whenever they want to convert back/forth between RGB and BGR. These benchmarks are on a 4K image (3840x2160): - Always Contiguous: No. Method: `x = x[...,::-1]`. Speed: `237 nsec (aka 0.237 usec aka 0.000237 msec) per call` - Always Contiguous: Yes. Method: `x = x[...,::-1].copy()`. Speed: `37.5 msec per call` - Always Contiguous: No. Method: `x = x[:, :, [2, 1, 0]]`. Speed: `12.6 msec per call` - Always Contiguous: Yes. Method: `x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR)`. Speed: `5.39 msec per call` - Always Contiguous: No. Method: `x = np.fliplr(x.reshape(-1,3)).reshape(x.shape)`. Speed: `1.62 usec (aka 0.00162 msec) per call` - Always Contiguous: Yes. Method: `x = np.fliplr(x.reshape(-1,3)).reshape(x.shape).copy()`. Speed: `37.4 msec per call` - Always Contiguous: No. Method: `x = np.flip(x, axis=2)`. Speed: `2.74 usec (aka 0.00274 msec) per call` - Always Contiguous: Yes. Method: `x = np.flip(x, axis=2).copy()`. Speed: `37.5 msec per call` - Always Contiguous: Yes. Method: `r = x[..., 0].copy(); x[..., 0] = x[..., 2]; x[..., 2] = r`. Speed: `21.8 msec per call` - Always Contiguous: Yes. Method: `x[:, :, [0, 2]] = x[:, :, [2, 0]]`. Speed: `21.7 msec per call` - Always Contiguous: Yes. Method: `x[..., [0, 2]] = x[..., [2, 0]]`. Speed: `21.8 msec per call` - Always Contiguous: Yes. Method: `x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]`. Speed: `33.1 msec per call` - Always Contiguous: Yes. Method: `x[:, :] = x[:, :, [2, 1, 0]]`. Speed: `49.3 msec per call` - Always Contiguous: Yes. Method: `foo = x.copy()`. Speed: `11.8 msec per call` (This example doesn't change the RGB/BGR channel order, and is just included here as a reference, to show how slow Numpy is at doing a *super simple* copy of an already-contiguous chunk of RAM. As you can see, even when the data is already in the proper order, Numpy is very slow... And if "x" had been non-contiguous here, it would be even slower, as shown in the `x = x[...,::-1].copy()` (equivalent to saying `bar = x[...,::-1]; foo = bar.copy()`) example near the top of the list, which took `37.5 msec` and demonstrates Numpy copying non-contiguous RAM (from numpy "views" marked as "read in reverse order" via "stride = -1") into contiguous RAM... PS: Whenever we want contiguous data from numpy, we're mostly using `x.copy()` to tell Numpy to allocate new RAM and copy all data to it in the correct (contiguous) order. There's also a `np.ascontiguousarray(x)` API but it does the *exact* same thing (it copies too, "but only when the Numpy data isn't already contiguous") and requires much more typing. ;-) And in a *few* of the examples we're using special indexing (such as `x[:, :, [0, 1, 2]] = x[:, :, [2, 1, 0]]`) to overwrite the memory directly, which always creates contiguous memory with correct "strides", and is faster than telling Numpy to do a `.copy()`, but is still *extremely* slow compared to `cv2.cvtColor()`. Docs for the various Numpy functions: [copy](https://docs.scipy.org/doc/numpy/reference/generated/numpy.copy.html), [ascontiguousarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ascontiguousarray.html), [fliplr](https://docs.scipy.org/doc/numpy/reference/generated/numpy.fliplr.html#numpy.fliplr), [flip](https://docs.scipy.org/doc/numpy/reference/generated/numpy.flip.html#numpy.flip), [reshape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html) Here's the benchmark that was used: `python -m timeit -s "import numpy as np; import cv2; x = np.zeros([2160,3840,3], np.uint8); x[:,:,2] = 255; x[:,:,1] = 100" "ALGORITHM HERE"` Replace the `"ALGORITHM HERE"` part with the algorithm above, such as `"x = np.flip(x, axis=2).copy()"`. **People's Misunderstandings of those Benchmarks** Alright, so we're finally getting to the whole purpose of this article! When people see the benchmarks above, they usually think "Oh my god, `x = x[...,::-1]` executes in `0.000237` milliseconds, and `x = cv2.cvtColor(x, cv2.COLOR_RGB2BGR)` executes in `5.39` milliseconds, which is **23798 times slower!!** And then they decide to always use "Numpy view manipulation" to do their channel conversions. That's a huge mistake. And here's why: - When you call an OpenCV API from Python, and pass it a `numpy.ndarray` image object, there's a process which prepares that data for internal usage within OpenCV (since OpenCV itself doesn't use `ndarray` internally; it uses `cv::Mat`). - First, your Python object (which is coming from the `PyOpenCV` module) goes into the appropriate `pyopencv_to()` function, whose purpose is to convert raw Python objects (such as numbers, strings, `ndarray`, etc), into something usable by OpenCV internally in C++. - Your Python object first enters the "full ArgInfo converter" code at https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L249 - That code, in turn, looks at the object and determines if it's a number, a float, or a tuple... If it's any of those, it does the appropriate conversion. Otherwise it assumes it's a Numpy array, at this line: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L292 - Next, it begins to analyze the Numpy array to determine how to use the data internally. It wants to determine "do we need to copy the data or can we use it as-is? do we need to cast the data?", see here: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L300 - It first does some simple checks to see if the number-type in the Numpy array is legal or not. (If illegal type, it marks the data as "needs copy" *and* "needs cast"). - Next, it retrieves the "strides" information from the Numpy array, which is those simple numbers such as "-1" which determine how to read a numpy array (such as backwards, in the case of our "fast" numpy-based "channel flipping" code earlier): https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L341-L342 - Then it analyzes the Strides for all dimensions of the Numpy array, and if it finds a non-contiguous stride (our "screwed up" data layout caused by doing those so-called "fast" Numpy view manipulations), then it marks the data as "needs copy": https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L344-L357 - Next, if "needs copy" is true, it does this horrible thing: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L367-L374 - As you can see, it calls `PyArray_Cast()` (if casting was needed too) or `PyArray_GETCONTIGUOUS()` (if we only need to make sure the data is contiguous). *Both* of those functions, no matter which one is called, then generates a brand-new Python Numpy Object, with all data COPIED by Numpy into brand-new memory, and re-arranged into proper Contiguous ordering. That's extremely wasteful! I'll explain more soon, after this walkthrough of what the code does... - Finally, the code proceeds to create a `cv::Mat` object whose data pointer points at the internal byte-array (RAM) of the Numpy object, ie. the RAM address that you can easily see in Python by typing `img.data`. That's an incredibly fast operation because it is just a pointer which says `Use the existing RAM data owned by Numpy at RAM address XYZ`: https://github.com/opencv/opencv/blob/778f42ad34559451d62ac9ba585717aec77fb23a/modules/python/src2/cv2.cpp#L415-L416 So, can you spot the problem yet? When you pass a contiguous Numpy array, the conversion into OpenCV is pretty much INSTANT: "This data looks fine! Just give its RAM address to `cv::Mat` and voila!". But when you instead insist on using those so-called "fast" channel transformations, where you "tweak" the Numpy array's view and stride values, then you are giving OpenCV a Numpy array with non-contiguous RAM and bad "strides". The PyOpenCV layer (the wrapper between OpenCV and Python) detects this problem, and creates a BRAND NEW, COPIED, RE-ARRANGED (CONTIGUOUS) NUMPY ARRAY. This is VERY VERY VERY VERY SLOW. In other words, if you've used those dumb Numpy "view" manipulation tricks, *EVERY* CALL TO OPENCV APIS IS CAUSING a HUGE memory copy (images are large, especially 1080p+ screenshots/video frames), a lot of math inside `PyArray_GETCONTIGUOUS / PyArray_Cast` to create that new object while respecting your tweaked "strides", etc. Your code won't be faster at all. It will be SLOWER! **Demonstration of the Slowness** Let's use a random OpenCV API to demonstrate the slowdown caused by all of those conversions. We'll use `cv2.imshow` here, but *any* OpenCV API call will always be doing the same "Python to OpenCV" conversions of the numpy data, so the exact API doesn't matter. They will *all* have this overhead. Here's the example code: import cv2 import numpy as np import time #img1 = cv2.imread("yourimage.png") # If you want to test with an image. img1 = np.zeros([2160,3840,3], np.uint8) # Create a 4K image. img1[:,:,2] = 255; img1[:,:,1] = 100 # Fill the channels with different values. img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS']) print("img1 strides", img1.strides, "img2 strides", img2.strides) wnd = cv2.namedWindow("", cv2.WINDOW_NORMAL) def show1(): cv2.imshow(wnd, img1) def show2(): cv2.imshow(wnd, img2) iterations = 20 start = time.perf_counter() for i in range(0, iterations): show1() elapsed1 = (time.perf_counter() - start) * 1000 elapsed1_percall = elapsed1 / iterations start = time.perf_counter() for i in range(0, iterations): show2() elapsed2 = (time.perf_counter() - start) * 1000 elapsed2_percall = elapsed2 / iterations # We know that the contiguos (img1) data does not need conversion, # which tells us that the runtime of the contiguous data is the # "internal work of the imshow" function. We only want to measure # the conversion time for non-contiguous data. So we'll subtract # the first image's (contiguous) runtime from the non-contiguous time. noncontiguous_overhead_per_call = elapsed2_percall - elapsed1_percall print("Extra time taken per OpenCV call when given non-contiguous data (in ms):", noncontiguous_overhead_per_call, "ms") The results: img1 contiguous True img2 contiguous False img1 strides (11520, 3, 1) img2 strides (11520, 3, -1) Extra time taken per OpenCV call when given non-contiguous data (in ms): 39.45334999999999 ms As you can see, the extra time added to the OpenCV calls when copy-conversion is needed (`39.45 ms`), is pretty much the same as when you call Numpy's own `img.copy()` on a "flipped view" inside Python itself (as seen in the earlier benchmark for "Method: `x = x[...,::-1].copy()`. Speed: `37.5 msec per call`"). So yes, *every* time you call OpenCV with a non-contiguous Numpy array as its argument, you are causing a Numpy `.copy()` to happen internally! PS: If we repeat the same test above with 1920x1080 test data instead of 4K test data, we get `Extra time taken per OpenCV call when given non-contiguous data (in ms): 9.972125 ms` which means that at the world's most popular image resolution (1080p) you're still adding around 10 milliseconds of overhead to *all* of your OpenCV calls. **Numpy "tricks" will cause subtle Bugs too!** Using those Numpy "tricks" isn't *just* extremely slow. It will cause *very subtle bugs* in your code, *too*. Look at this code and see if you can figure out the bug yourself before you run this example: import cv2 import numpy as np img1 = np.zeros([200,200,3], np.uint8) # Create a 200x200 image. (Is Contiguous) img2 = img1[...,::-1] # Make a "channel flipped view" of the Numpy data. (A Non-Contiguous View) print("img1 contiguous", img1.flags['C_CONTIGUOUS'], "img2 contiguous", img2.flags['C_CONTIGUOUS']) print("img1 strides", img1.strides, "img2 strides", img2.strides) cv2.rectangle(img2, (80,80), (120,120), (255,255,255), 2) cv2.imshow("", img2) What do you think the result will be when running this program? Logically, you expect to see black image with a white rectangle in the middle... But instead, you see *nothing* except a black image. Why? Well, it's simple... think about what was explained earlier about **how** `PyOpenCV` converts every incoming `numpy.ndarray` object into an internal C++ `cv::Mat` object. In this example, we're giving a *non-contiguous* `ndarray` as an argument to `cv2.rectangle()`, which causes `PyOpenCV` to "fix" the data by making a temporary, internal, *contiguous* `.copy()` of the image data, and then it wraps the *copy*'s memory address in a `cv::Mat`. Next, it passes that `cv::Mat` object to the internal C++ "draw rectangle" function, which dutifully draws a rectangle onto the memory pointed to by the `cv::Mat` object... which is... the memory of the temporary internal *copy* of your input array, since a copy had to be created... So, OpenCV happily writes a rectangle to the temporary object *copy*. And then when execution returns to Python, you're of course seeing NO RECTANGLE, since *nothing* was drawn to *your* actual `ndarray` data in RAM (since its memory storage was non-contiguous and therefore *not usable as-is* by OpenCV). If you want to see what the code above *should* be doing, simply add `img2 = img2.copy()` immediately above the `cv2.rectangle` call, to cause the `img2` ndarray object to become contiguous memory so that OpenCV won't need to make a copy of it (and will be able to use *that* exact object's memory internally, as intended)... After that tweak, you'll see OpenCV properly drawing the rectangle to the image... This is the kind of subtle bug that is very easy to cause when you're playing around with faked Numpy "views" rather than *real* contiguous memory. **Bonus: A note about Numpy "slices"** Numpy allows you to efficiently "slice" arrays, to extract a "partial view" of the data. This is very useful for images, since you can do something such as extracting a 100:100 pixel square from the middle of an image. The slicing syntax is `img_sliced = img[y1:y2,x1:x2]`. This generates a full Numpy object which points at the data of the original image (they share each other's memory), but which only points at the sub-range you wanted. Therefore it's super fast (since the slice is just a small object that points at and says how to interpret a small range of the original array's data). So it basically becomes a fully usable "Numpy array" object which you can use in any context you would pass an image. Such as to an OpenCV function, which would then only operate on the sliced segment of RAM. That's really useful! However, be aware that the Numpy slices inherit the `strides` and `contiguous` flag of the original object / data they were sliced from! So if you're slicing from a non-contiguous array, you'll generate a non-contiguous slice object too, which is horrible and has *all* the issues of non-contiguous objects. It's *only* safe to make partial views/slices (like `img[0:100, 0:100]`) when `img` itself is already PROVEN to be FULLY contiguous (with no "Numpy tricks" applied to it). In that case, feel free to pass your contiguous, partial image slices to OpenCV functions. You won't invoke any copy-mechanics in that case! Alternatively, if you already have a non-contiguous image array and you want to slice it, it's faster to *slice first* and *then* make the *slice* contiguous, since that means less data copying (for example, a 100x100 slice of a 4K image will need much less copying "to make contiguous" than the whole image would have needed). By slicing first and then making a contiguous copy of the slice, you will ensure that your slice is contiguous/safe to use with OpenCV. As an example, let's say that `xyz` is a non-contiguous image; in that case, the technique would look as follows: `slice = xyz[0:100, 0:100].copy()` (create a non-contiguous slice "view" of a non-contiguous image, and then force that to become copied which creates a new contiguous array based on the slice's view). Alternatively, if you *don't know* if the image that you're slicing from is already contiguous or not, then you can use `slice = np.ascontiguousarray(xyz[0:100, 0:100])` (creates a slice "view", and then instantly uses that fast view as-is if already contiguous, else copies the data to a new contiguous array and returns that instead). **Bonus: What to do when you get a non-contiguous ndarray from a library?** As an example, the very cool `D3DShot` [library](https://github.com/SerpentAI/D3DShot) has an optional `numpy` mode where it retrieves the screenshots as `ndarray` objects. The problem is that it generates them from RAM data laid out in a different order, so it tweaks the `ndarray` strides etc to give us an object of the proper "shape" `(height, width, 3 color channels in RGB order)`. Its `.flags` property shows that Contiguous is FALSE. So what do you do? If you try to pass that directly to OpenCV, you'll invoke the heavy `PyOpenCV` copy-mechanics described earlier. Well, you have two options. In this example case, the colors are in RGB order, and you want them to be BGR for usage in OpenCV. So you should be invoking `cv2.cvtColor` which internally will trigger the Numpy `.copy()` for you (just like all OpenCV APIs do when given non-contiguous data), and then changes the color order in RAM for you. The second option is when you have Numpy data that is already in the correct color order (such as BGR), but whose RAM is non-contiguous. In that case, you should *directly* invoke `img = img.copy()` to tell Numpy to make a contiguous copy of the array, to fix it. Then you're welcome to use that contiguous copy for everything. Also note that you can use `img = np.ascontiguousarray(img)` instead, if you're not sure if your library always returns non-contiguous data; this method automatically returns the same array if it was already contiguous, or does a `.copy` if it was non-contiguous. Alright, so let's look at the `D3DShot` example: import cv2 import d3dshot import time d = d3dshot.create(capture_output="numpy", frame_buffer_size=60) img1 = d.screenshot() img2 = d.screenshot() print(img1.strides, img1.flags) print(img2.strides, img2.flags) print("-------------") start = time.perf_counter() img1_justcopy = img1.copy() # copy RGB image to new, contiguous RAM elapsed = (time.perf_counter() - start) * 1000 print(img1_justcopy.strides, img1_justcopy.flags) print("justcopy milliseconds:", elapsed) print("-------------") start = time.perf_counter() img1 = img1.copy() img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR) # flip RGB -> BGR elapsed = (time.perf_counter() - start) * 1000 print(img1.strides, img1.flags) print("copy+cvtColor milliseconds:", elapsed) print("-------------") start = time.perf_counter() img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR) # flip RGB -> BGR elapsed = (time.perf_counter() - start) * 1000 print(img2.strides, img2.flags) print("cvtColor milliseconds:", elapsed) Output: (1920, 1, 2073600) C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False (1920, 1, 2073600) C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False ------------- (5760, 3, 1) C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False justcopy milliseconds: 9.122899999999989 ------------- (5760, 3, 1) C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False copy+cvtColor milliseconds: 12.177900000000019 ------------- (5760, 3, 1) C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False cvtColor milliseconds: 11.461500000000013 These examples are all on my 1920x1080 screen, so they're not directly comparable to the 4K resolution times we saw in earlier benchmarks. Anyway, what we can see here, is first of all that the two captured images (img1 and img2) coming straight from the `D3DShot` library have very strange `strides` values, and `C_CONTIGUOUS : False`. That's because they are raw RAM given to D3DShot by Windows and then just packaged into a ndarray with custom strides to make it read the raw RAM data in the desired order. Next, we see that just doing `img1_justcopy = img1.copy()` (which copies the RGB-channeled, non-contiguous RAM into new, contiguous RAM, but does not change the channel order (the image will still be RGB)), takes `9.12 ms`, which is indeed how slow Numpy is at copying non-contiguous `ndarray` data into new, contiguous RAM. Basically, internally, Numpy has to do a ton of looping to read the data byte-by-byte while writing each byte into the correct order in the new, contiguous RAM. So, the PyArray (Numpy) copying of non-contiguous to contiguous is always the slowest operation. That's why we want to avoid having non-contiguous RAM. Alright, we also demonstrated how to make a "copy AND fix the colors from RGB to BGR" in two different ways. Doing `img1 = img1.copy(); img1 = cv2.cvtColor(img1, cv2.COLOR_RGB2BGR)` takes `11.83 ms`, and letting `cvtColor` trigger the Numpy `.copy` internally via directly calling `img2 = cv2.cvtColor(img2, cv2.COLOR_RGB2BGR)` takes `10.61 ms`. The reason for the slight difference is of course that there's slightly more work involved when we're doing 2 separate function calls, than when we let OpenCV do the Numpy copying in its single call. In both cases, a PyArray (Numpy) copy operation happens internally, to give us a straight, contiguous RAM location. And then we pass that fixed, contiguous ndarray to `cvtColor` which fixes the color channel order. That gives you the following guidelines for dealing with image data from libraries: - If your Numpy data is always non-contiguous but is already in the correct channel order (you don't want to convert RGB to/from BGR, etc): Use `img = img.copy()` to force Numpy to make a contiguous copy of the data, which is then usable in all OpenCV calls without any bugs and without causing any slow internal, temporary copying. - If your Numpy data is SOMETIMES non-contiguous but is already in the correct channel order: Use `img = np.ascontiguousarray(img)`, which automatically copies the array to make it contiguous if necessary, or otherwise returns the exact same array (if it was already contiguous). - If your Numpy data has the wrong color channel order (and is either contiguous or non-contiguous; it doesn't matter which): Use `img = cv2.cvtColor(img, cv2.COLOR_)`, which will internally do the `.copy` (only does it if necessary) slightly more efficiently than if you had used two separate Python statements. And it will do the color conversion very rapidly with OpenCL accelerated code. All of these techniques will result in giving you fast, contiguous RAM, in the color arrangement of your choice! **Conclusions** Stop using Numpy view manipulations and "tricks". They are not "cool". They lead to SUBTLE BUGS and they are EXTREMELY SLOW. You are slowing down all of your OpenCV API calls by about 40 milliseconds PER CALL (at 4K resolution) or 10 milliseconds PER CALL (at 1920x1080 resolution), since your "cool" data has to be converted by OpenCV internally to PROPER CONTIGUOUS RAM. Those `39.45 ms (@ 4K) or 9.97 ms (@ 1920x1080)` are wasted on EVERY OpenCV call whenever you give OpenCV a non-contiguous image. So if you're (as people often do) passing the image to multiple OpenCV APIs to analyze it in multiple ways, then you are causing extreme slowdowns in your code. Use `cv2.cvtColor()` instead, which does a super fast one-time conversion to the proper format, using accelerated OpenCL code. You are guaranteed to get contiguous data which works as-is for EVERY OpenCV call with no memory copying/conversion needed. And OpenCV's color converter is WAY FASTER than Numpy's internal data copier/converter. Let's end this by imagining a scenario where you're using some Python library to capture an RGB 4K screenshot as a numpy array, and you need to use that data with OpenCV. So you're thinking you're clever and you write `img = img[...,::-1]` to "turn the RGB data into BGR (which OpenCV needs)", and you're thinking "Wow, my code is so fast! That RGB-to-BGR operation only took `0.000237 ms`!"... And *then* you're calling five different OpenCV functions to analyze that screenshot-image in various ways. Well, since you're causing one internal Numpy copy-conversion-to-contiguous PER CALL, you're now causing `5 * 39.45 = 197.25 ms` of total conversion overhead, just to get your "stupid" Numpy view into a proper contiguous memory stream. Does it still sound "slow" to just do a single, one-time `5.39 ms (@ 4K) or 1.53 ms (@ 1920x1080)` conversion via `cv2.cvtColor()`? ;-) Stop. Using. Numpy. Tricks! Enjoy! ;-)

Viewing all articles
Browse latest Browse all 2088

Trending Articles