Darktable Performance: Intel vs. Apple and Linux vs. MacOS
I’ve been looking at ways of speeding-up my editing workflow, looking at principally two options in the near future to replace my Intel-based Mac:
An upcoming Mac Pro based on Apple's ARM processors
A GNU/Linux desktop based on an AMD Threadripper
I wanted to see how the new Apple M-based processors would compare to the existing X86/AMD64 architectures, and also, how does MacOS compare to GNU/Linux when running darktable, specifically. GNU/Linux is by far the focus of darktable development, so how would it compare to running on MacOS?
A couple of years ago, I did most of my editing on an Intel NUC with 16GB RAM and an integrated graphics card. It had been working fine for 8MP images, and even 16MP images, but it never felt snappy. Obviously, computers continue to get faster, and I figured I’d upgrade at some point; a forcing function becoming when I moved to the Leica Q2 and its larger raw files. I figured if I needed more storage, I should take the opportunity to improve the performance as well. I ended-up replacing it with a 16” MacBook Pro with a Radeon 5500M and an 8TB SSD. That has been working out well, but again, I hit the limit of the on-device storage; and honestly the lower-powered laptop components weren’t as fast as I could get with a desktop.
It’s probably helpful at this point to understand what contributes to a smooth user experience in darktable, and what’s important to user performance. There are the basic steps of reading an image from disk and putting it into memory, but where the bulk of your time is going to be spent is rendering images. darktable effects a processed image by executing its rendering pipeline, which includes all of the modules you’ve applied in the image stack: like levels, exposure, cropping, and denoising. Some of these are really basic, but some of these are complex operations; and in order to minimize artifacts between stages of the rendering pipeline, darktable treats every pixel of the image along the way as a 32-bit single precision number–the type of math that CPUs don’t traditionally do very well. To really crank through these complex operations, you want to look toward co-processors, special silicon made by companies like nVidia or (formerly) ATI: discrete graphics cards.
Expensive workstations have been using expensive workstation graphics cards for decades; and more recently, super computers performing complex simulations have been using multiple ultra-high-end cards because of how good they are at doing high-precision calculations, quickly. The good news is, that if you just want to speed-up your darktable rendering, you don’t need one of those astronomically-expensive workstation cards; you’re probably going to see better performance with the simply outrageously priced gaming cards. Generally speaking, the workstation cards are optimized to improve double-precision, or 64-bit, calculations which isn’t used by darktable. For simulations requiring extensive precision: those extra bits will pay dividends, but when you’re mapping back to a limited color gamut, it simply isn’t going to matter, and 32-bit calculations are significantly faster regardless of whether you have a workstation or consumer class card.
darktable utilizes OpenCL which is supported by the major graphics card and operating system. Its first commercial implementation was with Apple’s OS 10.6 Snow Leopard in 2009; and despite Apple deprecating it in 2018 with OS 10.14 Mojave, it’s still available on the integrated GPUs in their M1-based computers. Because Apple was deprecating OpenCL in favour of its proprietary Metal API, and based on discussion in the darktable IRC channel (and general preference), at the start of the year I started looking into switching back to a Linux desktop. At the same time, reviews of the M1-based Mac Mini were coming in with praise.
Apple's M1
I think the M1 chipset is exciting in a lot of ways, and I wish Apple were more open about the rest of their hardware so that the overall systems could once again be solid Linux-boxes–but, those wishes aside, let’s focus on the key benefits of M1 and why it really is a better performing architecture for most people. Most users want a quiet–and therefore an energy-efficient–computer which performs normal desktop tasks, like Web browsing and maybe traditional office tasks, and basic media playing. Despite the efforts of Web developers, these tasks aren’t all that taxing, with most of the time involving moving information from disk or the network into memory, having the CPU do something with it, and then pushing that update to memory/disk/display or whatever. The only “special” work beyond that which really helps out the user is speeding-up encryption/decryption (for file encryption, like Apple’s FileVault; or accessing secure Web sites) and optimized decoding for media streams, like H.264 or H.265, to efficiently playback high-resolution video. The M1 is built on an extremely advanced fab process, which gives it better heat dissipation and power utilization compared to the current generation of Intel CPUs. Beyond that there are architectural differences which play into the M1’s favour.
There aren’t really philosophically pure RISC and CISC chips anymore, in part because of what we mentioned earlier with the desire to optimize certain complex tasks. Nevertheless, you may have heard that the M1-based Mac is a RISC–or “reduced instruction set” computer while Intel produces chips with a complex instruction set. What this means is technical, but importantly, it means that the M1 can do some fancy tricks not available to Intel or AMD X86-based chips. Because of guarantees in the M1’s instruction set, it can be much more efficient in processing upcoming instructions; while the AMD/Intel CPUs either need to be wait, guess, or rely on developers who tend to be pretty poor at managing these optimizations themselves. RISC instructions are fixed-length and need to access memory via load/store commands; while CISC instructions need not do either of those things. This allows the M1 to more efficiently process instructions, and confidently re-order instructions for better throughput automatically.
Intel and AMD chipsets also perform reordering, or out-of-order execution, but have much smaller amounts of memory dedicated to storing upcoming memory (the re-order buffer). Again, the reasons for this are in part driven by the architecture itself, where re-ordering and speculative execution in the CISC chips can lead to larger performance hits when the speculatively-performed (or predicted) work ends-up not being used, and the CPU needs to operate on some other data instead; in part, because the process of simply identifying and processing instructions is more expensive with CISC.
Just as reducing the size of the CPU improves efficiency and overall performance, bringing things closer together does the same thing. We’ve seen mobile phones continue to shrink by bringing more components into the same package: CPU, GPU, memory, etc.; the M1 does this as well. The M1 fully embraces this, by not only not-needing to access memory across the northbridge (recent generations of Intel chips have moved this closer to the CPU as well), but by utilizing its “unified memory architecture”: a single pool of memory which can be directly accessed and shared across the CPU, GPU, and the M1’s “neural cores”.
With Intel chips, certain devices–like integrated graphics–may use the same memory as the CPU. The M1's implementation differs in two significant ways here: the M1 uses faster memory than current-gen Intel chips, and unlike Intel's architecture, the M1 shares better. With Intel, effectively the computer puts up fences, and tells the CPU which memory is its, and which memory is reserved for–say, the graphics card, or GPU. In the Intel world, if you need to do work with some data on the GPU, you first need to copy that data to the GPU's section of the memory–even if it's already there, in the CPU's section. This is relatively quick, but still takes time and adds-up if you have to do it a lot. With the M1, the GPU can directly reference the same data the CPU was working on, which means you neither need to copy the memory into the GPU's special area, or back out to the CPU's area.
Benchmark
How does the M1-based Mac Mini compare to a 16” Macbook, when it comes to image processing in darktable?
I ran with the latest version of darktable in both cases, built for X86, so it was running through Apple’s Rosetta2 translation layer on the M1 Mini. It would likely run faster if it were native, but there are a few dependencies that just aren’t available yet on the M1. Since we’re testing opencl-optimized pipelines as well: those probably won’t see as much change.
Looking at the on-CPU performance only, the 16”’s 8-core i9 (2.4GHz, boostable to 5.0GHz) out-performs the M1 in raw throughput when looking at the total dataset. This is more pronounced, the more memory is necessary–so we seem to be seeing the effects of needing to swap to handle the lower RAM available on the M1 Mini. This was the 8GiB version of the mini, and you'd likely see much closer performance with the 16GiB version. In fact, in the “low memory” dataset, which comprised the majority of these images, the performance trends on-CPU were basically identical.
The Datasets
The images were made up of Adobe Digital Negative files, almost exclusively from the Leica X2 and Leica Q2. In addition to looking at the overall dataset, I split them into "low", "medium", and "high"-memory groups, roughly based on which cards would be able to fully fit the required data in memory. 62.5% of the files needed about 1GiB or less of memory–the "low memory" data set, and any modern graphics card would have absolutely no trouble with that amount of data. Also, looking at this low-memory group, we see very little difference between the M1 and the Intel Core i9–actually there's a bit of an advantage to the M1 at this point.
The larger memory available to the 16” allows it to to pull ahead as the complexity increases. 37.1% of the images were in the “mid-memory” category. This is where most of the Q2 images ended-up, with roughly 97% of the 47MP images requiring between 1GiB and 4GiB of memory when running on the GPU–again, these will fit on any recent discrete GPU from AMD or Nvidia, but ATI and Nvidia did release OpenCL cards supporting between 1GiB and 4GiB of on-card memory, and the Intel discrete graphics only allocated about 1GiB of memory. 8% of the “mid-memory” set comprised 16MP images coming from the X2.
About 1.2% of Q2 images made-up the “high-memory” dataset; just under the 1.4% of Q2 images which stayed in the “low memory” group. Because of the low sample size here, the “high memory” group I think is interesting, but not statistically significant. It does show clear results for the dataset though, with the on-board Radeon 5500M and eGPU 5700 clearly leading the pack, with the M1 GPU clearly beating on-CPU or Intel dedicated graphics. It’s an interesting note that darktable reported allocating up to 5GiB on the M1 GPU, but would only 1GiB for the integrated Intel graphics. The Radeon-based cards each had 8GiB of memory.
One interesting note here–that caught me surprise given that the darktable devs are principally focused on GNU/Linux, and Apple has deprecated OpenCL in MacOS–is that there wasn’t a noticeable, significant, performance loss on MacOS when compared to GNU/Linux either on-CPU or using Intel graphics. I used the same 2015 Macbook to test GNU/Linux and MacOS.
Hardware and Expectations
If you’re looking to pick the best hardware for running darktable–or Davinici Resolve, or Final Cut Pro–what numbers on the specsheet are most important?
Well, there are the quintessential basics–and these are even more critical if you’re condemned to Adobe: fast memory, solid CPU, and disk I/O. Memory speed, latency and bandwidth, is currently a clear winner for the M1, with AMD and Intel both now generally using DDR4@3200MHz on high-end workstations. If you’re using Adobe Lightroom or Photoshop, the CPU becomes more important, since–based on other people’s reports and benchmarks–it effectively added “GPU acceleration” as a marketing point rather than a performance feature. With darktable, or a profession NLE, you can significantly improve performance by offloading it to the GPU–how much depends on effectively two variables: how quickly can you transfer data to (and from) the GPU, and how quick can the GPU do the work.
For the latter–how fast the GPU is at doing the work–you can basically look-up the floating-point operations per second, and that’s going to give you a good idea of where the card sits on the performance yardstick. Specifically, the metric that’s relevant here is single-precision (or 32-bit) floating point operations, and they’re usually shown in either teraflops (trillions of operations) or gigaflops (billions of operations). You want to make sure that, when you’re comparing cards, you’re using the same unit or converting–obviously a 3 teraflops is 1000 times faster than 3 gigaflops; that is 3 teraflops is the same as 3000 gigaflops. The current leader for a single-card solution is the Nvidia Ampere-based GeForce RTX 3090 cards which advertises about 35 and a half teraflops–if you can get your hands on one.
I had hoped to test a Radeon VII Pro, the fastest AMD GPU on the market–and fastest GPU supported by MacOS, but it’s been on back-order for several months now, and my initial batch missed. I’ll see about adding it to the article on https-notta-dot-pro once it arrives and I can run the benchmark; but it only advertises 13.1 teraflops–only 36% the floating-point operations. The older, non-pro variant, if you could get your hands on it while it was still shipping was actually slightly faster, having been clocked higher, but with weaker double-precision (64-bit) performance; which again, doesn’t matter for the programs we’re considering.
The cards we did test are all MacOS compatible, and the the floating-point operations are taken from the manufacturer, with the exception of the M1’s GPU. Most people reference an Anandtech article, where they estimate 2.6 teraflops–so we’ll go with that. It fits the performance metrics, trouncing the 360 gigaflops of the integrated Intel graphics on the 16” Macbook, but falling short of its dedicated GPU, the Radeon 5500M which boasts almost 4 and a half teraflops. We also tested the AMD Radeon 5700, built into the higher-end of two recently released Sonnet eGPU breakway puck, which advertises just under 8 teraflops.
For comparison, Apple used to sell two Blackmagic eGPUs on their Website, one with the Radeon RX Vega 56 that offered 10.5 teraflops, and one–which is still available–using the Radeon Pro 580 which offers 5.5 teraflops.
As you can see from the charts, though, performance doesn’t carry over perfectly.
Unfortunately, we only had one eGPU here, making it hard to build a generalization, but we clearly don’t see the expected nearly 2x performance boost going from the internal 5500M to the 5700 eGPU–performance is flat. So what happened to those extra 3 and a half teraflops?
eGPU
Without more data it’s hard to say exactly; however, going back to that other performance variable–how fast can we get data to the card and back–it’s clear that the eGPU-based cards are getting hit here. Now, this has nothing to do with bandwidth, assuming your eGPU is directly connected to the PC and not sharing any bandwidth (and you have a high-quality, short, cable). That’s a lot of caveats–but basically, Thunderbolt 3 has plenty of bandwidth.
Remember way back when we were talking about the M1, and why it was super fast–assuming you didn’t just jump to the benchmark results? Well, the problem here is the counterpoint to how the M1 is improving performance by bringing more components closer together–an eGPU is moving the graphics card further from the CPU and memory–both logically and physically. This increases latency, which means getting data to and from the GPU gets a constant penalty added on to it. The good news is, this penalty shouldn’t get bigger as the data gets larger–assuming you don’t need to make additional round trips. You see, darktable will tile images that are too large to put into the memory card memory, and that will lead to additional round-trips–so being able to push and read data less often will show an improvement.
When we look at the charts, we see that the 5700 eGPU and its 8 teraflops actually underperformed all other options (besides the Intel GPU) on half of the images. By contrast, it and the 5500M were the clear winners in the mid-memory and high-memory datasets.
Results
So, to the questions: MacOS or GNU/Linux? and if MacOS–Intel or ARM?
The operating system itself doesn’t seem to make a huge difference–again, there are some meta considerations that I promise, I’ll get back to. The real difference though, is the availability of hardware support–Nvidia’s top-of-the-line 3090 is the clear performance king breaking 35 teraflops, while the runner-up 3080 is much more affordable, and still way beyond the nearest competitor–the Radeon VII–with nearly 30 teraflops compared to just over 13 teraflops. It’s possible an upcoming Intel-based Mac Pro will support a Radeon VII Pro (or maybe two)–but that still provides less than 90% the performance of a single RTX 3080. It’s also worth noting, even if you happen to see an improvement in Adobe’s Lightroom using GPU acceleration–it’ll only use a single card; darktable and professional NLEs will use multiple GPUs when it can.
There’s also the upcoming question of an M-based Mac Pro. Clearly, the M1 GPU isn’t a performant solution compared to discrete options, which leaves two options: add support for eGPUs, support dedicated GPUs, or improve the M1. The performance hits on eGPUs are the least attractive, and honestly, I’m not sure how likely that would be without straight-up adding support for dedicated GPUs. On the other hand, Apple currently offers the add-on Afterburner card to its Mac Pros; so it’s not entirely unreasonable that Apple will eventually add support, for some means of accelerating video work–either supporting its existing ecosystem, or effectively offering extra M-GPUs.
The interesting question, is what is the future of Apple’s homegrown silicon. What are they currently sitting-on, and what will they have in a year? The oft repeated rumor is that Apple’s planning a 128-core GPU for their professional line. The current M1 has up-to 8 graphic cores. If they can scale that linearly, we’d be looking at a 15-times performance increase, or somewhere just ahead of the RTX 3080 (and likely behind whatever Nvidia’s 2022 “less unaffordable” high-end card is). For comparison, a 16-core GPU scaled linearly would give you a 16% boost over the 16"'s Radeon Pro 5500M; and a 32-core GPU would get you to 80% of the performance of the Radeon VII based on single-precision floating-point operations.
Improving Darktable Performance
It’s worth considering how to improve darktable performance itself (or any similar software). With darktable, it’s worth checking what GPU it’s trying to use; I was surprised to find that on my Macbook Pro, it defaulted to the integrated GPU for the most complex tasks, rather than the Radeon–swapping that was an instant performance improvement. With darktable, there are either two or three different pipelines during the normal editing flow, and with enough GPUs, you can run each in parallel on a dedicated card. In a single-monitor mode, you have the main editing window, and the less demanding preview window. With a dual-monitor mode, you can make use of three different GPUs to accelerate each of those. Multiple GPUs will also accelerate exports, or probably more importantly, thumbnail generation on the light table.
GPU RAM
As described earlier, fast floating-point performance is key; but how much RAM? The darktable documentation doesn't give good guidance, other than saying more is better. I only looked at 16 megapixel and 47 megapixel images, but since 100% of the 16 megapixel images fit under 4GiB, and only a handful of the 47 megapixel images made it up to 8GB–most users who are in that 20-50 megapixel range are probably going to be safe with a fairly standard 8GiB amount of RAM if you're looking at a current card. The lower-end Sonnet eGPU, based on the Radeon RX 5500 XT only has 4GiB–so I'd avoid that model–but if you can find a good deal or have an old 4GiB card to upgrade your machine just lying around, it'll be good for most of your 16 or 20 megapixel images; or really any images if all you have otherise is the on-GPU graphics.
What's clear, is that you're not going to need all 24GiB of RAM on the NVidia RTX 3090; although if you're shooting with the Fuji GFX100 or your camera's high-res multi-shot mode, you'll fairly easily bust out of the 8GiB most cards offer, or even the 10GiB of the RTX 3080, meaning you're going to have to bite that upgrade cost, or find a lower-performing 16GiB card–like the Radeon VII.
MacOS Pros and Cons
I’m not a huge fan of MacOS, but treat it effectively as an expediency for certain tasks. I actually still use that NUC for many things that are neither video-nor-photo-centric.
Beyond commercial support–that is a wider variety of programs and hardware–MacOS does make managing multiple color-managed monitors easier. Single monitor set-ups for Linux aren’t bad, but multiple monitors gets awkward. I left Linux when the colour management daemon, colord, began to fail to work outside of Gnome. There are other options, but effectively, it felt like the walls were closing-in on me in Linux: between systemd and Gnome; and for ultimate performance, needing closed proprietary drivers again, I just gave-in and switched to OSX for the situations where I needed video performance and colour management. In for a penny, in for a pound.
The drawbacks? Beyond being closed and proprietary, Apple has a track record of changing things on software developers with minimal warning and no recourse. Remember when they switched the ports between generations, and suddenly none of your peripherals worked? Or removed a headphone port? It’s like that for developers, too. My understanding is that’s what did Apple’s own Aperture in–internal APIs changed, necessitating effectively a full rewrite. Especially, with OpenCL having been deprecated, there’s real concern that Cupertino has already hung the sword over our head.
Where does that leave us?
Ok, that was a lot of information, and where does that leave us?
Well, honestly I don’t know. I can’t find someone selling a tri-3090 system anymore, given shortages; and we don’t actually know what the M-based Mac Pro will offer (let alone when it’ll appear). We don’t know how Apple’s M-CPUs, GPUS, or Unified Memory architecture will scale–how how much memory it will support, how it will perform as it scales out, and how much it will cost, as it does.
At present, the clear performance option, if money is no object, is a dual or triple Nvidia-based system; and it’s likely that’ll beat any MacPro based option out next year. You’re subject to Nvidia proprietary drivers, but you’re not going to be subject to the whims of Cupertino.
For now, I haven’t decided what my next move is.
-NP