hoakley January 15, 2024 Macs, Technology

Why the M2 is more advanced that it seemed

When Apple launched its M2 chip at WWDC 18 months ago, in June 2022, pretty well everyone saw it as evolutionary, significantly faster than its predecessor the M1, but offering little real change in capability. In this article, which follows on from last week’s about changed support for AI, I take a deeper dive inside Apple silicon to discover whether the M2 is more than we thought at the time. The clues come in the instruction set supported by CPU cores inside the chips.

When comparing chips, emphasis is laid on performance, particularly benchmark tests that are the equivalent of sprint performance of an athlete in that they tell you how quickly the chip can run today’s tasks. If you intend keeping your Mac for longer than a year, you should also be interested in how its chip will perform when running the tasks of the future, or for our track athlete whether they’re also good enough at field events to be a good decathlete, their capability.

Changes in instructions since the M1

For CPUs, capability is primarily determined by the instructions they run. Today you may not be interested in whether your Mac’s chip can perform ray-tracing very quickly, but in a couple of year’s time the hardware-accelerated ray-tracing in the GPU of an M3 could make all the difference. While Apple adds plenty of its own hardware, including GPU, a neural engine, and its legendary matrix co-processor the AMX, the capability of CPU cores remains central to many of the tasks performed by its chips. Those are defined by Arm, and licensed to Apple, in its Instruction Set Architecture (ISA), documented in a manual currently well over 5,000 pages long.

Mercifully, Arm defines its ISA in versions. CPU cores in M1 chips use ARMv8.5-A, while those in M2 and M3 chips use ARMv8.6-A, and Arm helpfully explains their main differences in this list of changes in ARMv8.6-A of 2019:

General Matrix Multiply (AI and others)
bfloat16 data type and arithmetic instructions (AI and others)
Finer-grained traps for virtualisation (virtualisation)
Wait-for-event traps for virtualisation (virtualisation)
High precision time (1 GHz, general)
Extended Pointer Authentication (security).

Of those, support for bfloat16 and General Matrix Multiply are likely to have the most impact on the user.

Although Macs based on the M1 chip weren’t released until November 2020, a year after the introduction of ARMv8.6-A, the lead time in design and development is such that there would have been no time to incorporate changes from 2019, the year after bfloat16 first appeared.

bfloat16

As I explained before, we’re dealing here with three different floating-point number formats, each expressed using a sign (+ or -), a fraction whose length determines its precision, and an exponent that determines the overall range of numbers that can be represented in that format. Before the introduction of bfloat16, the choice for AI and some other computation came down to two formats:

float32 (single-precision), with a range of about +/- 1.2 x 10^-38 to 3.4 x 10^38, occupying 32 bits
float16 (half-precision), with a range of about +/- 6.1 x 10^-5 to 65,504, occupying 16 bits.

float32 has been almost universally used in AI and other applications where double-precision (float64) isn’t required.

bfloat16 adds to those an intermediate, with the same range as float32, of about +/- 1.2 x 10^-38 to 3.4 x 10^38, but only occupying half the space, at lower precision. It’s designed for easy conversion with float32, as its sign and exponent remain unchanged, only the fraction (significand, or mantissa) has to be extended or truncated, depending on which direction you’re going in. Converting between float32 and float16 is more involved, and most importantly, as the range allowed for float16 is far smaller, numbers outside its range would lose their numeric value. That means any floating-point number larger than 65,504, which is a severe limitation for many applications.

A number format that is half the length of float32 numbers isn’t only important when storing large amounts of data, but has substantial effects on the performance of operations. Those are usually accelerated using ‘single instruction, multiple data’ (SIMD) techniques, where a register is packed with two or more values, and the core then executes instructions on them in parallel. In the CPU cores of M-series chips, that’s normally done using 128-bit registers, which can hold four float32 values, or eight bfloat16 values. For tasks involving thousands of arithmetic operations, packing registers with twice the number of values can almost double the throughput, as reported in tests on Arm processors.

In applications where its reduced precision can be accepted, the bfloat16 format thus offers the same range as float32, simple and quick conversion with float32, occupies half the storage, and delivers up to double the performance in SIMD execution.

But my Mac isn’t training AI models

When Google’s AI researchers first made the claim that bfloat16 is “the secret to high performance”, they were of course referring to those developing AI models, rather than ordinary users. That article was published in August 2019, less than a year before Apple announced the M1, explaining why none of the hardware in the M1 could have supported bfloat16, and that support wasn’t added by Arm until ARMv8.6-A.

Both Arm and Apple recognise the importance of performing as much AI training as possible in-device rather than in the cloud. The case for this has been argued eloquently by Hellen Norman of Arm, independently of Apple.

It doesn’t take much imagination to come up with local training tasks that could improve commonplace features in macOS. Many of us disable spell-checking because of its seeming inability to recognise when we should be using similar words like their, there and they’re. Wouldn’t it be so much better if suggested corrections were based on grammar, usage and context? While that’s already starting to improve, and Sonoma’s auto-completion is getting smarter, there’s ample room for improvement. That depends in part at least on your Mac learning your writing style using in-device training.

At the start of this article, I explained how this isn’t about the performance of current tasks, but the capability to accomplish the tasks our apps and macOS will be doing in the future, when some of those will involve the sort of training that’s currently left to specialised or dedicated systems.

CPU cores aren’t the only hardware in Apple silicon to support AI: depending on the task, macOS may use their GPU, Apple’s specialised neural engine, or its legendary AMX. Given the fact that those in the M1 were designed and developed over the same timescale as the CPU cores, it seems improbable that they would have bfloat16 support, and Apple has only just added that number type to Metal Performance Shaders in Sonoma.

Does my Mac support it?

If you’re unconvinced whether your Apple silicon Mac supports bfloat16 in its CPU cores, then there’s an easy way to check. In Terminal, run the command
sysctl -A > ~/Documents/sysctloutput.text
where the last path is a new text file to take the output from the command.

If that file contains the line
hw.optional.arm.FEAT_BF16: 1
then its cores have hardware support for bfloat16. If the number given is 0, then I’m afraid they don’t. Apple provides information to decode most of the hw.optional.arm features here.

Not having bfloat16 support in hardware isn’t the end of the world, nor does it mean your M1 Mac is already obsolete. What it does mean, though, is that as more and heavier AI is rolled out in the coming years, some of those features will run noticeably more slowly on it. Spare a thought, though, for Intel Macs, for no matter how fast their CPUs might be, or how many cores they have, they will never have any equivalent hardware support for AI.

25Comments

Add yours

1

Paul Cooper on January 15, 2024 at 7:47 am
Reply

I don’t think we can call ARM a RISC architecture anymore.

LikeLiked by 1 person
- 2
  
  hoakley on January 15, 2024 at 7:57 am
  Reply
  
  I agree, and think it departed from RISC quite a few years ago.
  Howard.
  
  LikeLike
- 3
  
  Andrew Reilly on January 16, 2024 at 12:49 am
  Reply
  
  Sure we can. In fact Aarch64 is quite a bit more RISC-like than ARM32 or T32, as it drops the chained-shifts that complicate the pipeline and the LDM/STM that load and store any or all of the registers in sequence. It’s very much a fixed-width-decode, register-operation engine with limited addressing modes. The actual number of floating point operation variants that the functional units can perform has never been the measure of a RISC.
  
  (Yes, the instruction formats are a good deal more baroque than something like RISC-V, but the latter will have all of these different floating point formats and corresponding operations too.)
  
  LikeLiked by 1 person
  - 4
    
    hoakley on January 16, 2024 at 3:52 pm
    Reply
    
    Thank you.
    I’m afraid that you’re a little ahead of RISC-V, which isn’t scheduled to incorporate bfloat16 support until later this year. When that will be implemented in hardware, I don’t know.
    Howard.
    
    LikeLike
5

rfog926695139 on January 15, 2024 at 8:39 am
Reply

Great post!!

My 2017 iMac has this:

“hw.optional.f16c: 1”

Does it means it can do the same float16 stuff, but with an Intel. If yes, where is the advantage of the ARM equivalent?

LikeLiked by 1 person
- 6
  
  hoakley on January 15, 2024 at 9:56 am
  Reply
  
  Thank you.
  Apple doesn’t provide a decode for that particular value, I’m afraid. However, if it refers to float16 values, then you’d rather hope they were fully supported – I think most 32-bit processors supported them!
  Sorry, this is about bfloat16, not float16. Your Mac’s Intel processor doesn’t support them in hardware, and never can. That means that if it needs to work with bfloat16 values, it’s going to have to convert them into float32, requiring twice the space and running at a fraction of the speed.
  Your Mac’s Intel processor will never be able to do any of this in hardware, I’m afraid. M-series chips will be streaking ahead with new features that simply won’t be available for Intel models. Even existing features like Visual Look Up are far slower on Intel than on Apple silicon.
  Howard.
  
  LikeLike
  - 7
    
    rfog926695139 on January 15, 2024 at 11:10 am
    Reply
    
    Thank you for your clarification, Howard! As always, you are the first!
    
    LikeLiked by 1 person
  - 8
    
    Eclectic Light Blog on January 16, 2024 at 2:00 am
    Reply
    
    Intel Arrow Lake and Lunar Lake will have support for bfloat16 amongst other things.
    
    LikeLiked by 1 person
    - 9
      
      hoakley on January 16, 2024 at 3:41 pm
      
      Thank you.
      For the avoidance of any confusion, although this comment might appear to be written by me (or someone else from this blog), it isn’t.
      Moral: if you’re going to give yourself a user name when commenting, please choose one that doesn’t purport to be this blog.
      Howard.
      
      LikeLike
10

Enzo Vincenzo on January 15, 2024 at 9:00 am
Reply

Hi Howard, as far as Intel is concerned, it seems to me that for some time the improvements introduced with artificial intelligence have also been exploited. I cite for example Photoshop and the selection of contours with the lasso tool which for some time seems to be intelligent and understands contours; ditto the use of the clone stamp etc. Even without bFloat 16, therefore, for my needs I still don’t feel the urgency to move to M2 and later (even if I am in the process of purchasing the new iMac M3 and/or even an MBP M3 😉 ). I would like to add the important consideration, in fact, that with a fast Internet line you can exploit the AI of Adobe Servers (and I think also of other software houses) given that in Photoshop Preferences, for example, you can activate the option function for use Connect to Creative Cloud to improve the quality of selections and more, including neural filters.
HOWEVER, thanks to your IMPORTANT EXPLANATIONS, you have avoided the temptation to let me be tempted by some good used Mac M1 🙂 THANK YOU SO MUCH!

LikeLiked by 1 person
- 11
  
  hoakley on January 15, 2024 at 10:02 am
  Reply
  
  1. You’re looking at AI features of the past, not the future, as I’ve explained. This is about capability.
  2. When you use off-device AI features, they depend on the speed of the server and your connection to it. But they also throw away all privacy: would you seriously upload clinical images to someone else’s server to work on? Surely you’d want those processed on-device. Off-device AI also isn’t practical for many purposes: spell-checking, for example.
  3. Wait until you start using an Apple silicon Mac. You’ll be blown away by its speed, and features like Visual Look Up are transformed.
  Howard.
  
  LikeLiked by 1 person
12

Milo on January 15, 2024 at 11:53 am
Reply

Intersting find about the different ISA versions. Does that mean that a binary that was compiled for ARMv8.6-A won’t run on M1 devices? Or is that only the case if the binary uses the additional features that were introduced in the new revision?

LikeLiked by 1 person
- 13
  
  hoakley on January 15, 2024 at 3:50 pm
  Reply
  
  This is all a bit hypothetical at the moment, as the only way to use the new instructions at present is to write assembly language, or to use Apple’s libraries, which won’t generate ISA-limited code. Normally the executable would be flagged as requiring a minimum ISA, in which case macOS would block any attempt to run it on older hardware. I’ve seen that with some x86_64 code in the past, where it required AMX, for instance.
  If that didn’t happen, then all that you’d expect to occur is that, when a new and previously unsupported instruction was used, the processor would fault, crashing the app with an invalid operation. Because these run in user mode, there shouldn’t be any risk of a kernel panic.
  I should be able to test this in a while, as I’m going to be writing some ARMv8.6-A assembly language, to look deeper at bfloat16 performance. I’ll let you know what happens when that runs on a M1!
  Howard.
  
  LikeLike
  - 14
    
    Milo on January 15, 2024 at 4:07 pm
    Reply
    
    Thank you for the thorough explanation. I look forward to learn more about this topic in a future post.
    
    LikeLiked by 1 person
    - 15
      
      hoakley on January 15, 2024 at 8:59 pm
      
      Thank you.
      Howard.
      
      LikeLike
16

Enzo Vincenzo on January 15, 2024 at 2:45 pm
Reply

Perfect! I repeat that you were very clear! A valuable source of information, as always! Thank you! I also fully agree with what you say regarding sensitive data which in fact I have never trusted to entrust to the Cloud of any kind. I am lucky to be able to talk to an experienced and mature person like you.

LikeLiked by 1 person
- 17
  
  Enzo Vincenzo on January 15, 2024 at 2:48 pm
  Reply
  
  I meant to say “I am lucky…” Could you correct “It is…”, kindly?
  
  LikeLiked by 1 person
  - 18
    
    hoakley on January 15, 2024 at 3:52 pm
    Reply
    
    Corrected, thank you.
    Howard.
    
    LikeLike
19

William on January 15, 2024 at 4:22 pm
Reply

Where – if anywhere – have Apple compromised on existing hardware acceleration to “make room” for bfloat16? Are there trade-offs?

LikeLiked by 1 person
- 20
  
  hoakley on January 15, 2024 at 8:56 pm
  Reply
  
  No trade-offs at all. If you’d like to look back at the articles here looking at performance between M1 and M3, you’ll see the marked improvements across the board.
  Howard.
  
  LikeLike
21

nudge on January 15, 2024 at 7:23 pm
Reply

no need to make a file, for example:

sysctl -A | grep -i hw.optional.arm.FEAT_BF16 || echo “nope, you can’t boat your float”

LikeLiked by 1 person
- 22
  
  hoakley on January 15, 2024 at 8:58 pm
  Reply
  
  If you want to be terse, then
  sysctl hw.optional.arm.FEAT_BF16
  will just return that setting, there’s no need to grep through the whole lot. But it takes all the fun out of reading the others in that file.
  I intend offering easier access to sysctl variables in the next version of Mints.
  Howard.
  
  LikeLike
  - 23
    
    Robert Tanis on January 16, 2024 at 5:33 am
    Reply
    
    “terse” is not useful for the few newbies who check in.
    
    LikeLiked by 1 person
24

R2DHue on April 21, 2024 at 12:13 am
Reply

Howard, this was most helpful. I’d love to see a “Why the M3 is more advanced than it seemed,” where you do the same as here but switch out the M1 with the M2 in comparison to the new M3.

Unless it’s just too early early and Apple is being too opaque about the M3 right now — nor if it discusses any IP Apple doesn’t want broadcast ever since the ARM desktop/laptop space has really heated up just recently.

LikeLiked by 1 person
- 25
  
  hoakley on April 21, 2024 at 8:14 pm
  Reply
  
  Thank you.
  I don’t have an M2 (except in an iPad Pro), so unless someone gives me one, I’m not able to subject it to the same kind of testing. I do have my eyes on a new Studio, to replace my M1 Max with an M3 Max in the summer, assuming that Apple announces it at WWDC.
  Howard.
  
  LikeLike