Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
That's not what they mean by broadcast. They basically just mean transmit radio signals, i.e. just communicate. This appears to just be some updated patents related to the pencil for communication between tablet and pencil. The reference to broadcast of video signal between the stylus and other devices is unclear to me, that may relate to how they process stylus input, or perhaps they may intend to allow you to use the stylus on say your phone /ipad surface but do the actual drawing on different display or mac, like a wacom. But its probably just some optimisation to how the pencil works.
I honestly wonder what drives people to make comments like this, without even LOOKING AT the patent.

The FIRST damn sentence says: "A stylus usable with an electronic device to provide an input to the electronic device, may include a stylus body, a battery within the stylus body, a first antenna configured to receive a broadcast signal, the broadcast signal including video content,"

Later details talk about this antenna picking up ATSC content.
If words aren't your thing, look at the pictures!
Screenshot 2024-10-21 at 10.19.52 AM.png
 
Last edited:

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
...maybe?

I don't know anything about academic research on hierarchical memories. So I don't know if certain schemes have already been modeled or tested. I do have some understanding of the related problem in storage, but I don't even know if that understanding is portable - does the relationship between disk and SSD match that between RAM and HBM? No idea.

But I can at least imagine a few different ways you might want to approach the problem of the OS supporting some level of memory-tier awareness. Obviously you could support direct allocations from HBM or RAM, with varying failure modes when HBM isn't available. You could have priority levels for determining who gets evicted to DRAM. This might need to interact with OS-level permissions.

When you talk about a "dedicated cache die", I assume you're talking about something like AMD's X3D. Does that make sense for Apple? I mean, what does it buy them, in their target markets? So far, X3D has proven useful mostly (though not exclusively) in games. That's not exactly a motivating factor for Apple.

Of course Apple would have a minor advantage with a cache die, in that they wouldn't have to reduce their clock speed to take advantage of it the way AMD does, since they clock lower anyway.

The real question is what problem does this solve for Apple?

You only care about these bandwidths if you have a lot of GPU performance, in which case you have the perimeter (with many ideas for how to extend this) to support lots of memory controllers, PHYs, and separate DRAM chips. Why not use a scheme that works well (including with all the Apple infrastructure no-one ever talks about, like QoS and use of the SLC for IP block communication), which can grow the bandwidth as required, and which uses standard DRAM rather than more expensive HBM?

A 3090 has ~950 GB/s to HBM.
An M2 Ultra has 800 GB/s to DRAM, and that's with the older LP-DDR5, an M4 Ultra would presumably have about 1.3x this, so ~1050 GB/s. Then an Extreme doubles this...

I can see value in adding an additional die on top of the SoC that provides something "different", possibly a very large SLC (so equivalent to V-cache), more likely IMHO an augmentation of SLC that provides persistence (ie MRAM or ReRAM). But I just don't see the point of adding HBM.
Certainly in the V-cache version, there's no issue of APIs, you could just use the device as an extension of the SLC.

There's a reason I always refer to what Apple offers as an SLC, not an L3. The SLC does a lot more, and is a lot smarter than, a standard L3.
Remember that the SLC
(a) acts as a memory-side cache. In other words ALL memory traffic is visible to the SLC, which is why it can act as a communication mechanism between devices communicating via DMA - rather than the DMA writing to RAM, then being read from it, the DMA writes to SLC (which effectively IS RAM as far as addresses and the memory controller are concerned) then some other DMA reads from SLC.

(b) the SLC has a massive amount of controllability, in that different clients are given quotas, within a client's quota, different streams of use can be given streamIDs, so that different elements can fight it amongst themselves who lives in cache and who doesn't without affecting other clients, clients can lock items in cache (eg GPU can lock items that are used every frame, but which might otherwise be paged out as the textures and so on of each frame are pulled in), etc etc. There are amazing features there, like large items (think texture) can have only each n'th line cached, so that we pull in a cached line rapidly and while it is being processed, the next (n-1) lines are being transferred from memory, etc etc.

You don't want to give up any of this goodness!
The obvious way to do things, IF Apple goes down this path, would, IMHO to copy the way things were done back in the early 90s - keep the SLC tags on the main SoC die, taking up the space of the SLC on say a Max chip, and have the SLC storage in the separate die mounted above. This is Apple, not AMD - they don't have to cater to every imaginable config, they can just say that (as an example) when you get a Pro you get an on-die SLC (of 36MB or whatever) and when you buy a Max, you get an off-die SLC of 256 MB (or whatever).
On-die tags will mean that MOST transactions will run just as fast as the Pro, only transferring data to or from the SLC cache will add a few cycles (to the 50 or so cycles already taken).
 

diamond.g

macrumors G4
Mar 20, 2007
11,435
2,658
OBX
The real question is what problem does this solve for Apple?

You only care about these bandwidths if you have a lot of GPU performance, in which case you have the perimeter (with many ideas for how to extend this) to support lots of memory controllers, PHYs, and separate DRAM chips. Why not use a scheme that works well (including with all the Apple infrastructure no-one ever talks about, like QoS and use of the SLC for IP block communication), which can grow the bandwidth as required, and which uses standard DRAM rather than more expensive HBM?

A 3090 has ~950 GB/s to HBM.
An M2 Ultra has 800 GB/s to DRAM, and that's with the older LP-DDR5, an M4 Ultra would presumably have about 1.3x this, so ~1050 GB/s. Then an Extreme doubles this...

I can see value in adding an additional die on top of the SoC that provides something "different", possibly a very large SLC (so equivalent to V-cache), more likely IMHO an augmentation of SLC that provides persistence (ie MRAM or ReRAM). But I just don't see the point of adding HBM.
Certainly in the V-cache version, there's no issue of APIs, you could just use the device as an extension of the SLC.

There's a reason I always refer to what Apple offers as an SLC, not an L3. The SLC does a lot more, and is a lot smarter than, a standard L3.
Remember that the SLC
(a) acts as a memory-side cache. In other words ALL memory traffic is visible to the SLC, which is why it can act as a communication mechanism between devices communicating via DMA - rather than the DMA writing to RAM, then being read from it, the DMA writes to SLC (which effectively IS RAM as far as addresses and the memory controller are concerned) then some other DMA reads from SLC.

(b) the SLC has a massive amount of controllability, in that different clients are given quotas, within a client's quota, different streams of use can be given streamIDs, so that different elements can fight it amongst themselves who lives in cache and who doesn't without affecting other clients, clients can lock items in cache (eg GPU can lock items that are used every frame, but which might otherwise be paged out as the textures and so on of each frame are pulled in), etc etc. There are amazing features there, like large items (think texture) can have only each n'th line cached, so that we pull in a cached line rapidly and while it is being processed, the next (n-1) lines are being transferred from memory, etc etc.

You don't want to give up any of this goodness!
The obvious way to do things, IF Apple goes down this path, would, IMHO to copy the way things were done back in the early 90s - keep the SLC tags on the main SoC die, taking up the space of the SLC on say a Max chip, and have the SLC storage in the separate die mounted above. This is Apple, not AMD - they don't have to cater to every imaginable config, they can just say that (as an example) when you get a Pro you get an on-die SLC (of 36MB or whatever) and when you buy a Max, you get an off-die SLC of 256 MB (or whatever).
On-die tags will mean that MOST transactions will run just as fast as the Pro, only transferring data to or from the SLC cache will add a few cycles (to the 50 or so cycles already taken).
In this instance how is SLC different from say AMD's infinity cache?
 

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
I agree that there are probably different ways to model memory hierarchy in APIs, and that furthermore optimal APIs for different use cases can differ as well. However, here I am mainly concerned with the question "what would Apple do?". Historically, they seem to prefer solutions that hide this kind of complexity from the user, going to great lengths to smooth things out for the client software. NUMA and tiered memory programming is fun on supercomputers, not that much on consumer hardware. Apple wants their APIs to be scalable across the different types of software. They don't even give you APIs for pinning threads to CPU cores. Their CPU hierarchy abstraction is shared L2 cache, and that's it. And even then those APIs don't work reliably. I am experimenting with SME currently and I can't get the work to execute on E-cluster for example.

Apple invested who knows how much $$$ designing a NUMA system (M1/M2 Ultra) that would appear to have uniform access properties. And their patents go into great detail describing systems that would copy data across memory controllers to optimize access patterns and power efficiency. So no, I don't expect them to expose more details to the user. Rather they will try to design systems that do the "right thing" automatically. Whether it will be successful or not is a different question.
I agree with your take on Apple's history. I don't think that's dispositive, though it does suggest.

I mean... there is no M3 Ultra. Maybe that's just timing, TSMC processes, supply chain issues, etc. Or maybe they aren't thrilled with how that's worked out and they're changing course a little bit. I don't think they're going to change course a lot, because I think we'd be hearing more about interesting patents, but maybe they will change a bit. Perhaps the lack of complete success in automatically doing the right thing so far has led them to expose more choice to ... well, not users, but developers, which is somewhat different.

A number of important differences come to mind. First, Apple is feeding a bunch of bandwidth-reliant IP blocks off the same memory interface, not just the CPU cores. Second, Apple's CPU cores have access to much higher bandwidth than the AMD cores. Third, a large cache (I am talking about hundreds of MBs, not dozens) could potentially allow Apple to use a smaller memory interface to the main RAM, saving costs. That is the solution in Fig. 2/3 that I have posted in #1145
Hm. You do make a good point here- especially the GPU might see a really significant advantage from a 128-256MB cache. Plus there's all the other stuff @name99 wrote about.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
I don't know anything about academic research on hierarchical memories. So I don't know if certain schemes have already been modeled or tested. I do have some understanding of the related problem in storage, but I don't even know if that understanding is portable - does the relationship between disk and SSD match that between RAM and HBM? No idea.
FYI, those relationships do not match up. HBM is just a different way of packaging and interfacing to DRAM. The DRAM has all the normal DRAM characteristics you're used to, and while the interface is different in detail from LPDDRn, it's a variation on the same theme, not anything fundamentally different.

For this reason, I'm skeptical of the idea of using off the shelf LPDDRn as a cache for HBMn, or vice versa.
 

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
more likely IMHO an augmentation of SLC that provides persistence (ie MRAM or ReRAM).
You've mentioned persistence at least one other time recently.

What's the big advantage to persistence in an SLC? Most of what goes through there is even more ephemeral than RAM, though I can see a very nice use for that in storage buffering - you could have flush/fsync/whatever return as soon as the data is in the SLC (assuming you have a way of recovering it, like replaying from a log, after a crash/reboot). Still, that doesn't seem like enough of a motivation to make your whole SLC (or a huge piece of it) out of persistent memory. More like a reason to add that to their disk controller.
 
  • Like
Reactions: chown33

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
FYI, those relationships do not match up. HBM is just a different way of packaging and interfacing to DRAM. The DRAM has all the normal DRAM characteristics you're used to, and while the interface is different in detail from LPDDRn, it's a variation on the same theme, not anything fundamentally different.

For this reason, I'm skeptical of the idea of using off the shelf LPDDRn as a cache for HBMn, or vice versa.
In what way do they not match up? I mean, aside from the obvious, which is that SSD speed advantages over disks are more significant than HBM's over DRAM.

Although... come to think of it, SSDs have a huge advantage in latency over disks, as well as better xfer rates. HBM has no advantage there over normal DRAM, and maybe (IIRC) a small disadvantage, possibly due to lower clocks- I don't remember.
 

mr_roboto

macrumors 6502a
Sep 30, 2020
856
1,866
In what way do they not match up? I mean, aside from the obvious, which is that SSD speed advantages over disks are more significant than HBM's over DRAM.

Although... come to think of it, SSDs have a huge advantage in latency over disks, as well as better xfer rates. HBM has no advantage there over normal DRAM, and maybe (IIRC) a small disadvantage, possibly due to lower clocks- I don't remember.
That's exactly why they don't match up. HBM is different packaging for same-old same-old DRAM, but SSDs are a fundamentally different technology from spinning disks.

The biggest advantages HBM provides are up to a 1024 bit wide data path to a single HBM stack, and the "stack" itself - usually HBM packages have something like eight vertically stacked DRAM die. So, what HBM provides over (LP)DDRn is reduced physical area and volume for a package-integrated, high-capacity, high-datapath-width DRAM assembly. No advantage in latency, and probably less BW per data wire than DDRn since as you mentioned it's often clocked slower. (The plan being to make up for that with lots more wires.)

So, HBM has reasons to exist, but none of them are anything like "it's a good idea to use HBMn as cache for LPDDRn, or vice versa".
 
  • Like
Reactions: altaic

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
That's exactly why they don't match up. HBM is different packaging for same-old same-old DRAM, but SSDs are a fundamentally different technology from spinning disks.

The biggest advantages HBM provides are up to a 1024 bit wide data path to a single HBM stack, and the "stack" itself - usually HBM packages have something like eight vertically stacked DRAM die. So, what HBM provides over (LP)DDRn is reduced physical area and volume for a package-integrated, high-capacity, high-datapath-width DRAM assembly. No advantage in latency, and probably less BW per data wire than DDRn since as you mentioned it's often clocked slower. (The plan being to make up for that with lots more wires.)

So, HBM has reasons to exist, but none of them are anything like "it's a good idea to use HBMn as cache for LPDDRn, or vice versa".
12-hi is common (and 16-hi is specced by JEDEC, though I don't think any is shipping yet), but sure, all of that makes sense... which is why I was suggesting that @leman might be wrong about Apple using HBM as cache. It makes much more sense to me to use it as a separate memory tier, if you're going to use it at all. I thought you might be talking about some other factor.

Note that I am far from convinced that Apple will use it. But if they do, and they have both HBM and regular DDR in a single system, I still think it's more likely they'll expose the difference than conceal it. As an L4 cache of sorts, I don't see it being all that great (it definitely can't replace L3 or SLC caches, with that latency), but if they build an M4 Ultra with up to 256GB of HBM (384GB if 16-hi 48GB is real), some AI types would probably be extremely happy.

BTW, when I mentioned earlier that Apple would have a minor advantage over AMD's vcache/X3D because they run at lower clocks, I left out something probably much more important: They throw off less heat, which means cooling an Mx chip with stacked cache is an easier job, thus probably making the implementation a lot easier.
 
  • Like
Reactions: mr_roboto

leman

macrumors Core
Oct 14, 2008
19,516
19,664
which is why I was suggesting that @leman might be wrong about Apple using HBM as cache. It makes much more sense to me to use it as a separate memory tier, if you're going to use it at all. I thought you might be talking about some other factor.

Just to clarify — I don't think I ever suggested that Apple will use HBM as a cache. In fact, I explicitly mention in #1145 that I do not believe HBM has characteristics interesting to Apple, either as cache memory or as main memory.

There are a lot of topics being discussed here, so some things will inevitably be mixed up or lost. Apologies if I might have been unclear on this!
 

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
Just to clarify — I don't think I ever suggested that Apple will use HBM as a cache. In fact, I explicitly mention in #1145 that I do not believe HBM has characteristics interesting to Apple, either as cache memory or as main memory.

There are a lot of topics being discussed here, so some things will inevitably be mixed up or lost. Apologies if I might have been unclear on this!
You were not, and I did lose track of some of our earlier discussion. Sorry about that!

In the end I also don't think HBM is their best choice. Mostly I was wondering about whether other things they might do (like support persistent memory) might change that analysis, but unless they do, I don't see it happening. They've sort of built a poor-man's HBM by going wide on their memory busses, and that seems like a better tradeoff for them anyway: They aren't bound by HBM's width (only 1024b, or 2048b for HBM4) - which may in many cases be better for power consumption, they do better on latency, etc. Oh, and they can source from more vendors, who aren't constrained on supply (as HBM is), which leads to better pricing.

@name99 I'm still super curious why you think persistent memory (of presumably modest size, rather than the entire memory) would be worth doing.
 
  • Like
Reactions: chown33

Basic75

macrumors 68020
May 17, 2011
2,098
2,446
Europe
If Apple sticks to 6-core clusters - which is highly likely? - I doubt a "Pro" chip would have 8 P-cores. But we'll just have to wait and see.
 

komuh

macrumors regular
May 13, 2023
126
113
I think Ultra will be like 1.5x max sized new chip just to make some extra cash from smaller die, so probably like 16/14 P cores or smth like that and maybe some more power in GPU cluster.
 

Boil

macrumors 68040
Oct 23, 2018
3,477
3,173
Stargate Command
Rumors give us a new desktop-specific chip, the Hidra...
  • 1x Hidra = M4 Max (desktop variant)
  • 2x Hidra = M4 Ultra
  • 4x Hidra = M4 Extreme
Some rumors discuss the high-end desktop Macs having all Performance cores and no Efficiency cores; while other rumors talk about chips having more Efficiency cores than Performance, due to most apps not fully utilizing multi-cores, so the lower count P-cores handle the apps that focus on single-core work and the E-cores crunch out multi-core workloads...

Theorize from there...
 

thenewperson

macrumors 6502a
Mar 27, 2011
992
912
Absolutely prefer “Supreme” over the rumored “Extreme” — miles better, IMHO.

I’d be okay with Extreme if that wasn’t the end and there was still a step above that. Otherwise yeah I prefer Supreme to Extreme as well. However if it’s a top chip with no other step up I’d really like Zenith over Supreme.
 

name99

macrumors 68020
Jun 21, 2004
2,407
2,309
You've mentioned persistence at least one other time recently.

What's the big advantage to persistence in an SLC? Most of what goes through there is even more ephemeral than RAM, though I can see a very nice use for that in storage buffering - you could have flush/fsync/whatever return as soon as the data is in the SLC (assuming you have a way of recovering it, like replaying from a log, after a crash/reboot). Still, that doesn't seem like enough of a motivation to make your whole SLC (or a huge piece of it) out of persistent memory. More like a reason to add that to their disk controller.
One obvious point is that when the machine is powered down (ie an hour or so between when you last used your iPhone and when you next use it) everything is still in SLC. That's surely good for some speedup (minor) and energy savings (less minor).
I don't know how long an iPhone sleeps before it goes to the deepest stage of sleep (powering down SLC and DRAM) but this will help with any delay longer than that.

If you want to get more fancy, there's then capacity to augment the SSD.
Maybe by storing temporary data in transit to the SSD in MRAM rather than DRAM, you can worry less about possible data loss, and thereby either speed up the file system or reduce intermediate copies? (I don't know enough about file systems to know if this is feasible, but like everyone I know that, right now, when DRAM could in principle die at any moment, you do have to be somewhat careful in the order you write blocks from DRAM to SSD.)
Or, maybe when you power down the device, some kernel of essential startup files are placed in the MRAM, so that restarting from power-down is even faster?
Or you could put various of the small, but frequently accessed by AppleOS, databases into MRAM rather than on SSD, and have them run faster and lower power?
Or capture logs to MRAM rather than SSD, and only push them to SSD in large transactions?

As for why you would use MRAM (or ReRAM); the point is not even necessarily the persistence, it's the density. The MRAM and ReRAM guys have been saying for years that "soon" they'll provide better density than SRAM on somewhat trailing processes compared to SRAM. Micross claim to be offering a 1Gb (so 128MB MRAM) right now, which I *assume* is a single chip, about the size of a DRAM chip, so somewhere between maybe 25 and 80 mm^2?
Now can this overcome the fact of being fabbed on a lesser node? The MRAM guys keep saying "soon". I don't know enough to know if the timeline of "soon" is steadily shrinking from "ten years from now" to "next year", or if it's perpetually "five years from now". People (including serious people, like TSMC) keep working on it, so I assume there's potential there at some point.
 
Last edited:

Antony Newman

macrumors member
May 26, 2014
55
46
UK
If companies are racing get on and off chip silicon photonic / optical solutions out before 2027 (when Nvidia is expected to preview their own Data Centre offering) - and TSMC 3D Optical engine (gen 1) is available in 2025 (gen 3 with 1.6TB/s bw and lower latency in 2027) - would Apple still need (expensive) HBM4 memory?

If in 2027 Apple could build 12cm x 12cm SoC on TSMCs 3D Optical Engine (or COUPE) - is it likely that Apple would take the opportunity to push the limits of what could be created (for their own Data Centre / trickle down for Mac Pro)?

https://www.tomshardware.com/deskto...ficient-silicon-photonics-interconnect-for-ai

 

Confused-User

macrumors 6502a
Oct 14, 2014
850
984
If companies are racing get on and off chip silicon photonic / optical solutions out before 2027 (when Nvidia is expected to preview their own Data Centre offering) - and TSMC 3D Optical engine (gen 1) is available in 2025 (gen 3 with 1.6TB/s bw and lower latency in 2027) - would Apple still need (expensive) HBM4 memory?

If in 2027 Apple could build 12cm x 12cm SoC on TSMCs 3D Optical Engine (or COUPE) - is it likely that Apple would take the opportunity to push the limits of what could be created (for their own Data Centre / trickle down for Mac Pro)?

https://www.tomshardware.com/deskto...ficient-silicon-photonics-interconnect-for-ai

As we've just been discussing, Apple definitely doesn't "need" HBM at all. But in any case this interconnect tech is not likely relevant.

Optical is an exciting future tech but needing a different process (some sort of 65nm in this case) is probably a nonstarter for memory links due to the latency that would be added. For traditional main memory, anyway - if you're willing to use CXL, this might be one way to carry PCIe signals.
 
  • Like
Reactions: Antony Newman

cassmr

macrumors member
Apr 12, 2021
58
62
I honestly wonder what drives people to make comments like this, without even LOOKING AT the patent.

The FIRST damn sentence says: "A stylus usable with an electronic device to provide an input to the electronic device, may include a stylus body, a battery within the stylus body, a first antenna configured to receive a broadcast signal, the broadcast signal including video content,"

Later details talk about this antenna picking up ATSC content.
If words aren't your thing, look at the pictures!
View attachment 2440085

I did look at the patent. Patents are drafted very very broadly. Words like antenna and Broadcast signal have broad meaning, e.g. bluetooth, 5g and wifi all use antennas. Attennas and broadcast towers are often represented in that way. As such, it can often be hard to tell the intended use of a patent from its claims which are the only enforceable bit of the patent, all the other descriptions are primarily background information to help understand the claims.

So I did look at the core of the patent and it wasnt so limited to the use you were describing which as you said didnt make much sense. However, to be fair to you, later on in the patent they do say:

Electronic devices such as smartphones, tablets, and computers, may be invaluable for tasks such as retrieving information, coordinating actions, and generally communicating with others. However, these functions may be unavailable when cellular or other Internet connected networks are unavailable. Electronic devices have also become ubiquitous for consumption of audio, video, and other media in a convenient manner. These functions may also be unavailable when cellular or other Internet connected networks are unavailable. Moreover, due to the ubiquity of Internet-based, on-demand content delivery sources (e.g., streaming video), as well as the general trend towards smaller and lighter electronic devices, such devices may not be configured to receive and/or process broadcast content, such as broadcast TV signals.

That's not a meaningful limit on the patent, but does suggest that they may intend to add the functionality to the Apple pencil or another stylus, so that you can get broadcast tv, in cases when you dont have other means of receiving the signal.

I do wonder if the use case makes more sense with the Apple Vision Pro and successor devices. Perhaps it is useful for receiving broadcast television and overlaying stylus input. However, while they discuss broadcast tv specifically, I suspect this is a misdirection for its intended use with the broadcast of video more generally.

Realistically, we will probably never know as most patents dont end up in 1-1 products.
 

diamond.g

macrumors G4
Mar 20, 2007
11,435
2,658
OBX
So does Infinity Cache provide ways to control the retention and otherwise of cache lines?
Hell, is Infinity Cache even a MEMORY-SIDE cache?
It seems like the cache sits on the memory controller die. I guess it is just a large fancy L3 cache. So probably not as good as a SLC cache.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.