Apple Silicon in Sciences

leman · Jan 15, 2024

iPadified said:
Perhaps I am old to not get too worked up by these things anymore. Correct the authors and hope they learn and move on.

Same here. But I do get worked out if unpublished papers like these are used to start public discussions about topics I am passionate about

apostolosdt said:
I would be very happy to see such a correction. I'm interested---honestly.

Academia is weird sometimes and depends a lot on a person. I've reached out multiple times to various authors (both within and outside my fields) with either questions or corrections. Most reply and are genuinely very courteous. Some are less so.

Then again, I know someone who got a paper retracted because some anonymous moron wrote the editors asking to retract a peer-reviewed paper on the grounds that "the example on page 6 contradicts my own field research notes, and author also quotes a book that I don't like" (really, that was the full extent of the complaint!) and the reviewers actually went ahead with the retraction (!!!). It was a new and very inexperienced editorial team and the first issue of the journal and they probably freaked out (I assume that the anonymous moron was some sort of a big honcho in that particular area). Really caused a lot of anguish to my friend. So much to academic integrity... most people are really nice and helpful, but you can't put your guard down, these things can be very costly.

iPadified · Jan 15, 2024

leman said:
Same here. But I do get worked out if unpublished papers like these are used to start public discussions about topics I am passionate about

Academia is weird sometimes and depends a lot on a person. I've reached out multiple times to various authors (both within and outside my fields) with either questions or corrections. Most reply and are genuinely very courteous. Some are less so.

Then again, I know someone who got a paper retracted because some anonymous moron wrote the editors asking to retract a peer-reviewed paper on the grounds that "the example on page 6 contradicts my own field research notes, and author also quotes a book that I don't like" (really, that was the full extent of the complaint!) and the reviewers actually went ahead with the retraction (!!!). It was a new and very inexperienced editorial team and the first issue of the journal and they probably freaked out (I assume that the anonymous moron was some sort of a big honcho in that particular area). Really caused a lot of anguish to my friend. So much to academic integrity... most people are really nice and helpful, but you can't put your guard down, these things can be very costly.

Strong feelings sometimes in academia, definitely also in my field

apostolosdt said:
I would be very happy to see such a correction. I'm interested---honestly.

Well, if the result are unrealistic good, the authors have an explanation problem. They need to in detail explain under which circumstances the results are valid and also provide proof that no methodological errors have been made. Could the authors do that? Are the methods described in such detail that it is possible to be reproduce by others? Did the authors corroborated the results using at least two independent testing methods? Could the authors exclude any hardware bottlenecks not related to the test? If hardware was tested, was the software optimized for the respective platform? How was that validated? If algorithms were tested, did they have the same hardware support? How was that tested? Has the measuring technique been validated for both hardware platform? Is the measurement straightforward or does it include some algorithm that is easy to get wrong? A typing error can easily lead to large deviations. How was this excluded? Did the author demonstrate the superiority of the new method in something useful compute demanding task such as simulation, rendering, AI? What were the results of a real world application?

Have independent labs and researchers been able to reproduce the results?

Very strong claims need very strong scientific arguments. Showing some graphs or a benchmark is not sufficient but requires a strong proof and ideally mechanistic explanations or less satisfying extensive testing and exclusions of error sourses.

I am sorry, but to claim that a low end graphics card would beat a super computer seems like a cold fusion result to me so consider me a sceptic but I might be wrong, but it is not my job to prove that I am wrong.

theorist9 · Jan 15, 2024

iPadified said:
The distinction is between a scientific journal, magazines and popular science and the ordinary newspapers.
In my field, engineering/bioengineering/biology, it is only a publication after a peer review and then only in recognised scientific journals. The Economist is not a scientific journal as I understand and would therefore not count as a scientific publication. Posting a manuscript on a preprint server has zero weight in my publication list.

There are some differences between fields but generally if a paper is not peer reviewed - be aware and any scientist should know that and be prepared for some crap.

I am peer reviewing about 10 manuscript per year (and say no to the rest of the invitations) and it is always something wrong with the approach or the interpretations so rejections and major revisions are the common outcome and these authors have all done as best as they could. Sometimes it is economic limitation and sometimes I do wonder if the PI saw the manuscript before submitting because of the poor quality. Finding errors and crappy approaches happens all the time.

You're conflating two very different things. You're falsely equating "iPadified can't put non-peer-reviewed work on their CV" to "Theorist9 can't use the verb 'publish' when describing someone posting a paper on ArXiv".

What you are allowed to count as a publication on your CV has nothing to do with whether it's legitimate for me to use the verb "publish" when describing someone posting on ArXiv. You're making an incorrect and gratuitous criticism of my word choice, and it's also a red herring, because it entirely misses the point of this discussion, which is whether those scientists deserve criticism for what they posted.

Bottom line: I was not wrong for using "publish" in this context, and I don't know why you are so insistent on continuing to press the idea that I was.

Plus I'm also in the sciences (though a different field-my PhD is in chemistry), and I've never seen CV publications restricted to peer-reviewed publications only. So I think this is more what your university restricts (or your personal view) than what your field restricts. For instance, if you're the author of a significant government or industry technical publication, that is a legitimate part of your body of work in the sciences, and thus does belong on your CV, even if it's not peer-reviewed. The same applies to patents.

Examples:

1) Here's the 1984 Bell Labs Technical Report in which Bjarne Stroustrup, who designed and implemented C++, presented his codification of the language. I can imagine you lecturing Bjarne about how "it is only a publication after a peer review and then only in recognised scientific journals...and would therefore....[have] zero weight in my publication list." Are you beginning to see the absurdity here?

https://cds.cern.ch/record/169940/files/cer-000081326.pdf

2) Here's Rodger Easton's 1974 patent for GPS ("Navigation system using stallites and passive ranging techniques"), which he developed while working for the US Naval Research Lab. You're in engineering. Are you seriously going to tell another engineer that Easton's GPS patent doesn't belong on his publications list? And that because it appears in a government publication rather than a peer-reviewed scientific journal, it is incorrect to say one published a patent? Is the absurdity of your criticism of me—that you can't use the verb "publish" for non-peer reviewed works—starting to beome clearer?

https://patentimages.storage.googleapis.com/dd/07/4f/b6f5e1415d59a3/US3789409.pdf

3) If you were an analytical or environmental chemist, and you authored or edited this annual technical report (and your name were on it -- I don't see authors' or editors' names here, but sometimes they do have them), it would certainly belong on your CV:

https://www.ams.usda.gov/sites/default/files/media/2021PDPAnnualSummary.pdf

4) In my world, science outreach is important. If I were working on, say, carbon capture, and published an article about this in The Economist, it would by all means be acceptable to put that on my CV.

Chuckeee · Jan 15, 2024

iPadified said:
Very strong claims need very strong scientific arguments. Showing some graphs or a benchmark is not sufficient but requires a strong proof and ideally mechanistic explanations or less satisfying extensive testing and exclusions of error sourses.

Basically restating the Sagan Standard: Extraordinary Claims Require Extraordinary Evidence.

A new superconducting material that has a 10% improved tensile strength and a 2 degree warmer critical temperature does not require the same quantity/quality of supporting evidences as a new room temperature superconductor that resembles stainless steel.

Technological advancement does not occur in a vacuum, required standards for evidence/proof are dependent on the nature of the claim. There is not a single all encompassing standard for review and validation for papers.

theorist9 · Jan 15, 2024

Chuckeee said:
Basically restating the Sagan Standard: Extraordinary Claims Require Extraordinary Evidence.

And Sagan's phrasing was itself a restatement! Quote investigator has a great history of this aphorism, going back to Benjamin Bayly (1708): "These matters being very extraordinary, will require a very extraordinary proof."

Quote Origin: Extraordinary Claims Require Extraordinary Evidence – Quote Investigator®

quoteinvestigator.com

cbum · Jan 15, 2024

First, I've always been well served by the maxim: extraordinary claims need extraordinary proof.

Second, and most comforting: The key feature of the scientific method is that it is self-correcting!

😉

theorist9 · Jan 15, 2024

cbum said:
First, I've always been well served by the maxim: extraordinary claims need extraordinary proof.

Second, and most comforting: The key feature of the scientific method is that it is self-correcting!

😉

And it's able to be self-correcting because of a key feature of science—at least the natural sciences: As my mentor was fond of saying, "science is independent of us".

iPadified · Jan 15, 2024

theorist9 said:
You're conflating two very different things. You're falsely equating "iPadified can't put non-peer-reviewed work on their CV" to "Theorist9 can't use the verb 'publish' when describing someone posting a paper on ArXiv".

What you are allowed to count as a publication on your CV has nothing to do with whether it's legitimate for me to use the verb "publish" when describing someone posting on ArXiv. You're making an incorrect and gratuitous criticism of my word choice, and it's also a red herring, because it entirely misses the point of this discussion, which is whether those scientists deserve criticism for what they posted.

Bottom line: I was not wrong for using "publish" in this context, and I don't know why you are so insistent on continuing to press the idea that I was.

Plus I'm also in the sciences (though a different field-my PhD is in chemistry), and I've never seen CV publications restricted to peer-reviewed publications only. So I think this is more what your university restricts (or your personal view) than what your field restricts. For instance, if you're the author of a significant government or industry technical publication, that is a legitimate part of your body of work in the sciences, and thus does belong on your CV, even if it's not peer-reviewed. The same applies to patents.

Examples:

1) Here's the 1984 Bell Labs Technical Report in which Bjarne Stroustrup, who designed and implemented C++, presented his codification of the language. I can imagine you lecturing Bjarne about how "it is only a publication after a peer review and then only in recognised scientific journals...and would therefore....[have] zero weight in my publication list." Are you beginning to see the absurdity here?

https://cds.cern.ch/record/169940/files/cer-000081326.pdf

2) Here's Rodger Easton's 1974 patent for GPS ("Navigation system using stallites and passive ranging techniques"), which he developed while working for the US Naval Research Lab. You're in engineering. Are you seriously going to tell another engineer that Easton's GPS patent doesn't belong on his publications list? And that because it appears in a government publication rather than a peer-reviewed scientific journal, it is incorrect to say one published a patent? Is the absurdity of your criticism of me—that you can't use the verb "publish" for non-peer reviewed works—starting to beome clearer?

https://patentimages.storage.googleapis.com/dd/07/4f/b6f5e1415d59a3/US3789409.pdf

3) If you were an analytical or environmental chemist, and you authored or edited this annual technical report (and your name were on it -- I don't see authors' or editors' names here, but sometimes they do have them), it would certainly belong on your CV:

https://www.ams.usda.gov/sites/default/files/media/2021PDPAnnualSummary.pdf

4) In my world, science outreach is important. If I were working on, say, carbon capture, and published an article about this in The Economist, it would by all means be acceptable to put that on my CV.. Did we not talks an about scientific publication?

Just to be clear: we are talking about manuscript submitted to preprint server that usually end up in as a publication in a peer reviewed scientific journals or the proceedings hence the domain of academia? I also assume we talked about original research.

Exceptions do not make rules and citing work done 74 and 86 is also not valid for today’s values in STEM research dissemination in academia.

I divide my publication list into
1) peer reviewed journal publication (those are the primary reason for more funding) generally high quality and the main body of scientific documentation. Citation between each other.
2) proceedings at conferences (lower quality that 1). However, my friends at “compute” says it is in proceedings they publish their important work and not in 1). These are reviewed and are also cited.
3) Patents are in fact peer reviewed for novelty but patent do not require any proof that the claim is valid. Patents are part of the publication list but hardly any solid scientific documentation of the invention. Often patent are written to confuse and secure commercial freedom to operate rather than clarity required by a peer reviewed scientific article. Patents are rarely cited due to its poor scientific quality. Patents are secured before published in 1 and 2.
4) outreach articles. Has very little scientific value and are based on previous research under point 1 and 2 and thereby provides no new knowledge. Useable for saying that you are disseminating your research to other groups rhan scientist which is important. Very rarely cited in 1 and 2. Use these at your own peril in publication in 1 and 2.
5) UN and government reports. These are not widely used and may contain novel data or interpretations of data collected in 1 and 2.
6) Books but this is typically teaching material or a collection of methods but not a place for documenting novel research in STEM science. Based on 1 and 2.

All part of the publication list but each has different purposes and different quality and should not be confused. Sorry, we live in an impact factor and citation driven funding regime where peer reviewed articles rules. There are some funds geared towards innovation that also look at patents as quality parameter of a researcher. It is possible to survive the STEM academic world with publication in category 1 and 2 but not 3 - 6.

In the competitive academic world, conflating publication list by mixing the above categories are seen but make no mistake, a trained scientist will see through such attempts quite easily.

theorist9 · Jan 15, 2024

iPadified said:
Just to be clear: we are talking about....

Just to be clear, the point I keep making, and that you keep dodging, is that it's absurd to assert that, because you wouldn't include a document on your personal CV, I'm thus wrong to use the verb "publish" when discussing it.

You say you're a scientist, yet somehow you're unwilling to acknowledge that yawning logical disconnect. I'm just begging for some clear logical thinking here, and I'm not getting it. Instead, you keep throwing up the smoke screen of peer review, which, to use your phrasing:

iPadified said:
a trained scientist will see through...quite easily.

How about just coming clean and admitting you go that one wrong? I don't know you, so I can only judge you by your actions, but it feels like you're not someone who likes admitting they're wrong, and are doubling down to avoid that.

As someone who actually is a scientist, I think it's important to acknowledge when I get something wrong, and I have a long history on this site to support that. Just search for the phrase "thanks for the correction" under my user name.

How often on this site have you admitted you got something wrong, and thanked the person for pointing it out?

If your criticism of my post had been legitimate, I would have acknowledged it and thanked you for it. But I have little tolerance for those who criticize just for the sake of it, particularly when they get it wrong, as you did, and refuse to back down from it. That's the kind of behavior that makes interactions much less pleasant.

leman · Jan 22, 2024

I had to use my old 16" Intel MBP for a few days while on a business trip, and it again made it clear to me how far Apple came in in just a few years with their hardware. For example, I've been working on this software package for dataset coverage optimization — single-threaded code, matrix heavy computations. On M3 with AMX the code on a large dataset runs in 13 seconds. On the Intel i9-9980HK it's 78 seconds — 6x slower. A staggering difference in practice.

The Mercurian · Feb 10, 2024

Hey all. Whats the current thinking on best BLAS to use on M1 silicon for R/Stan etc? I googled but most info I'm finding is from 2021 or so, so I'm assuming may not be current. I just updated to Sonoma and R is using:

BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

leman · Feb 10, 2024

The Mercurian said:
Hey all. Whats the current thinking on best BLAS to use on M1 silicon for R/Stan etc? I googled but most info I'm finding is from 2021 or so, so I'm assuming may not be current. I just updated to Sonoma and R is using:

That’s interesting, did it use veclib by default? I had to explicitly change the symlink to make it use Apples implementation.

Using vecLib for BLAS is a no-brainer at this point IMO. I see speedups between 3-5x on matrix-heavy code.

The Mercurian · Feb 10, 2024

leman said:
That’s interesting, did it use veclib by default? I had to explicitly change the symlink to make it use Apples implementation.

Using vecLib for BLAS is a no-brainer at this point IMO. I see speedups between 3-5x on matrix-heavy code.

I think it did, but I had multiple window open and maybe confused myself 😅😅 But I did notice after updating things got faster before I did anything. Anyhow so I'm already on the fastest it seems 🥳

name99 · Feb 10, 2024

The Mercurian said:
I think it did, but I had multiple window open and maybe confused myself 😅😅 But I did notice after updating things got faster before I did anything. Anyhow so I'm already on the fastest it seems 🥳

Guys following AMX details might be interested in (2024) https://eprint.iacr.org/2024/002.pdf
Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs

I don't care about the crypto stuff, but I do like that they synthesized the work of myself, Pete Cawley, and Dougall Johnson to do something innovative with AMX, and along the way validated much of what we three concluded about how the system works.

name99 · Feb 28, 2024

Something that's been commented on is the apparent non-existence of the UltraFusion pads on the side of the M3 Max. Along with that goes the question of how the geometry of the mythical 4-core M3 Extreme might work.
This patent suggests answers (though historically packaging is one of those areas where Apple throws out a lot of ideas then eventually implements only one of them, very unlike say CPU or GPU).

US20230299007A1 - Scalable Large System Based on Organic Interconnect - Google Patents

Multi-chip modules and methods of fabrication are described. The MCM may include a plurality of dies in which die-to-die routing can be partitioned within multiple metal routing layers for shorter die-to-die routings, while longer die-to-die routing can be routed primarily in a single metal...

patents.google.com

Look at, eg

Screenshot 2024-02-28 at 10.46.07 PM.png

The big idea here (though the patent does not put it this way) is that this is essentially a chiplet design.
The chips 110 are Max's. The big thing in the middle is a combination of some small chiplets (think something a little like EMIB) embedded in a substrate. As chiplets they incorporate all the IO hardware (SerDes and PHYs) that took up space for UltraFusion on M1 and M2; and as chiplets they can be produced on an older node, as old as makes sense.

So the dashed area of say 110-4 has beneath the chip a whole lot of dense pins that connect to the IO chiplet below it embedded in the substrate. The chiplet then does routing and communicates with the other three chiplets via further routing layers beneath it in the substrate.

This obviously scales us to 4 chips (as far as we go, I suspect, for M3 and M4). What I find especially compelling about this diagram is that it explains/looks so much like the actual M3 Max, in particular the bizarre way the SLC, memory controllers, and DRAM PHYs go about 3/4 of the way up the side of the chip and then just stop. If the area below the chip in that area is reserved for microbumps connecting to the submerged IO chiplets, then the layout makes sense...

This same basic design can easily be used to two chips as an Ultra. It could also fairly easily be expanded to 6 (3+3) chips. Beyond that, I expect something additional will be required; but one step at a time!

leman · Feb 29, 2024

name99 said:
The big idea here (though the patent does not put it this way) is that this is essentially a chiplet design.
The chips 110 are Max's. The big thing in the middle is a combination of some small chiplets (think something a little like EMIB) embedded in a substrate. As chiplets they incorporate all the IO hardware (SerDes and PHYs) that took up space for UltraFusion on M1 and M2; and as chiplets they can be produced on an older node, as old as makes sense.

So the idea is moving the UltraFusion infrastructure off-chip and connecting the on-chip network to the substrate which then does serialization and routing? Is this feasible? It was my understanding that the physical layer is necessary to transfer data off-chip to begin with. I have zero background in electrical connections at this scale, so I don’t have any intuition what is required. But can one really just connect the internal fabric via micro bumps without running into major issues with the signal?

name99 said:
What I find especially compelling about this diagram is that it explains/looks so much like the actual M3 Max, in particular the bizarre way the SLC, memory controllers, and DRAM PHYs go about 3/4 of the way up the side of the chip and then just stop. If the area below the chip in that area is reserved for microbumps connecting to the submerged IO chiplets, then the layout makes sense...

M1 Max had the same design though. I thought that the other side of the chip is used for IO, which is why the DRAM infrastructure doesn’t extend there.

I do find your interpretation very compelling and it makes sense to me. I’m just trying to understand the evidence better.

name99 · Feb 29, 2024

leman said:
So the idea is moving the UltraFusion infrastructure off-chip and connecting the on-chip network to the substrate which then does serialization and routing? Is this feasible? It was my understanding that the physical layer is necessary to transfer data off-chip to begin with. I have zero background in electrical connections at this scale, so I don’t have any intuition what is required. But can one really just connect the internal fabric via micro bumps without running into major issues with the signal?

M1 Max had the same design though. I thought that the other side of the chip is used for IO, which is why the DRAM infrastructure doesn’t extend there.

I do find your interpretation very compelling and it makes sense to me. I’m just trying to understand the evidence better.

You're right! I forgot that M1 Max was also "asymmetric" that way.

As for "can you move the Phy and SerDes parts of UltraFusion off the main chip? Well the patent says so!
These things are hard to read if you don't know what you are looking for, but look at the difference between diagrams 1A (existing design) and 1B (new design). There are two big differences.
First is an RDL between the chips and the substrate. This allows some rerouting of signals so that the geometry of how the signals leave the Max doesn't have to match the way they enter the IO chiplet.
Second is the IO chiplets themselves (labelled 160).

As I said, packaging especially seems to throw out ideas a lot faster than Apple uses them! (Or perhaps they get reused in places we don't explore much like Apple Watch or Vision Pro packaging?) But enough pieces of what seems to be required for the next scaling up of M-series seem to be present here.

Regulus67 · May 18, 2024

Apple has publicised a lot of Apple Developer WWDC23 youtube videos the past month.
What would be your thoughts on this?

I am asking as a curious enthusiast, with no skill in programming.
I tried to study programming in a remote class at Sundsvall University (Sweden), 7 years ago. But the teacher never replied or gave any guidance. And told me to look on the internet when I struggled with Python. Yes, I dropped out.

Side note: I have studied several languages (Norwegian, Swedish, English, German, Russian and old Greek). And I need a proper structure to learn. With a clear syntax. Just learning words and listening to conversation, doesn't do me any good.

I have watched a few of the videos already, and they give good guidance, as far as my sketchy understanding allows.

Philip Turner · May 18, 2024

Programming languages are means through which you speak to the ALUs and memory hierarchy. Most of programming is not only learning languages, but software engineering/systems engineering. Finding ways to translate ideas into chunks of code with levels of abstraction and modularity. Problem solving skills to stop bugs before they happen, or find them quickly and resolve them in baby steps. In scientific computing, it’s important to be able to understand every single algorithm at the lowest level. Often I sketch out the algorithm visually on my iPad, or occasionally in Minecraft (as I did to understand bulge chasing for 2-stage matrix diagonalization).

Programming is engineering. There are many tools you must learn to get things done. You get better with hands on experience, discovering problems relevant to your own goals and fixing them.

Philip Turner · May 18, 2024

The process of understanding how a matrix is transformed from band form to tridiagonal form:

And the successful implementation, testing, and optimization in the Swift programming language:

Code:

//
//  BulgeChasing.swift
//
//
//  Created by Philip Turner on 3/11/24.
//

import Accelerate

extension Diagonalization {
  // Returns a sequence of reflectors.
  mutating func chaseBulges() -> [Float] {
    // Allocate a matrix to store the bulge reflectors.
    var bulgeReflectors = [Float](
      repeating: .zero, count: problemSize * problemSize)
    let reflectors = bulgeReflectors
      .withContiguousMutableStorageIfAvailable { $0.baseAddress! }!
    let dotProducts: UnsafeMutablePointer<Float> =
      .allocate(capacity: 3 * blockSize)
    defer { dotProducts.deallocate() }
   
    // Loop over the bulge chasing sweeps.
    let sweepEnd = max(0, problemSize - 2)
    for sweepID in 0..<sweepEnd {
      var maxOperationID = (problemSize - 2) - (sweepID + 1)
      maxOperationID /= blockSize
     
      // Loop over the bulges within this sweep.
      for operationID in 0...maxOperationID {
        let startOfRow = operationID &* blockSize
        let startOfColumn = sweepID &* problemSize
        let startRowID = (sweepID + 1) &+ startOfRow
       
        // Create a reflector using the 'ReflectorGeneration' API.
        var generationDesc = ReflectorGenerationDescriptor()
        let columnID = (sweepID + 1) &+ max(-1, startOfRow &- blockSize)
        let matrixBaseAddress = columnID &* problemSize &+ startRowID
        let matrixSource = UnsafePointer(matrixPointer! + matrixBaseAddress)
        generationDesc.source = matrixSource
       
        // Find the address to begin writing data at.
        let reflectorBaseAddress = startOfColumn &+ (sweepID + 1) &+ startOfRow
        let reflector = reflectors + reflectorBaseAddress
        generationDesc.destination = reflector
       
        // Determine the dimension of the reflector.
        let maxReflectorElementID = problemSize &- (sweepID + 1)
        let endOfRow = min(startOfRow &+ blockSize, maxReflectorElementID)
        generationDesc.dimension = endOfRow &- startOfRow
        ReflectorGeneration(descriptor: generationDesc)
       
        // Apply to the trailing submatrix.
        let endRowID = min(startRowID &+ blockSize, problemSize)
        applyBulgeChase(
          reflector: reflector,
          dotProducts: dotProducts,
          startRowID: startRowID,
          endRowID: endRowID)
      }
    }
   
    return bulgeReflectors
  }
 
  // Applies the reflector to the trailing submatrix.
  @_transparent
  private mutating func applyBulgeChase(
    reflector: UnsafePointer<Float>,
    dotProducts: UnsafeMutablePointer<Float>,
    startRowID: Int,
    endRowID: Int
  ) {
    let startApplicationID = max(startRowID &- blockSize, 0)
    let endApplicationID = min(endRowID &+ blockSize, problemSize)
    let dotProductCount = endApplicationID &- startApplicationID
    let rangeCount = endRowID &- startRowID
   
    // Apply the reflector to the matrix, from the left.
    do {
      let matrixOffset = startApplicationID &* problemSize &+ startRowID
      let A = matrixPointer! + matrixOffset
      let B = reflector
      let C = dotProducts
#if false
      for m in 0..<dotProductCount {
        for n in 0..<1 {
          var dotProduct: Float = .zero
          for k in 0..<rangeCount {
            let lhsValue = matrix[matrixOffset + m * problemSize + k]
            let rhsValue = reflector[n * dotProductCount + k]
            dotProduct += lhsValue * rhsValue
          }
          dotProducts[m * 1 + n] = dotProduct
        }
      }
#else
      var TRANSA = CChar(84) // T
      var TRANSB = CChar(78) // N
      var M = Int32(truncatingIfNeeded: dotProductCount)
      var N = Int32(1)
      var K = Int32(truncatingIfNeeded: rangeCount)
      var ALPHA = Float(1)
      var LDA = Int32(truncatingIfNeeded: problemSize)
      var BETA = Float(0)
      var LDB = Int32(truncatingIfNeeded: dotProductCount)
      var LDC = Int32(truncatingIfNeeded: dotProductCount)
      sgemm_(
        &TRANSA,
        &TRANSB,
        &M,
        &N,
        &K,
        &ALPHA,
        A, &LDA,
        B, &LDB,
        &BETA,
        C, &LDC)
#endif
    }
   
    do {
      let matrixOffset = startApplicationID &* problemSize &+ startRowID
      let X = reflector
      let Y = dotProducts
      let A = matrixPointer! + matrixOffset
#if false
      for m in 0..<dotProductCount {
        for n in 0..<rangeCount {
          let lhsValue = dotProducts[m * 1]
          let rhsValue = reflector[n * 1]
          matrix[matrixOffset + m * problemSize + n] -= lhsValue * rhsValue
        }
      }
#else
      var M = Int32(truncatingIfNeeded: rangeCount)
      var N = Int32(truncatingIfNeeded: dotProductCount)
      var ALPHA = Float(-1)
      var INCX = Int32(1)
      var INCY = Int32(1)
      var LDA = Int32(truncatingIfNeeded: problemSize)
      sger_(
        &M,
        &N,
        &ALPHA,
        X, &INCX,
        Y, &INCY,
        A, &LDA)
#endif
    }
   
    // Apply the reflector to the matrix, from the right.
    do {
      let matrixOffset = startRowID &* problemSize &+ startApplicationID
      let A = matrixPointer! + matrixOffset
      let B = reflector
      let C = dotProducts
#if false
      for m in 0..<dotProductCount {
        for n in 0..<1 {
          var dotProduct: Float = .zero
          for k in 0..<rangeCount {
            let lhsValue = matrix[matrixOffset + k * problemSize + m]
            let rhsValue = reflector[n * dotProductCount + k]
            dotProduct += lhsValue * rhsValue
          }
          dotProducts[m * 1 + n] = dotProduct
        }
      }
#else
      var TRANSA = CChar(78) // N
      var TRANSB = CChar(78) // N
      var M = Int32(truncatingIfNeeded: dotProductCount)
      var N = Int32(1)
      var K = Int32(truncatingIfNeeded: rangeCount)
      var ALPHA = Float(1)
      var LDA = Int32(truncatingIfNeeded: problemSize)
      var BETA = Float(0)
      var LDB = Int32(truncatingIfNeeded: dotProductCount)
      var LDC = Int32(truncatingIfNeeded: dotProductCount)
      sgemm_(
        &TRANSA,
        &TRANSB,
        &M,
        &N,
        &K,
        &ALPHA,
        A, &LDA,
        B, &LDB,
        &BETA,
        C, &LDC)
#endif
    }
   
    do {
      let matrixOffset = startRowID &* problemSize &+ startApplicationID
      let X = dotProducts
      let Y = reflector
      let A = matrixPointer! + matrixOffset
#if false
      for m in 0..<rangeCount {
        for n in 0..<dotProductCount {
          let lhsValue = reflector[m]
          let rhsValue = dotProducts[n]
          matrix[matrixOffset + m * problemSize + n] -= lhsValue * rhsValue
        }
      }
#else
      var M = Int32(truncatingIfNeeded: dotProductCount)
      var N = Int32(truncatingIfNeeded: rangeCount)
      var ALPHA = Float(-1)
      var INCX = Int32(1)
      var INCY = Int32(1)
      var LDA = Int32(truncatingIfNeeded: problemSize)
      sger_(
        &M,
        &N,
        &ALPHA,
        X, &INCX,
        Y, &INCY,
        A, &LDA)
#endif
    }
  }
}

The most workable code in highly complex software systems, is code that looks and feels like natural language. In fact, there needs to be a natural language comment before every section of code. Separate complex lines into distinct statements which are easier to interpret. Even if it feels redundant.

Philip Turner · May 18, 2024

When you talk about compressing a billion-step build sequence for a replicating nanosystem, into a megabyte of information, it's all about code and methods to efficiently transform data. Information theory, understanding how many bits are necessary to represent something in memory. And then writing assembly instructions or low-level computer hardware to execute a given goal, such as self replication.

Regulus67 · May 18, 2024

Philip Turner said:
The process of understanding how a matrix is transformed from band form to tridiagonal form:
...

The most workable code in highly complex software systems, is code that looks and feels like natural language. In fact, there needs to be a natural language comment before every section of code. Separate complex lines into distinct statements which are easier to interpret. Even if it feels redundant.

Actually a very good example, even if in a roundabout way.
Because you shot way above my head, into the stratosphere. But the point is well taken

This is very similar to how the Apple Developer videos I have watched are presented. So I take it as a confirmation that they are pretty good (classes). Which is what I wanted to know.

Thank you very much 👍

PetarM · May 5, 2025

Hi!

Does anyone have experience running FP16 computations via AMX?

According to the benchmark from corsix: https://github.com/corsix/amx

, the FP16 throughput is double that of FP32. If true, then the base M1 AMX would have higher peak throughput in FP16 than the GPU. But I never observed this in MLX (nor Julia).

Running a test in MLX on the M4 Max, both the CPU and GPU saw a small speedup when running in FP16: 10-15% faster. The GPU got to 94% its peak theoretical throughput.

The weird thing, for matrices of size 4096x4096 and 8192x8192 the performance on AMX in FP16 collapsed to half that of FP32.

leman · May 5, 2025

PetarM said:
Hi!

Does anyone have experience running FP16 computations via AMX?

According to the benchmark from corsix: https://github.com/corsix/amx

, the FP16 throughput is double that of FP32. If true, then the base M1 AMX would have higher peak throughput in FP16 than the GPU. But I never observed this in MLX (nor Julia).

Running a test in MLX on the M4 Max, both the CPU and GPU saw a small speedup when running in FP16: 10-15% faster. The GPU got to 94% its peak theoretical throughput.

The weird thing, for matrices of size 4096x4096 and 8192x8192 the performance on AMX in FP16 collapsed to half that of FP32.

I did some experiments with SME on M4 and the FP16 outer product has the same rate as FP32 variant. Note that the only supported accumulator is FP32, so that is what might be limiting the performance. From what I understand, AMX might also support FP16-to-FP16 accumulation, but those instructions are not documented. The GPU performs FP32 and FP16 operations at the same rate as far as I know.

How are you invoking AMX?

PetarM · May 5, 2025

leman said:
I did some experiments with SME on M4 and the FP16 outer product has the same rate as FP32 variant. Note that the only supported accumulator is FP32, so that is what might be limiting the performance. From what I understand, AMX might also support FP16-to-FP16 accumulation, but those instructions are not documented. The GPU performs FP32 and FP16 operations at the same rate as far as I know.

How are you invoking AMX?

I was testing it in Python, using MLX. I just call the matmul function provided by MLX and select the CPU as the device type.

The GPU indeed performs FP16 at the same rate as FP32, but as @name99 and @Philip Turner explained earlier in the thread, the register pressure is lower in half precision so the GPU manages to achieve throughput closer to theoretical maximum.

When I ran the benchmark from corsix, I found that matfp_f16f16_x*y+z achieved double the GFLOPS of matfp_f32f32_x*y+z. As the author explained, the code doesn't issue any load/stores, so it's not representative of real-world performance. Not sure what to make of that? f16 matmul would seem to be supported by AMX on the instruction level, but is not used by AppleAccelerate?

Apple Silicon in Sciences

macrumors Core

macrumors 68020

macrumors 601

macrumors 68040

macrumors 601

macrumors member

macrumors 601

macrumors 68020

macrumors 601

macrumors Core

macrumors 68020

macrumors Core

macrumors 68020

macrumors 68030

macrumors 68030

macrumors Core

macrumors 68030

macrumors 6502a

macrumors regular

macrumors regular

macrumors regular

macrumors 6502a

macrumors newbie

macrumors Core

macrumors newbie

Our Staff