Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

diamond.g

macrumors G4
Mar 20, 2007
11,405
2,638
OBX
The 3060 (non-Ti) does 24MH/s (similar tflop and bandwidth) and if you get around nvidia's nerf will do 50 MH/s. Not sure at what wattage though.

EDIT: minerstat says 110W for the 50MH/s rate.
 

leman

macrumors Core
Oct 14, 2008
19,494
19,631
I thought it GDDR had lower access latency. But regardless of that, we know the *effective average latency* for getting bytes from unified memory into the registers is higher. Whether that is due to the base latency of the RAM, a different amount of channels, different minimum word size moved, less parallelism in the controllers... 🤷‍♂️

Ah, that's interesting! Do you have more details?

I can definitely imagine that the effective path to DRAM is longer, Apple is likely relying on its large caches to feed the GPU.
 

StellarVixen

macrumors 68040
Mar 1, 2018
3,253
5,779
Somewhere between 0 and 1
Most cryptocurrencies to date are built on "proof of work": Solve a really hard math problem whose solution is trivial to verify (this is mining), get rewarded with some meaningless tokens which crypto promoters claim are going to go to the moon because obviously the whole world wants to abandon conventional money and switch to crypto. The claims are all pump and dump nonsense, but we live in interesting times, so we haven't gotten to the end of the pumpers creating crypto bubbles to profit off the gullible.

Anyways. PoW uses enormous amounts of electrical power. This is by design: the act of solving those hard math problems is also tied to how a cryptocurrency network signs off on transactions (i.e. Bob has 5 coins and wants to pay Alice three of them). The idea is that by forcing anyone who wants to propose including some transactions in the globally visible public ledger to prove that they burned a substantial amount of compute power working on a meaningless math problem, you can make it too expensive for individual bad actors to try to push cheating transactions (e.g. pay the same "coin" to two different people) into the ledger.

But if the world actually ran on PoW crypto, we'd end up using more energy on mining than industrial and residential uses put together. As public awareness about this problem rises, crypto promoters have been searching for a solution. It's hard to sell people on the idea that crypto is Money 2.0 if adopting Money 2.0 means accelerating global warming and harming the economy.

The only one that's gotten much traction so far is Proof of Stake. However, it amounts to "Let those with the most cryptocurrency make the rules". This appeals to the people who hold a large amount of crypto and want to become the new oligarchs of the world without actually doing anything of merit. I probably do not need to tell you why this is not likely to actually work in the real world.

Also, while crypto promoters have been claiming PoS is about to be real for a long time, it never quite materializes.

Also, even if you solved PoW, crypto would still face enormous problems taking over global finance. It turns out that despite what the pumpers claim, cryptocurrencies are terrible at actually being currency at any kind of large scale.
Bias aside, there is a good explanation in this.
 
  • Like
Reactions: Newfiejudd

quarkysg

macrumors 65816
Oct 12, 2019
1,245
833
I thought it GDDR had lower access latency. But regardless of that, we know the *effective average latency* for getting bytes from unified memory into the registers is higher (for this access pattern). Whether that is due to the base latency of the RAM, a different amount of channels, different minimum word size moved, less parallelism in the controllers... 🤷‍♂️
From what I’ve read, GDDR is suited for sequential access and DDR is suited for random access.

I guess the mining software is better optimized for dGPU.

In theory UMA should come out ahead if all else being equal (e.g. memory bandwith, GPU power.). In fact if random access for large dataset like 4GB, UMA with DDR should be better, as there’s no PCIe bottleneck between system memory and VRAM.
 

quarkysg

macrumors 65816
Oct 12, 2019
1,245
833
How? Once the data is in memory, and you are purely doing memory access and crunching numbers, how is it optimized for dGPU's.
Software need to be optimised for the architecture that the software is meant to be run. Let's take the Stockfish thread for example. There's a difference between standard Stockfish code and one that has been optmized for the M1.

I think you already know this, but just trying to stir things up?
 

diamond.g

macrumors G4
Mar 20, 2007
11,405
2,638
OBX
Software need to be optimised for the architecture that the software is meant to be run. Let's take the Stockfish thread for example. There's a difference between standard Stockfish code and one that has been optmized for the M1.

I think you already know this, but just trying to stir things up?
Not trying to stir things up. Does Metal provide the means to change the things @ChainfireXDA mentioned were issues they saw?

The issue really seems to be memory subsystem centric. I know GPU speed doesn't really affect hashrate (within reason). UMA is supposed to be an advantage (which it should be for inital DAG creation for sure) and because Apple is using LPDDR latency should be lower than GDDR.

Maybe the DAG needs to exist on all 4 memory chips to get the full 400GB/s bandwidth (assuming the issue is actually bandwidth and not latency to get data to the cache for operation)?
 

quarkysg

macrumors 65816
Oct 12, 2019
1,245
833
Not trying to stir things up. Does Metal provide the means to change the things @ChainfireXDA mentioned were issues they saw?

The issue really seems to be memory subsystem centric. I know GPU speed doesn't really affect hashrate (within reason). UMA is supposed to be an advantage (which it should be for inital DAG creation for sure) and because Apple is using LPDDR latency should be lower than GDDR.

Maybe the DAG needs to exist on all 4 memory chips to get the full 400GB/s bandwidth (assuming the issue is actually bandwidth and not latency to get data to the cache for operation)?
Well, I have to admit I know very little on Metal's capability.

On the surface, from my point of view, UMA and dGPU VRAM via PCIe, as well as DDR vs GDDR RAM makes it enough of an architectural differences that would necessitate re-optimisation. Using the same code that works well on a traditional dGPU setup would likely cause bottleneck on a UMA machine. We can look at the game Baldur Gate 3, with a before and after optimisation phase for the M1 Macs.

From what I read on the Internet, it seems that the M1 cores are able to fully saturate the memory bandwidth for the CPU core, while the M1 Pro/Max CPU cores could only chew up to 200GB/s. The GPU cores on the M1 Pro/Max on the other hand can get up to more than 300GB/s bandwidth. This is still quite a bit lower than high end dGPU memory bandwidth, so that may account for some deficit there, assuming the mining algorithm is bandwidth starved.

If the algorightm can push all the datasets into the dGPU VRAM and let it sits there to do it works without shuffling massive datasets back to system memory, then dGPU has an advantage there.

If there's a need to continously shuffle data back and forth between dGPU VRAM and system RAM, then the UMA architecture will likely be advantageous there, because this step is not needed. But if the algorithm does this on a UMA system, then this is an area of optimisation. I believe this is what the Baldur Gate 3 team is doing.

I have to admit again that I'm not familiar with the mining algorithm, so I cannot comment much there as well.

In any case, the M1 Macs are a change in paradigm for developers, so it'll take time for software to get optimised. At the end of the day, the M1 SoCs may turn out to be a bad fit for block-chain mining, and I'm very OK with that ... heh heh.
 

Botts85

macrumors regular
Feb 9, 2007
229
175
It's been interesting reading this thread, specifically around the memory access of GDDR vs DDR. That makes some sense looking at the performance of the AMD cards at ethereum mining with HMB2.
 

jmurphyau

macrumors newbie
May 20, 2019
9
5
Melbourne AU
I've coded the Ethereum algorithm on Metal and ARM assembly (using the new crypto instructions) - testing the results using the go-ethereum repo (ie making sure input X products output Y) - on my M1 MacBook Pro (Apple M1 Max), I'm getting:

GPU: over 100Mh/s @ 30 watts of additional power usage
CPU: over 10Mh/s @ 20 watts of additional power usage

I'm in the process of testing this more (against test pools with real life values rather than the test data from go-ethereum) and then turning this into a Mac app atm (won't go on the App Store because of their rules)

Edit: looking at this thread, it looks like someone else has already done the same thing? I wonder why my results (for GPU) are much higher?
 
Last edited:

diamond.g

macrumors G4
Mar 20, 2007
11,405
2,638
OBX
I've coded the Ethereum algorithm on Metal and ARM assembly (using the new crypto instructions) - testing the results using the go-ethereum repo (ie making sure input X products output Y) - on my M1 MacBook Pro (Apple M1 Max), I'm getting:

GPU: over 100Mh/s @ 30 watts of additional power usage
CPU: over 10Mh/s @ 20 watts of additional power usage

I'm in the process of testing this more (against test pools with real life values rather than the test data from go-ethereum) and then turning this into a Mac app atm (won't go on the App Store because of their rules)

Edit: looking at this thread, it looks like someone else has already done the same thing? I wonder why my results (for GPU) are much higher?
Now that is an awesome result, if it translates to a “real” pool.
 

ChainfireXDA

macrumors newbie
Nov 8, 2021
6
5
This was my first ever GPU/Metal code. It's quite possible yours is simply *much better*. Nor did I have reason to believe it could be much faster as the speed is essentially the same as the OpenCL miner. There are definitely some parts to my Metal code that I thought could be faster, but I didn't see the route to get there. Maybe you've solved some of those. I'd be happy to check out your solution, if it really does produce 100Mh/s, that is truly epic.
 

Pressure

macrumors 603
May 30, 2006
5,166
1,531
Denmark
I've coded the Ethereum algorithm on Metal and ARM assembly (using the new crypto instructions) - testing the results using the go-ethereum repo (ie making sure input X products output Y) - on my M1 MacBook Pro (Apple M1 Max), I'm getting:

GPU: over 100Mh/s @ 30 watts of additional power usage
CPU: over 10Mh/s @ 20 watts of additional power usage

I'm in the process of testing this more (against test pools with real life values rather than the test data from go-ethereum) and then turning this into a Mac app atm (won't go on the App Store because of their rules)

Edit: looking at this thread, it looks like someone else has already done the same thing? I wonder why my results (for GPU) are much higher?
That looks great!

Mind posting a screenshot of the hashing power?
 

jmurphyau

macrumors newbie
May 20, 2019
9
5
Melbourne AU
@ChainfireXDA, do you have any advice on how or where to test Ethereum code? I'm at the point where I need to connect to a test pool (wouldn't want to test against a prod pool) but I don't really know where to start..
 

ChainfireXDA

macrumors newbie
Nov 8, 2021
6
5
I guess the first step is implementing the basic protocol v1 ( https://github.com/nicehash/Specifications/blob/master/EthereumStratum_NiceHash_v1.0.0.txt )

It's absolutely fine connecting to a production pool for this. I use ViaBTC ( eth.viabtc.top port 433 or 3333 ). Most pools really don't care - if you do anything weird, they drop the connection, and that's that, don't worry overly much about it.

That being said, I think you can setup geth or something to create a local Ethereum testnet, and you'd then run a stratum proxy to the mining code? I've never done this though.

My own code is closed because I reuse a lot of stuff from my job, but if you're planning to open source what you yourself have created, you can share it and I might be able to plug that into my network layer for testing - time permitting. But that only works (quickly) if you wrote it in C/C++, my stuff isn't compatible with ObjC/Swift without some hoops to jump through. Or you can try hacking it into ethminer, which is also open source. The latter you can try even if you plan to keep your stuff closed, as long as you don't release it, you're not obligated to share the source as long as you don't spread any binaries.
 
  • Like
Reactions: Romain_H

jmurphyau

macrumors newbie
May 20, 2019
9
5
Melbourne AU
Seems due to memory and memory cache performance

The first tests were using such a small amount of data that it's likely it all was populated in cache - where as now there is much more data being worked with
 

mi7chy

macrumors G4
Oct 24, 2014
10,591
11,279
Seems due to memory and memory cache performance

The first tests were using such a small amount of data that it's likely it all was populated in cache - where as now there is much more data being worked with

What hashrate are you seeing now? Even half would be considered very good.
 

Botts85

macrumors regular
Feb 9, 2007
229
175
The CPU performance has me considering trying Monero on the M1.
What hashrate are you seeing now? Even half would be considered very good.
30Mh/s at that wattage would be freaking incredible. IIRC, the 3060 Ti was the efficiency king at 60MH/s and 120w.

That's 2Mh per watt. If Apple Silicon did 1Mh / watt it'd be a huge step forward.
 

mi7chy

macrumors G4
Oct 24, 2014
10,591
11,279
The CPU performance has me considering trying Monero on the M1.

30Mh/s at that wattage would be freaking incredible. IIRC, the 3060 Ti was the efficiency king at 60MH/s and 120w.

That's 2Mh per watt. If Apple Silicon did 1Mh / watt it'd be a huge step forward.

3060ti FE is 3x faster ETH hashrate, 22% faster per W and almost 1/8 the cost ($400 3060ti from Best Buy vs $3100+ M1 Max).

https://www.tomshardware.com/how-to/optimize-your-gpu-for-ethereum-mining
3060ti FE tuned 60.6 MH/s @ 115.6W = 0.524 MH/s per W

https://github.com/Chainfire/UselethMiner
M1 Max GPU 10.3 MH/s @ 24W = 0.429 MH/s per W
M1 Max GPU+CPU 20MH/s @ 62W = 0.323 MH/s per W
M1 Max CPU 10.4 MH/s @ 40W = 0.260 MH/s per W

For Monero, it's near AMD 5600U mobile hashrate.

https://xmrig.com/benchmark
1644966187011.png

1644966145368.png
 

diamond.g

macrumors G4
Mar 20, 2007
11,405
2,638
OBX
Seems due to memory and memory cache performance

The first tests were using such a small amount of data that it's likely it all was populated in cache - where as now there is much more data being worked with
Well that is a shame, it would have been cool to get 100 mega hash without burning all the power GA102 does to get there. It is also interesting that there is such a high penalty for not playing in the cache, as you would think 400 gigs of bandwidth would make up for it.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.