cMP Slow AFPS 4K Write performance

handheldgames · Jan 31, 2019

If user

h9826790 said:
I bet that disk benchmark disabled cache intentionally in order to measure pure disk performance. However, that will be totally unrealistic.

e.g. When Apple design APFS, it should be use with cache (this is the normal situation). Therefore, the 4k no cache writing performance isn't that important. And they should focus on improving the real world whole system performance, but not pure disk 4k no cache writing performance.

I will say may be you really discovered one of the performance disadvantage by using APFS (vs HFS+). However, if that doses't matter in real world, is that really still so important?

Importance depends on the individual. In your case, the answer appears to be no. All I can do is bring forward the facts...

Now.. those willing to spend / invest in a $200-$400 PCIe 3.0 PCIe SSD adapter and another pile of cash for multiple SSD's probably have a higher level of interest/importance in achieving top performance for their workflow. These types of users probably also use 3 sticks of memory per CPU in a 2009-2012 cMP to improve memory performance that many claim has no real-world performance gain.
[doublepost=1548985583][/doublepost]

crjackson2134 said:
Yes, I too have noticed this as well. Subjectively, I’ll add that HFS+ feels snappier when clicking on a target and waiting for a disk response. I won’t try to prove it to anyone, I have no need. It just feels that way to me.

I know, I’ll be the next person to be on the receiving end of contrary comments, but I do notice the same behaviors reported in your observations.

Thx for sharing, I've been trying to identify a slowdown in PCIe SSD performance for over a year. While I thought it was an effect of the patches to the firmware bitcode that has been seen on Windows, it's somewhat comforting to know the source of the big drop in small file performance.

w1z · Feb 3, 2019

So I decided to test 4k random write and read performance using fio on 4 separate partitions (HFS, HFS Encrypted, APFS and APFS Encrypted) created on my 970 Pro NVMe 1TB drive, which had 70% available space:

* deleted the test file from the partitions before running each test.

HFS results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=70279: Sun Feb  3 17:34:35 2019
  write: IOPS=139k, BW=543MiB/s (570MB/s)(50.0MiB/92msec)
  cpu          : usr=45.05%, sys=52.75%, ctx=66, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=3.8%, 8=75.4%, 16=20.6%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.5%, 8=4.4%, 16=2.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=543MiB/s (570MB/s), 543MiB/s-543MiB/s (570MB/s-570MB/s), io=50.0MiB (52.4MB), run=92-92msec

Random Read (parameters: no caching/direct access, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=72157: Sun Feb  3 17:45:21 2019
  read: IOPS=225k, BW=877MiB/s (920MB/s)(50.0MiB/57msec)
  cpu          : usr=46.43%, sys=53.57%, ctx=17, majf=0, minf=23
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.1%, 16=6.3%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.0%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=50.0MiB (52.4MB), run=57-57msec

HFS Encrypted results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=75394: Sun Feb  3 18:04:18 2019
  write: IOPS=82.1k, BW=321MiB/s (336MB/s)(50.0MiB/156msec)
  cpu          : usr=48.39%, sys=48.39%, ctx=489, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.7%, 16=49.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.1%, 8=0.9%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=321MiB/s (336MB/s), 321MiB/s-321MiB/s (336MB/s-336MB/s), io=50.0MiB (52.4MB), run=156-156msec

Random Read (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=79594: Sun Feb  3 18:31:34 2019
  read: IOPS=225k, BW=877MiB/s (920MB/s)(50.0MiB/57msec)
  cpu          : usr=46.43%, sys=51.79%, ctx=5, majf=0, minf=24
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.0%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=50.0MiB (52.4MB), run=57-57msec

APFS results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/APFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=70783: Sun Feb  3 17:37:32 2019
  write: IOPS=53.6k, BW=209MiB/s (219MB/s)(50.0MiB/239msec)
  cpu          : usr=44.54%, sys=40.34%, ctx=2458, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=35.8%, 16=64.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.4%, 8=0.5%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=209MiB/s (219MB/s), 209MiB/s-209MiB/s (219MB/s-219MB/s), io=50.0MiB (52.4MB), run=239-239msec

Random Read (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/APFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=72996: Sun Feb  3 17:50:07 2019
  read: IOPS=229k, BW=893MiB/s (936MB/s)(50.0MiB/56msec)
  cpu          : usr=47.27%, sys=52.73%, ctx=19, majf=0, minf=23
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.1%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=893MiB/s (936MB/s), 893MiB/s-893MiB/s (936MB/s-936MB/s), io=50.0MiB (52.4MB), run=56-56msec

APFS Encrypted results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=71324: Sun Feb  3 17:40:24 2019
  write: IOPS=28.1k, BW=110MiB/s (115MB/s)(50.0MiB/456msec)
  cpu          : usr=29.23%, sys=24.40%, ctx=8555, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=42.8%, 16=57.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=98.9%, 8=1.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=50.0MiB (52.4MB), run=456-456msec

Random Read (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=74222: Sun Feb  3 17:56:47 2019
  read: IOPS=225k, BW=877MiB/s (920MB/s)(50.0MiB/57msec)
  cpu          : usr=46.43%, sys=51.79%, ctx=15, majf=0, minf=23
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.0%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=50.0MiB (52.4MB), run=57-57msec

It's quite clear that 4k random write performance, including iops, drops by ~50% when using APFS in comparison to HFS. Also, when using APFS Encrypted expect 4k random write performance to drop by another ~50% when compared to APFS and for HFS Encrypted by ~40% when compared to HFS.

4k random read performance was not impacted in all tests. I am beginning to lean towards this being a file system issue unless the T2 test results prove otherwise.

h9826790 · Feb 3, 2019

W1SS said:

So I decided to test 4k random write and read performance using fio on 4 separate partitions (HFS, HFS Encrypted, APFS and APFS Encrypted) created on my 970 Pro NVMe 1TB drive, which had 70% available space:

* deleted the test file from the partitions before running each test.

HFS results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=70279: Sun Feb  3 17:34:35 2019
  write: IOPS=139k, BW=543MiB/s (570MB/s)(50.0MiB/92msec)
  cpu          : usr=45.05%, sys=52.75%, ctx=66, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=3.8%, 8=75.4%, 16=20.6%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.5%, 8=4.4%, 16=2.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=543MiB/s (570MB/s), 543MiB/s-543MiB/s (570MB/s-570MB/s), io=50.0MiB (52.4MB), run=92-92msec

Random Read (parameters: no caching/direct access, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=72157: Sun Feb  3 17:45:21 2019
  read: IOPS=225k, BW=877MiB/s (920MB/s)(50.0MiB/57msec)
  cpu          : usr=46.43%, sys=53.57%, ctx=17, majf=0, minf=23
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.1%, 16=6.3%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.0%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=50.0MiB (52.4MB), run=57-57msec

HFS Encrypted results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=75394: Sun Feb  3 18:04:18 2019
  write: IOPS=82.1k, BW=321MiB/s (336MB/s)(50.0MiB/156msec)
  cpu          : usr=48.39%, sys=48.39%, ctx=489, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.7%, 16=49.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.1%, 8=0.9%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=321MiB/s (336MB/s), 321MiB/s-321MiB/s (336MB/s-336MB/s), io=50.0MiB (52.4MB), run=156-156msec

Random Read (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/HFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=79594: Sun Feb  3 18:31:34 2019
  read: IOPS=225k, BW=877MiB/s (920MB/s)(50.0MiB/57msec)
  cpu          : usr=46.43%, sys=51.79%, ctx=5, majf=0, minf=24
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.0%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=50.0MiB (52.4MB), run=57-57msec

APFS results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/APFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=70783: Sun Feb  3 17:37:32 2019
  write: IOPS=53.6k, BW=209MiB/s (219MB/s)(50.0MiB/239msec)
  cpu          : usr=44.54%, sys=40.34%, ctx=2458, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=35.8%, 16=64.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.4%, 8=0.5%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=209MiB/s (219MB/s), 209MiB/s-209MiB/s (219MB/s-219MB/s), io=50.0MiB (52.4MB), run=239-239msec

Random Read (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=/Volumes/APFS/test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=72996: Sun Feb  3 17:50:07 2019
  read: IOPS=229k, BW=893MiB/s (936MB/s)(50.0MiB/56msec)
  cpu          : usr=47.27%, sys=52.73%, ctx=19, majf=0, minf=23
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.1%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=893MiB/s (936MB/s), 893MiB/s-893MiB/s (936MB/s-936MB/s), io=50.0MiB (52.4MB), run=56-56msec

APFS Encrypted results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=50M --readwrite=randwrite

test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=71324: Sun Feb  3 17:40:24 2019
  write: IOPS=28.1k, BW=110MiB/s (115MB/s)(50.0MiB/456msec)
  cpu          : usr=29.23%, sys=24.40%, ctx=8555, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=42.8%, 16=57.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=98.9%, 8=1.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=50.0MiB (52.4MB), run=456-456msec

Random Read (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜  ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=32 --size=50M --readwrite=randread

test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.12
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=74222: Sun Feb  3 17:56:47 2019
  read: IOPS=225k, BW=877MiB/s (920MB/s)(50.0MiB/57msec)
  cpu          : usr=46.43%, sys=51.79%, ctx=15, majf=0, minf=23
  IO depths    : 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=94.4%, 8=0.0%, 16=5.6%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=12800,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=877MiB/s (920MB/s), 877MiB/s-877MiB/s (920MB/s-920MB/s), io=50.0MiB (52.4MB), run=57-57msec

It's quite clear that 4k random write performance, including iops, drops by ~50% when using APFS in comparison to HFS. Also, when using APFS Encrypted expect 4k random write performance to drop by another ~50% when compared to APFS and for HFS Encrypted by ~40% when compared to HFS.

4k random read performance was not impacted in all tests. I am beginning to lean towards this being a file system issue unless the T2 test results prove otherwise.

Do you mind to check activity monitor and see if the APFS 4k write possible CPU single thread limiting?

Of course, this is still a kind of file system issue (too rely on CPU single thread performance). But if this is the case, then it may explain why cMP is so affected, but not other Mac with newer CPU.

If anyone can check is there any difference between 2.26 and 3.46GHz CPU, that will also help.

w1z · Feb 3, 2019

h9826790 said:
Do you mind to check activity monitor and see if the APFS 4k write possible CPU single thread limiting?

Nope, it's multi-threaded for both write and read... 4 threads increased in utilization by the kernel_task process. I had to increase the test file size to 100M as the 50M test was completing too quick for me to track the utilization.

Edit: I think you were spot on in pointing to the AmorphousDiskMark being a single thread app.. If you take the fio APFS read and write results and divide them by 4 (the number of threads) you would get the same results as the ones seen in AmorphousDiskMark

thornslack · Feb 3, 2019

I think he was asking more if any single thread is getting maxed out, versus multi thread distribution capability. ie is the cMPs IPC to blame.

Even if this were the case however, it would put to a fault in APFS design, no? Why would you want your native file system to impose that sort of overhead? The whole purpose of new frameworks for OS and compute (Vulcan, Metal, dx12) is to reduce that sort of load.

w1z · Feb 3, 2019

thornslack said:
I think he was asking more if any single thread is getting maxed out, versus multi thread distribution capability. ie is the cMPs IPC to blame.

Even if this were the case however, it would put to a fault in APFS design, no? Why would you want your native file system to impose that sort of overhead? The whole purpose of new frameworks for OS and compute (Vulcan, Metal, dx12) is to reduce that sort of load.

It wasn't running in a single thread. The 4 threads had about 20% to 30% utilization.

It could be a bug or by design. Perhaps Apple knew there was an overhead which could be the reason they chose to offload it to the T2 coprocessor and have it altogether manage file system operations, encryption and security in one package.

AidenShaw · Feb 3, 2019

W1SS said:
It wasn't running in a single thread. The 4 threads had about 20% to 30% utilization.

Four threads is useful for measuring spinners.

NVMe drives are designed for up to 64K or more threads.

handheldgames · Feb 3, 2019

AidenShaw said:
Four threads is useful for measuring spinners.

NVMe drives are designed for up to 64K or more threads.

Is it me or it's hard to believe that an NVMe SSD can handle up to 65536 simultaneous threads.

w1z · Feb 3, 2019

handheldgames said:
Is it me or it's hard to believe that an NVMe SSD can handle up to 65536 simultaneous threads.

If my memory serves me right, NVMe supports a maximum of 64K I/O queues but not threads.. there's a difference between queues and threads. That's not to say that there aren't NAND-based storage solutions that support 64K threads but NVMe isn't one of them.

deconstruct60 · Feb 3, 2019

thornslack said:
....
Even if this were the case however, it would put to a fault in APFS design, no? Why would you want your native file system to impose that sort of overhead? The whole purpose of new frameworks for OS and compute (Vulcan, Metal, dx12) is to reduce that sort of load.

APFS keeps checksums on the metadata of the file system. HFS+ does not. [ ZFS takes even more 'horsepower' grunt to checksum all of the data. ]. The IOPS dropping off is suggestive of the file system doing more computation in addition to I/O requests. ( encryption is indicative. I/O rate also goes down).
[ I would curious what Apple pushes their checksum function through. if they vectorized it for something like AVX2 and fall back to SSE4 if ancient then that could be a difference also. If so may see T2 system difference, but because of the CPU more so than the SSD. ]

At first glance it also seems to get worse as crank up the Queue Depth. When is even more metadata changes in parallel.
[doublepost=1549258244][/doublepost]

W1SS said:
So I decided to test 4k random write and read performance using fio on 4 separate partitions (HFS, HFS Encrypted, APFS and APFS Encrypted) created on my 970 Pro NVMe 1TB drive, which had 70% available space:

* deleted the test file from the partitions before running each test.

HFS results
Random Write (parameters: no caching/direct, file size 50MB, block size 4k, io depth 32)

Code:

➜ ~ fio --randrepeat=1 --ioengine=posixaio --direct=1 --gtod_reduce=1 --name=test --....

...

Seems likely that fio is using POSIX file i/o ops. [ A quick web search points to a page which list various flavors of unix where this has been ported to. ]

APFS normal mode isn't really POSIX complaint. Apple has adaptations so that if only go through the POSIX APIs that it still is complaint, but that isn't the normal mode.

Same issue for the other benchmarks which may being doing low level POSIX api calls to be portable.

AidenShaw · Feb 4, 2019

handheldgames said:
Is it me or it's hard to believe that an NVMe SSD can handle up to 65536 simultaneous threads.

W1SS said:
If my memory serves me right, NVMe supports a maximum of 64K I/O queues but not threads...

Is an I/O queue not analogous to a thread? An NVMe drive can have up to 65535 (64 Ki - 1) queues with 65536 (64 Ki) active commands in each - or roughly 4 Gi I/O operations in flight at once.

nvme-performance-tp692-1-1610us.pdf at Seagate.com

NVMe also has a much more robust command-queue structure with a significantly larger queue depth than AHCI/ SATA. Whereas AHCI/SATA has one command queue with a depth of 32 commands, NVMe is designed to have up to 65,535 queues with as many as 65,536 commands per queue. The much higher queue depth for NVMe allows for a greater number of commands that can be executed simultaneously

That is why running AHCI-era benchmarks on NVMe disks can be misleading to useless. Instead of QD32, try QD32Ki (QD32768).

handheldgames · Feb 4, 2019

AidenShaw said:
Is an I/O queue not analogous to a thread? An NVMe drive can have up to 65535 (64 Ki - 1) queues with 65536 (64 Ki) active commands in each - or roughly 4 Gi I/O operations in flight at once.

That is why running AHCI-era benchmarks on NVMe disks can be misleading to useless. Instead of QD32, try QD32Ki (QD32768).

I had no idea that QD32Ki is a common NVMe performance benchmark. Please share additional info. I'd be interested to see the reviews where NVMe ssd's are tortured to that level of performance.

Although I'm wondering... which is misleading to useless?

w1z · Feb 5, 2019

AidenShaw said:
Is an I/O queue not analogous to a thread? An NVMe drive can have up to 65535 (64 Ki - 1) queues with 65536 (64 Ki) active commands in each - or roughly 4 Gi I/O operations in flight at once.

That is why running AHCI-era benchmarks on NVMe disks can be misleading to useless. Instead of QD32, try QD32Ki (QD32768).

Not really. I/O queues are generated and managed by threads but threads are not generated nor managed by I/O queues. You cannot have 64K threads generating or managing the same I/O queue but you can have 64K I/O queues generated and managed by a single thread in a single process or by multiple threads in single or multiple processes. So NVMe cannot handle 64K threads but can handle 64K I/O queues with 64K commands in each queue initiated and managed by a single or x number of threads (where x is defined by system capabilities and resources).

Personally, I have not seen or read anywhere about NVMe being able to handle 64K threads simultaneously. FC-NVMe (fiber channel over nvme) on the other hand can but that requires an FC setup with multiple storage controllers connected to multiple JBOFs over 1 or 2 nodes at a minimum.

Retesting using QD32768 (summaries)

HFS random write (4k, No Cache, 50M, 4 threads)
write: IOPS=85.9k, BW=336MiB/s (352MB/s)(50.0MiB/149msec)

HFS Encrypted random write (4k, No Cache, 50M, 4 threads)
write: IOPS=76.2k, BW=298MiB/s (312MB/s)(50.0MiB/168msec)

APFS random write (4k, No Cache, 50M, 5 threads)
write: IOPS=57.7k, BW=225MiB/s (236MB/s)(50.0MiB/222msec)

APFS Encrypted random write (4k, No Cache, 50M, 6 threads)
write: IOPS=26.8k, BW=105MiB/s (110MB/s)(50.0MiB/478msec)

deconstruct60 · Feb 5, 2019

W1SS said:
....

Retesting using QD32768 (summaries)

HFS random write (4k, No Cache, 50M, 4 threads)
write: IOPS=85.9k, BW=336MiB/s (352MB/s)(50.0MiB/149msec)

HFS Encrypted random write (4k, No Cache, 50M, 4 threads)
write: IOPS=76.2k, BW=298MiB/s (312MB/s)(50.0MiB/168msec)

APFS random write (4k, No Cache, 50M, 5 threads)
write: IOPS=57.7k, BW=225MiB/s (236MB/s)(50.0MiB/222msec)

APFS Encrypted random write (4k, No Cache, 50M, 6 threads)
write: IOPS=26.8k, BW=105MiB/s (110MB/s)(50.0MiB/478msec)

Symptomatic of a lock (or set of locks ) being used to serialize metadata updates. APFS is probably isn't optimized for RDBMS workloads. Also pretty good chance Apple made their POSIX compliant file API work correctly (still pass the test0 more so than fast.

Also not surprising in context of iOS devices maxing out about 4 "big cores".

w1z · Feb 24, 2019

So I got to fiddle around with a base model iMac Pro that has a T2 chip at an Apple store and ran the same benchmarks ( AmorphiousDiskMark ) ... the results are not surprising as it confirms what I always suspected - it's the T2 chip and my money is on the SSD controller and crypto accelerator (behind APFS encrypted's results).

APFS - 50MB test

4K QD32 AVERAGE READ 285 MB/s - 4K QD32 AVERAGE WRITE 280 MB/S
4K AVERAGE READ 268 MB/s - 4K AVERAGE WRITE 244 MB/s

APFS Encrypted results were similar!

Edited: added AmorphiousDiskMark

cynics · Mar 17, 2019

Write amplification will be insanely high with a benchmark using 4k blocks on a SSD optimized filesystem that utilizes write-on-copy. You are essentially doing everything that Apple tells you not to do when programing for APFS all at once.

If write amplification is literally 20-40x higher than real world operation (kernel clusters I/O operation to maximize sequential) its no surprise encrypted performance would be so bad. Need new benchmark software.

turbineseaplane · Mar 30, 2019

I'm late to this thread and by no means knowledgeable here..

I've just put a Sabrent Rocket 2TB into my machine - did a DD clone and using APFS getting this result.

I'm on High Sierra, and by no means wedded to APFS.

Would I be better off speed wise installing a fresh copy of High Sierra on HFS+ on this new NVMe drive and then pulling over my user data manually?

Is that the consensus of this thread - this write speed thing is an APFS thing or do we think the benchmarks are the issue?

Just a note.
I'm running a Hack w/ i7 8700 based upon Gigabyte z370n-wifi MoBo

handheldgames · Mar 30, 2019

turbineseaplane said:
I'm late to this thread and by no means knowledgeable here..

I've just put a Sabrent Rocket 2TB into my machine - did a DD clone and using APFS getting this result.

I'm on High Sierra, and by no means wedded to APFS.

Would I be better off speed wise installing a fresh copy of High Sierra on HFS+ on this new NVMe drive and then pulling over my user data manually?

Is that the consensus of this thread - this write speed thing is an APFS thing or do we think the benchmarks are the issue?

Just a note.
I'm running a Hack w/ i7 8700 based upon Gigabyte z370n-wifi MoBo

Small file write performance seems to take the biggest hit in APFS.
For most tasks, the difference in performance should not be noticeable.
Most tasks on MacOS that write lots of small files, such as Mail, Messages, etc. are constrained by a much slower internet bandwidth so there probably isn't any discernible slowdown.
Processes that write lots of small files, such as installing an Operating system or cloning a drive will take longer.
Large file writes are not effected by APFS and in some cases it's faster than HFS+

If your hack is running well on APFS, you should be good to go. Although, a quick check the tonymacx86 forums wouldn't hurt.

turbineseaplane · Mar 30, 2019

handheldgames said:
If your hack is running well on APFS, you should be good to go. Although, a quick check the tonymacx86 forums wouldn't hurt.

Excellent!
Thank you for the reply.

Things have been superb so far...I was just worried about the benchmarks and if something wasn’t aligned right after my dd clone operation...but after seeing this thread and hearing your thoughts, it seems clear that APFS is the issue here.

Thanks again!

canhaz · Sep 25, 2019

I can confirm what others are seeing. Here are my results.

HW: Macbook Pro 13" 2018 (T2 chip). 2TB Internal SSD formatted APFS with Filevault 2.

Granted my space is really bad 8% free. But that write performance on 4k random. FML

solaris8x86 · Oct 10, 2019

I'm a final cut pro user. I can confirm the benefit of high sequential read/write speed has no big affect to the performance of FCPX when editing 4k clips (h264) in the editing windows (real time preview). But the random I/O is the key factor for how smooth your editing can be. This is the result of my Acrea RAID-0 (SanDisk Extreme Pro 480GB x 4 (RAID-0)) volume with Acrea controller I/O (firmware adjusted set to 4k block (not 64bit LBA). I have a high 4k random I/O benchmark closes to the Apple SSD. And better than NVMe 970 Evo (PCIe 2.0 adapter). But if your NVMe adapter comes with a PCIe 2.0 -> PCIe3.0 switch on its board. Then NVMe is obviously better because the random I/O can be improved significantly.

The sequential read/write improvement is only useful when doing a backup job. In the real world applications usage. It is not that really helpful as expected. That's why a lot of people have switched to NVMe on Mac Pro 4,1 or 5,1, but the smoothness of the editing speed hasn't been improved much. (of course, it also has a direct relationship to GPU hardware acceleration) too. But in my experience. The 4K random I/O seems play a key role after a few tests on my Mac Pro 2010.

** Acrea 6Gbps SAS RAID controller - benchmark of RAID-0 APFS volume. High random I/O is the key to improve the responding speed (sensitivity) of applications. Not by the sequential I/O performance. After changing the I/O setting from [64bit LBA] to [4K block] in the controller (firmware setting). My 4K QD1 test score jumps from just 22 to 90.62. After that, when editing a 4K 8GB size h264 video clip in FCPX. The editing experience and application sensitivity has a big improvement. (At least, loading the preview thumbnails in the FCPX timeline is much much faster).

Now, this theory explains why Mac Pro 4,1 and 5,1 people even after enabled the GPU hardware acceleration and with NVMe SSD installed. They still cannot obtain a performance result closes to the latest MacBook Pro 2018 models or Mac Pro 6,1. If you pay attention to the benchmarks we'd gathered. You would notice that the different which impacts to the application performance is the performance result of 4K QD1 test. If you after installing a NVMe SSD to your Mac Pro 4,1 or 5,1 and your 4K QD1 score is still around 40... then sorry. You wouldn't be seeing a major improvement on your applications. If your score is about 100. Then, It's close to a MacBook Pro 2018 or above (when editing a clips on the FCPX timeline is noticeable much faster).

My point in this reply is, don't just blind-upgrade the NVMe to your Mac Pro unless you can set it right and understand the theory of how the application works. You are not just only or always doing backup jobs all the time to draw the benefit of NVMe anyway. But your time is spending more on using the applications.

Eneco · Oct 10, 2019

solaris8x86 said:
After changing the I/O setting from [64bit LBA] to [4K block] in the controller (firmware setting). My 4K QD1 test score jumps from just 22 to 90.62

I'm not that much of an expert, but I guess this setting can only be changed because you are using Acrea RAID-0? This isn't a setting that can be changed on any hard drive itself, right?

solaris8x86 · Oct 10, 2019

Eneco said:
I'm not that much of an expert, but I guess this setting can only be changed because you are using Acrea RAID-0? This isn't a setting that can be changed on any hard drive itself, right?

You are correct. It's a setting of the SAS controller itself. Lucky, a dedicated controller can do this kind of trick on a Mac, saved my old Mac. With this trick, Its performance on the timeline of FCPX is close to a Mac Pro 2013 (not that fast, but close to ~80-90% performance of a Mac Pro 2013 (6,1) when processing a 4K 60fps direct h264 video clip editing (not need to transcode to PRORES 422).

Eneco · Oct 10, 2019

I see. I guess there is no other way to improve the 4k read performance?

RAID 5 would increase 4k read speed as far as I know, but would require at least 3 disks and since Mojave doesn't support booting from RAID, that's not an option.

I've read that using a 4KiB / sector drive could improve 4k read as well, but there aren't any NVME drives available I think.

solaris8x86 · Oct 11, 2019

Eneco said:
RAID 5 would increase 4k read speed as far as I know, but would require at least 3 disks and since Mojave doesn't support booting from RAID, that's not an option.

That's not entirely accuracy. Mojave can boot from any hardware RAID controller. Because it is hardware RAID. Mojave even doesn't know it is a RAID drive when booting up. Your mention is only applicable to the macOS software RAID volume.

cMP Slow AFPS 4K Write performance

macrumors 68000

macrumors 6502a

macrumors P6

macrumors 6502a

macrumors 6502

macrumors 6502a

macrumors P6

macrumors 68000

macrumors 6502a

macrumors G5

macrumors P6

macrumors 68000

macrumors 6502a

macrumors G5

macrumors 6502a

macrumors G4

Contributor

macrumors 68000

Contributor

macrumors 6502

macrumors regular

macrumors regular

macrumors regular

macrumors regular

macrumors regular

Our Staff