Browse Source
A relatively large set of fixes here, the biggest piece of it is a series correcting some problems with the delay reporting for Intel SOF cards but there's a bunch of other things. Everything here is driver specific except for a fix in the core for an issue with sign extension handling volume controls. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmYPJIUACgkQJNaLcl1U h9Ap8Qf/Xr2YP+KJxiD7g52grelyqjUdMCBkoB5Ndf4BU/mBl5X1yZkBuiTEB24D dcjamLuxNHGJhrHalYEfok6wRK3RMlkDZ8SNgpCP9rmOH0cSdt/8I3vUXA/XUKrx SnceTSZC1S7BSRk26IKf2UdFRULMSGpC87mVLxi7DT+nmuIAigGau3yBPXveG00p OLOYmJK9vwdlxAkfIp0ddYx3iTqfaq55W5ttWadoLG9gpoGbDzkvapBsZebqlGeo MkZrur70Vi7ousGATkzQHCkEUD8atNOTRrrAgVmzgBbUy3Y6LOmeMwiH306wXJAq XHiYYLCPlPI5myNYHeKmOQbCK1AzpA== =/2l+ -----END PGP SIGNATURE----- Merge tag 'asoc-fix-v6.9-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v6.9 A relatively large set of fixes here, the biggest piece of it is a series correcting some problems with the delay reporting for Intel SOF cards but there's a bunch of other things. Everything here is driver specific except for a fix in the core for an issue with sign extension handling volume controls.master
Takashi Iwai
4 weeks ago
10646 changed files with 631197 additions and 193617 deletions
@ -1,4 +1,5 @@
|
||||
Alan Cox <alan@lxorguk.ukuu.org.uk> |
||||
Alan Cox <root@hraefn.swansea.linux.org.uk> |
||||
Christoph Hellwig <hch@lst.de> |
||||
Jeff Kirsher <jeffrey.t.kirsher@intel.com> |
||||
Marc Gonzalez <marc.w.gonzalez@free.fr> |
||||
|
@ -0,0 +1,276 @@
|
||||
What: /sys/kernel/debug/iommu/intel/iommu_regset |
||||
Date: December 2023 |
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com> |
||||
Description: |
||||
This file dumps all the register contents for each IOMMU device. |
||||
|
||||
Example in Kabylake: |
||||
|
||||
:: |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/iommu_regset |
||||
|
||||
IOMMU: dmar0 Register Base Address: 26be37000 |
||||
|
||||
Name Offset Contents |
||||
VER 0x00 0x0000000000000010 |
||||
GCMD 0x18 0x0000000000000000 |
||||
GSTS 0x1c 0x00000000c7000000 |
||||
FSTS 0x34 0x0000000000000000 |
||||
FECTL 0x38 0x0000000000000000 |
||||
|
||||
[...] |
||||
|
||||
IOMMU: dmar1 Register Base Address: fed90000 |
||||
|
||||
Name Offset Contents |
||||
VER 0x00 0x0000000000000010 |
||||
GCMD 0x18 0x0000000000000000 |
||||
GSTS 0x1c 0x00000000c7000000 |
||||
FSTS 0x34 0x0000000000000000 |
||||
FECTL 0x38 0x0000000000000000 |
||||
|
||||
[...] |
||||
|
||||
IOMMU: dmar2 Register Base Address: fed91000 |
||||
|
||||
Name Offset Contents |
||||
VER 0x00 0x0000000000000010 |
||||
GCMD 0x18 0x0000000000000000 |
||||
GSTS 0x1c 0x00000000c7000000 |
||||
FSTS 0x34 0x0000000000000000 |
||||
FECTL 0x38 0x0000000000000000 |
||||
|
||||
[...] |
||||
|
||||
What: /sys/kernel/debug/iommu/intel/ir_translation_struct |
||||
Date: December 2023 |
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com> |
||||
Description: |
||||
This file dumps the table entries for Interrupt |
||||
remapping and Interrupt posting. |
||||
|
||||
Example in Kabylake: |
||||
|
||||
:: |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/ir_translation_struct |
||||
|
||||
Remapped Interrupt supported on IOMMU: dmar0 |
||||
IR table address:100900000 |
||||
|
||||
Entry SrcID DstID Vct IRTE_high IRTE_low |
||||
0 00:0a.0 00000080 24 0000000000040050 000000800024000d |
||||
1 00:0a.0 00000001 ef 0000000000040050 0000000100ef000d |
||||
|
||||
Remapped Interrupt supported on IOMMU: dmar1 |
||||
IR table address:100300000 |
||||
Entry SrcID DstID Vct IRTE_high IRTE_low |
||||
0 00:02.0 00000002 26 0000000000040010 000000020026000d |
||||
|
||||
[...] |
||||
|
||||
**** |
||||
|
||||
Posted Interrupt supported on IOMMU: dmar0 |
||||
IR table address:100900000 |
||||
Entry SrcID PDA_high PDA_low Vct IRTE_high IRTE_low |
||||
|
||||
What: /sys/kernel/debug/iommu/intel/dmar_translation_struct |
||||
Date: December 2023 |
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com> |
||||
Description: |
||||
This file dumps Intel IOMMU DMA remapping tables, such |
||||
as root table, context table, PASID directory and PASID |
||||
table entries in debugfs. For legacy mode, it doesn't |
||||
support PASID, and hence PASID field is defaulted to |
||||
'-1' and other PASID related fields are invalid. |
||||
|
||||
Example in Kabylake: |
||||
|
||||
:: |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_translation_struct |
||||
|
||||
IOMMU dmar1: Root Table Address: 0x103027000 |
||||
B.D.F Root_entry |
||||
00:02.0 0x0000000000000000:0x000000010303e001 |
||||
|
||||
Context_entry |
||||
0x0000000000000102:0x000000010303f005 |
||||
|
||||
PASID PASID_table_entry |
||||
-1 0x0000000000000000:0x0000000000000000:0x0000000000000000 |
||||
|
||||
IOMMU dmar0: Root Table Address: 0x103028000 |
||||
B.D.F Root_entry |
||||
00:0a.0 0x0000000000000000:0x00000001038a7001 |
||||
|
||||
Context_entry |
||||
0x0000000000000000:0x0000000103220e7d |
||||
|
||||
PASID PASID_table_entry |
||||
0 0x0000000000000000:0x0000000000800002:0x00000001038a5089 |
||||
|
||||
[...] |
||||
|
||||
What: /sys/kernel/debug/iommu/intel/invalidation_queue |
||||
Date: December 2023 |
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com> |
||||
Description: |
||||
This file exports invalidation queue internals of each |
||||
IOMMU device. |
||||
|
||||
Example in Kabylake: |
||||
|
||||
:: |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/invalidation_queue |
||||
|
||||
Invalidation queue on IOMMU: dmar0 |
||||
Base: 0x10022e000 Head: 20 Tail: 20 |
||||
Index qw0 qw1 qw2 |
||||
0 0000000000000014 0000000000000000 0000000000000000 |
||||
1 0000000200000025 0000000100059c04 0000000000000000 |
||||
2 0000000000000014 0000000000000000 0000000000000000 |
||||
|
||||
qw3 status |
||||
0000000000000000 0000000000000000 |
||||
0000000000000000 0000000000000000 |
||||
0000000000000000 0000000000000000 |
||||
|
||||
[...] |
||||
|
||||
Invalidation queue on IOMMU: dmar1 |
||||
Base: 0x10026e000 Head: 32 Tail: 32 |
||||
Index qw0 qw1 status |
||||
0 0000000000000004 0000000000000000 0000000000000000 |
||||
1 0000000200000025 0000000100059804 0000000000000000 |
||||
2 0000000000000011 0000000000000000 0000000000000000 |
||||
|
||||
[...] |
||||
|
||||
What: /sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
Date: December 2023 |
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com> |
||||
Description: |
||||
This file is used to control and show counts of |
||||
execution time ranges for various types per DMAR. |
||||
|
||||
Firstly, write a value to |
||||
/sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
to enable sampling. |
||||
|
||||
The possible values are as follows: |
||||
|
||||
* 0 - disable sampling all latency data |
||||
|
||||
* 1 - enable sampling IOTLB invalidation latency data |
||||
|
||||
* 2 - enable sampling devTLB invalidation latency data |
||||
|
||||
* 3 - enable sampling intr entry cache invalidation latency data |
||||
|
||||
Next, read /sys/kernel/debug/iommu/intel/dmar_perf_latency gives |
||||
a snapshot of sampling result of all enabled monitors. |
||||
|
||||
Examples in Kabylake: |
||||
|
||||
:: |
||||
|
||||
1) Disable sampling all latency data: |
||||
|
||||
$ sudo echo 0 > /sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
|
||||
2) Enable sampling IOTLB invalidation latency data |
||||
|
||||
$ sudo echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
|
||||
IOMMU: dmar0 Register Base Address: 26be37000 |
||||
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms |
||||
inv_iotlb 0 0 0 0 0 |
||||
|
||||
1ms-10ms >=10ms min(us) max(us) average(us) |
||||
inv_iotlb 0 0 0 0 0 |
||||
|
||||
[...] |
||||
|
||||
IOMMU: dmar2 Register Base Address: fed91000 |
||||
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms |
||||
inv_iotlb 0 0 18 0 0 |
||||
|
||||
1ms-10ms >=10ms min(us) max(us) average(us) |
||||
inv_iotlb 0 0 2 2 2 |
||||
|
||||
3) Enable sampling devTLB invalidation latency data |
||||
|
||||
$ sudo echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency |
||||
|
||||
IOMMU: dmar0 Register Base Address: 26be37000 |
||||
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms |
||||
inv_devtlb 0 0 0 0 0 |
||||
|
||||
>=10ms min(us) max(us) average(us) |
||||
inv_devtlb 0 0 0 0 |
||||
|
||||
[...] |
||||
|
||||
What: /sys/kernel/debug/iommu/intel/<bdf>/domain_translation_struct |
||||
Date: December 2023 |
||||
Contact: Jingqi Liu <Jingqi.liu@intel.com> |
||||
Description: |
||||
This file dumps a specified page table of Intel IOMMU |
||||
in legacy mode or scalable mode. |
||||
|
||||
For a device that only supports legacy mode, dump its |
||||
page table by the debugfs file in the debugfs device |
||||
directory. e.g. |
||||
/sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct. |
||||
|
||||
For a device that supports scalable mode, dump the |
||||
page table of specified pasid by the debugfs file in |
||||
the debugfs pasid directory. e.g. |
||||
/sys/kernel/debug/iommu/intel/0000:00:02.0/1/domain_translation_struct. |
||||
|
||||
Examples in Kabylake: |
||||
|
||||
:: |
||||
|
||||
1) Dump the page table of device "0000:00:02.0" that only supports legacy mode. |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct |
||||
|
||||
Device 0000:00:02.0 @0x1017f8000 |
||||
IOVA_PFN PML5E PML4E |
||||
0x000000008d800 | 0x0000000000000000 0x00000001017f9003 |
||||
0x000000008d801 | 0x0000000000000000 0x00000001017f9003 |
||||
0x000000008d802 | 0x0000000000000000 0x00000001017f9003 |
||||
|
||||
PDPE PDE PTE |
||||
0x00000001017fa003 0x00000001017fb003 0x000000008d800003 |
||||
0x00000001017fa003 0x00000001017fb003 0x000000008d801003 |
||||
0x00000001017fa003 0x00000001017fb003 0x000000008d802003 |
||||
|
||||
[...] |
||||
|
||||
2) Dump the page table of device "0000:00:0a.0" with PASID "1" that |
||||
supports scalable mode. |
||||
|
||||
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:0a.0/1/domain_translation_struct |
||||
|
||||
Device 0000:00:0a.0 with pasid 1 @0x10c112000 |
||||
IOVA_PFN PML5E PML4E |
||||
0x0000000000000 | 0x0000000000000000 0x000000010df93003 |
||||
0x0000000000001 | 0x0000000000000000 0x000000010df93003 |
||||
0x0000000000002 | 0x0000000000000000 0x000000010df93003 |
||||
|
||||
PDPE PDE PTE |
||||
0x0000000106ae6003 0x0000000104b38003 0x0000000147c00803 |
||||
0x0000000106ae6003 0x0000000104b38003 0x0000000147c01803 |
||||
0x0000000106ae6003 0x0000000104b38003 0x0000000147c02803 |
||||
|
||||
[...] |
@ -0,0 +1,153 @@
|
||||
What: /sys/bus/dax/devices/daxX.Y/align |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RW) Provides a way to specify an alignment for a dax device. |
||||
Values allowed are constrained by the physical address ranges |
||||
that back the dax device, and also by arch requirements. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/mapping |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(WO) Provides a way to allocate a mapping range under a dax |
||||
device. Specified in the format <start>-<end>. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start |
||||
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end |
||||
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) A dax device may have multiple constituent discontiguous |
||||
address ranges. These are represented by the different |
||||
'mappingX' subdirectories. The 'start' attribute indicates the |
||||
start physical address for the given range. The 'end' attribute |
||||
indicates the end physical address for the given range. The |
||||
'page_offset' attribute indicates the offset of the current |
||||
range in the dax device. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/resource |
||||
Date: June, 2019 |
||||
KernelVersion: v5.3 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The resource attribute indicates the starting physical |
||||
address of a dax device. In case of a device with multiple |
||||
constituent ranges, it indicates the starting address of the |
||||
first range. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/size |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RW) The size attribute indicates the total size of a dax |
||||
device. For creating subdivided dax devices, or for resizing |
||||
an existing device, the new size can be written to this as |
||||
part of the reconfiguration process. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/numa_node |
||||
Date: November, 2019 |
||||
KernelVersion: v5.5 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) If NUMA is enabled and the platform has affinitized the |
||||
backing device for this dax device, emit the CPU node |
||||
affinity for this device. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/target_node |
||||
Date: February, 2019 |
||||
KernelVersion: v5.1 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The target-node attribute is the Linux numa-node that a |
||||
device-dax instance may create when it is online. Prior to |
||||
being online the device's 'numa_node' property reflects the |
||||
closest online cpu node which is the typical expectation of a |
||||
device 'numa_node'. Once it is online it becomes its own |
||||
distinct numa node. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/available_size |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The available_size attribute tracks available dax region |
||||
capacity. This only applies to volatile hmem devices, not pmem |
||||
devices, since pmem devices are defined by nvdimm namespace |
||||
boundaries. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/size |
||||
Date: July, 2017 |
||||
KernelVersion: v5.1 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The size attribute indicates the size of a given dax region |
||||
in bytes. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/align |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The align attribute indicates alignment of the dax region. |
||||
Changes on align may not always be valid, when say certain |
||||
mappings were created with 2M and then we switch to 1G. This |
||||
validates all ranges against the new value being attempted, post |
||||
resizing. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/seed |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The seed device is a concept for dynamic dax regions to be |
||||
able to split the region amongst multiple sub-instances. The |
||||
seed device, similar to libnvdimm seed devices, is a device |
||||
that starts with zero capacity allocated and unbound to a |
||||
driver. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/create |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RW) The create interface to the dax region provides a way to |
||||
create a new unconfigured dax device under the given region, which |
||||
can then be configured (with a size etc.) and then probed. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/delete |
||||
Date: October, 2020 |
||||
KernelVersion: v5.10 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(WO) The delete interface for a dax region provides for deletion |
||||
of any 0-sized and idle dax devices. |
||||
|
||||
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/id |
||||
Date: July, 2017 |
||||
KernelVersion: v5.1 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RO) The id attribute indicates the region id of a dax region. |
||||
|
||||
What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory |
||||
Date: January, 2024 |
||||
KernelVersion: v6.8 |
||||
Contact: nvdimm@lists.linux.dev |
||||
Description: |
||||
(RW) Control the memmap_on_memory setting if the dax device |
||||
were to be hotplugged as system memory. This determines whether |
||||
the 'altmap' for the hotplugged memory will be placed on the |
||||
device being hotplugged (memmap_on_memory=1) or if it will be |
||||
placed on regular memory (memmap_on_memory=0). This attribute |
||||
must be set before the device is handed over to the 'kmem' |
||||
driver (i.e. hotplugged into system-ram). Additionally, this |
||||
depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled |
||||
memmap_on_memory parameter for memory_hotplug. This is |
||||
typically set on the kernel command line - |
||||
memory_hotplug.memmap_on_memory set to 'true' or 'force'." |
@ -0,0 +1,9 @@
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_shunt_resistorY |
||||
KernelVersion: 6.7 |
||||
Contact: linux-iio@vger.kernel.org |
||||
Description: |
||||
The value of the shunt resistor may be known only at runtime |
||||
and set by a client application. This attribute allows to |
||||
set its value in micro-ohms. X is the IIO index of the device. |
||||
Y is the channel number. The value is used to calculate |
||||
current, power and accumulated energy. |
@ -0,0 +1,11 @@
|
||||
What: /sys/fs/virtiofs/<n>/tag |
||||
Date: Feb 2024 |
||||
Contact: virtio-fs@lists.linux.dev |
||||
Description: |
||||
[RO] The mount "tag" that can be used to mount this filesystem. |
||||
|
||||
What: /sys/fs/virtiofs/<n>/device |
||||
Date: Feb 2024 |
||||
Contact: virtio-fs@lists.linux.dev |
||||
Description: |
||||
Symlink to the virtio device that exports this filesystem. |
@ -0,0 +1,4 @@
|
||||
What: /sys/kernel/mm/mempolicy/ |
||||
Date: January 2024 |
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org> |
||||
Description: Interface for Mempolicy |
@ -0,0 +1,25 @@
|
||||
What: /sys/kernel/mm/mempolicy/weighted_interleave/ |
||||
Date: January 2024 |
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org> |
||||
Description: Configuration Interface for the Weighted Interleave policy |
||||
|
||||
What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN |
||||
Date: January 2024 |
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org> |
||||
Description: Weight configuration interface for nodeN |
||||
|
||||
The interleave weight for a memory node (N). These weights are |
||||
utilized by tasks which have set their mempolicy to |
||||
MPOL_WEIGHTED_INTERLEAVE. |
||||
|
||||
These weights only affect new allocations, and changes at runtime |
||||
will not cause migrations on already allocated pages. |
||||
|
||||
The minimum weight for a node is always 1. |
||||
|
||||
Minimum weight: 1 |
||||
Maximum weight: 255 |
||||
|
||||
Writing an empty string or `0` will reset the weight to the |
||||
system default. The system default may be set by the kernel |
||||
or drivers at boot or during hotplug events. |
@ -0,0 +1,24 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 |
||||
|
||||
Address translation |
||||
=================== |
||||
|
||||
x86 AMD |
||||
------- |
||||
|
||||
Zen-based AMD systems include a Data Fabric that manages the layout of |
||||
physical memory. Devices attached to the Fabric, like memory controllers, |
||||
I/O, etc., may not have a complete view of the system physical memory map. |
||||
These devices may provide a "normalized", i.e. device physical, address |
||||
when reporting memory errors. Normalized addresses must be translated to |
||||
a system physical address for the kernel to action on the memory. |
||||
|
||||
AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for |
||||
this case. |
||||
|
||||
Glossary of acronyms used in address translation for Zen-based systems |
||||
|
||||
* CCM = Cache Coherent Moderator |
||||
* COD = Cluster-on-Die |
||||
* COH_ST = Coherent Station |
||||
* DF = Data Fabric |
@ -1,15 +1,10 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 |
||||
|
||||
Reliability, Availability and Serviceability features |
||||
===================================================== |
||||
|
||||
This documents different aspects of the RAS functionality present in the |
||||
kernel. |
||||
|
||||
Error decoding |
||||
--------------- |
||||
============== |
||||
|
||||
* x86 |
||||
x86 |
||||
--- |
||||
|
||||
Error decoding on AMD systems should be done using the rasdaemon tool: |
||||
https://github.com/mchehab/rasdaemon/ |
@ -0,0 +1,7 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 |
||||
.. toctree:: |
||||
:maxdepth: 2 |
||||
|
||||
main |
||||
error-decoding |
||||
address-translation |
@ -1,8 +1,12 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 |
||||
.. include:: <isonum.txt> |
||||
|
||||
============================================ |
||||
Reliability, Availability and Serviceability |
||||
============================================ |
||||
================================================== |
||||
Reliability, Availability and Serviceability (RAS) |
||||
================================================== |
||||
|
||||
This documents different aspects of the RAS functionality present in the |
||||
kernel. |
||||
|
||||
RAS concepts |
||||
************ |
@ -0,0 +1,633 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only |
||||
|
||||
================ |
||||
Design of dm-vdo |
||||
================ |
||||
|
||||
The dm-vdo (virtual data optimizer) target provides inline deduplication, |
||||
compression, zero-block elimination, and thin provisioning. A dm-vdo target |
||||
can be backed by up to 256TB of storage, and can present a logical size of |
||||
up to 4PB. This target was originally developed at Permabit Technology |
||||
Corp. starting in 2009. It was first released in 2013 and has been used in |
||||
production environments ever since. It was made open-source in 2017 after |
||||
Permabit was acquired by Red Hat. This document describes the design of |
||||
dm-vdo. For usage, see vdo.rst in the same directory as this file. |
||||
|
||||
Because deduplication rates fall drastically as the block size increases, a |
||||
vdo target has a maximum block size of 4K. However, it can achieve |
||||
deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can |
||||
reference a single 4K of actual storage. It can achieve compression rates |
||||
of 14:1. All zero blocks consume no storage at all. |
||||
|
||||
Theory of Operation |
||||
=================== |
||||
|
||||
The design of dm-vdo is based on the idea that deduplication is a two-part |
||||
problem. The first is to recognize duplicate data. The second is to avoid |
||||
storing multiple copies of those duplicates. Therefore, dm-vdo has two main |
||||
parts: a deduplication index (called UDS) that is used to discover |
||||
duplicate data, and a data store with a reference counted block map that |
||||
maps from logical block addresses to the actual storage location of the |
||||
data. |
||||
|
||||
Zones and Threading |
||||
------------------- |
||||
|
||||
Due to the complexity of data optimization, the number of metadata |
||||
structures involved in a single write operation to a vdo target is larger |
||||
than most other targets. Furthermore, because vdo must operate on small |
||||
block sizes in order to achieve good deduplication rates, acceptable |
||||
performance can only be achieved through parallelism. Therefore, vdo's |
||||
design attempts to be lock-free. |
||||
|
||||
Most of a vdo's main data structures are designed to be easily divided into |
||||
"zones" such that any given bio must only access a single zone of any zoned |
||||
structure. Safety with minimal locking is achieved by ensuring that during |
||||
normal operation, each zone is assigned to a specific thread, and only that |
||||
thread will access the portion of the data structure in that zone. |
||||
Associated with each thread is a work queue. Each bio is associated with a |
||||
request object (the "data_vio") which will be added to a work queue when |
||||
the next phase of its operation requires access to the structures in the |
||||
zone associated with that queue. |
||||
|
||||
Another way of thinking about this arrangement is that the work queue for |
||||
each zone has an implicit lock on the structures it manages for all its |
||||
operations, because vdo guarantees that no other thread will alter those |
||||
structures. |
||||
|
||||
Although each structure is divided into zones, this division is not |
||||
reflected in the on-disk representation of each data structure. Therefore, |
||||
the number of zones for each structure, and hence the number of threads, |
||||
can be reconfigured each time a vdo target is started. |
||||
|
||||
The Deduplication Index |
||||
----------------------- |
||||
|
||||
In order to identify duplicate data efficiently, vdo was designed to |
||||
leverage some common characteristics of duplicate data. From empirical |
||||
observations, we gathered two key insights. The first is that in most data |
||||
sets with significant amounts of duplicate data, the duplicates tend to |
||||
have temporal locality. When a duplicate appears, it is more likely that |
||||
other duplicates will be detected, and that those duplicates will have been |
||||
written at about the same time. This is why the index keeps records in |
||||
temporal order. The second insight is that new data is more likely to |
||||
duplicate recent data than it is to duplicate older data and in general, |
||||
there are diminishing returns to looking further back in time. Therefore, |
||||
when the index is full, it should cull its oldest records to make space for |
||||
new ones. Another important idea behind the design of the index is that the |
||||
ultimate goal of deduplication is to reduce storage costs. Since there is a |
||||
trade-off between the storage saved and the resources expended to achieve |
||||
those savings, vdo does not attempt to find every last duplicate block. It |
||||
is sufficient to find and eliminate most of the redundancy. |
||||
|
||||
Each block of data is hashed to produce a 16-byte block name. An index |
||||
record consists of this block name paired with the presumed location of |
||||
that data on the underlying storage. However, it is not possible to |
||||
guarantee that the index is accurate. In the most common case, this occurs |
||||
because it is too costly to update the index when a block is over-written |
||||
or discarded. Doing so would require either storing the block name along |
||||
with the blocks, which is difficult to do efficiently in block-based |
||||
storage, or reading and rehashing each block before overwriting it. |
||||
Inaccuracy can also result from a hash collision where two different blocks |
||||
have the same name. In practice, this is extremely unlikely, but because |
||||
vdo does not use a cryptographic hash, a malicious workload could be |
||||
constructed. Because of these inaccuracies, vdo treats the locations in the |
||||
index as hints, and reads each indicated block to verify that it is indeed |
||||
a duplicate before sharing the existing block with a new one. |
||||
|
||||
Records are collected into groups called chapters. New records are added to |
||||
the newest chapter, called the open chapter. This chapter is stored in a |
||||
format optimized for adding and modifying records, and the content of the |
||||
open chapter is not finalized until it runs out of space for new records. |
||||
When the open chapter fills up, it is closed and a new open chapter is |
||||
created to collect new records. |
||||
|
||||
Closing a chapter converts it to a different format which is optimized for |
||||
reading. The records are written to a series of record pages based on the |
||||
order in which they were received. This means that records with temporal |
||||
locality should be on a small number of pages, reducing the I/O required to |
||||
retrieve them. The chapter also compiles an index that indicates which |
||||
record page contains any given name. This index means that a request for a |
||||
name can determine exactly which record page may contain that record, |
||||
without having to load the entire chapter from storage. This index uses |
||||
only a subset of the block name as its key, so it cannot guarantee that an |
||||
index entry refers to the desired block name. It can only guarantee that if |
||||
there is a record for this name, it will be on the indicated page. Closed |
||||
chapters are read-only structures and their contents are never altered in |
||||
any way. |
||||
|
||||
Once enough records have been written to fill up all the available index |
||||
space, the oldest chapter is removed to make space for new chapters. Any |
||||
time a request finds a matching record in the index, that record is copied |
||||
into the open chapter. This ensures that useful block names remain available |
||||
in the index, while unreferenced block names are forgotten over time. |
||||
|
||||
In order to find records in older chapters, the index also maintains a |
||||
higher level structure called the volume index, which contains entries |
||||
mapping each block name to the chapter containing its newest record. This |
||||
mapping is updated as records for the block name are copied or updated, |
||||
ensuring that only the newest record for a given block name can be found. |
||||
An older record for a block name will no longer be found even though it has |
||||
not been deleted from its chapter. Like the chapter index, the volume index |
||||
uses only a subset of the block name as its key and can not definitively |
||||
say that a record exists for a name. It can only say which chapter would |
||||
contain the record if a record exists. The volume index is stored entirely |
||||
in memory and is saved to storage only when the vdo target is shut down. |
||||
|
||||
From the viewpoint of a request for a particular block name, it will first |
||||
look up the name in the volume index. This search will either indicate that |
||||
the name is new, or which chapter to search. If it returns a chapter, the |
||||
request looks up its name in the chapter index. This will indicate either |
||||
that the name is new, or which record page to search. Finally, if it is not |
||||
new, the request will look for its name in the indicated record page. |
||||
This process may require up to two page reads per request (one for the |
||||
chapter index page and one for the request page). However, recently |
||||
accessed pages are cached so that these page reads can be amortized across |
||||
many block name requests. |
||||
|
||||
The volume index and the chapter indexes are implemented using a |
||||
memory-efficient structure called a delta index. Instead of storing the |
||||
entire block name (the key) for each entry, the entries are sorted by name |
||||
and only the difference between adjacent keys (the delta) is stored. |
||||
Because we expect the hashes to be randomly distributed, the size of the |
||||
deltas follows an exponential distribution. Because of this distribution, |
||||
the deltas are expressed using a Huffman code to take up even less space. |
||||
The entire sorted list of keys is called a delta list. This structure |
||||
allows the index to use many fewer bytes per entry than a traditional hash |
||||
table, but it is slightly more expensive to look up entries, because a |
||||
request must read every entry in a delta list to add up the deltas in order |
||||
to find the record it needs. The delta index reduces this lookup cost by |
||||
splitting its key space into many sub-lists, each starting at a fixed key |
||||
value, so that each individual list is short. |
||||
|
||||
The default index size can hold 64 million records, corresponding to about |
||||
256GB of data. This means that the index can identify duplicate data if the |
||||
original data was written within the last 256GB of writes. This range is |
||||
called the deduplication window. If new writes duplicate data that is older |
||||
than that, the index will not be able to find it because the records of the |
||||
older data have been removed. This means that if an application writes a |
||||
200 GB file to a vdo target and then immediately writes it again, the two |
||||
copies will deduplicate perfectly. Doing the same with a 500 GB file will |
||||
result in no deduplication, because the beginning of the file will no |
||||
longer be in the index by the time the second write begins (assuming there |
||||
is no duplication within the file itself). |
||||
|
||||
If an application anticipates a data workload that will see useful |
||||
deduplication beyond the 256GB threshold, vdo can be configured to use a |
||||
larger index with a correspondingly larger deduplication window. (This |
||||
configuration can only be set when the target is created, not altered |
||||
later. It is important to consider the expected workload for a vdo target |
||||
before configuring it.) There are two ways to do this. |
||||
|
||||
One way is to increase the memory size of the index, which also increases |
||||
the amount of backing storage required. Doubling the size of the index will |
||||
double the length of the deduplication window at the expense of doubling |
||||
the storage size and the memory requirements. |
||||
|
||||
The other option is to enable sparse indexing. Sparse indexing increases |
||||
the deduplication window by a factor of 10, at the expense of also |
||||
increasing the storage size by a factor of 10. However with sparse |
||||
indexing, the memory requirements do not increase. The trade-off is |
||||
slightly more computation per request and a slight decrease in the amount |
||||
of deduplication detected. For most workloads with significant amounts of |
||||
duplicate data, sparse indexing will detect 97-99% of the deduplication |
||||
that a standard index will detect. |
||||
|
||||
The vio and data_vio Structures |
||||
------------------------------- |
||||
|
||||
A vio (short for Vdo I/O) is conceptually similar to a bio, with additional |
||||
fields and data to track vdo-specific information. A struct vio maintains a |
||||
pointer to a bio but also tracks other fields specific to the operation of |
||||
vdo. The vio is kept separate from its related bio because there are many |
||||
circumstances where vdo completes the bio but must continue to do work |
||||
related to deduplication or compression. |
||||
|
||||
Metadata reads and writes, and other writes that originate within vdo, use |
||||
a struct vio directly. Application reads and writes use a larger structure |
||||
called a data_vio to track information about their progress. A struct |
||||
data_vio contain a struct vio and also includes several other fields |
||||
related to deduplication and other vdo features. The data_vio is the |
||||
primary unit of application work in vdo. Each data_vio proceeds through a |
||||
set of steps to handle the application data, after which it is reset and |
||||
returned to a pool of data_vios for reuse. |
||||
|
||||
There is a fixed pool of 2048 data_vios. This number was chosen to bound |
||||
the amount of work that is required to recover from a crash. In addition, |
||||
benchmarks have indicated that increasing the size of the pool does not |
||||
significantly improve performance. |
||||
|
||||
The Data Store |
||||
-------------- |
||||
|
||||
The data store is implemented by three main data structures, all of which |
||||
work in concert to reduce or amortize metadata updates across as many data |
||||
writes as possible. |
||||
|
||||
*The Slab Depot* |
||||
|
||||
Most of the vdo volume belongs to the slab depot. The depot contains a |
||||
collection of slabs. The slabs can be up to 32GB, and are divided into |
||||
three sections. Most of a slab consists of a linear sequence of 4K blocks. |
||||
These blocks are used either to store data, or to hold portions of the |
||||
block map (see below). In addition to the data blocks, each slab has a set |
||||
of reference counters, using 1 byte for each data block. Finally each slab |
||||
has a journal. |
||||
|
||||
Reference updates are written to the slab journal. Slab journal blocks are |
||||
written out either when they are full, or when the recovery journal |
||||
requests they do so in order to allow the main recovery journal (see below) |
||||
to free up space. The slab journal is used both to ensure that the main |
||||
recovery journal can regularly free up space, and also to amortize the cost |
||||
of updating individual reference blocks. The reference counters are kept in |
||||
memory and are written out, a block at a time in oldest-dirtied-order, only |
||||
when there is a need to reclaim slab journal space. The write operations |
||||
are performed in the background as needed so they do not add latency to |
||||
particular I/O operations. |
||||
|
||||
Each slab is independent of every other. They are assigned to "physical |
||||
zones" in round-robin fashion. If there are P physical zones, then slab n |
||||
is assigned to zone n mod P. |
||||
|
||||
The slab depot maintains an additional small data structure, the "slab |
||||
summary," which is used to reduce the amount of work needed to come back |
||||
online after a crash. The slab summary maintains an entry for each slab |
||||
indicating whether or not the slab has ever been used, whether all of its |
||||
reference count updates have been persisted to storage, and approximately |
||||
how full it is. During recovery, each physical zone will attempt to recover |
||||
at least one slab, stopping whenever it has recovered a slab which has some |
||||
free blocks. Once each zone has some space, or has determined that none is |
||||
available, the target can resume normal operation in a degraded mode. Read |
||||
and write requests can be serviced, perhaps with degraded performance, |
||||
while the remainder of the dirty slabs are recovered. |
||||
|
||||
*The Block Map* |
||||
|
||||
The block map contains the logical to physical mapping. It can be thought |
||||
of as an array with one entry per logical address. Each entry is 5 bytes, |
||||
36 bits of which contain the physical block number which holds the data for |
||||
the given logical address. The other 4 bits are used to indicate the nature |
||||
of the mapping. Of the 16 possible states, one represents a logical address |
||||
which is unmapped (i.e. it has never been written, or has been discarded), |
||||
one represents an uncompressed block, and the other 14 states are used to |
||||
indicate that the mapped data is compressed, and which of the compression |
||||
slots in the compressed block contains the data for this logical address. |
||||
|
||||
In practice, the array of mapping entries is divided into "block map |
||||
pages," each of which fits in a single 4K block. Each block map page |
||||
consists of a header and 812 mapping entries. Each mapping page is actually |
||||
a leaf of a radix tree which consists of block map pages at each level. |
||||
There are 60 radix trees which are assigned to "logical zones" in round |
||||
robin fashion. (If there are L logical zones, tree n will belong to zone n |
||||
mod L.) At each level, the trees are interleaved, so logical addresses |
||||
0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so |
||||
on. The interleaving is maintained all the way up to the 60 root nodes. |
||||
Choosing 60 trees results in an evenly distributed number of trees per zone |
||||
for a large number of possible logical zone counts. The storage for the 60 |
||||
tree roots is allocated at format time. All other block map pages are |
||||
allocated out of the slabs as needed. This flexible allocation avoids the |
||||
need to pre-allocate space for the entire set of logical mappings and also |
||||
makes growing the logical size of a vdo relatively easy. |
||||
|
||||
In operation, the block map maintains two caches. It is prohibitive to keep |
||||
the entire leaf level of the trees in memory, so each logical zone |
||||
maintains its own cache of leaf pages. The size of this cache is |
||||
configurable at target start time. The second cache is allocated at start |
||||
time, and is large enough to hold all the non-leaf pages of the entire |
||||
block map. This cache is populated as pages are needed. |
||||
|
||||
*The Recovery Journal* |
||||
|
||||
The recovery journal is used to amortize updates across the block map and |
||||
slab depot. Each write request causes an entry to be made in the journal. |
||||
Entries are either "data remappings" or "block map remappings." For a data |
||||
remapping, the journal records the logical address affected and its old and |
||||
new physical mappings. For a block map remapping, the journal records the |
||||
block map page number and the physical block allocated for it. Block map |
||||
pages are never reclaimed or repurposed, so the old mapping is always 0. |
||||
|
||||
Each journal entry is an intent record summarizing the metadata updates |
||||
that are required for a data_vio. The recovery journal issues a flush |
||||
before each journal block write to ensure that the physical data for the |
||||
new block mappings in that block are stable on storage, and journal block |
||||
writes are all issued with the FUA bit set to ensure the recovery journal |
||||
entries themselves are stable. The journal entry and the data write it |
||||
represents must be stable on disk before the other metadata structures may |
||||
be updated to reflect the operation. These entries allow the vdo device to |
||||
reconstruct the logical to physical mappings after an unexpected |
||||
interruption such as a loss of power. |
||||
|
||||
*Write Path* |
||||
|
||||
All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon |
||||
as vdo has done enough work to guarantee that it can complete the write |
||||
eventually. Generally, the data for acknowledged but unflushed write I/O |
||||
can be treated as though it is cached in memory. If an application |
||||
requires data to be stable on storage, it must issue a flush or write the |
||||
data with the FUA bit set like any other asynchronous I/O. Shutting down |
||||
the vdo target will also flush any remaining I/O. |
||||
|
||||
Application write bios follow the steps outlined below. |
||||
|
||||
1. A data_vio is obtained from the data_vio pool and associated with the |
||||
application bio. If there are no data_vios available, the incoming bio |
||||
will block until a data_vio is available. This provides back pressure |
||||
to the application. The data_vio pool is protected by a spin lock. |
||||
|
||||
The newly acquired data_vio is reset and the bio's data is copied into |
||||
the data_vio if it is a write and the data is not all zeroes. The data |
||||
must be copied because the application bio can be acknowledged before |
||||
the data_vio processing is complete, which means later processing steps |
||||
will no longer have access to the application bio. The application bio |
||||
may also be smaller than 4K, in which case the data_vio will have |
||||
already read the underlying block and the data is instead copied over |
||||
the relevant portion of the larger block. |
||||
|
||||
2. The data_vio places a claim (the "logical lock") on the logical address |
||||
of the bio. It is vital to prevent simultaneous modifications of the |
||||
same logical address, because deduplication involves sharing blocks. |
||||
This claim is implemented as an entry in a hashtable where the key is |
||||
the logical address and the value is a pointer to the data_vio |
||||
currently handling that address. |
||||
|
||||
If a data_vio looks in the hashtable and finds that another data_vio is |
||||
already operating on that logical address, it waits until the previous |
||||
operation finishes. It also sends a message to inform the current |
||||
lock holder that it is waiting. Most notably, a new data_vio waiting |
||||
for a logical lock will flush the previous lock holder out of the |
||||
compression packer (step 8d) rather than allowing it to continue |
||||
waiting to be packed. |
||||
|
||||
This stage requires the data_vio to get an implicit lock on the |
||||
appropriate logical zone to prevent concurrent modifications of the |
||||
hashtable. This implicit locking is handled by the zone divisions |
||||
described above. |
||||
|
||||
3. The data_vio traverses the block map tree to ensure that all the |
||||
necessary internal tree nodes have been allocated, by trying to find |
||||
the leaf page for its logical address. If any interior tree page is |
||||
missing, it is allocated at this time out of the same physical storage |
||||
pool used to store application data. |
||||
|
||||
a. If any page-node in the tree has not yet been allocated, it must be |
||||
allocated before the write can continue. This step requires the |
||||
data_vio to lock the page-node that needs to be allocated. This |
||||
lock, like the logical block lock in step 2, is a hashtable entry |
||||
that causes other data_vios to wait for the allocation process to |
||||
complete. |
||||
|
||||
The implicit logical zone lock is released while the allocation is |
||||
happening, in order to allow other operations in the same logical |
||||
zone to proceed. The details of allocation are the same as in |
||||
step 4. Once a new node has been allocated, that node is added to |
||||
the tree using a similar process to adding a new data block mapping. |
||||
The data_vio journals the intent to add the new node to the block |
||||
map tree (step 10), updates the reference count of the new block |
||||
(step 11), and reacquires the implicit logical zone lock to add the |
||||
new mapping to the parent tree node (step 12). Once the tree is |
||||
updated, the data_vio proceeds down the tree. Any other data_vios |
||||
waiting on this allocation also proceed. |
||||
|
||||
b. In the steady-state case, the block map tree nodes will already be |
||||
allocated, so the data_vio just traverses the tree until it finds |
||||
the required leaf node. The location of the mapping (the "block map |
||||
slot") is recorded in the data_vio so that later steps do not need |
||||
to traverse the tree again. The data_vio then releases the implicit |
||||
logical zone lock. |
||||
|
||||
4. If the block is a zero block, skip to step 9. Otherwise, an attempt is |
||||
made to allocate a free data block. This allocation ensures that the |
||||
data_vio can write its data somewhere even if deduplication and |
||||
compression are not possible. This stage gets an implicit lock on a |
||||
physical zone to search for free space within that zone. |
||||
|
||||
The data_vio will search each slab in a zone until it finds a free |
||||
block or decides there are none. If the first zone has no free space, |
||||
it will proceed to search the next physical zone by taking the implicit |
||||
lock for that zone and releasing the previous one until it finds a |
||||
free block or runs out of zones to search. The data_vio will acquire a |
||||
struct pbn_lock (the "physical block lock") on the free block. The |
||||
struct pbn_lock also has several fields to record the various kinds of |
||||
claims that data_vios can have on physical blocks. The pbn_lock is |
||||
added to a hashtable like the logical block locks in step 2. This |
||||
hashtable is also covered by the implicit physical zone lock. The |
||||
reference count of the free block is updated to prevent any other |
||||
data_vio from considering it free. The reference counters are a |
||||
sub-component of the slab and are thus also covered by the implicit |
||||
physical zone lock. |
||||
|
||||
5. If an allocation was obtained, the data_vio has all the resources it |
||||
needs to complete the write. The application bio can safely be |
||||
acknowledged at this point. The acknowledgment happens on a separate |
||||
thread to prevent the application callback from blocking other data_vio |
||||
operations. |
||||
|
||||
If an allocation could not be obtained, the data_vio continues to |
||||
attempt to deduplicate or compress the data, but the bio is not |
||||
acknowledged because the vdo device may be out of space. |
||||
|
||||
6. At this point vdo must determine where to store the application data. |
||||
The data_vio's data is hashed and the hash (the "record name") is |
||||
recorded in the data_vio. |
||||
|
||||
7. The data_vio reserves or joins a struct hash_lock, which manages all of |
||||
the data_vios currently writing the same data. Active hash locks are |
||||
tracked in a hashtable similar to the way logical block locks are |
||||
tracked in step 2. This hashtable is covered by the implicit lock on |
||||
the hash zone. |
||||
|
||||
If there is no existing hash lock for this data_vio's record_name, the |
||||
data_vio obtains a hash lock from the pool, adds it to the hashtable, |
||||
and sets itself as the new hash lock's "agent." The hash_lock pool is |
||||
also covered by the implicit hash zone lock. The hash lock agent will |
||||
do all the work to decide where the application data will be |
||||
written. If a hash lock for the data_vio's record_name already exists, |
||||
and the data_vio's data is the same as the agent's data, the new |
||||
data_vio will wait for the agent to complete its work and then share |
||||
its result. |
||||
|
||||
In the rare case that a hash lock exists for the data_vio's hash but |
||||
the data does not match the hash lock's agent, the data_vio skips to |
||||
step 8h and attempts to write its data directly. This can happen if two |
||||
different data blocks produce the same hash, for example. |
||||
|
||||
8. The hash lock agent attempts to deduplicate or compress its data with |
||||
the following steps. |
||||
|
||||
a. The agent initializes and sends its embedded deduplication request |
||||
(struct uds_request) to the deduplication index. This does not |
||||
require the data_vio to get any locks because the index components |
||||
manage their own locking. The data_vio waits until it either gets a |
||||
response from the index or times out. |
||||
|
||||
b. If the deduplication index returns advice, the data_vio attempts to |
||||
obtain a physical block lock on the indicated physical address, in |
||||
order to read the data and verify that it is the same as the |
||||
data_vio's data, and that it can accept more references. If the |
||||
physical address is already locked by another data_vio, the data at |
||||
that address may soon be overwritten so it is not safe to use the |
||||
address for deduplication. |
||||
|
||||
c. If the data matches and the physical block can add references, the |
||||
agent and any other data_vios waiting on it will record this |
||||
physical block as their new physical address and proceed to step 9 |
||||
to record their new mapping. If there are more data_vios in the hash |
||||
lock than there are references available, one of the remaining |
||||
data_vios becomes the new agent and continues to step 8d as if no |
||||
valid advice was returned. |
||||
|
||||
d. If no usable duplicate block was found, the agent first checks that |
||||
it has an allocated physical block (from step 3) that it can write |
||||
to. If the agent does not have an allocation, some other data_vio in |
||||
the hash lock that does have an allocation takes over as agent. If |
||||
none of the data_vios have an allocated physical block, these writes |
||||
are out of space, so they proceed to step 13 for cleanup. |
||||
|
||||
e. The agent attempts to compress its data. If the data does not |
||||
compress, the data_vio will continue to step 8h to write its data |
||||
directly. |
||||
|
||||
If the compressed size is small enough, the agent will release the |
||||
implicit hash zone lock and go to the packer (struct packer) where |
||||
it will be placed in a bin (struct packer_bin) along with other |
||||
data_vios. All compression operations require the implicit lock on |
||||
the packer zone. |
||||
|
||||
The packer can combine up to 14 compressed blocks in a single 4k |
||||
data block. Compression is only helpful if vdo can pack at least 2 |
||||
data_vios into a single data block. This means that a data_vio may |
||||
wait in the packer for an arbitrarily long time for other data_vios |
||||
to fill out the compressed block. There is a mechanism for vdo to |
||||
evict waiting data_vios when continuing to wait would cause |
||||
problems. Circumstances causing an eviction include an application |
||||
flush, device shutdown, or a subsequent data_vio trying to overwrite |
||||
the same logical block address. A data_vio may also be evicted from |
||||
the packer if it cannot be paired with any other compressed block |
||||
before more compressible blocks need to use its bin. An evicted |
||||
data_vio will proceed to step 8h to write its data directly. |
||||
|
||||
f. If the agent fills a packer bin, either because all 14 of its slots |
||||
are used or because it has no remaining space, it is written out |
||||
using the allocated physical block from one of its data_vios. Step |
||||
8d has already ensured that an allocation is available. |
||||
|
||||
g. Each data_vio sets the compressed block as its new physical address. |
||||
The data_vio obtains an implicit lock on the physical zone and |
||||
acquires the struct pbn_lock for the compressed block, which is |
||||
modified to be a shared lock. Then it releases the implicit physical |
||||
zone lock and proceeds to step 8i. |
||||
|
||||
h. Any data_vio evicted from the packer will have an allocation from |
||||
step 3. It will write its data to that allocated physical block. |
||||
|
||||
i. After the data is written, if the data_vio is the agent of a hash |
||||
lock, it will reacquire the implicit hash zone lock and share its |
||||
physical address with as many other data_vios in the hash lock as |
||||
possible. Each data_vio will then proceed to step 9 to record its |
||||
new mapping. |
||||
|
||||
j. If the agent actually wrote new data (whether compressed or not), |
||||
the deduplication index is updated to reflect the location of the |
||||
new data. The agent then releases the implicit hash zone lock. |
||||
|
||||
9. The data_vio determines the previous mapping of the logical address. |
||||
There is a cache for block map leaf pages (the "block map cache"), |
||||
because there are usually too many block map leaf nodes to store |
||||
entirely in memory. If the desired leaf page is not in the cache, the |
||||
data_vio will reserve a slot in the cache and load the desired page |
||||
into it, possibly evicting an older cached page. The data_vio then |
||||
finds the current physical address for this logical address (the "old |
||||
physical mapping"), if any, and records it. This step requires a lock |
||||
on the block map cache structures, covered by the implicit logical zone |
||||
lock. |
||||
|
||||
10. The data_vio makes an entry in the recovery journal containing the |
||||
logical block address, the old physical mapping, and the new physical |
||||
mapping. Making this journal entry requires holding the implicit |
||||
recovery journal lock. The data_vio will wait in the journal until all |
||||
recovery blocks up to the one containing its entry have been written |
||||
and flushed to ensure the transaction is stable on storage. |
||||
|
||||
11. Once the recovery journal entry is stable, the data_vio makes two slab |
||||
journal entries: an increment entry for the new mapping, and a |
||||
decrement entry for the old mapping. These two operations each require |
||||
holding a lock on the affected physical slab, covered by its implicit |
||||
physical zone lock. For correctness during recovery, the slab journal |
||||
entries in any given slab journal must be in the same order as the |
||||
corresponding recovery journal entries. Therefore, if the two entries |
||||
are in different zones, they are made concurrently, and if they are in |
||||
the same zone, the increment is always made before the decrement in |
||||
order to avoid underflow. After each slab journal entry is made in |
||||
memory, the associated reference count is also updated in memory. |
||||
|
||||
12. Once both of the reference count updates are done, the data_vio |
||||
acquires the implicit logical zone lock and updates the |
||||
logical-to-physical mapping in the block map to point to the new |
||||
physical block. At this point the write operation is complete. |
||||
|
||||
13. If the data_vio has a hash lock, it acquires the implicit hash zone |
||||
lock and releases its hash lock to the pool. |
||||
|
||||
The data_vio then acquires the implicit physical zone lock and releases |
||||
the struct pbn_lock it holds for its allocated block. If it had an |
||||
allocation that it did not use, it also sets the reference count for |
||||
that block back to zero to free it for use by subsequent data_vios. |
||||
|
||||
The data_vio then acquires the implicit logical zone lock and releases |
||||
the logical block lock acquired in step 2. |
||||
|
||||
The application bio is then acknowledged if it has not previously been |
||||
acknowledged, and the data_vio is returned to the pool. |
||||
|
||||
*Read Path* |
||||
|
||||
An application read bio follows a much simpler set of steps. It does steps |
||||
1 and 2 in the write path to obtain a data_vio and lock its logical |
||||
address. If there is already a write data_vio in progress for that logical |
||||
address that is guaranteed to complete, the read data_vio will copy the |
||||
data from the write data_vio and return it. Otherwise, it will look up the |
||||
logical-to-physical mapping by traversing the block map tree as in step 3, |
||||
and then read and possibly decompress the indicated data at the indicated |
||||
physical block address. A read data_vio will not allocate block map tree |
||||
nodes if they are missing. If the interior block map nodes do not exist |
||||
yet, the logical block map address must still be unmapped and the read |
||||
data_vio will return all zeroes. A read data_vio handles cleanup and |
||||
acknowledgment as in step 13, although it only needs to release the logical |
||||
lock and return itself to the pool. |
||||
|
||||
*Small Writes* |
||||
|
||||
All storage within vdo is managed as 4KB blocks, but it can accept writes |
||||
as small as 512 bytes. Processing a write that is smaller than 4K requires |
||||
a read-modify-write operation that reads the relevant 4K block, copies the |
||||
new data over the approriate sectors of the block, and then launches a |
||||
write operation for the modified data block. The read and write stages of |
||||
this operation are nearly identical to the normal read and write |
||||
operations, and a single data_vio is used throughout this operation. |
||||
|
||||
*Recovery* |
||||
|
||||
When a vdo is restarted after a crash, it will attempt to recover from the |
||||
recovery journal. During the pre-resume phase of the next start, the |
||||
recovery journal is read. The increment portion of valid entries are played |
||||
into the block map. Next, valid entries are played, in order as required, |
||||
into the slab journals. Finally, each physical zone attempts to replay at |
||||
least one slab journal to reconstruct the reference counts of one slab. |
||||
Once each zone has some free space (or has determined that it has none), |
||||
the vdo comes back online, while the remainder of the slab journals are |
||||
used to reconstruct the rest of the reference counts in the background. |
||||
|
||||
*Read-only Rebuild* |
||||
|
||||
If a vdo encounters an unrecoverable error, it will enter read-only mode. |
||||
This mode indicates that some previously acknowledged data may have been |
||||
lost. The vdo may be instructed to rebuild as best it can in order to |
||||
return to a writable state. However, this is never done automatically due |
||||
to the possibility that data has been lost. During a read-only rebuild, the |
||||
block map is recovered from the recovery journal as before. However, the |
||||
reference counts are not rebuilt from the slab journals. Instead, the |
||||
reference counts are zeroed, the entire block map is traversed, and the |
||||
reference counts are updated from the block mappings. While this may lose |
||||
some data, it ensures that the block map and reference counts are |
||||
consistent with each other. This allows vdo to resume normal operation and |
||||
accept further writes. |
@ -0,0 +1,406 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0-only |
||||
|
||||
dm-vdo |
||||
====== |
||||
|
||||
The dm-vdo (virtual data optimizer) device mapper target provides |
||||
block-level deduplication, compression, and thin provisioning. As a device |
||||
mapper target, it can add these features to the storage stack, compatible |
||||
with any file system. The vdo target does not protect against data |
||||
corruption, relying instead on integrity protection of the storage below |
||||
it. It is strongly recommended that lvm be used to manage vdo volumes. See |
||||
lvmvdo(7). |
||||
|
||||
Userspace component |
||||
=================== |
||||
|
||||
Formatting a vdo volume requires the use of the 'vdoformat' tool, available |
||||
at: |
||||
|
||||
https://github.com/dm-vdo/vdo/ |
||||
|
||||
In most cases, a vdo target will recover from a crash automatically the |
||||
next time it is started. In cases where it encountered an unrecoverable |
||||
error (either during normal operation or crash recovery) the target will |
||||
enter or come up in read-only mode. Because read-only mode is indicative of |
||||
data-loss, a positive action must be taken to bring vdo out of read-only |
||||
mode. The 'vdoforcerebuild' tool, available from the same repo, is used to |
||||
prepare a read-only vdo to exit read-only mode. After running this tool, |
||||
the vdo target will rebuild its metadata the next time it is |
||||
started. Although some data may be lost, the rebuilt vdo's metadata will be |
||||
internally consistent and the target will be writable again. |
||||
|
||||
The repo also contains additional userspace tools which can be used to |
||||
inspect a vdo target's on-disk metadata. Fortunately, these tools are |
||||
rarely needed except by dm-vdo developers. |
||||
|
||||
Metadata requirements |
||||
===================== |
||||
|
||||
Each vdo volume reserves 3GB of space for metadata, or more depending on |
||||
its configuration. It is helpful to check that the space saved by |
||||
deduplication and compression is not cancelled out by the metadata |
||||
requirements. An estimation of the space saved for a specific dataset can |
||||
be computed with the vdo estimator tool, which is available at: |
||||
|
||||
https://github.com/dm-vdo/vdoestimator/ |
||||
|
||||
Target interface |
||||
================ |
||||
|
||||
Table line |
||||
---------- |
||||
|
||||
:: |
||||
|
||||
<offset> <logical device size> vdo V4 <storage device> |
||||
<storage device size> <minimum I/O size> <block map cache size> |
||||
<block map era length> [optional arguments] |
||||
|
||||
|
||||
Required parameters: |
||||
|
||||
offset: |
||||
The offset, in sectors, at which the vdo volume's logical |
||||
space begins. |
||||
|
||||
logical device size: |
||||
The size of the device which the vdo volume will service, |
||||
in sectors. Must match the current logical size of the vdo |
||||
volume. |
||||
|
||||
storage device: |
||||
The device holding the vdo volume's data and metadata. |
||||
|
||||
storage device size: |
||||
The size of the device holding the vdo volume, as a number |
||||
of 4096-byte blocks. Must match the current size of the vdo |
||||
volume. |
||||
|
||||
minimum I/O size: |
||||
The minimum I/O size for this vdo volume to accept, in |
||||
bytes. Valid values are 512 or 4096. The recommended value |
||||
is 4096. |
||||
|
||||
block map cache size: |
||||
The size of the block map cache, as a number of 4096-byte |
||||
blocks. The minimum and recommended value is 32768 blocks. |
||||
If the logical thread count is non-zero, the cache size |
||||
must be at least 4096 blocks per logical thread. |
||||
|
||||
block map era length: |
||||
The speed with which the block map cache writes out |
||||
modified block map pages. A smaller era length is likely to |
||||
reduce the amount of time spent rebuilding, at the cost of |
||||
increased block map writes during normal operation. The |
||||
maximum and recommended value is 16380; the minimum value |
||||
is 1. |
||||
|
||||
Optional parameters: |
||||
-------------------- |
||||
Some or all of these parameters may be specified as <key> <value> pairs. |
||||
|
||||
Thread related parameters: |
||||
|
||||
Different categories of work are assigned to separate thread groups, and |
||||
the number of threads in each group can be configured separately. |
||||
|
||||
If <hash>, <logical>, and <physical> are all set to 0, the work handled by |
||||
all three thread types will be handled by a single thread. If any of these |
||||
values are non-zero, all of them must be non-zero. |
||||
|
||||
ack: |
||||
The number of threads used to complete bios. Since |
||||
completing a bio calls an arbitrary completion function |
||||
outside the vdo volume, threads of this type allow the vdo |
||||
volume to continue processing requests even when bio |
||||
completion is slow. The default is 1. |
||||
|
||||
bio: |
||||
The number of threads used to issue bios to the underlying |
||||
storage. Threads of this type allow the vdo volume to |
||||
continue processing requests even when bio submission is |
||||
slow. The default is 4. |
||||
|
||||
bioRotationInterval: |
||||
The number of bios to enqueue on each bio thread before |
||||
switching to the next thread. The value must be greater |
||||
than 0 and not more than 1024; the default is 64. |
||||
|
||||
cpu: |
||||
The number of threads used to do CPU-intensive work, such |
||||
as hashing and compression. The default is 1. |
||||
|
||||
hash: |
||||
The number of threads used to manage data comparisons for |
||||
deduplication based on the hash value of data blocks. The |
||||
default is 0. |
||||
|
||||
logical: |
||||
The number of threads used to manage caching and locking |
||||
based on the logical address of incoming bios. The default |
||||
is 0; the maximum is 60. |
||||
|
||||
physical: |
||||
The number of threads used to manage administration of the |
||||
underlying storage device. At format time, a slab size for |
||||
the vdo is chosen; the vdo storage device must be large |
||||
enough to have at least 1 slab per physical thread. The |
||||
default is 0; the maximum is 16. |
||||
|
||||
Miscellaneous parameters: |
||||
|
||||
maxDiscard: |
||||
The maximum size of discard bio accepted, in 4096-byte |
||||
blocks. I/O requests to a vdo volume are normally split |
||||
into 4096-byte blocks, and processed up to 2048 at a time. |
||||
However, discard requests to a vdo volume can be |
||||
automatically split to a larger size, up to <maxDiscard> |
||||
4096-byte blocks in a single bio, and are limited to 1500 |
||||
at a time. Increasing this value may provide better overall |
||||
performance, at the cost of increased latency for the |
||||
individual discard requests. The default and minimum is 1; |
||||
the maximum is UINT_MAX / 4096. |
||||
|
||||
deduplication: |
||||
Whether deduplication is enabled. The default is 'on'; the |
||||
acceptable values are 'on' and 'off'. |
||||
|
||||
compression: |
||||
Whether compression is enabled. The default is 'off'; the |
||||
acceptable values are 'on' and 'off'. |
||||
|
||||
Device modification |
||||
------------------- |
||||
|
||||
A modified table may be loaded into a running, non-suspended vdo volume. |
||||
The modifications will take effect when the device is next resumed. The |
||||
modifiable parameters are <logical device size>, <physical device size>, |
||||
<maxDiscard>, <compression>, and <deduplication>. |
||||
|
||||
If the logical device size or physical device size are changed, upon |
||||
successful resume vdo will store the new values and require them on future |
||||
startups. These two parameters may not be decreased. The logical device |
||||
size may not exceed 4 PB. The physical device size must increase by at |
||||
least 32832 4096-byte blocks if at all, and must not exceed the size of the |
||||
underlying storage device. Additionally, when formatting the vdo device, a |
||||
slab size is chosen: the physical device size may never increase above the |
||||
size which provides 8192 slabs, and each increase must be large enough to |
||||
add at least one new slab. |
||||
|
||||
Examples: |
||||
|
||||
Start a previously-formatted vdo volume with 1 GB logical space and 1 GB |
||||
physical space, storing to /dev/dm-1 which has more than 1 GB of space. |
||||
|
||||
:: |
||||
|
||||
dmsetup create vdo0 --table \ |
||||
"0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380" |
||||
|
||||
Grow the logical size to 4 GB. |
||||
|
||||
:: |
||||
|
||||
dmsetup reload vdo0 --table \ |
||||
"0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380" |
||||
dmsetup resume vdo0 |
||||
|
||||
Grow the physical size to 2 GB. |
||||
|
||||
:: |
||||
|
||||
dmsetup reload vdo0 --table \ |
||||
"0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380" |
||||
dmsetup resume vdo0 |
||||
|
||||
Grow the physical size by 1 GB more and increase max discard sectors. |
||||
|
||||
:: |
||||
|
||||
dmsetup reload vdo0 --table \ |
||||
"0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8" |
||||
dmsetup resume vdo0 |
||||
|
||||
Stop the vdo volume. |
||||
|
||||
:: |
||||
|
||||
dmsetup remove vdo0 |
||||
|
||||
Start the vdo volume again. Note that the logical and physical device sizes |
||||
must still match, but other parameters can change. |
||||
|
||||
:: |
||||
|
||||
dmsetup create vdo1 --table \ |
||||
"0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2" |
||||
|
||||
Messages |
||||
-------- |
||||
All vdo devices accept messages in the form: |
||||
|
||||
:: |
||||
dmsetup message <target-name> 0 <message-name> <message-parameters> |
||||
|
||||
The messages are: |
||||
|
||||
stats: |
||||
Outputs the current view of the vdo statistics. Mostly used |
||||
by the vdostats userspace program to interpret the output |
||||
buffer. |
||||
|
||||
dump: |
||||
Dumps many internal structures to the system log. This is |
||||
not always safe to run, so it should only be used to debug |
||||
a hung vdo. Optional parameters to specify structures to |
||||
dump are: |
||||
|
||||
viopool: The pool of I/O requests incoming bios |
||||
pools: A synonym of 'viopool' |
||||
vdo: Most of the structures managing on-disk data |
||||
queues: Basic information about each vdo thread |
||||
threads: A synonym of 'queues' |
||||
default: Equivalent to 'queues vdo' |
||||
all: All of the above. |
||||
|
||||
dump-on-shutdown: |
||||
Perform a default dump next time vdo shuts down. |
||||
|
||||
|
||||
Status |
||||
------ |
||||
|
||||
:: |
||||
|
||||
<device> <operating mode> <in recovery> <index state> |
||||
<compression state> <physical blocks used> <total physical blocks> |
||||
|
||||
device: |
||||
The name of the vdo volume. |
||||
|
||||
operating mode: |
||||
The current operating mode of the vdo volume; values may be |
||||
'normal', 'recovering' (the volume has detected an issue |
||||
with its metadata and is attempting to repair itself), and |
||||
'read-only' (an error has occurred that forces the vdo |
||||
volume to only support read operations and not writes). |
||||
|
||||
in recovery: |
||||
Whether the vdo volume is currently in recovery mode; |
||||
values may be 'recovering' or '-' which indicates not |
||||
recovering. |
||||
|
||||
index state: |
||||
The current state of the deduplication index in the vdo |
||||
volume; values may be 'closed', 'closing', 'error', |
||||
'offline', 'online', 'opening', and 'unknown'. |
||||
|
||||
compression state: |
||||
The current state of compression in the vdo volume; values |
||||
may be 'offline' and 'online'. |
||||
|
||||
used physical blocks: |
||||
The number of physical blocks in use by the vdo volume. |
||||
|
||||
total physical blocks: |
||||
The total number of physical blocks the vdo volume may use; |
||||
the difference between this value and the |
||||
<used physical blocks> is the number of blocks the vdo |
||||
volume has left before being full. |
||||
|
||||
Memory Requirements |
||||
=================== |
||||
|
||||
A vdo target requires a fixed 38 MB of RAM along with the following amounts |
||||
that scale with the target: |
||||
|
||||
- 1.15 MB of RAM for each 1 MB of configured block map cache size. The |
||||
block map cache requires a minimum of 150 MB. |
||||
- 1.6 MB of RAM for each 1 TB of logical space. |
||||
- 268 MB of RAM for each 1 TB of physical storage managed by the volume. |
||||
|
||||
The deduplication index requires additional memory which scales with the |
||||
size of the deduplication window. For dense indexes, the index requires 1 |
||||
GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB |
||||
of RAM per 10 TB of window. The index configuration is set when the target |
||||
is formatted and may not be modified. |
||||
|
||||
Module Parameters |
||||
================= |
||||
|
||||
The vdo driver has a numeric parameter 'log_level' which controls the |
||||
verbosity of logging from the driver. The default setting is 6 |
||||
(LOGLEVEL_INFO and more severe messages). |
||||
|
||||
Run-time Usage |
||||
============== |
||||
|
||||
When using dm-vdo, it is important to be aware of the ways in which its |
||||
behavior differs from other storage targets. |
||||
|
||||
- There is no guarantee that over-writes of existing blocks will succeed. |
||||
Because the underlying storage may be multiply referenced, over-writing |
||||
an existing block generally requires a vdo to have a free block |
||||
available. |
||||
|
||||
- When blocks are no longer in use, sending a discard request for those |
||||
blocks lets the vdo release references for those blocks. If the vdo is |
||||
thinly provisioned, discarding unused blocks is essential to prevent the |
||||
target from running out of space. However, due to the sharing of |
||||
duplicate blocks, no discard request for any given logical block is |
||||
guaranteed to reclaim space. |
||||
|
||||
- Assuming the underlying storage properly implements flush requests, vdo |
||||
is resilient against crashes, however, unflushed writes may or may not |
||||
persist after a crash. |
||||
|
||||
- Each write to a vdo target entails a significant amount of processing. |
||||
However, much of the work is paralellizable. Therefore, vdo targets |
||||
achieve better throughput at higher I/O depths, and can support up 2048 |
||||
requests in parallel. |
||||
|
||||
Tuning |
||||
====== |
||||
|
||||
The vdo device has many options, and it can be difficult to make optimal |
||||
choices without perfect knowledge of the workload. Additionally, most |
||||
configuration options must be set when a vdo target is started, and cannot |
||||
be changed without shutting it down completely; the configuration cannot be |
||||
changed while the target is active. Ideally, tuning with simulated |
||||
workloads should be performed before deploying vdo in production |
||||
environments. |
||||
|
||||
The most important value to adjust is the block map cache size. In order to |
||||
service a request for any logical address, a vdo must load the portion of |
||||
the block map which holds the relevant mapping. These mappings are cached. |
||||
Performance will suffer when the working set does not fit in the cache. By |
||||
default, a vdo allocates 128 MB of metadata cache in RAM to support |
||||
efficient access to 100 GB of logical space at a time. It should be scaled |
||||
up proportionally for larger working sets. |
||||
|
||||
The logical and physical thread counts should also be adjusted. A logical |
||||
thread controls a disjoint section of the block map, so additional logical |
||||
threads increase parallelism and can increase throughput. Physical threads |
||||
control a disjoint section of the data blocks, so additional physical |
||||
threads can also increase throughput. However, excess threads can waste |
||||
resources and increase contention. |
||||
|
||||
Bio submission threads control the parallelism involved in sending I/O to |
||||
the underlying storage; fewer threads mean there is more opportunity to |
||||
reorder I/O requests for performance benefit, but also that each I/O |
||||
request has to wait longer before being submitted. |
||||
|
||||
Bio acknowledgment threads are used for finishing I/O requests. This is |
||||
done on dedicated threads since the amount of work required to execute a |
||||
bio's callback can not be controlled by the vdo itself. Usually one thread |
||||
is sufficient but additional threads may be beneficial, particularly when |
||||
bios have CPU-heavy callbacks. |
||||
|
||||
CPU threads are used for hashing and for compression; in workloads with |
||||
compression enabled, more threads may result in higher throughput. |
||||
|
||||
Hash threads are used to sort active requests by hash and determine whether |
||||
they should deduplicate; the most CPU intensive actions done by these |
||||
threads are comparison of 4096-byte data blocks. In most cases, a single |
||||
hash thread is sufficient. |
@ -0,0 +1,13 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 |
||||
|
||||
================== |
||||
Obsolete GPIO APIs |
||||
================== |
||||
|
||||
.. toctree:: |
||||
:maxdepth: 1 |
||||
|
||||
Character Device Userspace API (v1) <../../userspace-api/gpio/chardev_v1> |
||||
Sysfs Interface <../../userspace-api/gpio/sysfs> |
||||
Mockup Testing Module <gpio-mockup> |
||||
|
@ -0,0 +1,104 @@
|
||||
================================== |
||||
Register File Data Sampling (RFDS) |
||||
================================== |
||||
|
||||
Register File Data Sampling (RFDS) is a microarchitectural vulnerability that |
||||
only affects Intel Atom parts(also branded as E-cores). RFDS may allow |
||||
a malicious actor to infer data values previously used in floating point |
||||
registers, vector registers, or integer registers. RFDS does not provide the |
||||
ability to choose which data is inferred. CVE-2023-28746 is assigned to RFDS. |
||||
|
||||
Affected Processors |
||||
=================== |
||||
Below is the list of affected Intel processors [#f1]_: |
||||
|
||||
=================== ============ |
||||
Common name Family_Model |
||||
=================== ============ |
||||
ATOM_GOLDMONT 06_5CH |
||||
ATOM_GOLDMONT_D 06_5FH |
||||
ATOM_GOLDMONT_PLUS 06_7AH |
||||
ATOM_TREMONT_D 06_86H |
||||
ATOM_TREMONT 06_96H |
||||
ALDERLAKE 06_97H |
||||
ALDERLAKE_L 06_9AH |
||||
ATOM_TREMONT_L 06_9CH |
||||
RAPTORLAKE 06_B7H |
||||
RAPTORLAKE_P 06_BAH |
||||
ATOM_GRACEMONT 06_BEH |
||||
RAPTORLAKE_S 06_BFH |
||||
=================== ============ |
||||
|
||||
As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and |
||||
RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as |
||||
vulnerable in Linux because they share the same family/model with an affected |
||||
part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or |
||||
CPUID.HYBRID. This information could be used to distinguish between the |
||||
affected and unaffected parts, but it is deemed not worth adding complexity as |
||||
the reporting is fixed automatically when these parts enumerate RFDS_NO. |
||||
|
||||
Mitigation |
||||
========== |
||||
Intel released a microcode update that enables software to clear sensitive |
||||
information using the VERW instruction. Like MDS, RFDS deploys the same |
||||
mitigation strategy to force the CPU to clear the affected buffers before an |
||||
attacker can extract the secrets. This is achieved by using the otherwise |
||||
unused and obsolete VERW instruction in combination with a microcode update. |
||||
The microcode clears the affected CPU buffers when the VERW instruction is |
||||
executed. |
||||
|
||||
Mitigation points |
||||
----------------- |
||||
VERW is executed by the kernel before returning to user space, and by KVM |
||||
before VMentry. None of the affected cores support SMT, so VERW is not required |
||||
at C-state transitions. |
||||
|
||||
New bits in IA32_ARCH_CAPABILITIES |
||||
---------------------------------- |
||||
Newer processors and microcode update on existing affected processors added new |
||||
bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate |
||||
vulnerability and mitigation capability: |
||||
|
||||
- Bit 27 - RFDS_NO - When set, processor is not affected by RFDS. |
||||
- Bit 28 - RFDS_CLEAR - When set, processor is affected by RFDS, and has the |
||||
microcode that clears the affected buffers on VERW execution. |
||||
|
||||
Mitigation control on the kernel command line |
||||
--------------------------------------------- |
||||
The kernel command line allows to control RFDS mitigation at boot time with the |
||||
parameter "reg_file_data_sampling=". The valid arguments are: |
||||
|
||||
========== ================================================================= |
||||
on If the CPU is vulnerable, enable mitigation; CPU buffer clearing |
||||
on exit to userspace and before entering a VM. |
||||
off Disables mitigation. |
||||
========== ================================================================= |
||||
|
||||
Mitigation default is selected by CONFIG_MITIGATION_RFDS. |
||||
|
||||
Mitigation status information |
||||
----------------------------- |
||||
The Linux kernel provides a sysfs interface to enumerate the current |
||||
vulnerability status of the system: whether the system is vulnerable, and |
||||
which mitigations are active. The relevant sysfs file is: |
||||
|
||||
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling |
||||
|
||||
The possible values in this file are: |
||||
|
||||
.. list-table:: |
||||
|
||||
* - 'Not affected' |
||||
- The processor is not vulnerable |
||||
* - 'Vulnerable' |
||||
- The processor is vulnerable, but no mitigation enabled |
||||
* - 'Vulnerable: No microcode' |
||||
- The processor is vulnerable but microcode is not updated. |
||||
* - 'Mitigation: Clear Register File' |
||||
- The processor is vulnerable and the CPU buffer clearing mitigation is |
||||
enabled. |
||||
|
||||
References |
||||
---------- |
||||
.. [#f1] Affected Processors |
||||
https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html |
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,46 @@
|
||||
================================================ |
||||
StarFive StarLink Performance Monitor Unit (PMU) |
||||
================================================ |
||||
|
||||
StarFive StarLink Performance Monitor Unit (PMU) exists within the |
||||
StarLink Coherent Network on Chip (CNoC) that connects multiple CPU |
||||
clusters with an L3 memory system. |
||||
|
||||
The uncore PMU supports overflow interrupt, up to 16 programmable 64bit |
||||
event counters, and an independent 64bit cycle counter. |
||||
The PMU can only be accessed via Memory Mapped I/O and are common to the |
||||
cores connected to the same PMU. |
||||
|
||||
Driver exposes supported PMU events in sysfs "events" directory under:: |
||||
|
||||
/sys/bus/event_source/devices/starfive_starlink_pmu/events/ |
||||
|
||||
Driver exposes cpu used to handle PMU events in sysfs "cpumask" directory |
||||
under:: |
||||
|
||||
/sys/bus/event_source/devices/starfive_starlink_pmu/cpumask/ |
||||
|
||||
Driver describes the format of config (event ID) in sysfs "format" directory |
||||
under:: |
||||
|
||||
/sys/bus/event_source/devices/starfive_starlink_pmu/format/ |
||||
|
||||
Example of perf usage:: |
||||
|
||||
$ perf list |
||||
|
||||
starfive_starlink_pmu/cycles/ [Kernel PMU event] |
||||
starfive_starlink_pmu/read_hit/ [Kernel PMU event] |
||||
starfive_starlink_pmu/read_miss/ [Kernel PMU event] |
||||
starfive_starlink_pmu/read_request/ [Kernel PMU event] |
||||
starfive_starlink_pmu/release_request/ [Kernel PMU event] |
||||
starfive_starlink_pmu/write_hit/ [Kernel PMU event] |
||||
starfive_starlink_pmu/write_miss/ [Kernel PMU event] |
||||
starfive_starlink_pmu/write_request/ [Kernel PMU event] |
||||
starfive_starlink_pmu/writeback/ [Kernel PMU event] |
||||
|
||||
|
||||
$ perf stat -a -e /starfive_starlink_pmu/cycles/ sleep 1 |
||||
|
||||
Sampling is not supported. As a result, "perf record" is not supported. |
||||
Attaching to a task is not supported, only system-wide counting is supported. |
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,96 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0 |
||||
|
||||
========================================= |
||||
Flexible Return and Event Delivery (FRED) |
||||
========================================= |
||||
|
||||
Overview |
||||
======== |
||||
|
||||
The FRED architecture defines simple new transitions that change |
||||
privilege level (ring transitions). The FRED architecture was |
||||
designed with the following goals: |
||||
|
||||
1) Improve overall performance and response time by replacing event |
||||
delivery through the interrupt descriptor table (IDT event |
||||
delivery) and event return by the IRET instruction with lower |
||||
latency transitions. |
||||
|
||||
2) Improve software robustness by ensuring that event delivery |
||||
establishes the full supervisor context and that event return |
||||
establishes the full user context. |
||||
|
||||
The new transitions defined by the FRED architecture are FRED event |
||||
delivery and, for returning from events, two FRED return instructions. |
||||
FRED event delivery can effect a transition from ring 3 to ring 0, but |
||||
it is used also to deliver events incident to ring 0. One FRED |
||||
instruction (ERETU) effects a return from ring 0 to ring 3, while the |
||||
other (ERETS) returns while remaining in ring 0. Collectively, FRED |
||||
event delivery and the FRED return instructions are FRED transitions. |
||||
|
||||
In addition to these transitions, the FRED architecture defines a new |
||||
instruction (LKGS) for managing the state of the GS segment register. |
||||
The LKGS instruction can be used by 64-bit operating systems that do |
||||
not use the new FRED transitions. |
||||
|
||||
Furthermore, the FRED architecture is easy to extend for future CPU |
||||
architectures. |
||||
|
||||
Software based event dispatching |
||||
================================ |
||||
|
||||
FRED operates differently from IDT in terms of event handling. Instead |
||||
of directly dispatching an event to its handler based on the event |
||||
vector, FRED requires the software to dispatch an event to its handler |
||||
based on both the event's type and vector. Therefore, an event dispatch |
||||
framework must be implemented to facilitate the event-to-handler |
||||
dispatch process. The FRED event dispatch framework takes control |
||||
once an event is delivered, and employs a two-level dispatch. |
||||
|
||||
The first level dispatching is event type based, and the second level |
||||
dispatching is event vector based. |
||||
|
||||
Full supervisor/user context |
||||
============================ |
||||
|
||||
FRED event delivery atomically save and restore full supervisor/user |
||||
context upon event delivery and return. Thus it avoids the problem of |
||||
transient states due to %cr2 and/or %dr6, and it is no longer needed |
||||
to handle all the ugly corner cases caused by half baked entry states. |
||||
|
||||
FRED allows explicit unblock of NMI with new event return instructions |
||||
ERETS/ERETU, avoiding the mess caused by IRET which unconditionally |
||||
unblocks NMI, e.g., when an exception happens during NMI handling. |
||||
|
||||
FRED always restores the full value of %rsp, thus ESPFIX is no longer |
||||
needed when FRED is enabled. |
||||
|
||||
LKGS |
||||
==== |
||||
|
||||
LKGS behaves like the MOV to GS instruction except that it loads the |
||||
base address into the IA32_KERNEL_GS_BASE MSR instead of the GS |
||||
segment’s descriptor cache. With LKGS, it ends up with avoiding |
||||
mucking with kernel GS, i.e., an operating system can always operate |
||||
with its own GS base address. |
||||
|
||||
Because FRED event delivery from ring 3 and ERETU both swap the value |
||||
of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus |
||||
the introduction of LKGS instruction, the SWAPGS instruction is no |
||||
longer needed when FRED is enabled, thus is disallowed (#UD). |
||||
|
||||
Stack levels |
||||
============ |
||||
|
||||
4 stack levels 0~3 are introduced to replace the nonreentrant IST for |
||||
event handling, and each stack level should be configured to use a |
||||
dedicated stack. |
||||
|
||||
The current stack level could be unchanged or go higher upon FRED |
||||
event delivery. If unchanged, the CPU keeps using the current event |
||||
stack. If higher, the CPU switches to a new event stack specified by |
||||
the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123]. |
||||
|
||||
Only execution of a FRED return instruction ERET[US], could lower the |
||||
current stack level, causing the CPU to switch back to the stack it was |
||||
on before a previous event delivery that promoted the stack level. |
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue