Browse Source

ASoC: Fixes for v6.9

A relatively large set of fixes here, the biggest piece of it is a
 series correcting some problems with the delay reporting for Intel SOF
 cards but there's a bunch of other things.  Everything here is driver
 specific except for a fix in the core for an issue with sign extension
 handling volume controls.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmYPJIUACgkQJNaLcl1U
 h9Ap8Qf/Xr2YP+KJxiD7g52grelyqjUdMCBkoB5Ndf4BU/mBl5X1yZkBuiTEB24D
 dcjamLuxNHGJhrHalYEfok6wRK3RMlkDZ8SNgpCP9rmOH0cSdt/8I3vUXA/XUKrx
 SnceTSZC1S7BSRk26IKf2UdFRULMSGpC87mVLxi7DT+nmuIAigGau3yBPXveG00p
 OLOYmJK9vwdlxAkfIp0ddYx3iTqfaq55W5ttWadoLG9gpoGbDzkvapBsZebqlGeo
 MkZrur70Vi7ousGATkzQHCkEUD8atNOTRrrAgVmzgBbUy3Y6LOmeMwiH306wXJAq
 XHiYYLCPlPI5myNYHeKmOQbCK1AzpA==
 =/2l+
 -----END PGP SIGNATURE-----

Merge tag 'asoc-fix-v6.9-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v6.9

A relatively large set of fixes here, the biggest piece of it is a
series correcting some problems with the delay reporting for Intel SOF
cards but there's a bunch of other things.  Everything here is driver
specific except for a fix in the core for an issue with sign extension
handling volume controls.
master
Takashi Iwai 4 weeks ago
parent
commit
100c85421b
  1. 1
      .get_maintainer.ignore
  2. 1
      .gitignore
  3. 3
      .mailmap
  4. 10
      CREDITS
  5. 4
      Documentation/ABI/obsolete/sysfs-gpio
  6. 12
      Documentation/ABI/testing/configfs-usb-gadget-ffs
  7. 34
      Documentation/ABI/testing/debugfs-cxl
  8. 26
      Documentation/ABI/testing/debugfs-driver-qat
  9. 22
      Documentation/ABI/testing/debugfs-hisi-hpre
  10. 22
      Documentation/ABI/testing/debugfs-hisi-sec
  11. 22
      Documentation/ABI/testing/debugfs-hisi-zip
  12. 276
      Documentation/ABI/testing/debugfs-intel-iommu
  13. 9
      Documentation/ABI/testing/gpio-cdev
  14. 87
      Documentation/ABI/testing/sysfs-bus-coresight-devices-tpdm
  15. 34
      Documentation/ABI/testing/sysfs-bus-cxl
  16. 153
      Documentation/ABI/testing/sysfs-bus-dax
  17. 9
      Documentation/ABI/testing/sysfs-bus-iio-adc-pac1934
  18. 18
      Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
  19. 10
      Documentation/ABI/testing/sysfs-bus-usb
  20. 10
      Documentation/ABI/testing/sysfs-bus-vdpa
  21. 27
      Documentation/ABI/testing/sysfs-class-hwmon
  22. 12
      Documentation/ABI/testing/sysfs-class-led-trigger-netdev
  23. 14
      Documentation/ABI/testing/sysfs-class-led-trigger-tty
  24. 23
      Documentation/ABI/testing/sysfs-class-net-queues
  25. 6
      Documentation/ABI/testing/sysfs-class-usb_role
  26. 1
      Documentation/ABI/testing/sysfs-devices-system-cpu
  27. 20
      Documentation/ABI/testing/sysfs-driver-qat
  28. 52
      Documentation/ABI/testing/sysfs-fs-f2fs
  29. 11
      Documentation/ABI/testing/sysfs-fs-virtiofs
  30. 6
      Documentation/ABI/testing/sysfs-kernel-mm-cma
  31. 16
      Documentation/ABI/testing/sysfs-kernel-mm-damon
  32. 4
      Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
  33. 25
      Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
  34. 5
      Documentation/Makefile
  35. 32
      Documentation/RCU/checklist.rst
  36. 5
      Documentation/RCU/rcu_dereference.rst
  37. 2
      Documentation/RCU/torture.rst
  38. 19
      Documentation/RCU/whatisRCU.rst
  39. 24
      Documentation/admin-guide/RAS/address-translation.rst
  40. 11
      Documentation/admin-guide/RAS/error-decoding.rst
  41. 7
      Documentation/admin-guide/RAS/index.rst
  42. 10
      Documentation/admin-guide/RAS/main.rst
  43. 69
      Documentation/admin-guide/README.rst
  44. 2
      Documentation/admin-guide/cgroup-v1/cpusets.rst
  45. 20
      Documentation/admin-guide/cgroup-v1/hugetlb.rst
  46. 2
      Documentation/admin-guide/cifs/introduction.rst
  47. 2
      Documentation/admin-guide/device-mapper/index.rst
  48. 633
      Documentation/admin-guide/device-mapper/vdo-design.rst
  49. 406
      Documentation/admin-guide/device-mapper/vdo.rst
  50. 35
      Documentation/admin-guide/edid.rst
  51. 8
      Documentation/admin-guide/gpio/gpio-mockup.rst
  52. 6
      Documentation/admin-guide/gpio/index.rst
  53. 13
      Documentation/admin-guide/gpio/obsolete.rst
  54. 1
      Documentation/admin-guide/hw-vuln/index.rst
  55. 104
      Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
  56. 8
      Documentation/admin-guide/hw-vuln/spectre.rst
  57. 4
      Documentation/admin-guide/index.rst
  58. 7
      Documentation/admin-guide/kdump/kdump.rst
  59. 8
      Documentation/admin-guide/kdump/vmcoreinfo.rst
  60. 1
      Documentation/admin-guide/kernel-parameters.rst
  61. 686
      Documentation/admin-guide/kernel-parameters.txt
  62. 7
      Documentation/admin-guide/laptops/thinkpad-acpi.rst
  63. 12
      Documentation/admin-guide/media/visl.rst
  64. 2
      Documentation/admin-guide/media/vivid.rst
  65. 27
      Documentation/admin-guide/mm/damon/reclaim.rst
  66. 158
      Documentation/admin-guide/mm/damon/usage.rst
  67. 9
      Documentation/admin-guide/mm/numa_memory_policy.rst
  68. 32
      Documentation/admin-guide/perf/hisi-pcie-pmu.rst
  69. 1
      Documentation/admin-guide/perf/index.rst
  70. 46
      Documentation/admin-guide/perf/starfive_starlink_pmu.rst
  71. 59
      Documentation/admin-guide/pm/amd-pstate.rst
  72. 2
      Documentation/admin-guide/reporting-regressions.rst
  73. 43
      Documentation/admin-guide/sysctl/kernel.rst
  74. 5
      Documentation/admin-guide/sysctl/net.rst
  75. 4
      Documentation/admin-guide/tainted-kernels.rst
  76. 1985
      Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
  77. 49
      Documentation/arch/arm64/elf_hwcaps.rst
  78. 5
      Documentation/arch/arm64/silicon-errata.rst
  79. 11
      Documentation/arch/arm64/sme.rst
  80. 10
      Documentation/arch/arm64/sve.rst
  81. 16
      Documentation/arch/riscv/vm-layout.rst
  82. 16
      Documentation/arch/x86/amd-memory-encryption.rst
  83. 7
      Documentation/arch/x86/amd_hsmp.rst
  84. 3
      Documentation/arch/x86/boot.rst
  85. 6
      Documentation/arch/x86/pti.rst
  86. 8
      Documentation/arch/x86/resctrl.rst
  87. 24
      Documentation/arch/x86/topology.rst
  88. 96
      Documentation/arch/x86/x86_64/fred.rst
  89. 1
      Documentation/arch/x86/x86_64/index.rst
  90. 8
      Documentation/bpf/kfuncs.rst
  91. 2
      Documentation/bpf/map_lpm_trie.rst
  92. 594
      Documentation/bpf/standardization/instruction-set.rst
  93. 2
      Documentation/bpf/verifier.rst
  94. 6
      Documentation/conf.py
  95. 43
      Documentation/core-api/workqueue.rst
  96. 4
      Documentation/dev-tools/checkpatch.rst
  97. 41
      Documentation/dev-tools/kasan.rst
  98. 16
      Documentation/dev-tools/kselftest.rst
  99. 28
      Documentation/dev-tools/ubsan.rst
  100. 3
      Documentation/devicetree/bindings/Makefile
  101. Some files were not shown because too many files have changed in this diff Show More

1
.get_maintainer.ignore

@ -1,4 +1,5 @@
Alan Cox <alan@lxorguk.ukuu.org.uk>
Alan Cox <root@hraefn.swansea.linux.org.uk>
Christoph Hellwig <hch@lst.de>
Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Marc Gonzalez <marc.w.gonzalez@free.fr>

1
.gitignore vendored

@ -52,6 +52,7 @@
*.xz
*.zst
Module.symvers
dtbs-list
modules.order
#

3
.mailmap

@ -439,6 +439,8 @@ Mukesh Ojha <quic_mojha@quicinc.com> <mojha@codeaurora.org>
Muna Sinada <quic_msinada@quicinc.com> <msinada@codeaurora.org>
Murali Nalajala <quic_mnalajal@quicinc.com> <mnalajal@codeaurora.org>
Mythri P K <mythripk@ti.com>
Nadav Amit <nadav.amit@gmail.com> <namit@vmware.com>
Nadav Amit <nadav.amit@gmail.com> <namit@cs.technion.ac.il>
Nadia Yvette Chambers <nyc@holomorphy.com> William Lee Irwin III <wli@holomorphy.com>
Naoya Horiguchi <naoya.horiguchi@nec.com> <n-horiguchi@ah.jp.nec.com>
Nathan Chancellor <nathan@kernel.org> <natechancellor@gmail.com>
@ -573,6 +575,7 @@ Simon Kelley <simon@thekelleys.org.uk>
Sricharan Ramabadhran <quic_srichara@quicinc.com> <sricharan@codeaurora.org>
Srinivas Ramana <quic_sramana@quicinc.com> <sramana@codeaurora.org>
Sriram R <quic_srirrama@quicinc.com> <srirrama@codeaurora.org>
Stefan Wahren <wahrenst@gmx.net> <stefan.wahren@i2se.com>
Stéphane Witzmann <stephane.witzmann@ubpmes.univ-bpclermont.fr>
Stephen Hemminger <stephen@networkplumber.org> <shemminger@linux-foundation.org>
Stephen Hemminger <stephen@networkplumber.org> <shemminger@osdl.org>

10
CREDITS

@ -63,6 +63,11 @@ D: dosfs, LILO, some fd features, ATM, various other hacks here and there
S: Buenos Aires
S: Argentina
NTFS FILESYSTEM
N: Anton Altaparmakov
E: anton@tuxera.com
D: NTFS filesystem
N: Tim Alpaerts
E: tim_alpaerts@toyota-motor-europe.com
D: 802.2 class II logical link control layer,
@ -2955,6 +2960,11 @@ S: 2364 Old Trail Drive
S: Reston, Virginia 20191
S: USA
N: Sekhar Nori
E: nori.sekhar@gmail.com
D: Maintainer of Texas Instruments DaVinci machine support, contributor
D: to device drivers relevant to that SoC family.
N: Fredrik Noring
E: noring@nocrew.org
W: http://www.lysator.liu.se/~noring/

4
Documentation/ABI/obsolete/sysfs-gpio

@ -28,5 +28,5 @@ Description:
/label ... (r/o) descriptive, not necessarily unique
/ngpio ... (r/o) number of GPIOs; numbered N to N + (ngpio - 1)
This ABI is deprecated and will be removed after 2020. It is
replaced with the GPIO character device.
This ABI is obsoleted by Documentation/ABI/testing/gpio-cdev and will be
removed after 2020.

12
Documentation/ABI/testing/configfs-usb-gadget-ffs

@ -4,6 +4,14 @@ KernelVersion: 3.13
Description: The purpose of this directory is to create and remove it.
A corresponding USB function instance is created/removed.
There are no attributes here.
All parameters are set through FunctionFS.
All attributes are read only:
============= ============================================
ready 1 if the function is ready to be used, E.G.
if userspace has written descriptors and
strings to ep0, so the gadget can be
enabled - 0 otherwise.
============= ============================================
All other parameters are set through FunctionFS.

34
Documentation/ABI/testing/debugfs-cxl

@ -33,3 +33,37 @@ Description:
device cannot clear poison from the address, -ENXIO is returned.
The clear_poison attribute is only visible for devices
supporting the capability.
What: /sys/kernel/debug/cxl/einj_types
Date: January, 2024
KernelVersion: v6.9
Contact: linux-cxl@vger.kernel.org
Description:
(RO) Prints the CXL protocol error types made available by
the platform in the format:
0x<error number> <error type>
The possible error types are (as of ACPI v6.5):
0x1000 CXL.cache Protocol Correctable
0x2000 CXL.cache Protocol Uncorrectable non-fatal
0x4000 CXL.cache Protocol Uncorrectable fatal
0x8000 CXL.mem Protocol Correctable
0x10000 CXL.mem Protocol Uncorrectable non-fatal
0x20000 CXL.mem Protocol Uncorrectable fatal
The <error number> can be written to einj_inject to inject
<error type> into a chosen dport.
What: /sys/kernel/debug/cxl/$dport_dev/einj_inject
Date: January, 2024
KernelVersion: v6.9
Contact: linux-cxl@vger.kernel.org
Description:
(WO) Writing an integer to this file injects the corresponding
CXL protocol error into $dport_dev ($dport_dev will be a device
name from /sys/bus/pci/devices). The integer to type mapping for
injection can be found by reading from einj_types. If the dport
was enumerated in RCH mode, a CXL 1.1 error is injected, otherwise
a CXL 2.0 error is injected.

26
Documentation/ABI/testing/debugfs-driver-qat

@ -81,3 +81,29 @@ Description: (RO) Read returns, for each Acceleration Engine (AE), the number
<N>: Number of Compress and Verify (CnV) errors and type
of the last CnV error detected by Acceleration
Engine N.
What: /sys/kernel/debug/qat_<device>_<BDF>/heartbeat/inject_error
Date: March 2024
KernelVersion: 6.8
Contact: qat-linux@intel.com
Description: (WO) Write to inject an error that simulates an heartbeat
failure. This is to be used for testing purposes.
After writing this file, the driver stops arbitration on a
random engine and disables the fetching of heartbeat counters.
If a workload is running on the device, a job submitted to the
accelerator might not get a response and a read of the
`heartbeat/status` attribute might report -1, i.e. device
unresponsive.
The error is unrecoverable thus the device must be restarted to
restore its functionality.
This attribute is available only when the kernel is built with
CONFIG_CRYPTO_DEV_QAT_ERROR_INJECTION=y.
A write of 1 enables error injection.
The following example shows how to enable error injection::
# cd /sys/kernel/debug/qat_<device>_<BDF>
# echo 1 > heartbeat/inject_error

22
Documentation/ABI/testing/debugfs-hisi-hpre

@ -111,6 +111,28 @@ Description: QM debug registers(regs) read hardware register value. This
node is used to show the change of the qm register values. This
node can be help users to check the change of register values.
What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/qm_state
Date: Jan 2024
Contact: linux-crypto@vger.kernel.org
Description: Dump the state of the device.
0: busy, 1: idle.
Only available for PF, and take no other effect on HPRE.
What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/dev_timeout
Date: Feb 2024
Contact: linux-crypto@vger.kernel.org
Description: Set the wait time when stop queue fails. Available for both PF
and VF, and take no other effect on HPRE.
0: not wait(default), others value: wait dev_timeout * 20 microsecond.
What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/dev_state
Date: Feb 2024
Contact: linux-crypto@vger.kernel.org
Description: Dump the stop queue status of the QM. The default value is 0,
if dev_timeout is set, when stop queue fails, the dev_state
will return non-zero value. Available for both PF and VF,
and take no other effect on HPRE.
What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/diff_regs
Date: Mar 2022
Contact: linux-crypto@vger.kernel.org

22
Documentation/ABI/testing/debugfs-hisi-sec

@ -91,6 +91,28 @@ Description: QM debug registers(regs) read hardware register value. This
node is used to show the change of the qm register values. This
node can be help users to check the change of register values.
What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/qm_state
Date: Jan 2024
Contact: linux-crypto@vger.kernel.org
Description: Dump the state of the device.
0: busy, 1: idle.
Only available for PF, and take no other effect on SEC.
What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/dev_timeout
Date: Feb 2024
Contact: linux-crypto@vger.kernel.org
Description: Set the wait time when stop queue fails. Available for both PF
and VF, and take no other effect on SEC.
0: not wait(default), others value: wait dev_timeout * 20 microsecond.
What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/dev_state
Date: Feb 2024
Contact: linux-crypto@vger.kernel.org
Description: Dump the stop queue status of the QM. The default value is 0,
if dev_timeout is set, when stop queue fails, the dev_state
will return non-zero value. Available for both PF and VF,
and take no other effect on SEC.
What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/diff_regs
Date: Mar 2022
Contact: linux-crypto@vger.kernel.org

22
Documentation/ABI/testing/debugfs-hisi-zip

@ -104,6 +104,28 @@ Description: QM debug registers(regs) read hardware register value. This
node is used to show the change of the qm registers value. This
node can be help users to check the change of register values.
What: /sys/kernel/debug/hisi_zip/<bdf>/qm/qm_state
Date: Jan 2024
Contact: linux-crypto@vger.kernel.org
Description: Dump the state of the device.
0: busy, 1: idle.
Only available for PF, and take no other effect on ZIP.
What: /sys/kernel/debug/hisi_zip/<bdf>/qm/dev_timeout
Date: Feb 2024
Contact: linux-crypto@vger.kernel.org
Description: Set the wait time when stop queue fails. Available for both PF
and VF, and take no other effect on ZIP.
0: not wait(default), others value: wait dev_timeout * 20 microsecond.
What: /sys/kernel/debug/hisi_zip/<bdf>/qm/dev_state
Date: Feb 2024
Contact: linux-crypto@vger.kernel.org
Description: Dump the stop queue status of the QM. The default value is 0,
if dev_timeout is set, when stop queue fails, the dev_state
will return non-zero value. Available for both PF and VF,
and take no other effect on ZIP.
What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/diff_regs
Date: Mar 2022
Contact: linux-crypto@vger.kernel.org

276
Documentation/ABI/testing/debugfs-intel-iommu

@ -0,0 +1,276 @@
What: /sys/kernel/debug/iommu/intel/iommu_regset
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps all the register contents for each IOMMU device.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/iommu_regset
IOMMU: dmar0 Register Base Address: 26be37000
Name Offset Contents
VER 0x00 0x0000000000000010
GCMD 0x18 0x0000000000000000
GSTS 0x1c 0x00000000c7000000
FSTS 0x34 0x0000000000000000
FECTL 0x38 0x0000000000000000
[...]
IOMMU: dmar1 Register Base Address: fed90000
Name Offset Contents
VER 0x00 0x0000000000000010
GCMD 0x18 0x0000000000000000
GSTS 0x1c 0x00000000c7000000
FSTS 0x34 0x0000000000000000
FECTL 0x38 0x0000000000000000
[...]
IOMMU: dmar2 Register Base Address: fed91000
Name Offset Contents
VER 0x00 0x0000000000000010
GCMD 0x18 0x0000000000000000
GSTS 0x1c 0x00000000c7000000
FSTS 0x34 0x0000000000000000
FECTL 0x38 0x0000000000000000
[...]
What: /sys/kernel/debug/iommu/intel/ir_translation_struct
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps the table entries for Interrupt
remapping and Interrupt posting.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/ir_translation_struct
Remapped Interrupt supported on IOMMU: dmar0
IR table address:100900000
Entry SrcID DstID Vct IRTE_high IRTE_low
0 00:0a.0 00000080 24 0000000000040050 000000800024000d
1 00:0a.0 00000001 ef 0000000000040050 0000000100ef000d
Remapped Interrupt supported on IOMMU: dmar1
IR table address:100300000
Entry SrcID DstID Vct IRTE_high IRTE_low
0 00:02.0 00000002 26 0000000000040010 000000020026000d
[...]
****
Posted Interrupt supported on IOMMU: dmar0
IR table address:100900000
Entry SrcID PDA_high PDA_low Vct IRTE_high IRTE_low
What: /sys/kernel/debug/iommu/intel/dmar_translation_struct
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps Intel IOMMU DMA remapping tables, such
as root table, context table, PASID directory and PASID
table entries in debugfs. For legacy mode, it doesn't
support PASID, and hence PASID field is defaulted to
'-1' and other PASID related fields are invalid.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_translation_struct
IOMMU dmar1: Root Table Address: 0x103027000
B.D.F Root_entry
00:02.0 0x0000000000000000:0x000000010303e001
Context_entry
0x0000000000000102:0x000000010303f005
PASID PASID_table_entry
-1 0x0000000000000000:0x0000000000000000:0x0000000000000000
IOMMU dmar0: Root Table Address: 0x103028000
B.D.F Root_entry
00:0a.0 0x0000000000000000:0x00000001038a7001
Context_entry
0x0000000000000000:0x0000000103220e7d
PASID PASID_table_entry
0 0x0000000000000000:0x0000000000800002:0x00000001038a5089
[...]
What: /sys/kernel/debug/iommu/intel/invalidation_queue
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file exports invalidation queue internals of each
IOMMU device.
Example in Kabylake:
::
$ sudo cat /sys/kernel/debug/iommu/intel/invalidation_queue
Invalidation queue on IOMMU: dmar0
Base: 0x10022e000 Head: 20 Tail: 20
Index qw0 qw1 qw2
0 0000000000000014 0000000000000000 0000000000000000
1 0000000200000025 0000000100059c04 0000000000000000
2 0000000000000014 0000000000000000 0000000000000000
qw3 status
0000000000000000 0000000000000000
0000000000000000 0000000000000000
0000000000000000 0000000000000000
[...]
Invalidation queue on IOMMU: dmar1
Base: 0x10026e000 Head: 32 Tail: 32
Index qw0 qw1 status
0 0000000000000004 0000000000000000 0000000000000000
1 0000000200000025 0000000100059804 0000000000000000
2 0000000000000011 0000000000000000 0000000000000000
[...]
What: /sys/kernel/debug/iommu/intel/dmar_perf_latency
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file is used to control and show counts of
execution time ranges for various types per DMAR.
Firstly, write a value to
/sys/kernel/debug/iommu/intel/dmar_perf_latency
to enable sampling.
The possible values are as follows:
* 0 - disable sampling all latency data
* 1 - enable sampling IOTLB invalidation latency data
* 2 - enable sampling devTLB invalidation latency data
* 3 - enable sampling intr entry cache invalidation latency data
Next, read /sys/kernel/debug/iommu/intel/dmar_perf_latency gives
a snapshot of sampling result of all enabled monitors.
Examples in Kabylake:
::
1) Disable sampling all latency data:
$ sudo echo 0 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
2) Enable sampling IOTLB invalidation latency data
$ sudo echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
IOMMU: dmar0 Register Base Address: 26be37000
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
inv_iotlb 0 0 0 0 0
1ms-10ms >=10ms min(us) max(us) average(us)
inv_iotlb 0 0 0 0 0
[...]
IOMMU: dmar2 Register Base Address: fed91000
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
inv_iotlb 0 0 18 0 0
1ms-10ms >=10ms min(us) max(us) average(us)
inv_iotlb 0 0 2 2 2
3) Enable sampling devTLB invalidation latency data
$ sudo echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
$ sudo cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
IOMMU: dmar0 Register Base Address: 26be37000
<0.1us 0.1us-1us 1us-10us 10us-100us 100us-1ms
inv_devtlb 0 0 0 0 0
>=10ms min(us) max(us) average(us)
inv_devtlb 0 0 0 0
[...]
What: /sys/kernel/debug/iommu/intel/<bdf>/domain_translation_struct
Date: December 2023
Contact: Jingqi Liu <Jingqi.liu@intel.com>
Description:
This file dumps a specified page table of Intel IOMMU
in legacy mode or scalable mode.
For a device that only supports legacy mode, dump its
page table by the debugfs file in the debugfs device
directory. e.g.
/sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct.
For a device that supports scalable mode, dump the
page table of specified pasid by the debugfs file in
the debugfs pasid directory. e.g.
/sys/kernel/debug/iommu/intel/0000:00:02.0/1/domain_translation_struct.
Examples in Kabylake:
::
1) Dump the page table of device "0000:00:02.0" that only supports legacy mode.
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:02.0/domain_translation_struct
Device 0000:00:02.0 @0x1017f8000
IOVA_PFN PML5E PML4E
0x000000008d800 | 0x0000000000000000 0x00000001017f9003
0x000000008d801 | 0x0000000000000000 0x00000001017f9003
0x000000008d802 | 0x0000000000000000 0x00000001017f9003
PDPE PDE PTE
0x00000001017fa003 0x00000001017fb003 0x000000008d800003
0x00000001017fa003 0x00000001017fb003 0x000000008d801003
0x00000001017fa003 0x00000001017fb003 0x000000008d802003
[...]
2) Dump the page table of device "0000:00:0a.0" with PASID "1" that
supports scalable mode.
$ sudo cat /sys/kernel/debug/iommu/intel/0000:00:0a.0/1/domain_translation_struct
Device 0000:00:0a.0 with pasid 1 @0x10c112000
IOVA_PFN PML5E PML4E
0x0000000000000 | 0x0000000000000000 0x000000010df93003
0x0000000000001 | 0x0000000000000000 0x000000010df93003
0x0000000000002 | 0x0000000000000000 0x000000010df93003
PDPE PDE PTE
0x0000000106ae6003 0x0000000104b38003 0x0000000147c00803
0x0000000106ae6003 0x0000000104b38003 0x0000000147c01803
0x0000000106ae6003 0x0000000104b38003 0x0000000147c02803
[...]

9
Documentation/ABI/testing/gpio-cdev

@ -6,8 +6,9 @@ Description:
The character device files /dev/gpiochip* are the interface
between GPIO chips and userspace.
The ioctl(2)-based ABI is defined and documented in
[include/uapi]<linux/gpio.h>.
The ioctl(2)-based ABI is defined in
[include/uapi]<linux/gpio.h> and documented in
Documentation/userspace-api/gpio/chardev.rst.
The following file operations are supported:
@ -17,8 +18,8 @@ Description:
ioctl(2)
Initiate various actions.
See the inline documentation in [include/uapi]<linux/gpio.h>
for descriptions of all ioctls.
See Documentation/userspace-api/gpio/chardev.rst
for a description of all ioctls.
close(2)
Stops and free up the I/O contexts that was associated

87
Documentation/ABI/testing/sysfs-bus-coresight-devices-tpdm

@ -170,3 +170,90 @@ Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_t
Description:
(RW) Set/Get the MSR(mux select register) for the DSB subunit
TPDM.
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_mode
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description: (Write) Set the data collection mode of CMB tpdm. Continuous
change creates CMB data set elements on every CMBCLK edge.
Trace-on-change creates CMB data set elements only when a new
data set element differs in value from the previous element
in a CMB data set.
Accepts only one of the 2 values - 0 or 1.
0 : Continuous CMB collection mode.
1 : Trace-on-change CMB collection mode.
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_trig_patt/xpr[0:1]
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Set/Get the value of the trigger pattern for the CMB
subunit TPDM.
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_trig_patt/xpmr[0:1]
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Set/Get the mask of the trigger pattern for the CMB
subunit TPDM.
What: /sys/bus/coresight/devices/<tpdm-name>/dsb_patt/tpr[0:1]
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Set/Get the value of the pattern for the CMB subunit TPDM.
What: /sys/bus/coresight/devices/<tpdm-name>/dsb_patt/tpmr[0:1]
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Set/Get the mask of the pattern for the CMB subunit TPDM.
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_patt/enable_ts
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(Write) Set the pattern timestamp of CMB tpdm. Read
the pattern timestamp of CMB tpdm.
Accepts only one of the 2 values - 0 or 1.
0 : Disable CMB pattern timestamp.
1 : Enable CMB pattern timestamp.
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_trig_ts
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Set/Get the trigger timestamp of the CMB for tpdm.
Accepts only one of the 2 values - 0 or 1.
0 : Set the CMB trigger type to false
1 : Set the CMB trigger type to true
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_ts_all
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Read or write the status of timestamp upon all interface.
Only value 0 and 1 can be written to this node. Set this node to 1 to requeset
timestamp to all trace packet.
Accepts only one of the 2 values - 0 or 1.
0 : Disable the timestamp of all trace packets.
1 : Enable the timestamp of all trace packets.
What: /sys/bus/coresight/devices/<tpdm-name>/cmb_msr/msr[0:31]
Date: January 2024
KernelVersion 6.9
Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>
Description:
(RW) Set/Get the MSR(mux select register) for the CMB subunit
TPDM.

34
Documentation/ABI/testing/sysfs-bus-cxl

@ -552,3 +552,37 @@ Description:
attribute is only visible for devices supporting the
capability. The retrieved errors are logged as kernel
events when cxl_poison event tracing is enabled.
What: /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth
/sys/bus/cxl/devices/regionZ/accessY/write_banwidth
Date: Jan, 2024
KernelVersion: v6.9
Contact: linux-cxl@vger.kernel.org
Description:
(RO) The aggregated read or write bandwidth of the region. The
number is the accumulated read or write bandwidth of all CXL memory
devices that contributes to the region in MB/s. It is
identical data that should appear in
/sys/devices/system/node/nodeX/accessY/initiators/read_bandwidth or
/sys/devices/system/node/nodeX/accessY/initiators/write_bandwidth.
See Documentation/ABI/stable/sysfs-devices-node. access0 provides
the number to the closest initiator and access1 provides the
number to the closest CPU.
What: /sys/bus/cxl/devices/regionZ/accessY/read_latency
/sys/bus/cxl/devices/regionZ/accessY/write_latency
Date: Jan, 2024
KernelVersion: v6.9
Contact: linux-cxl@vger.kernel.org
Description:
(RO) The read or write latency of the region. The number is
the worst read or write latency of all CXL memory devices that
contributes to the region in nanoseconds. It is identical data
that should appear in
/sys/devices/system/node/nodeX/accessY/initiators/read_latency or
/sys/devices/system/node/nodeX/accessY/initiators/write_latency.
See Documentation/ABI/stable/sysfs-devices-node. access0 provides
the number to the closest initiator and access1 provides the
number to the closest CPU.

153
Documentation/ABI/testing/sysfs-bus-dax

@ -0,0 +1,153 @@
What: /sys/bus/dax/devices/daxX.Y/align
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RW) Provides a way to specify an alignment for a dax device.
Values allowed are constrained by the physical address ranges
that back the dax device, and also by arch requirements.
What: /sys/bus/dax/devices/daxX.Y/mapping
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(WO) Provides a way to allocate a mapping range under a dax
device. Specified in the format <start>-<end>.
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/start
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/end
What: /sys/bus/dax/devices/daxX.Y/mapping[0..N]/page_offset
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RO) A dax device may have multiple constituent discontiguous
address ranges. These are represented by the different
'mappingX' subdirectories. The 'start' attribute indicates the
start physical address for the given range. The 'end' attribute
indicates the end physical address for the given range. The
'page_offset' attribute indicates the offset of the current
range in the dax device.
What: /sys/bus/dax/devices/daxX.Y/resource
Date: June, 2019
KernelVersion: v5.3
Contact: nvdimm@lists.linux.dev
Description:
(RO) The resource attribute indicates the starting physical
address of a dax device. In case of a device with multiple
constituent ranges, it indicates the starting address of the
first range.
What: /sys/bus/dax/devices/daxX.Y/size
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RW) The size attribute indicates the total size of a dax
device. For creating subdivided dax devices, or for resizing
an existing device, the new size can be written to this as
part of the reconfiguration process.
What: /sys/bus/dax/devices/daxX.Y/numa_node
Date: November, 2019
KernelVersion: v5.5
Contact: nvdimm@lists.linux.dev
Description:
(RO) If NUMA is enabled and the platform has affinitized the
backing device for this dax device, emit the CPU node
affinity for this device.
What: /sys/bus/dax/devices/daxX.Y/target_node
Date: February, 2019
KernelVersion: v5.1
Contact: nvdimm@lists.linux.dev
Description:
(RO) The target-node attribute is the Linux numa-node that a
device-dax instance may create when it is online. Prior to
being online the device's 'numa_node' property reflects the
closest online cpu node which is the typical expectation of a
device 'numa_node'. Once it is online it becomes its own
distinct numa node.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/available_size
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RO) The available_size attribute tracks available dax region
capacity. This only applies to volatile hmem devices, not pmem
devices, since pmem devices are defined by nvdimm namespace
boundaries.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/size
Date: July, 2017
KernelVersion: v5.1
Contact: nvdimm@lists.linux.dev
Description:
(RO) The size attribute indicates the size of a given dax region
in bytes.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/align
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RO) The align attribute indicates alignment of the dax region.
Changes on align may not always be valid, when say certain
mappings were created with 2M and then we switch to 1G. This
validates all ranges against the new value being attempted, post
resizing.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/seed
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RO) The seed device is a concept for dynamic dax regions to be
able to split the region amongst multiple sub-instances. The
seed device, similar to libnvdimm seed devices, is a device
that starts with zero capacity allocated and unbound to a
driver.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/create
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(RW) The create interface to the dax region provides a way to
create a new unconfigured dax device under the given region, which
can then be configured (with a size etc.) and then probed.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/delete
Date: October, 2020
KernelVersion: v5.10
Contact: nvdimm@lists.linux.dev
Description:
(WO) The delete interface for a dax region provides for deletion
of any 0-sized and idle dax devices.
What: $(readlink -f /sys/bus/dax/devices/daxX.Y)/../dax_region/id
Date: July, 2017
KernelVersion: v5.1
Contact: nvdimm@lists.linux.dev
Description:
(RO) The id attribute indicates the region id of a dax region.
What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory
Date: January, 2024
KernelVersion: v6.8
Contact: nvdimm@lists.linux.dev
Description:
(RW) Control the memmap_on_memory setting if the dax device
were to be hotplugged as system memory. This determines whether
the 'altmap' for the hotplugged memory will be placed on the
device being hotplugged (memmap_on_memory=1) or if it will be
placed on regular memory (memmap_on_memory=0). This attribute
must be set before the device is handed over to the 'kmem'
driver (i.e. hotplugged into system-ram). Additionally, this
depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
memmap_on_memory parameter for memory_hotplug. This is
typically set on the kernel command line -
memory_hotplug.memmap_on_memory set to 'true' or 'force'."

9
Documentation/ABI/testing/sysfs-bus-iio-adc-pac1934

@ -0,0 +1,9 @@
What: /sys/bus/iio/devices/iio:deviceX/in_shunt_resistorY
KernelVersion: 6.7
Contact: linux-iio@vger.kernel.org
Description:
The value of the shunt resistor may be known only at runtime
and set by a client application. This attribute allows to
set its value in micro-ohms. X is the IIO index of the device.
Y is the channel number. The value is used to calculate
current, power and accumulated energy.

18
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats

@ -11,7 +11,7 @@ saw any problems).
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
Date: July 2018
KernelVersion: 4.19.0
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of correctable errors seen and reported by this
PCI device using ERR_COR. Note that since multiple errors may
@ -32,7 +32,7 @@ Description: List of correctable errors seen and reported by this
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
Date: July 2018
KernelVersion: 4.19.0
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable fatal errors seen and reported by this
PCI device using ERR_FATAL. Note that since multiple errors may
@ -62,7 +62,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
Date: July 2018
KernelVersion: 4.19.0
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: List of uncorrectable nonfatal errors seen and reported by this
PCI device using ERR_NONFATAL. Note that since multiple errors
@ -100,20 +100,20 @@ collectors) that are AER capable. These indicate the number of error messages as
device, so these counters include them and are thus cumulative of all the error
messages on the PCI hierarchy originating at that root port.
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_cor
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
Date: July 2018
KernelVersion: 4.19.0
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_COR messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_fatal
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
Date: July 2018
KernelVersion: 4.19.0
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_FATAL messages reported to rootport.
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_nonfatal
What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
Date: July 2018
KernelVersion: 4.19.0
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.

10
Documentation/ABI/testing/sysfs-bus-usb

@ -442,6 +442,16 @@ What: /sys/bus/usb/devices/usbX/descriptors
Description:
Contains the interface descriptors, in binary.
What: /sys/bus/usb/devices/usbX/bos_descriptors
Date: March 2024
Contact: Elbert Mai <code@elbertmai.com>
Description:
Binary file containing the cached binary device object store (BOS)
of the device. This consists of the BOS descriptor followed by the
set of device capability descriptors. All descriptors read from
this file are in bus-endian format. Note that the kernel will not
request the BOS from a device if its bcdUSB is less than 0x0201.
What: /sys/bus/usb/devices/usbX/idProduct
Description:
Product ID, in hexadecimal.

10
Documentation/ABI/testing/sysfs-bus-vdpa

@ -1,6 +1,6 @@
What: /sys/bus/vdpa/drivers_autoprobe
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Contact: virtualization@lists.linux.dev
Description:
This file determines whether new devices are immediately bound
to a driver after the creation. It initially contains 1, which
@ -12,7 +12,7 @@ Description:
What: /sys/bus/vdpa/driver_probe
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Contact: virtualization@lists.linux.dev
Description:
Writing a device name to this file will cause the kernel binds
devices to a compatible driver.
@ -22,7 +22,7 @@ Description:
What: /sys/bus/vdpa/drivers/.../bind
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Contact: virtualization@lists.linux.dev
Description:
Writing a device name to this file will cause the driver to
attempt to bind to the device. This is useful for overriding
@ -30,7 +30,7 @@ Description:
What: /sys/bus/vdpa/drivers/.../unbind
Date: March 2020
Contact: virtualization@lists.linux-foundation.org
Contact: virtualization@lists.linux.dev
Description:
Writing a device name to this file will cause the driver to
attempt to unbind from the device. This may be useful when
@ -38,7 +38,7 @@ Description:
What: /sys/bus/vdpa/devices/.../driver_override
Date: November 2021
Contact: virtualization@lists.linux-foundation.org
Contact: virtualization@lists.linux.dev
Description:
This file allows the driver for a device to be specified.
When specified, only a driver with a name matching the value

27
Documentation/ABI/testing/sysfs-class-hwmon

@ -149,6 +149,15 @@ Description:
RW
What: /sys/class/hwmon/hwmonX/inY_fault
Description:
Reports a voltage hard failure (eg: shorted component)
- 1: Failed
- 0: Ok
RO
What: /sys/class/hwmon/hwmonX/cpuY_vid
Description:
CPU core reference voltage.
@ -968,6 +977,15 @@ Description:
RW
What: /sys/class/hwmon/hwmonX/humidityY_max_alarm
Description:
Maximum humidity detection
- 0: OK
- 1: Maximum humidity detected
RO
What: /sys/class/hwmon/hwmonX/humidityY_max_hyst
Description:
Humidity hysteresis value for max limit.
@ -987,6 +1005,15 @@ Description:
RW
What: /sys/class/hwmon/hwmonX/humidityY_min_alarm
Description:
Minimum humidity detection
- 0: OK
- 1: Minimum humidity detected
RO
What: /sys/class/hwmon/hwmonX/humidityY_min_hyst
Description:
Humidity hysteresis value for min limit.

12
Documentation/ABI/testing/sysfs-class-led-trigger-netdev

@ -88,6 +88,8 @@ Description:
speed of 10MBps of the named network device.
Setting this value also immediately changes the LED state.
Present only if the named network device supports 10Mbps link speed.
What: /sys/class/leds/<led>/link_100
Date: Jun 2023
KernelVersion: 6.5
@ -101,6 +103,8 @@ Description:
speed of 100Mbps of the named network device.
Setting this value also immediately changes the LED state.
Present only if the named network device supports 100Mbps link speed.
What: /sys/class/leds/<led>/link_1000
Date: Jun 2023
KernelVersion: 6.5
@ -114,6 +118,8 @@ Description:
speed of 1000Mbps of the named network device.
Setting this value also immediately changes the LED state.
Present only if the named network device supports 1000Mbps link speed.
What: /sys/class/leds/<led>/link_2500
Date: Nov 2023
KernelVersion: 6.8
@ -127,6 +133,8 @@ Description:
speed of 2500Mbps of the named network device.
Setting this value also immediately changes the LED state.
Present only if the named network device supports 2500Mbps link speed.
What: /sys/class/leds/<led>/link_5000
Date: Nov 2023
KernelVersion: 6.8
@ -140,6 +148,8 @@ Description:
speed of 5000Mbps of the named network device.
Setting this value also immediately changes the LED state.
Present only if the named network device supports 5000Mbps link speed.
What: /sys/class/leds/<led>/link_10000
Date: Nov 2023
KernelVersion: 6.8
@ -153,6 +163,8 @@ Description:
speed of 10000Mbps of the named network device.
Setting this value also immediately changes the LED state.
Present only if the named network device supports 10000Mbps link speed.
What: /sys/class/leds/<led>/half_duplex
Date: Jun 2023
KernelVersion: 6.5

14
Documentation/ABI/testing/sysfs-class-led-trigger-tty

@ -1,11 +1,11 @@
What: /sys/class/leds/<led>/ttyname
What: /sys/class/leds/<tty_led>/ttyname
Date: Dec 2020
KernelVersion: 5.10
Contact: linux-leds@vger.kernel.org
Description:
Specifies the tty device name of the triggering tty
What: /sys/class/leds/<led>/rx
What: /sys/class/leds/<tty_led>/rx
Date: February 2024
KernelVersion: 6.8
Description:
@ -13,7 +13,7 @@ Description:
If set to 0, the LED will not blink on reception.
If set to 1 (default), the LED will blink on reception.
What: /sys/class/leds/<led>/tx
What: /sys/class/leds/<tty_led>/tx
Date: February 2024
KernelVersion: 6.8
Description:
@ -21,7 +21,7 @@ Description:
If set to 0, the LED will not blink on transmission.
If set to 1 (default), the LED will blink on transmission.
What: /sys/class/leds/<led>/cts
What: /sys/class/leds/<tty_led>/cts
Date: February 2024
KernelVersion: 6.8
Description:
@ -31,7 +31,7 @@ Description:
If set to 0 (default), the LED will not evaluate CTS.
If set to 1, the LED will evaluate CTS.
What: /sys/class/leds/<led>/dsr
What: /sys/class/leds/<tty_led>/dsr
Date: February 2024
KernelVersion: 6.8
Description:
@ -41,7 +41,7 @@ Description:
If set to 0 (default), the LED will not evaluate DSR.
If set to 1, the LED will evaluate DSR.
What: /sys/class/leds/<led>/dcd
What: /sys/class/leds/<tty_led>/dcd
Date: February 2024
KernelVersion: 6.8
Description:
@ -51,7 +51,7 @@ Description:
If set to 0 (default), the LED will not evaluate CAR (DCD).
If set to 1, the LED will evaluate CAR (DCD).
What: /sys/class/leds/<led>/rng
What: /sys/class/leds/<tty_led>/rng
Date: February 2024
KernelVersion: 6.8
Description:

23
Documentation/ABI/testing/sysfs-class-net-queues

@ -96,3 +96,26 @@ Description:
Indicates the absolute minimum limit of bytes allowed to be
queued on this network device transmit queue. Default value is
0.
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_thrs
Date: Jan 2024
KernelVersion: 6.9
Contact: netdev@vger.kernel.org
Description:
Tx completion stall detection threshold in ms. Kernel will
guarantee to detect all stalls longer than this threshold but
may also detect stalls longer than half of the threshold.
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_cnt
Date: Jan 2024
KernelVersion: 6.9
Contact: netdev@vger.kernel.org
Description:
Number of detected Tx completion stalls.
What: /sys/class/net/<iface>/queues/tx-<queue>/byte_queue_limits/stall_max
Date: Jan 2024
KernelVersion: 6.9
Contact: netdev@vger.kernel.org
Description:
Longest detected Tx completion stall. Write 0 to clear.

6
Documentation/ABI/testing/sysfs-class-usb_role

@ -19,3 +19,9 @@ Description:
- none
- host
- device
What: /sys/class/usb_role/<switch>/connector
Date: Feb 2024
Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Description:
Optional symlink to the USB Type-C connector.

1
Documentation/ABI/testing/sysfs-devices-system-cpu

@ -516,6 +516,7 @@ What: /sys/devices/system/cpu/vulnerabilities
/sys/devices/system/cpu/vulnerabilities/mds
/sys/devices/system/cpu/vulnerabilities/meltdown
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
/sys/devices/system/cpu/vulnerabilities/retbleed
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
/sys/devices/system/cpu/vulnerabilities/spectre_v1

20
Documentation/ABI/testing/sysfs-driver-qat

@ -141,3 +141,23 @@ Description:
64
This attribute is only available for qat_4xxx devices.
What: /sys/bus/pci/devices/<BDF>/qat/auto_reset
Date: March 2024
KernelVersion: 6.8
Contact: qat-linux@intel.com
Description: (RW) Reports the current state of the autoreset feature
for a QAT device
Write to the attribute to enable or disable device auto reset.
Device auto reset is disabled by default.
The values are:
* 1/Yy/on: auto reset enabled. If the device encounters an
unrecoverable error, it will be reset automatically.
* 0/Nn/off: auto reset disabled. If the device encounters an
unrecoverable error, it will not be reset.
This attribute is only available for qat_4xxx devices.

52
Documentation/ABI/testing/sysfs-fs-f2fs

@ -205,7 +205,7 @@ Description: Controls the idle timing of system, if there is no FS operation
What: /sys/fs/f2fs/<disk>/discard_idle_interval
Date: September 2018
Contact: "Chao Yu" <yuchao0@huawei.com>
Contact: "Sahitya Tummala" <stummala@codeaurora.org>
Contact: "Sahitya Tummala" <quic_stummala@quicinc.com>
Description: Controls the idle timing of discard thread given
this time interval.
Default is 5 secs.
@ -213,7 +213,7 @@ Description: Controls the idle timing of discard thread given
What: /sys/fs/f2fs/<disk>/gc_idle_interval
Date: September 2018
Contact: "Chao Yu" <yuchao0@huawei.com>
Contact: "Sahitya Tummala" <stummala@codeaurora.org>
Contact: "Sahitya Tummala" <quic_stummala@quicinc.com>
Description: Controls the idle timing for gc path. Set to 5 seconds by default.
What: /sys/fs/f2fs/<disk>/iostat_enable
@ -701,29 +701,31 @@ Description: Support configuring fault injection type, should be
enabled with fault_injection option, fault type value
is shown below, it supports single or combined type.
=================== ===========
Type_Name Type_Value
=================== ===========
FAULT_KMALLOC 0x000000001
FAULT_KVMALLOC 0x000000002
FAULT_PAGE_ALLOC 0x000000004
FAULT_PAGE_GET 0x000000008
FAULT_ALLOC_BIO 0x000000010 (obsolete)
FAULT_ALLOC_NID 0x000000020
FAULT_ORPHAN 0x000000040
FAULT_BLOCK 0x000000080
FAULT_DIR_DEPTH 0x000000100
FAULT_EVICT_INODE 0x000000200
FAULT_TRUNCATE 0x000000400
FAULT_READ_IO 0x000000800
FAULT_CHECKPOINT 0x000001000
FAULT_DISCARD 0x000002000
FAULT_WRITE_IO 0x000004000
FAULT_SLAB_ALLOC 0x000008000
FAULT_DQUOT_INIT 0x000010000
FAULT_LOCK_OP 0x000020000
FAULT_BLKADDR 0x000040000
=================== ===========
=========================== ===========
Type_Name Type_Value
=========================== ===========
FAULT_KMALLOC 0x000000001
FAULT_KVMALLOC 0x000000002
FAULT_PAGE_ALLOC 0x000000004
FAULT_PAGE_GET 0x000000008
FAULT_ALLOC_BIO 0x000000010 (obsolete)
FAULT_ALLOC_NID 0x000000020
FAULT_ORPHAN 0x000000040
FAULT_BLOCK 0x000000080
FAULT_DIR_DEPTH 0x000000100
FAULT_EVICT_INODE 0x000000200
FAULT_TRUNCATE 0x000000400
FAULT_READ_IO 0x000000800
FAULT_CHECKPOINT 0x000001000
FAULT_DISCARD 0x000002000
FAULT_WRITE_IO 0x000004000
FAULT_SLAB_ALLOC 0x000008000
FAULT_DQUOT_INIT 0x000010000
FAULT_LOCK_OP 0x000020000
FAULT_BLKADDR_VALIDITY 0x000040000
FAULT_BLKADDR_CONSISTENCE 0x000080000
FAULT_NO_SEGMENT 0x000100000
=========================== ===========
What: /sys/fs/f2fs/<disk>/discard_io_aware_gran
Date: January 2023

11
Documentation/ABI/testing/sysfs-fs-virtiofs

@ -0,0 +1,11 @@
What: /sys/fs/virtiofs/<n>/tag
Date: Feb 2024
Contact: virtio-fs@lists.linux.dev
Description:
[RO] The mount "tag" that can be used to mount this filesystem.
What: /sys/fs/virtiofs/<n>/device
Date: Feb 2024
Contact: virtio-fs@lists.linux.dev
Description:
Symlink to the virtio device that exports this filesystem.

6
Documentation/ABI/testing/sysfs-kernel-mm-cma

@ -23,3 +23,9 @@ Date: Feb 2021
Contact: Minchan Kim <minchan@kernel.org>
Description:
the number of pages CMA API failed to allocate
What: /sys/kernel/mm/cma/<cma-heap-name>/release_pages_success
Date: Feb 2024
Contact: Anshuman Khandual <anshuman.khandual@arm.com>
Description:
the number of pages CMA API succeeded to release

16
Documentation/ABI/testing/sysfs-kernel-mm-damon

@ -34,7 +34,9 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
kdamond. Writing 'update_schemes_tried_bytes' to the file
updates only '.../tried_regions/total_bytes' files of this
kdamond. Writing 'clear_schemes_tried_regions' to the file
removes contents of the 'tried_regions' directory.
removes contents of the 'tried_regions' directory. Writing
'update_schemes_effective_quotas' to the file updates
'.../quotas/effective_bytes' files of this kdamond.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
Date: Mar 2022
@ -208,6 +210,12 @@ Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the size
quota of the scheme in bytes.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/effective_bytes
Date: Feb 2024
Contact: SeongJae Park <sj@kernel.org>
Description: Reading from this file gets the effective size quota of the
scheme in bytes, which adjusted for the time quota and goals.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/reset_interval_ms
Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org>
@ -221,6 +229,12 @@ Description: Writing a number 'N' to this file creates the number of
directories for setting automatic tuning of the scheme's
aggressiveness named '0' to 'N-1' under the goals/ directory.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_metric
Date: Feb 2024
Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the quota
auto-tuning goal metric.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value
Date: Nov 2023
Contact: SeongJae Park <sj@kernel.org>

4
Documentation/ABI/testing/sysfs-kernel-mm-mempolicy

@ -0,0 +1,4 @@
What: /sys/kernel/mm/mempolicy/
Date: January 2024
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Interface for Mempolicy

25
Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

@ -0,0 +1,25 @@
What: /sys/kernel/mm/mempolicy/weighted_interleave/
Date: January 2024
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Configuration Interface for the Weighted Interleave policy
What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
Date: January 2024
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Weight configuration interface for nodeN
The interleave weight for a memory node (N). These weights are
utilized by tasks which have set their mempolicy to
MPOL_WEIGHTED_INTERLEAVE.
These weights only affect new allocations, and changes at runtime
will not cause migrations on already allocated pages.
The minimum weight for a node is always 1.
Minimum weight: 1
Maximum weight: 255
Writing an empty string or `0` will reset the weight to the
system default. The system default may be set by the kernel
or drivers at boot or during hotplug events.

5
Documentation/Makefile

@ -111,7 +111,9 @@ $(YNL_INDEX): $(YNL_RST_FILES)
$(YNL_RST_DIR)/%.rst: $(YNL_YAML_DIR)/%.yaml $(YNL_TOOL)
$(Q)$(YNL_TOOL) -i $< -o $@
htmldocs: $(YNL_INDEX)
htmldocs texinfodocs latexdocs epubdocs xmldocs: $(YNL_INDEX)
htmldocs:
@$(srctree)/scripts/sphinx-pre-install --version-check
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
@ -176,6 +178,7 @@ refcheckdocs:
$(Q)cd $(srctree);scripts/documentation-file-ref-check
cleandocs:
$(Q)rm -f $(YNL_INDEX) $(YNL_RST_FILES)
$(Q)rm -rf $(BUILDDIR)
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/userspace-api/media clean

32
Documentation/RCU/checklist.rst

@ -68,7 +68,8 @@ over a rather long period of time, but improvements are always welcome!
rcu_read_lock_sched(), or by the appropriate update-side lock.
Explicit disabling of preemption (preempt_disable(), for example)
can serve as rcu_read_lock_sched(), but is less readable and
prevents lockdep from detecting locking issues.
prevents lockdep from detecting locking issues. Acquiring a
spinlock also enters an RCU read-side critical section.
Please note that you *cannot* rely on code known to be built
only in non-preemptible kernels. Such code can and will break,
@ -382,16 +383,17 @@ over a rather long period of time, but improvements are always welcome!
must use whatever locking or other synchronization is required
to safely access and/or modify that data structure.
Do not assume that RCU callbacks will be executed on the same
CPU that executed the corresponding call_rcu() or call_srcu().
For example, if a given CPU goes offline while having an RCU
callback pending, then that RCU callback will execute on some
surviving CPU. (If this was not the case, a self-spawning RCU
callback would prevent the victim CPU from ever going offline.)
Furthermore, CPUs designated by rcu_nocbs= might well *always*
have their RCU callbacks executed on some other CPUs, in fact,
for some real-time workloads, this is the whole point of using
the rcu_nocbs= kernel boot parameter.
Do not assume that RCU callbacks will be executed on
the same CPU that executed the corresponding call_rcu(),
call_srcu(), call_rcu_tasks(), call_rcu_tasks_rude(), or
call_rcu_tasks_trace(). For example, if a given CPU goes offline
while having an RCU callback pending, then that RCU callback
will execute on some surviving CPU. (If this was not the case,
a self-spawning RCU callback would prevent the victim CPU from
ever going offline.) Furthermore, CPUs designated by rcu_nocbs=
might well *always* have their RCU callbacks executed on some
other CPUs, in fact, for some real-time workloads, this is the
whole point of using the rcu_nocbs= kernel boot parameter.
In addition, do not assume that callbacks queued in a given order
will be invoked in that order, even if they all are queued on the
@ -444,7 +446,7 @@ over a rather long period of time, but improvements are always welcome!
real-time workloads than is synchronize_rcu_expedited().
It is also permissible to sleep in RCU Tasks Trace read-side
critical, which are delimited by rcu_read_lock_trace() and
critical section, which are delimited by rcu_read_lock_trace() and
rcu_read_unlock_trace(). However, this is a specialized flavor
of RCU, and you should not use it without first checking with
its current users. In most cases, you should instead use SRCU.
@ -490,6 +492,12 @@ over a rather long period of time, but improvements are always welcome!
since the last time that you passed that same object to
call_rcu() (or friends).
CONFIG_RCU_STRICT_GRACE_PERIOD:
combine with KASAN to check for pointers leaked out
of RCU read-side critical sections. This Kconfig
option is tough on both performance and scalability,
and so is limited to four-CPU systems.
__rcu sparse checks:
tag the pointer to the RCU-protected data structure
with __rcu, and sparse will warn you if you access that

5
Documentation/RCU/rcu_dereference.rst

@ -408,7 +408,10 @@ member of the rcu_dereference() to use in various situations:
RCU flavors, an RCU read-side critical section is entered
using rcu_read_lock(), anything that disables bottom halves,
anything that disables interrupts, or anything that disables
preemption.
preemption. Please note that spinlock critical sections
are also implied RCU read-side critical sections, even when
they are preemptible, as they are in kernels built with
CONFIG_PREEMPT_RT=y.
2. If the access might be within an RCU read-side critical section
on the one hand, or protected by (say) my_lock on the other,

2
Documentation/RCU/torture.rst

@ -318,7 +318,7 @@ Suppose that a previous kvm.sh run left its output in this directory::
tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28
Then this run can be re-run without rebuilding as follow:
Then this run can be re-run without rebuilding as follow::
kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28

19
Documentation/RCU/whatisRCU.rst

@ -172,14 +172,25 @@ rcu_read_lock()
critical section. Reference counts may be used in conjunction
with RCU to maintain longer-term references to data structures.
Note that anything that disables bottom halves, preemption,
or interrupts also enters an RCU read-side critical section.
Acquiring a spinlock also enters an RCU read-side critical
sections, even for spinlocks that do not disable preemption,
as is the case in kernels built with CONFIG_PREEMPT_RT=y.
Sleeplocks do *not* enter RCU read-side critical sections.
rcu_read_unlock()
^^^^^^^^^^^^^^^^^
void rcu_read_unlock(void);
This temporal primitives is used by a reader to inform the
reclaimer that the reader is exiting an RCU read-side critical
section. Note that RCU read-side critical sections may be nested
and/or overlapping.
section. Anything that enables bottom halves, preemption,
or interrupts also exits an RCU read-side critical section.
Releasing a spinlock also exits an RCU read-side critical section.
Note that RCU read-side critical sections may be nested and/or
overlapping.
synchronize_rcu()
^^^^^^^^^^^^^^^^^
@ -952,8 +963,8 @@ unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be
initialized after each and every call to kmem_cache_alloc(), which renders
reference-free spinlock acquisition completely unsafe. Therefore, when
using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.
(Those willing to use a kmem_cache constructor may also use locking,
including cache-friendly sequence locking.)
(Those willing to initialize their locks in a kmem_cache constructor
may also use locking, including cache-friendly sequence locking.)
With traditional reference counting -- such as that implemented by the
kref library in Linux -- there is typically code that runs when the last

24
Documentation/admin-guide/RAS/address-translation.rst

@ -0,0 +1,24 @@
.. SPDX-License-Identifier: GPL-2.0
Address translation
===================
x86 AMD
-------
Zen-based AMD systems include a Data Fabric that manages the layout of
physical memory. Devices attached to the Fabric, like memory controllers,
I/O, etc., may not have a complete view of the system physical memory map.
These devices may provide a "normalized", i.e. device physical, address
when reporting memory errors. Normalized addresses must be translated to
a system physical address for the kernel to action on the memory.
AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for
this case.
Glossary of acronyms used in address translation for Zen-based systems
* CCM = Cache Coherent Moderator
* COD = Cluster-on-Die
* COH_ST = Coherent Station
* DF = Data Fabric

11
Documentation/RAS/ras.rst → Documentation/admin-guide/RAS/error-decoding.rst

@ -1,15 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
Reliability, Availability and Serviceability features
=====================================================
This documents different aspects of the RAS functionality present in the
kernel.
Error decoding
---------------
==============
* x86
x86
---
Error decoding on AMD systems should be done using the rasdaemon tool:
https://github.com/mchehab/rasdaemon/

7
Documentation/admin-guide/RAS/index.rst

@ -0,0 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
.. toctree::
:maxdepth: 2
main
error-decoding
address-translation

10
Documentation/admin-guide/ras.rst → Documentation/admin-guide/RAS/main.rst

@ -1,8 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
============================================
Reliability, Availability and Serviceability
============================================
==================================================
Reliability, Availability and Serviceability (RAS)
==================================================
This documents different aspects of the RAS functionality present in the
kernel.
RAS concepts
************

69
Documentation/admin-guide/README.rst

@ -262,9 +262,11 @@ Compiling the kernel
- Make sure you have at least gcc 5.1 available.
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
- Do a ``make`` to create a compressed kernel image. It is also
possible to do ``make install`` if you have lilo installed to suit the
kernel makefiles, but you may want to check your particular lilo setup first.
- Do a ``make`` to create a compressed kernel image. It is also possible to do
``make install`` if you have lilo installed or if your distribution has an
install script recognised by the kernel's installer. Most popular
distributions will have a recognized install script. You may want to
check your distribution's setup first.
To do the actual install, you have to be root, but none of the normal
build should require that. Don't take the name of root in vain.
@ -301,32 +303,51 @@ Compiling the kernel
image (e.g. .../linux/arch/x86/boot/bzImage after compilation)
to the place where your regular bootable kernel is found.
- Booting a kernel directly from a floppy without the assistance of a
bootloader such as LILO, is no longer supported.
If you boot Linux from the hard drive, chances are you use LILO, which
uses the kernel image as specified in the file /etc/lilo.conf. The
kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
/boot/bzImage. To use the new kernel, save a copy of the old image
and copy the new image over the old one. Then, you MUST RERUN LILO
to update the loading map! If you don't, you won't be able to boot
the new kernel image.
Reinstalling LILO is usually a matter of running /sbin/lilo.
You may wish to edit /etc/lilo.conf to specify an entry for your
old kernel image (say, /vmlinux.old) in case the new one does not
work. See the LILO docs for more information.
After reinstalling LILO, you should be all set. Shutdown the system,
- Booting a kernel directly from a storage device without the assistance
of a bootloader such as LILO or GRUB, is no longer supported in BIOS
(non-EFI systems). On UEFI/EFI systems, however, you can use EFISTUB
which allows the motherboard to boot directly to the kernel.
On modern workstations and desktops, it's generally recommended to use a
bootloader as difficulties can arise with multiple kernels and secure boot.
For more details on EFISTUB,
see "Documentation/admin-guide/efi-stub.rst".
- It's important to note that as of 2016 LILO (LInux LOader) is no longer in
active development, though as it was extremely popular, it often comes up
in documentation. Popular alternatives include GRUB2, rEFInd, Syslinux,
systemd-boot, or EFISTUB. For various reasons, it's not recommended to use
software that's no longer in active development.
- Chances are your distribution includes an install script and running
``make install`` will be all that's needed. Should that not be the case
you'll have to identify your bootloader and reference its documentation or
configure your EFI.
Legacy LILO Instructions
------------------------
- If you use LILO the kernel images are specified in the file /etc/lilo.conf.
The kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
/boot/bzImage. To use the new kernel, save a copy of the old image and copy
the new image over the old one. Then, you MUST RERUN LILO to update the
loading map! If you don't, you won't be able to boot the new kernel image.
- Reinstalling LILO is usually a matter of running /sbin/lilo. You may wish
to edit /etc/lilo.conf to specify an entry for your old kernel image
(say, /vmlinux.old) in case the new one does not work. See the LILO docs
for more information.
- After reinstalling LILO, you should be all set. Shutdown the system,
reboot, and enjoy!
If you ever need to change the default root device, video mode,
etc. in the kernel image, use your bootloader's boot options
where appropriate. No need to recompile the kernel to change
these parameters.
- If you ever need to change the default root device, video mode, etc. in the
kernel image, use your bootloader's boot options where appropriate. No need
to recompile the kernel to change these parameters.
- Reboot with the new kernel and enjoy.
If something goes wrong
-----------------------

2
Documentation/admin-guide/cgroup-v1/cpusets.rst

@ -179,7 +179,7 @@ files describing that cpuset:
- cpuset.mem_hardwall flag: is memory allocation hardwalled
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
- cpuset.memory_spread_slab flag: OBSOLETE. Doesn't have any function.
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- cpuset.sched_relax_domain_level: the searching range when migrating tasks

20
Documentation/admin-guide/cgroup-v1/hugetlb.rst

@ -65,10 +65,12 @@ files include::
1. Page fault accounting
hugetlb.<hugepagesize>.limit_in_bytes
hugetlb.<hugepagesize>.max_usage_in_bytes
hugetlb.<hugepagesize>.usage_in_bytes
hugetlb.<hugepagesize>.failcnt
::
hugetlb.<hugepagesize>.limit_in_bytes
hugetlb.<hugepagesize>.max_usage_in_bytes
hugetlb.<hugepagesize>.usage_in_bytes
hugetlb.<hugepagesize>.failcnt
The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
control group and enforces the limit during page fault. Since HugeTLB
@ -82,10 +84,12 @@ getting SIGBUS.
2. Reservation accounting
hugetlb.<hugepagesize>.rsvd.limit_in_bytes
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
hugetlb.<hugepagesize>.rsvd.usage_in_bytes
hugetlb.<hugepagesize>.rsvd.failcnt
::
hugetlb.<hugepagesize>.rsvd.limit_in_bytes
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
hugetlb.<hugepagesize>.rsvd.usage_in_bytes
hugetlb.<hugepagesize>.rsvd.failcnt
The HugeTLB controller allows to limit the HugeTLB reservations per control
group and enforces the controller limit at reservation time and at the fault of

2
Documentation/admin-guide/cifs/introduction.rst

@ -28,7 +28,7 @@ Introduction
high performance safe distributed caching (leases/oplocks), optional packet
signing, large files, Unicode support and other internationalization
improvements. Since both Samba server and this filesystem client support the
CIFS Unix extensions, and the Linux client also suppors SMB3 POSIX extensions,
CIFS Unix extensions, and the Linux client also supports SMB3 POSIX extensions,
the combination can provide a reasonable alternative to other network and
cluster file systems for fileserving in some Linux to Linux environments,
not just in Linux to Windows (or Linux to Mac) environments.

2
Documentation/admin-guide/device-mapper/index.rst

@ -34,6 +34,8 @@ Device Mapper
switch
thin-provisioning
unstriped
vdo-design
vdo
verity
writecache
zero

633
Documentation/admin-guide/device-mapper/vdo-design.rst

@ -0,0 +1,633 @@
.. SPDX-License-Identifier: GPL-2.0-only
================
Design of dm-vdo
================
The dm-vdo (virtual data optimizer) target provides inline deduplication,
compression, zero-block elimination, and thin provisioning. A dm-vdo target
can be backed by up to 256TB of storage, and can present a logical size of
up to 4PB. This target was originally developed at Permabit Technology
Corp. starting in 2009. It was first released in 2013 and has been used in
production environments ever since. It was made open-source in 2017 after
Permabit was acquired by Red Hat. This document describes the design of
dm-vdo. For usage, see vdo.rst in the same directory as this file.
Because deduplication rates fall drastically as the block size increases, a
vdo target has a maximum block size of 4K. However, it can achieve
deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
reference a single 4K of actual storage. It can achieve compression rates
of 14:1. All zero blocks consume no storage at all.
Theory of Operation
===================
The design of dm-vdo is based on the idea that deduplication is a two-part
problem. The first is to recognize duplicate data. The second is to avoid
storing multiple copies of those duplicates. Therefore, dm-vdo has two main
parts: a deduplication index (called UDS) that is used to discover
duplicate data, and a data store with a reference counted block map that
maps from logical block addresses to the actual storage location of the
data.
Zones and Threading
-------------------
Due to the complexity of data optimization, the number of metadata
structures involved in a single write operation to a vdo target is larger
than most other targets. Furthermore, because vdo must operate on small
block sizes in order to achieve good deduplication rates, acceptable
performance can only be achieved through parallelism. Therefore, vdo's
design attempts to be lock-free.
Most of a vdo's main data structures are designed to be easily divided into
"zones" such that any given bio must only access a single zone of any zoned
structure. Safety with minimal locking is achieved by ensuring that during
normal operation, each zone is assigned to a specific thread, and only that
thread will access the portion of the data structure in that zone.
Associated with each thread is a work queue. Each bio is associated with a
request object (the "data_vio") which will be added to a work queue when
the next phase of its operation requires access to the structures in the
zone associated with that queue.
Another way of thinking about this arrangement is that the work queue for
each zone has an implicit lock on the structures it manages for all its
operations, because vdo guarantees that no other thread will alter those
structures.
Although each structure is divided into zones, this division is not
reflected in the on-disk representation of each data structure. Therefore,
the number of zones for each structure, and hence the number of threads,
can be reconfigured each time a vdo target is started.
The Deduplication Index
-----------------------
In order to identify duplicate data efficiently, vdo was designed to
leverage some common characteristics of duplicate data. From empirical
observations, we gathered two key insights. The first is that in most data
sets with significant amounts of duplicate data, the duplicates tend to
have temporal locality. When a duplicate appears, it is more likely that
other duplicates will be detected, and that those duplicates will have been
written at about the same time. This is why the index keeps records in
temporal order. The second insight is that new data is more likely to
duplicate recent data than it is to duplicate older data and in general,
there are diminishing returns to looking further back in time. Therefore,
when the index is full, it should cull its oldest records to make space for
new ones. Another important idea behind the design of the index is that the
ultimate goal of deduplication is to reduce storage costs. Since there is a
trade-off between the storage saved and the resources expended to achieve
those savings, vdo does not attempt to find every last duplicate block. It
is sufficient to find and eliminate most of the redundancy.
Each block of data is hashed to produce a 16-byte block name. An index
record consists of this block name paired with the presumed location of
that data on the underlying storage. However, it is not possible to
guarantee that the index is accurate. In the most common case, this occurs
because it is too costly to update the index when a block is over-written
or discarded. Doing so would require either storing the block name along
with the blocks, which is difficult to do efficiently in block-based
storage, or reading and rehashing each block before overwriting it.
Inaccuracy can also result from a hash collision where two different blocks
have the same name. In practice, this is extremely unlikely, but because
vdo does not use a cryptographic hash, a malicious workload could be
constructed. Because of these inaccuracies, vdo treats the locations in the
index as hints, and reads each indicated block to verify that it is indeed
a duplicate before sharing the existing block with a new one.
Records are collected into groups called chapters. New records are added to
the newest chapter, called the open chapter. This chapter is stored in a
format optimized for adding and modifying records, and the content of the
open chapter is not finalized until it runs out of space for new records.
When the open chapter fills up, it is closed and a new open chapter is
created to collect new records.
Closing a chapter converts it to a different format which is optimized for
reading. The records are written to a series of record pages based on the
order in which they were received. This means that records with temporal
locality should be on a small number of pages, reducing the I/O required to
retrieve them. The chapter also compiles an index that indicates which
record page contains any given name. This index means that a request for a
name can determine exactly which record page may contain that record,
without having to load the entire chapter from storage. This index uses
only a subset of the block name as its key, so it cannot guarantee that an
index entry refers to the desired block name. It can only guarantee that if
there is a record for this name, it will be on the indicated page. Closed
chapters are read-only structures and their contents are never altered in
any way.
Once enough records have been written to fill up all the available index
space, the oldest chapter is removed to make space for new chapters. Any
time a request finds a matching record in the index, that record is copied
into the open chapter. This ensures that useful block names remain available
in the index, while unreferenced block names are forgotten over time.
In order to find records in older chapters, the index also maintains a
higher level structure called the volume index, which contains entries
mapping each block name to the chapter containing its newest record. This
mapping is updated as records for the block name are copied or updated,
ensuring that only the newest record for a given block name can be found.
An older record for a block name will no longer be found even though it has
not been deleted from its chapter. Like the chapter index, the volume index
uses only a subset of the block name as its key and can not definitively
say that a record exists for a name. It can only say which chapter would
contain the record if a record exists. The volume index is stored entirely
in memory and is saved to storage only when the vdo target is shut down.
From the viewpoint of a request for a particular block name, it will first
look up the name in the volume index. This search will either indicate that
the name is new, or which chapter to search. If it returns a chapter, the
request looks up its name in the chapter index. This will indicate either
that the name is new, or which record page to search. Finally, if it is not
new, the request will look for its name in the indicated record page.
This process may require up to two page reads per request (one for the
chapter index page and one for the request page). However, recently
accessed pages are cached so that these page reads can be amortized across
many block name requests.
The volume index and the chapter indexes are implemented using a
memory-efficient structure called a delta index. Instead of storing the
entire block name (the key) for each entry, the entries are sorted by name
and only the difference between adjacent keys (the delta) is stored.
Because we expect the hashes to be randomly distributed, the size of the
deltas follows an exponential distribution. Because of this distribution,
the deltas are expressed using a Huffman code to take up even less space.
The entire sorted list of keys is called a delta list. This structure
allows the index to use many fewer bytes per entry than a traditional hash
table, but it is slightly more expensive to look up entries, because a
request must read every entry in a delta list to add up the deltas in order
to find the record it needs. The delta index reduces this lookup cost by
splitting its key space into many sub-lists, each starting at a fixed key
value, so that each individual list is short.
The default index size can hold 64 million records, corresponding to about
256GB of data. This means that the index can identify duplicate data if the
original data was written within the last 256GB of writes. This range is
called the deduplication window. If new writes duplicate data that is older
than that, the index will not be able to find it because the records of the
older data have been removed. This means that if an application writes a
200 GB file to a vdo target and then immediately writes it again, the two
copies will deduplicate perfectly. Doing the same with a 500 GB file will
result in no deduplication, because the beginning of the file will no
longer be in the index by the time the second write begins (assuming there
is no duplication within the file itself).
If an application anticipates a data workload that will see useful
deduplication beyond the 256GB threshold, vdo can be configured to use a
larger index with a correspondingly larger deduplication window. (This
configuration can only be set when the target is created, not altered
later. It is important to consider the expected workload for a vdo target
before configuring it.) There are two ways to do this.
One way is to increase the memory size of the index, which also increases
the amount of backing storage required. Doubling the size of the index will
double the length of the deduplication window at the expense of doubling
the storage size and the memory requirements.
The other option is to enable sparse indexing. Sparse indexing increases
the deduplication window by a factor of 10, at the expense of also
increasing the storage size by a factor of 10. However with sparse
indexing, the memory requirements do not increase. The trade-off is
slightly more computation per request and a slight decrease in the amount
of deduplication detected. For most workloads with significant amounts of
duplicate data, sparse indexing will detect 97-99% of the deduplication
that a standard index will detect.
The vio and data_vio Structures
-------------------------------
A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
fields and data to track vdo-specific information. A struct vio maintains a
pointer to a bio but also tracks other fields specific to the operation of
vdo. The vio is kept separate from its related bio because there are many
circumstances where vdo completes the bio but must continue to do work
related to deduplication or compression.
Metadata reads and writes, and other writes that originate within vdo, use
a struct vio directly. Application reads and writes use a larger structure
called a data_vio to track information about their progress. A struct
data_vio contain a struct vio and also includes several other fields
related to deduplication and other vdo features. The data_vio is the
primary unit of application work in vdo. Each data_vio proceeds through a
set of steps to handle the application data, after which it is reset and
returned to a pool of data_vios for reuse.
There is a fixed pool of 2048 data_vios. This number was chosen to bound
the amount of work that is required to recover from a crash. In addition,
benchmarks have indicated that increasing the size of the pool does not
significantly improve performance.
The Data Store
--------------
The data store is implemented by three main data structures, all of which
work in concert to reduce or amortize metadata updates across as many data
writes as possible.
*The Slab Depot*
Most of the vdo volume belongs to the slab depot. The depot contains a
collection of slabs. The slabs can be up to 32GB, and are divided into
three sections. Most of a slab consists of a linear sequence of 4K blocks.
These blocks are used either to store data, or to hold portions of the
block map (see below). In addition to the data blocks, each slab has a set
of reference counters, using 1 byte for each data block. Finally each slab
has a journal.
Reference updates are written to the slab journal. Slab journal blocks are
written out either when they are full, or when the recovery journal
requests they do so in order to allow the main recovery journal (see below)
to free up space. The slab journal is used both to ensure that the main
recovery journal can regularly free up space, and also to amortize the cost
of updating individual reference blocks. The reference counters are kept in
memory and are written out, a block at a time in oldest-dirtied-order, only
when there is a need to reclaim slab journal space. The write operations
are performed in the background as needed so they do not add latency to
particular I/O operations.
Each slab is independent of every other. They are assigned to "physical
zones" in round-robin fashion. If there are P physical zones, then slab n
is assigned to zone n mod P.
The slab depot maintains an additional small data structure, the "slab
summary," which is used to reduce the amount of work needed to come back
online after a crash. The slab summary maintains an entry for each slab
indicating whether or not the slab has ever been used, whether all of its
reference count updates have been persisted to storage, and approximately
how full it is. During recovery, each physical zone will attempt to recover
at least one slab, stopping whenever it has recovered a slab which has some
free blocks. Once each zone has some space, or has determined that none is
available, the target can resume normal operation in a degraded mode. Read
and write requests can be serviced, perhaps with degraded performance,
while the remainder of the dirty slabs are recovered.
*The Block Map*
The block map contains the logical to physical mapping. It can be thought
of as an array with one entry per logical address. Each entry is 5 bytes,
36 bits of which contain the physical block number which holds the data for
the given logical address. The other 4 bits are used to indicate the nature
of the mapping. Of the 16 possible states, one represents a logical address
which is unmapped (i.e. it has never been written, or has been discarded),
one represents an uncompressed block, and the other 14 states are used to
indicate that the mapped data is compressed, and which of the compression
slots in the compressed block contains the data for this logical address.
In practice, the array of mapping entries is divided into "block map
pages," each of which fits in a single 4K block. Each block map page
consists of a header and 812 mapping entries. Each mapping page is actually
a leaf of a radix tree which consists of block map pages at each level.
There are 60 radix trees which are assigned to "logical zones" in round
robin fashion. (If there are L logical zones, tree n will belong to zone n
mod L.) At each level, the trees are interleaved, so logical addresses
0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
on. The interleaving is maintained all the way up to the 60 root nodes.
Choosing 60 trees results in an evenly distributed number of trees per zone
for a large number of possible logical zone counts. The storage for the 60
tree roots is allocated at format time. All other block map pages are
allocated out of the slabs as needed. This flexible allocation avoids the
need to pre-allocate space for the entire set of logical mappings and also
makes growing the logical size of a vdo relatively easy.
In operation, the block map maintains two caches. It is prohibitive to keep
the entire leaf level of the trees in memory, so each logical zone
maintains its own cache of leaf pages. The size of this cache is
configurable at target start time. The second cache is allocated at start
time, and is large enough to hold all the non-leaf pages of the entire
block map. This cache is populated as pages are needed.
*The Recovery Journal*
The recovery journal is used to amortize updates across the block map and
slab depot. Each write request causes an entry to be made in the journal.
Entries are either "data remappings" or "block map remappings." For a data
remapping, the journal records the logical address affected and its old and
new physical mappings. For a block map remapping, the journal records the
block map page number and the physical block allocated for it. Block map
pages are never reclaimed or repurposed, so the old mapping is always 0.
Each journal entry is an intent record summarizing the metadata updates
that are required for a data_vio. The recovery journal issues a flush
before each journal block write to ensure that the physical data for the
new block mappings in that block are stable on storage, and journal block
writes are all issued with the FUA bit set to ensure the recovery journal
entries themselves are stable. The journal entry and the data write it
represents must be stable on disk before the other metadata structures may
be updated to reflect the operation. These entries allow the vdo device to
reconstruct the logical to physical mappings after an unexpected
interruption such as a loss of power.
*Write Path*
All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
as vdo has done enough work to guarantee that it can complete the write
eventually. Generally, the data for acknowledged but unflushed write I/O
can be treated as though it is cached in memory. If an application
requires data to be stable on storage, it must issue a flush or write the
data with the FUA bit set like any other asynchronous I/O. Shutting down
the vdo target will also flush any remaining I/O.
Application write bios follow the steps outlined below.
1. A data_vio is obtained from the data_vio pool and associated with the
application bio. If there are no data_vios available, the incoming bio
will block until a data_vio is available. This provides back pressure
to the application. The data_vio pool is protected by a spin lock.
The newly acquired data_vio is reset and the bio's data is copied into
the data_vio if it is a write and the data is not all zeroes. The data
must be copied because the application bio can be acknowledged before
the data_vio processing is complete, which means later processing steps
will no longer have access to the application bio. The application bio
may also be smaller than 4K, in which case the data_vio will have
already read the underlying block and the data is instead copied over
the relevant portion of the larger block.
2. The data_vio places a claim (the "logical lock") on the logical address
of the bio. It is vital to prevent simultaneous modifications of the
same logical address, because deduplication involves sharing blocks.
This claim is implemented as an entry in a hashtable where the key is
the logical address and the value is a pointer to the data_vio
currently handling that address.
If a data_vio looks in the hashtable and finds that another data_vio is
already operating on that logical address, it waits until the previous
operation finishes. It also sends a message to inform the current
lock holder that it is waiting. Most notably, a new data_vio waiting
for a logical lock will flush the previous lock holder out of the
compression packer (step 8d) rather than allowing it to continue
waiting to be packed.
This stage requires the data_vio to get an implicit lock on the
appropriate logical zone to prevent concurrent modifications of the
hashtable. This implicit locking is handled by the zone divisions
described above.
3. The data_vio traverses the block map tree to ensure that all the
necessary internal tree nodes have been allocated, by trying to find
the leaf page for its logical address. If any interior tree page is
missing, it is allocated at this time out of the same physical storage
pool used to store application data.
a. If any page-node in the tree has not yet been allocated, it must be
allocated before the write can continue. This step requires the
data_vio to lock the page-node that needs to be allocated. This
lock, like the logical block lock in step 2, is a hashtable entry
that causes other data_vios to wait for the allocation process to
complete.
The implicit logical zone lock is released while the allocation is
happening, in order to allow other operations in the same logical
zone to proceed. The details of allocation are the same as in
step 4. Once a new node has been allocated, that node is added to
the tree using a similar process to adding a new data block mapping.
The data_vio journals the intent to add the new node to the block
map tree (step 10), updates the reference count of the new block
(step 11), and reacquires the implicit logical zone lock to add the
new mapping to the parent tree node (step 12). Once the tree is
updated, the data_vio proceeds down the tree. Any other data_vios
waiting on this allocation also proceed.
b. In the steady-state case, the block map tree nodes will already be
allocated, so the data_vio just traverses the tree until it finds
the required leaf node. The location of the mapping (the "block map
slot") is recorded in the data_vio so that later steps do not need
to traverse the tree again. The data_vio then releases the implicit
logical zone lock.
4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
made to allocate a free data block. This allocation ensures that the
data_vio can write its data somewhere even if deduplication and
compression are not possible. This stage gets an implicit lock on a
physical zone to search for free space within that zone.
The data_vio will search each slab in a zone until it finds a free
block or decides there are none. If the first zone has no free space,
it will proceed to search the next physical zone by taking the implicit
lock for that zone and releasing the previous one until it finds a
free block or runs out of zones to search. The data_vio will acquire a
struct pbn_lock (the "physical block lock") on the free block. The
struct pbn_lock also has several fields to record the various kinds of
claims that data_vios can have on physical blocks. The pbn_lock is
added to a hashtable like the logical block locks in step 2. This
hashtable is also covered by the implicit physical zone lock. The
reference count of the free block is updated to prevent any other
data_vio from considering it free. The reference counters are a
sub-component of the slab and are thus also covered by the implicit
physical zone lock.
5. If an allocation was obtained, the data_vio has all the resources it
needs to complete the write. The application bio can safely be
acknowledged at this point. The acknowledgment happens on a separate
thread to prevent the application callback from blocking other data_vio
operations.
If an allocation could not be obtained, the data_vio continues to
attempt to deduplicate or compress the data, but the bio is not
acknowledged because the vdo device may be out of space.
6. At this point vdo must determine where to store the application data.
The data_vio's data is hashed and the hash (the "record name") is
recorded in the data_vio.
7. The data_vio reserves or joins a struct hash_lock, which manages all of
the data_vios currently writing the same data. Active hash locks are
tracked in a hashtable similar to the way logical block locks are
tracked in step 2. This hashtable is covered by the implicit lock on
the hash zone.
If there is no existing hash lock for this data_vio's record_name, the
data_vio obtains a hash lock from the pool, adds it to the hashtable,
and sets itself as the new hash lock's "agent." The hash_lock pool is
also covered by the implicit hash zone lock. The hash lock agent will
do all the work to decide where the application data will be
written. If a hash lock for the data_vio's record_name already exists,
and the data_vio's data is the same as the agent's data, the new
data_vio will wait for the agent to complete its work and then share
its result.
In the rare case that a hash lock exists for the data_vio's hash but
the data does not match the hash lock's agent, the data_vio skips to
step 8h and attempts to write its data directly. This can happen if two
different data blocks produce the same hash, for example.
8. The hash lock agent attempts to deduplicate or compress its data with
the following steps.
a. The agent initializes and sends its embedded deduplication request
(struct uds_request) to the deduplication index. This does not
require the data_vio to get any locks because the index components
manage their own locking. The data_vio waits until it either gets a
response from the index or times out.
b. If the deduplication index returns advice, the data_vio attempts to
obtain a physical block lock on the indicated physical address, in
order to read the data and verify that it is the same as the
data_vio's data, and that it can accept more references. If the
physical address is already locked by another data_vio, the data at
that address may soon be overwritten so it is not safe to use the
address for deduplication.
c. If the data matches and the physical block can add references, the
agent and any other data_vios waiting on it will record this
physical block as their new physical address and proceed to step 9
to record their new mapping. If there are more data_vios in the hash
lock than there are references available, one of the remaining
data_vios becomes the new agent and continues to step 8d as if no
valid advice was returned.
d. If no usable duplicate block was found, the agent first checks that
it has an allocated physical block (from step 3) that it can write
to. If the agent does not have an allocation, some other data_vio in
the hash lock that does have an allocation takes over as agent. If
none of the data_vios have an allocated physical block, these writes
are out of space, so they proceed to step 13 for cleanup.
e. The agent attempts to compress its data. If the data does not
compress, the data_vio will continue to step 8h to write its data
directly.
If the compressed size is small enough, the agent will release the
implicit hash zone lock and go to the packer (struct packer) where
it will be placed in a bin (struct packer_bin) along with other
data_vios. All compression operations require the implicit lock on
the packer zone.
The packer can combine up to 14 compressed blocks in a single 4k
data block. Compression is only helpful if vdo can pack at least 2
data_vios into a single data block. This means that a data_vio may
wait in the packer for an arbitrarily long time for other data_vios
to fill out the compressed block. There is a mechanism for vdo to
evict waiting data_vios when continuing to wait would cause
problems. Circumstances causing an eviction include an application
flush, device shutdown, or a subsequent data_vio trying to overwrite
the same logical block address. A data_vio may also be evicted from
the packer if it cannot be paired with any other compressed block
before more compressible blocks need to use its bin. An evicted
data_vio will proceed to step 8h to write its data directly.
f. If the agent fills a packer bin, either because all 14 of its slots
are used or because it has no remaining space, it is written out
using the allocated physical block from one of its data_vios. Step
8d has already ensured that an allocation is available.
g. Each data_vio sets the compressed block as its new physical address.
The data_vio obtains an implicit lock on the physical zone and
acquires the struct pbn_lock for the compressed block, which is
modified to be a shared lock. Then it releases the implicit physical
zone lock and proceeds to step 8i.
h. Any data_vio evicted from the packer will have an allocation from
step 3. It will write its data to that allocated physical block.
i. After the data is written, if the data_vio is the agent of a hash
lock, it will reacquire the implicit hash zone lock and share its
physical address with as many other data_vios in the hash lock as
possible. Each data_vio will then proceed to step 9 to record its
new mapping.
j. If the agent actually wrote new data (whether compressed or not),
the deduplication index is updated to reflect the location of the
new data. The agent then releases the implicit hash zone lock.
9. The data_vio determines the previous mapping of the logical address.
There is a cache for block map leaf pages (the "block map cache"),
because there are usually too many block map leaf nodes to store
entirely in memory. If the desired leaf page is not in the cache, the
data_vio will reserve a slot in the cache and load the desired page
into it, possibly evicting an older cached page. The data_vio then
finds the current physical address for this logical address (the "old
physical mapping"), if any, and records it. This step requires a lock
on the block map cache structures, covered by the implicit logical zone
lock.
10. The data_vio makes an entry in the recovery journal containing the
logical block address, the old physical mapping, and the new physical
mapping. Making this journal entry requires holding the implicit
recovery journal lock. The data_vio will wait in the journal until all
recovery blocks up to the one containing its entry have been written
and flushed to ensure the transaction is stable on storage.
11. Once the recovery journal entry is stable, the data_vio makes two slab
journal entries: an increment entry for the new mapping, and a
decrement entry for the old mapping. These two operations each require
holding a lock on the affected physical slab, covered by its implicit
physical zone lock. For correctness during recovery, the slab journal
entries in any given slab journal must be in the same order as the
corresponding recovery journal entries. Therefore, if the two entries
are in different zones, they are made concurrently, and if they are in
the same zone, the increment is always made before the decrement in
order to avoid underflow. After each slab journal entry is made in
memory, the associated reference count is also updated in memory.
12. Once both of the reference count updates are done, the data_vio
acquires the implicit logical zone lock and updates the
logical-to-physical mapping in the block map to point to the new
physical block. At this point the write operation is complete.
13. If the data_vio has a hash lock, it acquires the implicit hash zone
lock and releases its hash lock to the pool.
The data_vio then acquires the implicit physical zone lock and releases
the struct pbn_lock it holds for its allocated block. If it had an
allocation that it did not use, it also sets the reference count for
that block back to zero to free it for use by subsequent data_vios.
The data_vio then acquires the implicit logical zone lock and releases
the logical block lock acquired in step 2.
The application bio is then acknowledged if it has not previously been
acknowledged, and the data_vio is returned to the pool.
*Read Path*
An application read bio follows a much simpler set of steps. It does steps
1 and 2 in the write path to obtain a data_vio and lock its logical
address. If there is already a write data_vio in progress for that logical
address that is guaranteed to complete, the read data_vio will copy the
data from the write data_vio and return it. Otherwise, it will look up the
logical-to-physical mapping by traversing the block map tree as in step 3,
and then read and possibly decompress the indicated data at the indicated
physical block address. A read data_vio will not allocate block map tree
nodes if they are missing. If the interior block map nodes do not exist
yet, the logical block map address must still be unmapped and the read
data_vio will return all zeroes. A read data_vio handles cleanup and
acknowledgment as in step 13, although it only needs to release the logical
lock and return itself to the pool.
*Small Writes*
All storage within vdo is managed as 4KB blocks, but it can accept writes
as small as 512 bytes. Processing a write that is smaller than 4K requires
a read-modify-write operation that reads the relevant 4K block, copies the
new data over the approriate sectors of the block, and then launches a
write operation for the modified data block. The read and write stages of
this operation are nearly identical to the normal read and write
operations, and a single data_vio is used throughout this operation.
*Recovery*
When a vdo is restarted after a crash, it will attempt to recover from the
recovery journal. During the pre-resume phase of the next start, the
recovery journal is read. The increment portion of valid entries are played
into the block map. Next, valid entries are played, in order as required,
into the slab journals. Finally, each physical zone attempts to replay at
least one slab journal to reconstruct the reference counts of one slab.
Once each zone has some free space (or has determined that it has none),
the vdo comes back online, while the remainder of the slab journals are
used to reconstruct the rest of the reference counts in the background.
*Read-only Rebuild*
If a vdo encounters an unrecoverable error, it will enter read-only mode.
This mode indicates that some previously acknowledged data may have been
lost. The vdo may be instructed to rebuild as best it can in order to
return to a writable state. However, this is never done automatically due
to the possibility that data has been lost. During a read-only rebuild, the
block map is recovered from the recovery journal as before. However, the
reference counts are not rebuilt from the slab journals. Instead, the
reference counts are zeroed, the entire block map is traversed, and the
reference counts are updated from the block mappings. While this may lose
some data, it ensures that the block map and reference counts are
consistent with each other. This allows vdo to resume normal operation and
accept further writes.

406
Documentation/admin-guide/device-mapper/vdo.rst

@ -0,0 +1,406 @@
.. SPDX-License-Identifier: GPL-2.0-only
dm-vdo
======
The dm-vdo (virtual data optimizer) device mapper target provides
block-level deduplication, compression, and thin provisioning. As a device
mapper target, it can add these features to the storage stack, compatible
with any file system. The vdo target does not protect against data
corruption, relying instead on integrity protection of the storage below
it. It is strongly recommended that lvm be used to manage vdo volumes. See
lvmvdo(7).
Userspace component
===================
Formatting a vdo volume requires the use of the 'vdoformat' tool, available
at:
https://github.com/dm-vdo/vdo/
In most cases, a vdo target will recover from a crash automatically the
next time it is started. In cases where it encountered an unrecoverable
error (either during normal operation or crash recovery) the target will
enter or come up in read-only mode. Because read-only mode is indicative of
data-loss, a positive action must be taken to bring vdo out of read-only
mode. The 'vdoforcerebuild' tool, available from the same repo, is used to
prepare a read-only vdo to exit read-only mode. After running this tool,
the vdo target will rebuild its metadata the next time it is
started. Although some data may be lost, the rebuilt vdo's metadata will be
internally consistent and the target will be writable again.
The repo also contains additional userspace tools which can be used to
inspect a vdo target's on-disk metadata. Fortunately, these tools are
rarely needed except by dm-vdo developers.
Metadata requirements
=====================
Each vdo volume reserves 3GB of space for metadata, or more depending on
its configuration. It is helpful to check that the space saved by
deduplication and compression is not cancelled out by the metadata
requirements. An estimation of the space saved for a specific dataset can
be computed with the vdo estimator tool, which is available at:
https://github.com/dm-vdo/vdoestimator/
Target interface
================
Table line
----------
::
<offset> <logical device size> vdo V4 <storage device>
<storage device size> <minimum I/O size> <block map cache size>
<block map era length> [optional arguments]
Required parameters:
offset:
The offset, in sectors, at which the vdo volume's logical
space begins.
logical device size:
The size of the device which the vdo volume will service,
in sectors. Must match the current logical size of the vdo
volume.
storage device:
The device holding the vdo volume's data and metadata.
storage device size:
The size of the device holding the vdo volume, as a number
of 4096-byte blocks. Must match the current size of the vdo
volume.
minimum I/O size:
The minimum I/O size for this vdo volume to accept, in
bytes. Valid values are 512 or 4096. The recommended value
is 4096.
block map cache size:
The size of the block map cache, as a number of 4096-byte
blocks. The minimum and recommended value is 32768 blocks.
If the logical thread count is non-zero, the cache size
must be at least 4096 blocks per logical thread.
block map era length:
The speed with which the block map cache writes out
modified block map pages. A smaller era length is likely to
reduce the amount of time spent rebuilding, at the cost of
increased block map writes during normal operation. The
maximum and recommended value is 16380; the minimum value
is 1.
Optional parameters:
--------------------
Some or all of these parameters may be specified as <key> <value> pairs.
Thread related parameters:
Different categories of work are assigned to separate thread groups, and
the number of threads in each group can be configured separately.
If <hash>, <logical>, and <physical> are all set to 0, the work handled by
all three thread types will be handled by a single thread. If any of these
values are non-zero, all of them must be non-zero.
ack:
The number of threads used to complete bios. Since
completing a bio calls an arbitrary completion function
outside the vdo volume, threads of this type allow the vdo
volume to continue processing requests even when bio
completion is slow. The default is 1.
bio:
The number of threads used to issue bios to the underlying
storage. Threads of this type allow the vdo volume to
continue processing requests even when bio submission is
slow. The default is 4.
bioRotationInterval:
The number of bios to enqueue on each bio thread before
switching to the next thread. The value must be greater
than 0 and not more than 1024; the default is 64.
cpu:
The number of threads used to do CPU-intensive work, such
as hashing and compression. The default is 1.
hash:
The number of threads used to manage data comparisons for
deduplication based on the hash value of data blocks. The
default is 0.
logical:
The number of threads used to manage caching and locking
based on the logical address of incoming bios. The default
is 0; the maximum is 60.
physical:
The number of threads used to manage administration of the
underlying storage device. At format time, a slab size for
the vdo is chosen; the vdo storage device must be large
enough to have at least 1 slab per physical thread. The
default is 0; the maximum is 16.
Miscellaneous parameters:
maxDiscard:
The maximum size of discard bio accepted, in 4096-byte
blocks. I/O requests to a vdo volume are normally split
into 4096-byte blocks, and processed up to 2048 at a time.
However, discard requests to a vdo volume can be
automatically split to a larger size, up to <maxDiscard>
4096-byte blocks in a single bio, and are limited to 1500
at a time. Increasing this value may provide better overall
performance, at the cost of increased latency for the
individual discard requests. The default and minimum is 1;
the maximum is UINT_MAX / 4096.
deduplication:
Whether deduplication is enabled. The default is 'on'; the
acceptable values are 'on' and 'off'.
compression:
Whether compression is enabled. The default is 'off'; the
acceptable values are 'on' and 'off'.
Device modification
-------------------
A modified table may be loaded into a running, non-suspended vdo volume.
The modifications will take effect when the device is next resumed. The
modifiable parameters are <logical device size>, <physical device size>,
<maxDiscard>, <compression>, and <deduplication>.
If the logical device size or physical device size are changed, upon
successful resume vdo will store the new values and require them on future
startups. These two parameters may not be decreased. The logical device
size may not exceed 4 PB. The physical device size must increase by at
least 32832 4096-byte blocks if at all, and must not exceed the size of the
underlying storage device. Additionally, when formatting the vdo device, a
slab size is chosen: the physical device size may never increase above the
size which provides 8192 slabs, and each increase must be large enough to
add at least one new slab.
Examples:
Start a previously-formatted vdo volume with 1 GB logical space and 1 GB
physical space, storing to /dev/dm-1 which has more than 1 GB of space.
::
dmsetup create vdo0 --table \
"0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380"
Grow the logical size to 4 GB.
::
dmsetup reload vdo0 --table \
"0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380"
dmsetup resume vdo0
Grow the physical size to 2 GB.
::
dmsetup reload vdo0 --table \
"0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380"
dmsetup resume vdo0
Grow the physical size by 1 GB more and increase max discard sectors.
::
dmsetup reload vdo0 --table \
"0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8"
dmsetup resume vdo0
Stop the vdo volume.
::
dmsetup remove vdo0
Start the vdo volume again. Note that the logical and physical device sizes
must still match, but other parameters can change.
::
dmsetup create vdo1 --table \
"0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2"
Messages
--------
All vdo devices accept messages in the form:
::
dmsetup message <target-name> 0 <message-name> <message-parameters>
The messages are:
stats:
Outputs the current view of the vdo statistics. Mostly used
by the vdostats userspace program to interpret the output
buffer.
dump:
Dumps many internal structures to the system log. This is
not always safe to run, so it should only be used to debug
a hung vdo. Optional parameters to specify structures to
dump are:
viopool: The pool of I/O requests incoming bios
pools: A synonym of 'viopool'
vdo: Most of the structures managing on-disk data
queues: Basic information about each vdo thread
threads: A synonym of 'queues'
default: Equivalent to 'queues vdo'
all: All of the above.
dump-on-shutdown:
Perform a default dump next time vdo shuts down.
Status
------
::
<device> <operating mode> <in recovery> <index state>
<compression state> <physical blocks used> <total physical blocks>
device:
The name of the vdo volume.
operating mode:
The current operating mode of the vdo volume; values may be
'normal', 'recovering' (the volume has detected an issue
with its metadata and is attempting to repair itself), and
'read-only' (an error has occurred that forces the vdo
volume to only support read operations and not writes).
in recovery:
Whether the vdo volume is currently in recovery mode;
values may be 'recovering' or '-' which indicates not
recovering.
index state:
The current state of the deduplication index in the vdo
volume; values may be 'closed', 'closing', 'error',
'offline', 'online', 'opening', and 'unknown'.
compression state:
The current state of compression in the vdo volume; values
may be 'offline' and 'online'.
used physical blocks:
The number of physical blocks in use by the vdo volume.
total physical blocks:
The total number of physical blocks the vdo volume may use;
the difference between this value and the
<used physical blocks> is the number of blocks the vdo
volume has left before being full.
Memory Requirements
===================
A vdo target requires a fixed 38 MB of RAM along with the following amounts
that scale with the target:
- 1.15 MB of RAM for each 1 MB of configured block map cache size. The
block map cache requires a minimum of 150 MB.
- 1.6 MB of RAM for each 1 TB of logical space.
- 268 MB of RAM for each 1 TB of physical storage managed by the volume.
The deduplication index requires additional memory which scales with the
size of the deduplication window. For dense indexes, the index requires 1
GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB
of RAM per 10 TB of window. The index configuration is set when the target
is formatted and may not be modified.
Module Parameters
=================
The vdo driver has a numeric parameter 'log_level' which controls the
verbosity of logging from the driver. The default setting is 6
(LOGLEVEL_INFO and more severe messages).
Run-time Usage
==============
When using dm-vdo, it is important to be aware of the ways in which its
behavior differs from other storage targets.
- There is no guarantee that over-writes of existing blocks will succeed.
Because the underlying storage may be multiply referenced, over-writing
an existing block generally requires a vdo to have a free block
available.
- When blocks are no longer in use, sending a discard request for those
blocks lets the vdo release references for those blocks. If the vdo is
thinly provisioned, discarding unused blocks is essential to prevent the
target from running out of space. However, due to the sharing of
duplicate blocks, no discard request for any given logical block is
guaranteed to reclaim space.
- Assuming the underlying storage properly implements flush requests, vdo
is resilient against crashes, however, unflushed writes may or may not
persist after a crash.
- Each write to a vdo target entails a significant amount of processing.
However, much of the work is paralellizable. Therefore, vdo targets
achieve better throughput at higher I/O depths, and can support up 2048
requests in parallel.
Tuning
======
The vdo device has many options, and it can be difficult to make optimal
choices without perfect knowledge of the workload. Additionally, most
configuration options must be set when a vdo target is started, and cannot
be changed without shutting it down completely; the configuration cannot be
changed while the target is active. Ideally, tuning with simulated
workloads should be performed before deploying vdo in production
environments.
The most important value to adjust is the block map cache size. In order to
service a request for any logical address, a vdo must load the portion of
the block map which holds the relevant mapping. These mappings are cached.
Performance will suffer when the working set does not fit in the cache. By
default, a vdo allocates 128 MB of metadata cache in RAM to support
efficient access to 100 GB of logical space at a time. It should be scaled
up proportionally for larger working sets.
The logical and physical thread counts should also be adjusted. A logical
thread controls a disjoint section of the block map, so additional logical
threads increase parallelism and can increase throughput. Physical threads
control a disjoint section of the data blocks, so additional physical
threads can also increase throughput. However, excess threads can waste
resources and increase contention.
Bio submission threads control the parallelism involved in sending I/O to
the underlying storage; fewer threads mean there is more opportunity to
reorder I/O requests for performance benefit, but also that each I/O
request has to wait longer before being submitted.
Bio acknowledgment threads are used for finishing I/O requests. This is
done on dedicated threads since the amount of work required to execute a
bio's callback can not be controlled by the vdo itself. Usually one thread
is sufficient but additional threads may be beneficial, particularly when
bios have CPU-heavy callbacks.
CPU threads are used for hashing and for compression; in workloads with
compression enabled, more threads may result in higher throughput.
Hash threads are used to sort active requests by hash and determine whether
they should deduplicate; the most CPU intensive actions done by these
threads are comparison of 4096-byte data blocks. In most cases, a single
hash thread is sufficient.

35
Documentation/admin-guide/edid.rst

@ -24,37 +24,4 @@ restrictions later on.
As a remedy for such situations, the kernel configuration item
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
individually prepared or corrected EDID data set in the /lib/firmware
directory from where it is loaded via the firmware interface. The code
(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
not contain code to create these data. In order to elucidate the origin
of the built-in binary EDID blobs and to facilitate the creation of
individual data for a specific misbehaving monitor, commented sources
and a Makefile environment are given here.
To create binary EDID and C source code files from the existing data
material, simply type "make" in tools/edid/.
If you want to create your own EDID file, copy the file 1024x768.S,
replace the settings with your own data and add a new target to the
Makefile. Please note that the EDID data structure expects the timing
values in a different way as compared to the standard X11 format.
X11:
HTimings:
hdisp hsyncstart hsyncend htotal
VTimings:
vdisp vsyncstart vsyncend vtotal
EDID::
#define XPIX hdisp
#define XBLANK htotal-hdisp
#define XOFFSET hsyncstart-hdisp
#define XPULSE hsyncend-hsyncstart
#define YPIX vdisp
#define YBLANK vtotal-vdisp
#define YOFFSET vsyncstart-vdisp
#define YPULSE vsyncend-vsyncstart
directory from where it is loaded via the firmware interface.

8
Documentation/admin-guide/gpio/gpio-mockup.rst

@ -3,6 +3,14 @@
GPIO Testing Driver
===================
.. note::
This module has been obsoleted by the more flexible gpio-sim.rst.
New developments should use that API and existing developments are
encouraged to migrate as soon as possible.
This module will continue to be maintained but no new features will be
added.
The GPIO Testing Driver (gpio-mockup) provides a way to create simulated GPIO
chips for testing purposes. The lines exposed by these chips can be accessed
using the standard GPIO character device interface as well as manipulated

6
Documentation/admin-guide/gpio/index.rst

@ -1,16 +1,16 @@
.. SPDX-License-Identifier: GPL-2.0
====
gpio
GPIO
====
.. toctree::
:maxdepth: 1
Character Device Userspace API <../../userspace-api/gpio/chardev>
gpio-aggregator
sysfs
gpio-mockup
gpio-sim
Obsolete APIs <obsolete>
.. only:: subproject and html

13
Documentation/admin-guide/gpio/obsolete.rst

@ -0,0 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
==================
Obsolete GPIO APIs
==================
.. toctree::
:maxdepth: 1
Character Device Userspace API (v1) <../../userspace-api/gpio/chardev_v1>
Sysfs Interface <../../userspace-api/gpio/sysfs>
Mockup Testing Module <gpio-mockup>

1
Documentation/admin-guide/hw-vuln/index.rst

@ -21,3 +21,4 @@ are configurable at compile, boot or run time.
cross-thread-rsb
srso
gather_data_sampling
reg-file-data-sampling

104
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst

@ -0,0 +1,104 @@
==================================
Register File Data Sampling (RFDS)
==================================
Register File Data Sampling (RFDS) is a microarchitectural vulnerability that
only affects Intel Atom parts(also branded as E-cores). RFDS may allow
a malicious actor to infer data values previously used in floating point
registers, vector registers, or integer registers. RFDS does not provide the
ability to choose which data is inferred. CVE-2023-28746 is assigned to RFDS.
Affected Processors
===================
Below is the list of affected Intel processors [#f1]_:
=================== ============
Common name Family_Model
=================== ============
ATOM_GOLDMONT 06_5CH
ATOM_GOLDMONT_D 06_5FH
ATOM_GOLDMONT_PLUS 06_7AH
ATOM_TREMONT_D 06_86H
ATOM_TREMONT 06_96H
ALDERLAKE 06_97H
ALDERLAKE_L 06_9AH
ATOM_TREMONT_L 06_9CH
RAPTORLAKE 06_B7H
RAPTORLAKE_P 06_BAH
ATOM_GRACEMONT 06_BEH
RAPTORLAKE_S 06_BFH
=================== ============
As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and
RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as
vulnerable in Linux because they share the same family/model with an affected
part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or
CPUID.HYBRID. This information could be used to distinguish between the
affected and unaffected parts, but it is deemed not worth adding complexity as
the reporting is fixed automatically when these parts enumerate RFDS_NO.
Mitigation
==========
Intel released a microcode update that enables software to clear sensitive
information using the VERW instruction. Like MDS, RFDS deploys the same
mitigation strategy to force the CPU to clear the affected buffers before an
attacker can extract the secrets. This is achieved by using the otherwise
unused and obsolete VERW instruction in combination with a microcode update.
The microcode clears the affected CPU buffers when the VERW instruction is
executed.
Mitigation points
-----------------
VERW is executed by the kernel before returning to user space, and by KVM
before VMentry. None of the affected cores support SMT, so VERW is not required
at C-state transitions.
New bits in IA32_ARCH_CAPABILITIES
----------------------------------
Newer processors and microcode update on existing affected processors added new
bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
vulnerability and mitigation capability:
- Bit 27 - RFDS_NO - When set, processor is not affected by RFDS.
- Bit 28 - RFDS_CLEAR - When set, processor is affected by RFDS, and has the
microcode that clears the affected buffers on VERW execution.
Mitigation control on the kernel command line
---------------------------------------------
The kernel command line allows to control RFDS mitigation at boot time with the
parameter "reg_file_data_sampling=". The valid arguments are:
========== =================================================================
on If the CPU is vulnerable, enable mitigation; CPU buffer clearing
on exit to userspace and before entering a VM.
off Disables mitigation.
========== =================================================================
Mitigation default is selected by CONFIG_MITIGATION_RFDS.
Mitigation status information
-----------------------------
The Linux kernel provides a sysfs interface to enumerate the current
vulnerability status of the system: whether the system is vulnerable, and
which mitigations are active. The relevant sysfs file is:
/sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
The possible values in this file are:
.. list-table::
* - 'Not affected'
- The processor is not vulnerable
* - 'Vulnerable'
- The processor is vulnerable, but no mitigation enabled
* - 'Vulnerable: No microcode'
- The processor is vulnerable but microcode is not updated.
* - 'Mitigation: Clear Register File'
- The processor is vulnerable and the CPU buffer clearing mitigation is
enabled.
References
----------
.. [#f1] Affected Processors
https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html

8
Documentation/admin-guide/hw-vuln/spectre.rst

@ -473,8 +473,8 @@ Spectre variant 2
-mindirect-branch=thunk-extern -mindirect-branch-register options.
If the kernel is compiled with a Clang compiler, the compiler needs
to support -mretpoline-external-thunk option. The kernel config
CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
the latest updated microcode.
CONFIG_MITIGATION_RETPOLINE needs to be turned on, and the CPU needs
to run with the latest updated microcode.
On Intel Skylake-era systems the mitigation covers most, but not all,
cases. See :ref:`[3] <spec_ref3>` for more details.
@ -609,8 +609,8 @@ kernel command line.
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
CONFIG_RETPOLINE configuration option, and the
compiler with which the kernel was built.
CONFIG_MITIGATION_RETPOLINE configuration option,
and the compiler with which the kernel was built.
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.

4
Documentation/admin-guide/index.rst

@ -1,3 +1,4 @@
=================================================
The Linux kernel user's and administrator's guide
=================================================
@ -37,6 +38,7 @@ problems and bugs in particular.
reporting-issues
reporting-regressions
quickly-build-trimmed-linux
verify-bugs-and-bisect-regressions
bug-hunting
bug-bisect
tainted-kernels
@ -122,7 +124,7 @@ configure specific aspects of kernel behavior to your liking.
pmf
pnp
rapidio
ras
RAS/index
rtc
serial-console
svga

7
Documentation/admin-guide/kdump/kdump.rst

@ -191,9 +191,7 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
CPU is enough for kdump kernel to dump vmcore on most of systems.
However, you can also specify nr_cpus=X to enable multiple processors
in kdump kernel. In this case, "disable_cpu_apicid=" is needed to
tell kdump kernel which cpu is 1st kernel's BSP. Please refer to
admin-guide/kernel-parameters.txt for more details.
in kdump kernel.
With CONFIG_SMP=n, the above things are not related.
@ -454,8 +452,7 @@ Notes on loading the dump-capture kernel:
to use multi-thread programs with it, such as parallel dump feature of
makedumpfile. Otherwise, the multi-thread program may have a great
performance degradation. To enable multi-cpu support, you should bring up an
SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
options while loading it.
SMP dump-capture kernel and specify maxcpus/nr_cpus options while loading it.
* For s390x there are two kdump modes: If a ELF header is specified with
the elfcorehdr= kernel parameter, it is used by the kdump kernel as it

8
Documentation/admin-guide/kdump/vmcoreinfo.rst

@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
the kernel start address. Used to convert a virtual address from the
direct kernel map to a physical address.
vmap_area_list
--------------
VMALLOC_START
-------------
Stores the virtual area list. makedumpfile gets the vmalloc start value
from this variable and its value is necessary for vmalloc translation.
Stores the base address of vmalloc area. makedumpfile gets this value
since is necessary for vmalloc translation.
mem_map
-------

1
Documentation/admin-guide/kernel-parameters.rst

@ -108,6 +108,7 @@ is applicable::
CMA Contiguous Memory Area support is enabled.
DRM Direct Rendering Management support is enabled.
DYNAMIC_DEBUG Build in debug messages and enable them at runtime
EARLY Parameter processed too early to be embedded in initrd.
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
EFI EFI Partitioning (GPT) is enabled
EVM Extended Verification Module

686
Documentation/admin-guide/kernel-parameters.txt

File diff suppressed because it is too large Load Diff

7
Documentation/admin-guide/laptops/thinkpad-acpi.rst

@ -444,7 +444,9 @@ event code Key Notes
0x1008 0x07 FN+F8 IBM: toggle screen expand
Lenovo: configure UltraNav,
or toggle screen expand
or toggle screen expand.
On newer platforms (2024+)
replaced by 0x131f (see below)
0x1009 0x08 FN+F9 -
@ -504,6 +506,9 @@ event code Key Notes
0x1019 0x18 unknown
0x131f ... FN+F8 Platform Mode change.
Implemented in driver.
... ... ...
0x1020 0x1F unknown

12
Documentation/admin-guide/media/visl.rst

@ -49,6 +49,10 @@ Module parameters
visl_dprintk_frame_start, visl_dprintk_nframes, but controls the dumping of
buffer data through debugfs instead.
- tpg_verbose: Write extra information on each output frame to ease debugging
the API. When set to true, the output frames are not stable for a given input
as some information like pointers or queue status will be added to them.
What is the default use case for this driver?
---------------------------------------------
@ -57,8 +61,12 @@ This assumes that a working client is run against visl and that the ftrace and
OUTPUT buffer data is subsequently used to debug a work-in-progress
implementation.
Information on reference frames, their timestamps, the status of the OUTPUT and
CAPTURE queues and more can be read directly from the CAPTURE buffers.
Even though no video decoding is actually done, the output frames can be used
against a reference for a given input, except if tpg_verbose is set to true.
Depending on the tpg_verbose parameter value, information on reference frames,
their timestamps, the status of the OUTPUT and CAPTURE queues and more can be
read directly from the CAPTURE buffers.
Supported codecs
----------------

2
Documentation/admin-guide/media/vivid.rst

@ -60,7 +60,7 @@ all configurable using the following module options:
- node_types:
which devices should each driver instance create. An array of
hexadecimal values, one for each instance. The default is 0x1d3d.
hexadecimal values, one for each instance. The default is 0xe1d3d.
Each value is a bitmask with the following meaning:
- bit 0: Video Capture node

27
Documentation/admin-guide/mm/damon/reclaim.rst

@ -117,6 +117,33 @@ milliseconds.
1 second by default.
quota_mem_pressure_us
---------------------
Desired level of memory pressure-stall time in microseconds.
While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
increases and decreases the effective level of the quota aiming this level of
memory pressure is incurred. System-wide ``some`` memory PSI in microseconds
per quota reset interval (``quota_reset_interval_ms``) is collected and
compared to this value to see if the aim is satisfied. Value zero means
disabling this auto-tuning feature.
Disabled by default.
quota_autotune_feedback
-----------------------
User-specifiable feedback for auto-tuning of the effective quota.
While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
increases and decreases the effective level of the quota aiming receiving this
feedback of value ``10,000`` from the user. DAMON_RECLAIM assumes the feedback
value and the quota are positively proportional. Value zero means disabling
this auto-tuning feature.
Disabled by default.
wmarks_interval
---------------

158
Documentation/admin-guide/mm/damon/usage.rst

@ -83,10 +83,10 @@ comma (",").
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
│ │ │ │ │ │ │ │ age/min,max
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
│ │ │ │ │ │ │ │ │ 0/target_value,current_value
│ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
@ -153,6 +153,9 @@ Users can write below commands for the kdamond to the ``state`` file.
- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
action tried regions directory for each DAMON-based operation scheme of the
kdamond.
- ``update_schemes_effective_bytes``: Update the contents of
``effective_bytes`` files for each DAMON-based operation scheme of the
kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
@ -180,19 +183,14 @@ In each context directory, two files (``avail_operations`` and ``operations``)
and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
exist.
DAMON supports multiple types of monitoring operations, including those for
virtual address space and the physical address space. You can get the list of
available monitoring operations set on the currently running kernel by reading
DAMON supports multiple types of :ref:`monitoring operations
<damon_design_configurable_operations_set>`, including those for virtual address
space and the physical address space. You can get the list of available
monitoring operations set on the currently running kernel by reading
``avail_operations`` file. Based on the kernel configuration, the file will
list some or all of below keywords.
- vaddr: Monitor virtual address spaces of specific processes
- fvaddr: Monitor fixed virtual address ranges
- paddr: Monitor the physical address space of the system
Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
differences between the operations sets in terms of the monitoring target
regions.
list different available operation sets. Please refer to the :ref:`design
<damon_operations_set>` for the list of all available operation sets and their
brief explanations.
You can set and get what type of monitoring operations DAMON will use for the
context by writing one of the keywords listed in ``avail_operations`` file and
@ -247,17 +245,11 @@ process to the ``pid_target`` file.
targets/<N>/regions
-------------------
When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to
the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the
monitoring target regions so that entire memory mappings of target processes
can be covered. However, users could want to set the initial monitoring region
to specific address ranges.
In contrast, DAMON do not automatically sets and updates the monitoring target
regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
Therefore, users should set the monitoring target regions by themselves in the
cases.
In case of ``fvaddr`` or ``paddr`` monitoring operations sets, users are
required to set the monitoring target address ranges. In case of ``vaddr``
operations set, it is not mandatory, but users can optionally set the initial
monitoring region to specific address ranges. Please refer to the :ref:`design
<damon_design_vaddr_target_regions_construction>` for more details.
For such cases, users can explicitly set the initial monitoring target regions
as they want, by writing proper values to the files under this directory.
@ -302,27 +294,8 @@ In each scheme directory, five directories (``access_pattern``, ``quotas``,
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
from the file and their meaning are as below.
Note that support of each action depends on the running DAMON operations set
:ref:`implementation <sysfs_context>`.
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
- ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``lru_prio``: Prioritize the region on its LRU lists.
Supported by ``paddr`` operations set.
- ``lru_deprio``: Deprioritize the region on its LRU lists.
Supported by ``paddr`` operations set.
- ``stat``: Do nothing but count the statistics.
Supported by all operations sets.
from the file and their meaning are same to those of the list on
:ref:`design doc <damon_design_damos_action>`.
The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds.
@ -350,8 +323,9 @@ schemes/<N>/quotas/
The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
DAMON-based operation scheme.
Under ``quotas`` directory, three files (``ms``, ``bytes``,
``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
Under ``quotas`` directory, four files (``ms``, ``bytes``,
``reset_interval_ms``, ``effective_bytes``) and two directores (``weights`` and
``goals``) exist.
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
``reset interval`` in milliseconds by writing the values to the three files,
@ -359,7 +333,17 @@ respectively. Then, DAMON tries to use only up to ``time quota`` milliseconds
for applying the ``action`` to memory regions of the ``access_pattern``, and to
apply the action to only up to ``bytes`` bytes of memory regions within the
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
quota limits.
quota limits unless at least one :ref:`goal <sysfs_schemes_quota_goals>` is
set.
The time quota is internally transformed to a size quota. Between the
transformed size quota and user-specified size quota, smaller one is applied.
Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the
effective size quota is further adjusted. Reading ``effective_bytes`` returns
the current effective size quota. The file is not updated in real time, so
users should ask DAMON sysfs interface to update the content of the file for
the stats by writing a special keyword, ``update_schemes_effective_bytes`` to
the relevant ``kdamonds/<N>/state`` file.
Under ``weights`` directory, three files (``sz_permil``,
``nr_accesses_permil``, and ``age_permil``) exist.
@ -382,11 +366,11 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each goal and current achievement.
Among the multiple feedback, the best one is used.
Each goal directory contains two files, namely ``target_value`` and
``current_value``. Users can set and get any number to those files to set the
feedback. User space main workload's latency or throughput, system metrics
like free memory ratio or memory pressure stall time (PSI) could be example
metrics for the values. Note that users should write
Each goal directory contains three files, namely ``target_metric``,
``target_value`` and ``current_value``. Users can set and get the three
parameters for the quota auto-tuning goals that specified on the :ref:`design
doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each
of the files. Note that users should further write
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
directory <sysfs_kdamond>` to pass the feedback to DAMON.
@ -579,11 +563,11 @@ monitoring results recording.
While the monitoring is turned on, you could record the tracepoint events and
show results using tracepoint supporting tools like ``perf``. For example::
# echo on > monitor_on
# echo on > kdamonds/0/state
# perf record -e damon:damon_aggregated &
# sleep 5
# kill 9 $(pidof perf)
# echo off > monitor_on
# echo off > kdamonds/0/state
# perf script
kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
[...]
@ -628,9 +612,17 @@ debugfs Interface (DEPRECATED!)
move, please report your usecase to damon@lists.linux.dev and
linux-mm@kvack.org.
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``,
``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``,
``mk_contexts`` and ``rm_contexts`` under its debugfs directory,
``<debugfs>/damon/``.
``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation
notice. Reading it returns the deprecation notice, as below::
# cat DEPRECATED
DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org.
Attributes
@ -755,19 +747,17 @@ Action
~~~~~~
The ``<action>`` is a predefined integer for memory management :ref:`actions
<damon_design_damos_action>`. The supported numbers and their meanings are as
below.
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if
``target`` is ``paddr``.
- 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if
``target`` is ``paddr``.
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if
``target`` is ``paddr``.
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if
``target`` is ``paddr``.
- 5: Do nothing but count the statistics
<damon_design_damos_action>`. The mapping between the ``<action>`` values and
the memory management actions is as below. For the detailed meaning of the
action and DAMON operations set supporting each action, please refer to the
list on :ref:`design doc <damon_design_damos_action>`.
- 0: ``willneed``
- 1: ``cold``
- 2: ``pageout``
- 3: ``hugepage``
- 4: ``nohugepage``
- 5: ``stat``
Quota
~~~~~
@ -848,16 +838,16 @@ Turning On/Off
Setting the files as described above doesn't incur effect unless you explicitly
start the monitoring. You can start, stop, and check the current status of the
monitoring by writing to and reading from the ``monitor_on`` file. Writing
``on`` to the file starts the monitoring of the targets with the attributes.
Writing ``off`` to the file stops those. DAMON also stops if every target
process is terminated. Below example commands turn on, off, and check the
status of DAMON::
monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file.
Writing ``on`` to the file starts the monitoring of the targets with the
attributes. Writing ``off`` to the file stops those. DAMON also stops if
every target process is terminated. Below example commands turn on, off, and
check the status of DAMON::
# cd <debugfs>/damon
# echo on > monitor_on
# echo off > monitor_on
# cat monitor_on
# echo on > monitor_on_DEPRECATED
# echo off > monitor_on_DEPRECATED
# cat monitor_on_DEPRECATED
off
Please note that you cannot write to the above-mentioned debugfs files while
@ -873,11 +863,11 @@ can get the pid of the thread by reading the ``kdamond_pid`` file. When the
monitoring is turned off, reading the file returns ``none``. ::
# cd <debugfs>/damon
# cat monitor_on
# cat monitor_on_DEPRECATED
off
# cat kdamond_pid
none
# echo on > monitor_on
# echo on > monitor_on_DEPRECATED
# cat kdamond_pid
18594
@ -907,5 +897,5 @@ directory by putting the name of the context to the ``rm_contexts`` file. ::
# ls foo
# ls: cannot access 'foo': No such file or directory
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
root directory only.
Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files
are in the root directory only.

9
Documentation/admin-guide/mm/numa_memory_policy.rst

@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
MPOL_WEIGHTED_INTERLEAVE
This mode operates the same as MPOL_INTERLEAVE, except that
interleaving behavior is executed based on weights set in
/sys/kernel/mm/mempolicy/weighted_interleave/
Weighted interleave allocates pages on nodes according to a
weight. For example if nodes [0,1] are weighted [5,2], 5 pages
will be allocated on node0 for every 2 pages allocated on node1.
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES

32
Documentation/admin-guide/perf/hisi-pcie-pmu.rst

@ -37,9 +37,21 @@ Example usage of perf::
hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
------------------------------------------
$# perf stat -e hisi_pcie0_core0/rx_mwr_latency/
$# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/
$# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/
$# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0xffff/
$# perf stat -e hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/
The related events usually used to calculate the bandwidth, latency or others.
They need to start and end counting at the same time, therefore related events
are best used in the same event group to get the expected value. There are two
ways to know if they are related events:
a) By event name, such as the latency events "xxx_latency, xxx_cnt" or
bandwidth events "xxx_flux, xxx_time".
b) By event type, such as "event=0xXXXX, event=0x1XXXX".
Example usage of perf group::
$# perf stat -e "{hisi_pcie0_core0/rx_mwr_latency,port=0xffff/,hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/}"
The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported for PCIe PMU.
@ -51,8 +63,12 @@ Filter options
PMU could only monitor the performance of traffic downstream target Root
Ports or downstream target Endpoint. PCIe PMU driver support "port" and
"bdf" interfaces for users, and these two interfaces aren't supported at the
same time.
"bdf" interfaces for users.
Please notice that, one of these two interfaces must be set, and these two
interfaces aren't supported at the same time. If they are both set, only
"port" filter is valid.
If "port" filter not being set or is set explicitly to zero (default), the
"bdf" filter will be in effect, because "bdf=0" meaning 0000:000:00.0.
- port
@ -95,7 +111,7 @@ Filter options
Example usage of perf::
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,trig_len=0x4,trig_mode=1/ sleep 5
3. Threshold filter
@ -109,7 +125,7 @@ Filter options
Example usage of perf::
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,thr_len=0x4,thr_mode=1/ sleep 5
4. TLP Length filter
@ -127,4 +143,4 @@ Filter options
Example usage of perf::
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5
$# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,len_mode=0x1/ sleep 5

1
Documentation/admin-guide/perf/index.rst

@ -13,6 +13,7 @@ Performance monitor support
imx-ddr
qcom_l2_pmu
qcom_l3_pmu
starfive_starlink_pmu
arm-ccn
arm-cmn
xgene-pmu

46
Documentation/admin-guide/perf/starfive_starlink_pmu.rst

@ -0,0 +1,46 @@
================================================
StarFive StarLink Performance Monitor Unit (PMU)
================================================
StarFive StarLink Performance Monitor Unit (PMU) exists within the
StarLink Coherent Network on Chip (CNoC) that connects multiple CPU
clusters with an L3 memory system.
The uncore PMU supports overflow interrupt, up to 16 programmable 64bit
event counters, and an independent 64bit cycle counter.
The PMU can only be accessed via Memory Mapped I/O and are common to the
cores connected to the same PMU.
Driver exposes supported PMU events in sysfs "events" directory under::
/sys/bus/event_source/devices/starfive_starlink_pmu/events/
Driver exposes cpu used to handle PMU events in sysfs "cpumask" directory
under::
/sys/bus/event_source/devices/starfive_starlink_pmu/cpumask/
Driver describes the format of config (event ID) in sysfs "format" directory
under::
/sys/bus/event_source/devices/starfive_starlink_pmu/format/
Example of perf usage::
$ perf list
starfive_starlink_pmu/cycles/ [Kernel PMU event]
starfive_starlink_pmu/read_hit/ [Kernel PMU event]
starfive_starlink_pmu/read_miss/ [Kernel PMU event]
starfive_starlink_pmu/read_request/ [Kernel PMU event]
starfive_starlink_pmu/release_request/ [Kernel PMU event]
starfive_starlink_pmu/write_hit/ [Kernel PMU event]
starfive_starlink_pmu/write_miss/ [Kernel PMU event]
starfive_starlink_pmu/write_request/ [Kernel PMU event]
starfive_starlink_pmu/writeback/ [Kernel PMU event]
$ perf stat -a -e /starfive_starlink_pmu/cycles/ sleep 1
Sampling is not supported. As a result, "perf record" is not supported.
Attaching to a task is not supported, only system-wide counting is supported.

59
Documentation/admin-guide/pm/amd-pstate.rst

@ -300,8 +300,8 @@ platforms. The AMD P-States mechanism is the more performance and energy
efficiency frequency management method on AMD processors.
AMD Pstate Driver Operation Modes
=================================
``amd-pstate`` Driver Operation Modes
======================================
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
non-autonomous (passive) mode and guided autonomous (guided) mode.
@ -353,6 +353,48 @@ is activated. In this mode, driver requests minimum and maximum performance
level and the platform autonomously selects a performance level in this range
and appropriate to the current workload.
``amd-pstate`` Preferred Core
=================================
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
cores that can achieve a higher frequency with lower voltage. The preferred
core rankings can dynamically change based on the workload, platform conditions,
thermals and ageing.
The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
driver will also determine whether or not ``amd-pstate`` preferred core is
supported by the platform.
``amd-pstate`` driver will provide an initial core ordering when the system boots.
The platform uses the CPPC interfaces to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When ``amd-pstate``
driver receives a message with the highest performance change, it will
update the core ranking and set the cpu's priority.
``amd-pstate`` Preferred Core Switch
=====================================
Kernel Parameters
-----------------
``amd-pstate`` peferred core`` has two states: enable and disable.
Enable/disable states can be chosen by different kernel parameters.
Default enable ``amd-pstate`` preferred core.
``amd_prefcore=disable``
For systems that support ``amd-pstate`` preferred core, the core rankings will
always be advertised by the platform. But OS can choose to ignore that via the
kernel parameter ``amd_prefcore=disable``.
User Space Interface in ``sysfs`` - General
===========================================
@ -385,6 +427,19 @@ control its functionality at the system level. They are located in the
to the operation mode represented by that string - or to be
unregistered in the "disable" case.
``prefcore``
Preferred core state of the driver: "enabled" or "disabled".
"enabled"
Enable the ``amd-pstate`` preferred core.
"disabled"
Disable the ``amd-pstate`` preferred core
This attribute is read-only to check the state of preferred core set
by the kernel parameter.
``cpupower`` tool support for ``amd-pstate``
===============================================

2
Documentation/admin-guide/reporting-regressions.rst

@ -31,7 +31,7 @@ The important bits (aka "TL;DR")
Linux kernel regression tracking bot "regzbot" track the issue by specifying
when the regression started like this::
#regzbot introduced v5.13..v5.14-rc1
#regzbot introduced: v5.13..v5.14-rc1
All the details on Linux kernel regressions relevant for users

43
Documentation/admin-guide/sysctl/kernel.rst

@ -296,12 +296,30 @@ kernel panic). This will output the contents of the ftrace buffers to
the console. This is very useful for capturing traces that lead to
crashes and outputting them to a serial console.
= ===================================================
0 Disabled (default).
1 Dump buffers of all CPUs.
2 Dump the buffer of the CPU that triggered the oops.
= ===================================================
======================= ===========================================
0 Disabled (default).
1 Dump buffers of all CPUs.
2(orig_cpu) Dump the buffer of the CPU that triggered the
oops.
<instance> Dump the specific instance buffer on all CPUs.
<instance>=2(orig_cpu) Dump the specific instance buffer on the CPU
that triggered the oops.
======================= ===========================================
Multiple instance dump is also supported, and instances are separated
by commas. If global buffer also needs to be dumped, please specify
the dump mode (1/2/orig_cpu) first for global buffer.
So for example to dump "foo" and "bar" instance buffer on all CPUs,
user can::
echo "foo,bar" > /proc/sys/kernel/ftrace_dump_on_oops
To dump global buffer and "foo" instance buffer on all
CPUs along with the "bar" instance buffer on CPU that triggered the
oops, user can::
echo "1,foo,bar=2" > /proc/sys/kernel/ftrace_dump_on_oops
ftrace_enabled, stack_tracer_enabled
====================================
@ -594,6 +612,9 @@ default (``MSGMNB``).
``msgmni`` is the maximum number of IPC queues. 32000 by default
(``MSGMNI``).
All of these parameters are set per ipc namespace. The maximum number of bytes
in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is
respected hierarchically in the each user namespace.
msg_next_id, sem_next_id, and shm_next_id (System V IPC)
========================================================
@ -850,6 +871,7 @@ bit 3 print locks info if ``CONFIG_LOCKDEP`` is on
bit 4 print ftrace buffer
bit 5 print all printk messages in buffer
bit 6 print all CPUs backtrace (if available in the arch)
bit 7 print only tasks in uninterruptible (blocked) state
===== ============================================
So for example to print tasks and memory info on panic, user can::
@ -1274,15 +1296,20 @@ are doing anyway :)
shmall
======
This parameter sets the total amount of shared memory pages that
can be used system wide. Hence, ``shmall`` should always be at least
``ceil(shmmax/PAGE_SIZE)``.
This parameter sets the total amount of shared memory pages that can be used
inside ipc namespace. The shared memory pages counting occurs for each ipc
namespace separately and is not inherited. Hence, ``shmall`` should always be at
least ``ceil(shmmax/PAGE_SIZE)``.
If you are not sure what the default ``PAGE_SIZE`` is on your Linux
system, you can run the following command::
# getconf PAGE_SIZE
To reduce or disable the ability to allocate shared memory, you must create a
new ipc namespace, set this parameter to the required value and prohibit the
creation of a new ipc namespace in the current user namespace or cgroups can
be used.
shmmax
======

5
Documentation/admin-guide/sysctl/net.rst

@ -206,6 +206,11 @@ Will increase power usage.
Default: 0 (off)
mem_pcpu_rsv
------------
Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
rmem_default
------------

4
Documentation/admin-guide/tainted-kernels.rst

@ -34,7 +34,7 @@ name of the command ('Comm:') that triggered the event::
You'll find a 'Not tainted: ' there if the kernel was not tainted at the
time of the event; if it was, then it will print 'Tainted: ' and characters
either letters or blanks. In above example it looks like this::
either letters or blanks. In the example above it looks like this::
Tainted: P W O
@ -52,7 +52,7 @@ At runtime, you can query the tainted state by reading
tainted; any other number indicates the reasons why it is. The easiest way to
decode that number is the script ``tools/debugging/kernel-chktaint``, which your
distribution might ship as part of a package called ``linux-tools`` or
``kernel-tools``; if it doesn't you can download the script from
``kernel-tools``; if it doesn't, you can download the script from
`git.kernel.org <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/debugging/kernel-chktaint>`_
and execute it with ``sh kernel-chktaint``, which would print something like
this on the machine that had the statements in the logs that were quoted earlier::

1985
Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst

File diff suppressed because it is too large Load Diff

49
Documentation/arch/arm64/elf_hwcaps.rst

@ -317,6 +317,55 @@ HWCAP2_LRCPC3
HWCAP2_LSE128
Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0011.
HWCAP2_FPMR
Functionality implied by ID_AA64PFR2_EL1.FMR == 0b0001.
HWCAP2_LUT
Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0001.
HWCAP2_FAMINMAX
Functionality implied by ID_AA64ISAR3_EL1.FAMINMAX == 0b0001.
HWCAP2_F8CVT
Functionality implied by ID_AA64FPFR0_EL1.F8CVT == 0b1.
HWCAP2_F8FMA
Functionality implied by ID_AA64FPFR0_EL1.F8FMA == 0b1.
HWCAP2_F8DP4
Functionality implied by ID_AA64FPFR0_EL1.F8DP4 == 0b1.
HWCAP2_F8DP2
Functionality implied by ID_AA64FPFR0_EL1.F8DP2 == 0b1.
HWCAP2_F8E4M3
Functionality implied by ID_AA64FPFR0_EL1.F8E4M3 == 0b1.
HWCAP2_F8E5M2
Functionality implied by ID_AA64FPFR0_EL1.F8E5M2 == 0b1.
HWCAP2_SME_LUTV2
Functionality implied by ID_AA64SMFR0_EL1.LUTv2 == 0b1.
HWCAP2_SME_F8F16
Functionality implied by ID_AA64SMFR0_EL1.F8F16 == 0b1.
HWCAP2_SME_F8F32
Functionality implied by ID_AA64SMFR0_EL1.F8F32 == 0b1.
HWCAP2_SME_SF8FMA
Functionality implied by ID_AA64SMFR0_EL1.SF8FMA == 0b1.
HWCAP2_SME_SF8DP4
Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
HWCAP2_SME_SF8DP2
Functionality implied by ID_AA64SMFR0_EL1.SF8DP2 == 0b1.
HWCAP2_SME_SF8DP4
Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1.
4. Unused AT_HWCAP bits
-----------------------

5
Documentation/arch/arm64/silicon-errata.rst

@ -35,8 +35,9 @@ can be triggered by Linux).
For software workarounds that may adversely impact systems unaffected by
the erratum in question, a Kconfig entry is added under "Kernel
Features" -> "ARM errata workarounds via the alternatives framework".
These are enabled by default and patched in at runtime when an affected
CPU is detected. For less-intrusive workarounds, a Kconfig option is not
With the exception of workarounds for errata deemed "rare" by Arm, these
are enabled by default and patched in at runtime when an affected CPU is
detected. For less-intrusive workarounds, a Kconfig option is not
available and the code is structured (preferably with a comment) in such
a way that the erratum will not be hit.

11
Documentation/arch/arm64/sme.rst

@ -75,7 +75,7 @@ model features for SME is included in Appendix A.
2. Vector lengths
------------------
SME defines a second vector length similar to the SVE vector length which is
SME defines a second vector length similar to the SVE vector length which
controls the size of the streaming mode SVE vectors and the ZA matrix array.
The ZA matrix is square with each side having as many bytes as a streaming
mode SVE vector.
@ -238,12 +238,12 @@ prctl(PR_SME_SET_VL, unsigned long arg)
bits of Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become
unspecified, including both streaming and non-streaming SVE state.
Calling PR_SME_SET_VL with vl equal to the thread's current vector
length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag,
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
does not constitute a change to the vector length for this purpose.
* Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared.
Calling PR_SME_SET_VL with vl equal to the thread's current vector
length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag,
length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag,
does not constitute a change to the vector length for this purpose.
@ -379,9 +379,8 @@ The regset data starts with struct user_za_header, containing:
/proc/sys/abi/sme_default_vector_length
Writing the text representation of an integer to this file sets the system
default vector length to the specified value, unless the value is greater
than the maximum vector length supported by the system in which case the
default vector length is set to that maximum.
default vector length to the specified value rounded to a supported value
using the same rules as for setting vector length via PR_SME_SET_VL.
The result can be determined by reopening the file and reading its
contents.

10
Documentation/arch/arm64/sve.rst

@ -117,11 +117,6 @@ the SVE instruction set architecture.
* The SVE registers are not used to pass arguments to or receive results from
any syscall.
* In practice the affected registers/bits will be preserved or will be replaced
with zeros on return from a syscall, but userspace should not make
assumptions about this. The kernel behaviour may vary on a case-by-case
basis.
* All other SVE state of a thread, including the currently configured vector
length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector
length (if any), is preserved across all syscalls, subject to the specific
@ -428,9 +423,8 @@ The regset data starts with struct user_sve_header, containing:
/proc/sys/abi/sve_default_vector_length
Writing the text representation of an integer to this file sets the system
default vector length to the specified value, unless the value is greater
than the maximum vector length supported by the system in which case the
default vector length is set to that maximum.
default vector length to the specified value rounded to a supported value
using the same rules as for setting vector length via PR_SVE_SET_VL.
The result can be determined by reopening the file and reading its
contents.

16
Documentation/arch/riscv/vm-layout.rst

@ -144,14 +144,8 @@ passing 0 into the hint address parameter of mmap. On CPUs with an address space
smaller than sv48, the CPU maximum supported address space will be the default.
Software can "opt-in" to receiving VAs from another VA space by providing
a hint address to mmap. A hint address passed to mmap will cause the largest
address space that fits entirely into the hint to be used, unless there is no
space left in the address space. If there is no space available in the requested
address space, an address in the next smallest available address space will be
returned.
For example, in order to obtain 48-bit VA space, a hint address greater than
:code:`1 << 47` must be provided. Note that this is 47 due to sv48 userspace
ending at :code:`1 << 47` and the addresses beyond this are reserved for the
kernel. Similarly, to obtain 57-bit VA space addresses, a hint address greater
than or equal to :code:`1 << 56` must be provided.
a hint address to mmap. When a hint address is passed to mmap, the returned
address will never use more bits than the hint address. For example, if a hint
address of `1 << 40` is passed to mmap, a valid returned address will never use
bits 41 through 63. If no mappable addresses are available in that range, mmap
will return `MAP_FAILED`.

16
Documentation/arch/x86/amd-memory-encryption.rst

@ -87,14 +87,14 @@ The state of SME in the Linux kernel can be documented as follows:
kernel is non-zero).
SME can also be enabled and activated in the BIOS. If SME is enabled and
activated in the BIOS, then all memory accesses will be encrypted and it will
not be necessary to activate the Linux memory encryption support. If the BIOS
merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate
memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or
by supplying mem_encrypt=on on the kernel command line. However, if BIOS does
not enable SME, then Linux will not be able to activate memory encryption, even
if configured to do so by default or the mem_encrypt=on command line parameter
is specified.
activated in the BIOS, then all memory accesses will be encrypted and it
will not be necessary to activate the Linux memory encryption support.
If the BIOS merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG),
then memory encryption can be enabled by supplying mem_encrypt=on on the
kernel command line. However, if BIOS does not enable SME, then Linux
will not be able to activate memory encryption, even if configured to do
so by default or the mem_encrypt=on command line parameter is specified.
Secure Nested Paging (SNP)
==========================

7
Documentation/arch/x86/amd_hsmp.rst

@ -13,7 +13,8 @@ set of mailbox registers.
More details on the interface can be found in chapter
"7 Host System Management Port (HSMP)" of the family/model PPR
Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
HSMP interface is supported on EPYC server CPU models only.
@ -97,8 +98,8 @@ what happened. The transaction returns 0 on success.
More details on the interface and message definitions can be found in chapter
"7 Host System Management Port (HSMP)" of the respective family/model PPR
eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
User space C-APIs are made available by linking against the esmi library,
which is provided by the E-SMS project https://developer.amd.com/e-sms/.
which is provided by the E-SMS project https://www.amd.com/en/developer/e-sms.html.
See: https://github.com/amd/esmi_ib_library

3
Documentation/arch/x86/boot.rst

@ -878,7 +878,8 @@ Protocol: 2.10+
address if possible.
A non-relocatable kernel will unconditionally move itself and to run
at this address.
at this address. A relocatable kernel will move itself to this address if it
loaded below this address.
============ =======
Field name: init_size

6
Documentation/arch/x86/pti.rst

@ -26,9 +26,9 @@ comments in pti.c).
This approach helps to ensure that side-channel attacks leveraging
the paging structures do not function when PTI is enabled. It can be
enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
Once enabled at compile-time, it can be disabled at boot with the
'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
time. Once enabled at compile-time, it can be disabled at boot with
the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
Page Table Management
=====================

8
Documentation/arch/x86/resctrl.rst

@ -45,7 +45,7 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
bandwidth in MBps
bandwidth in MiBps
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@ -526,7 +526,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
In order to mitigate this and make the interface more user friendly,
resctrl added support for specifying the bandwidth in MBps as well. The
resctrl added support for specifying the bandwidth in MiBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memory bandwidth percentages to ensure::
@ -573,13 +573,13 @@ Memory b/w domain is L3 cache.
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
Memory bandwidth Allocation specified in MBps
Memory bandwidth Allocation specified in MiBps
---------------------------------------------
Memory bandwidth domain is L3 cache.
::
MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------

24
Documentation/arch/x86/topology.rst

@ -47,17 +47,21 @@ AMD nomenclature for package is 'Node'.
Package-related topology information in the kernel:
- cpuinfo_x86.x86_max_cores:
- topology_num_threads_per_package()
The number of cores in a package. This information is retrieved via CPUID.
The number of threads in a package.
- cpuinfo_x86.x86_max_dies:
- topology_num_cores_per_package()
The number of dies in a package. This information is retrieved via CPUID.
The number of cores in a package.
- topology_max_dies_per_package()
The maximum number of dies in a package.
- cpuinfo_x86.topo.die_id:
The physical ID of the die. This information is retrieved via CPUID.
The physical ID of the die.
- cpuinfo_x86.topo.pkg_id:
@ -96,16 +100,6 @@ are SMT- or CMT-type threads.
AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
"core".
Core-related topology information in the kernel:
- smp_num_siblings:
The number of threads in a core. The number of threads in a package can be
calculated by::
threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
Threads
=======
A thread is a single scheduling unit. It's the equivalent to a logical Linux

96
Documentation/arch/x86/x86_64/fred.rst

@ -0,0 +1,96 @@
.. SPDX-License-Identifier: GPL-2.0
=========================================
Flexible Return and Event Delivery (FRED)
=========================================
Overview
========
The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:
1) Improve overall performance and response time by replacing event
delivery through the interrupt descriptor table (IDT event
delivery) and event return by the IRET instruction with lower
latency transitions.
2) Improve software robustness by ensuring that event delivery
establishes the full supervisor context and that event return
establishes the full user context.
The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.
In addition to these transitions, the FRED architecture defines a new
instruction (LKGS) for managing the state of the GS segment register.
The LKGS instruction can be used by 64-bit operating systems that do
not use the new FRED transitions.
Furthermore, the FRED architecture is easy to extend for future CPU
architectures.
Software based event dispatching
================================
FRED operates differently from IDT in terms of event handling. Instead
of directly dispatching an event to its handler based on the event
vector, FRED requires the software to dispatch an event to its handler
based on both the event's type and vector. Therefore, an event dispatch
framework must be implemented to facilitate the event-to-handler
dispatch process. The FRED event dispatch framework takes control
once an event is delivered, and employs a two-level dispatch.
The first level dispatching is event type based, and the second level
dispatching is event vector based.
Full supervisor/user context
============================
FRED event delivery atomically save and restore full supervisor/user
context upon event delivery and return. Thus it avoids the problem of
transient states due to %cr2 and/or %dr6, and it is no longer needed
to handle all the ugly corner cases caused by half baked entry states.
FRED allows explicit unblock of NMI with new event return instructions
ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
unblocks NMI, e.g., when an exception happens during NMI handling.
FRED always restores the full value of %rsp, thus ESPFIX is no longer
needed when FRED is enabled.
LKGS
====
LKGS behaves like the MOV to GS instruction except that it loads the
base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
segment’s descriptor cache. With LKGS, it ends up with avoiding
mucking with kernel GS, i.e., an operating system can always operate
with its own GS base address.
Because FRED event delivery from ring 3 and ERETU both swap the value
of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
the introduction of LKGS instruction, the SWAPGS instruction is no
longer needed when FRED is enabled, thus is disallowed (#UD).
Stack levels
============
4 stack levels 0~3 are introduced to replace the nonreentrant IST for
event handling, and each stack level should be configured to use a
dedicated stack.
The current stack level could be unchanged or go higher upon FRED
event delivery. If unchanged, the CPU keeps using the current event
stack. If higher, the CPU switches to a new event stack specified by
the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].
Only execution of a FRED return instruction ERET[US], could lower the
current stack level, causing the CPU to switch back to the stack it was
on before a previous event delivery that promoted the stack level.

1
Documentation/arch/x86/x86_64/index.rst

@ -15,3 +15,4 @@ x86_64 Support
cpu-hotplug-spec
machinecheck
fsgs
fred

8
Documentation/bpf/kfuncs.rst

@ -177,10 +177,10 @@ In addition to kfuncs' arguments, verifier may need more information about the
type of kfunc(s) being registered with the BPF subsystem. To do so, we define
flags on a set of kfuncs as follows::
BTF_SET8_START(bpf_task_set)
BTF_KFUNCS_START(bpf_task_set)
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
BTF_SET8_END(bpf_task_set)
BTF_KFUNCS_END(bpf_task_set)
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
along with it. Ofcourse, it is also allowed to specify no flags.
@ -347,10 +347,10 @@ Once the kfunc is prepared for use, the final step to making it visible is
registering it with the BPF subsystem. Registration is done per BPF program
type. An example is shown below::
BTF_SET8_START(bpf_task_set)
BTF_KFUNCS_START(bpf_task_set)
BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE)
BTF_SET8_END(bpf_task_set)
BTF_KFUNCS_END(bpf_task_set)
static const struct btf_kfunc_id_set bpf_task_kfunc_set = {
.owner = THIS_MODULE,

2
Documentation/bpf/map_lpm_trie.rst

@ -17,7 +17,7 @@ significant byte.
LPM tries may be created with a maximum prefix length that is a multiple
of 8, in the range from 8 to 2048. The key used for lookup and update
operations is a ``struct bpf_lpm_trie_key``, extended by
operations is a ``struct bpf_lpm_trie_key_u8``, extended by
``max_prefixlen/8`` bytes.
- For IPv4 addresses the data length is 4 bytes

594
Documentation/bpf/standardization/instruction-set.rst

@ -1,11 +1,11 @@
.. contents::
.. sectnum::
=======================================
BPF Instruction Set Specification, v1.0
=======================================
======================================
BPF Instruction Set Architecture (ISA)
======================================
This document specifies version 1.0 of the BPF instruction set.
This document specifies the BPF instruction set architecture (ISA).
Documentation conventions
=========================
@ -24,22 +24,22 @@ a type's signedness (`S`) and bit width (`N`), respectively.
.. table:: Meaning of signedness notation.
==== =========
`S` Meaning
S Meaning
==== =========
`u` unsigned
`s` signed
u unsigned
s signed
==== =========
.. table:: Meaning of bit-width notation.
===== =========
`N` Bit width
N Bit width
===== =========
`8` 8 bits
`16` 16 bits
`32` 32 bits
`64` 64 bits
`128` 128 bits
8 8 bits
16 16 bits
32 32 bits
64 64 bits
128 128 bits
===== =========
For example, `u32` is a type whose valid values are all the 32-bit unsigned
@ -48,31 +48,31 @@ numbers.
Functions
---------
* `htobe16`: Takes an unsigned 16-bit number in host-endian format and
* htobe16: Takes an unsigned 16-bit number in host-endian format and
returns the equivalent number as an unsigned 16-bit number in big-endian
format.
* `htobe32`: Takes an unsigned 32-bit number in host-endian format and
* htobe32: Takes an unsigned 32-bit number in host-endian format and
returns the equivalent number as an unsigned 32-bit number in big-endian
format.
* `htobe64`: Takes an unsigned 64-bit number in host-endian format and
* htobe64: Takes an unsigned 64-bit number in host-endian format and
returns the equivalent number as an unsigned 64-bit number in big-endian
format.
* `htole16`: Takes an unsigned 16-bit number in host-endian format and
* htole16: Takes an unsigned 16-bit number in host-endian format and
returns the equivalent number as an unsigned 16-bit number in little-endian
format.
* `htole32`: Takes an unsigned 32-bit number in host-endian format and
* htole32: Takes an unsigned 32-bit number in host-endian format and
returns the equivalent number as an unsigned 32-bit number in little-endian
format.
* `htole64`: Takes an unsigned 64-bit number in host-endian format and
* htole64: Takes an unsigned 64-bit number in host-endian format and
returns the equivalent number as an unsigned 64-bit number in little-endian
format.
* `bswap16`: Takes an unsigned 16-bit number in either big- or little-endian
* bswap16: Takes an unsigned 16-bit number in either big- or little-endian
format and returns the equivalent number with the same bit width but
opposite endianness.
* `bswap32`: Takes an unsigned 32-bit number in either big- or little-endian
* bswap32: Takes an unsigned 32-bit number in either big- or little-endian
format and returns the equivalent number with the same bit width but
opposite endianness.
* `bswap64`: Takes an unsigned 64-bit number in either big- or little-endian
* bswap64: Takes an unsigned 64-bit number in either big- or little-endian
format and returns the equivalent number with the same bit width but
opposite endianness.
@ -97,40 +97,101 @@ Definitions
A: 10000110
B: 11111111 10000110
Conformance groups
------------------
An implementation does not need to support all instructions specified in this
document (e.g., deprecated instructions). Instead, a number of conformance
groups are specified. An implementation must support the base32 conformance
group and may support additional conformance groups, where supporting a
conformance group means it must support all instructions in that conformance
group.
The use of named conformance groups enables interoperability between a runtime
that executes instructions, and tools as such compilers that generate
instructions for the runtime. Thus, capability discovery in terms of
conformance groups might be done manually by users or automatically by tools.
Each conformance group has a short ASCII label (e.g., "base32") that
corresponds to a set of instructions that are mandatory. That is, each
instruction has one or more conformance groups of which it is a member.
This document defines the following conformance groups:
* base32: includes all instructions defined in this
specification unless otherwise noted.
* base64: includes base32, plus instructions explicitly noted
as being in the base64 conformance group.
* atomic32: includes 32-bit atomic operation instructions (see `Atomic operations`_).
* atomic64: includes atomic32, plus 64-bit atomic operation instructions.
* divmul32: includes 32-bit division, multiplication, and modulo instructions.
* divmul64: includes divmul32, plus 64-bit division, multiplication,
and modulo instructions.
* packet: deprecated packet access instructions.
Instruction encoding
====================
BPF has two instruction encodings:
* the basic instruction encoding, which uses 64 bits to encode an instruction
* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
constant) value after the basic instruction for a total of 128 bits.
* the wide instruction encoding, which appends a second 64 bits
after the basic instruction for a total of 128 bits.
The fields conforming an encoded basic instruction are stored in the
following order::
Basic instruction encoding
--------------------------
opcode:8 src_reg:4 dst_reg:4 offset:16 imm:32 // In little-endian BPF.
opcode:8 dst_reg:4 src_reg:4 offset:16 imm:32 // In big-endian BPF.
A basic instruction is encoded as follows::
**imm**
signed integer immediate value
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| opcode | regs | offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| imm |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
**offset**
signed integer offset used with pointer arithmetic
**opcode**
operation to perform, encoded as follows::
**src_reg**
the source register number (0-10), except where otherwise specified
(`64-bit immediate instructions`_ reuse this field for other purposes)
+-+-+-+-+-+-+-+-+
|specific |class|
+-+-+-+-+-+-+-+-+
**dst_reg**
destination register number (0-10)
**specific**
The format of these bits varies by instruction class
**opcode**
operation to perform
**class**
The instruction class (see `Instruction classes`_)
**regs**
The source and destination register numbers, encoded as follows
on a little-endian host::
+-+-+-+-+-+-+-+-+
|src_reg|dst_reg|
+-+-+-+-+-+-+-+-+
and as follows on a big-endian host::
+-+-+-+-+-+-+-+-+
|dst_reg|src_reg|
+-+-+-+-+-+-+-+-+
**src_reg**
the source register number (0-10), except where otherwise specified
(`64-bit immediate instructions`_ reuse this field for other purposes)
**dst_reg**
destination register number (0-10)
**offset**
signed integer offset used with pointer arithmetic
**imm**
signed integer immediate value
Note that the contents of multi-byte fields ('imm' and 'offset') are
stored using big-endian byte ordering in big-endian BPF and
little-endian byte ordering in little-endian BPF.
Note that the contents of multi-byte fields ('offset' and 'imm') are
stored using big-endian byte ordering on big-endian hosts and
little-endian byte ordering on little-endian hosts.
For example::
@ -143,71 +204,83 @@ For example::
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
instruction uses a 64-bit immediate value that is constructed as follows.
The 64 bits following the basic instruction contain a pseudo instruction
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
and imm containing the high 32 bits of the immediate value.
Wide instruction encoding
--------------------------
Some instructions are defined to use the wide instruction encoding,
which uses two 32-bit immediate values. The 64 bits following
the basic instruction format contain a pseudo instruction
with 'opcode', 'dst_reg', 'src_reg', and 'offset' all set to zero.
This is depicted in the following figure::
basic_instruction
.-----------------------------.
| |
code:8 regs:8 offset:16 imm:32 unused:32 imm:32
| |
'--------------'
pseudo instruction
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| opcode | regs | offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| imm |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| next_imm |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
**opcode**
operation to perform, encoded as explained above
Thus the 64-bit immediate value is constructed as follows:
**regs**
The source and destination register numbers, encoded as explained above
**offset**
signed integer offset used with pointer arithmetic
**imm**
signed integer immediate value
imm64 = (next_imm << 32) | imm
**reserved**
unused, set to zero
where 'next_imm' refers to the imm value of the pseudo instruction
following the basic instruction. The unused bytes in the pseudo
instruction are reserved and shall be cleared to zero.
**next_imm**
second signed integer immediate value
Instruction classes
-------------------
The three LSB bits of the 'opcode' field store the instruction class:
========= ===== =============================== ===================================
class value description reference
========= ===== =============================== ===================================
BPF_LD 0x00 non-standard load operations `Load and store instructions`_
BPF_LDX 0x01 load into register operations `Load and store instructions`_
BPF_ST 0x02 store from immediate operations `Load and store instructions`_
BPF_STX 0x03 store from register operations `Load and store instructions`_
BPF_ALU 0x04 32-bit arithmetic operations `Arithmetic and jump instructions`_
BPF_JMP 0x05 64-bit jump operations `Arithmetic and jump instructions`_
BPF_JMP32 0x06 32-bit jump operations `Arithmetic and jump instructions`_
BPF_ALU64 0x07 64-bit arithmetic operations `Arithmetic and jump instructions`_
========= ===== =============================== ===================================
The three least significant bits of the 'opcode' field store the instruction class:
===== ===== =============================== ===================================
class value description reference
===== ===== =============================== ===================================
LD 0x0 non-standard load operations `Load and store instructions`_
LDX 0x1 load into register operations `Load and store instructions`_
ST 0x2 store from immediate operations `Load and store instructions`_
STX 0x3 store from register operations `Load and store instructions`_
ALU 0x4 32-bit arithmetic operations `Arithmetic and jump instructions`_
JMP 0x5 64-bit jump operations `Arithmetic and jump instructions`_
JMP32 0x6 32-bit jump operations `Arithmetic and jump instructions`_
ALU64 0x7 64-bit arithmetic operations `Arithmetic and jump instructions`_
===== ===== =============================== ===================================
Arithmetic and jump instructions
================================
For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
For arithmetic and jump instructions (``ALU``, ``ALU64``, ``JMP`` and
``JMP32``), the 8-bit 'opcode' field is divided into three parts::
============== ====== =================
4 bits (MSB) 1 bit 3 bits (LSB)
============== ====== =================
code source instruction class
============== ====== =================
+-+-+-+-+-+-+-+-+
| code |s|class|
+-+-+-+-+-+-+-+-+
**code**
the operation code, whose meaning varies by instruction class
**source**
**s (source)**
the source operand location, which unless otherwise specified is one of:
====== ===== ==============================================
source value description
====== ===== ==============================================
BPF_K 0x00 use 32-bit 'imm' value as source operand
BPF_X 0x08 use 'src_reg' register value as source operand
K 0 use 32-bit 'imm' value as source operand
X 1 use 'src_reg' register value as source operand
====== ===== ==============================================
**instruction class**
@ -216,70 +289,75 @@ code source instruction class
Arithmetic instructions
-----------------------
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
``ALU`` uses 32-bit wide operands while ``ALU64`` uses 64-bit wide operands for
otherwise identical operations. ``ALU64`` instructions belong to the
base64 conformance group unless noted otherwise.
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
to the values of the source and destination registers, respectively.
========= ===== ======= ==========================================================
code value offset description
========= ===== ======= ==========================================================
BPF_ADD 0x00 0 dst += src
BPF_SUB 0x10 0 dst -= src
BPF_MUL 0x20 0 dst \*= src
BPF_DIV 0x30 0 dst = (src != 0) ? (dst / src) : 0
BPF_SDIV 0x30 1 dst = (src != 0) ? (dst s/ src) : 0
BPF_OR 0x40 0 dst \|= src
BPF_AND 0x50 0 dst &= src
BPF_LSH 0x60 0 dst <<= (src & mask)
BPF_RSH 0x70 0 dst >>= (src & mask)
BPF_NEG 0x80 0 dst = -dst
BPF_MOD 0x90 0 dst = (src != 0) ? (dst % src) : dst
BPF_SMOD 0x90 1 dst = (src != 0) ? (dst s% src) : dst
BPF_XOR 0xa0 0 dst ^= src
BPF_MOV 0xb0 0 dst = src
BPF_MOVSX 0xb0 8/16/32 dst = (s8,s16,s32)src
BPF_ARSH 0xc0 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
BPF_END 0xd0 0 byte swap operations (see `Byte swap instructions`_ below)
========= ===== ======= ==========================================================
===== ===== ======= ==========================================================
name code offset description
===== ===== ======= ==========================================================
ADD 0x0 0 dst += src
SUB 0x1 0 dst -= src
MUL 0x2 0 dst \*= src
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
OR 0x4 0 dst \|= src
AND 0x5 0 dst &= src
LSH 0x6 0 dst <<= (src & mask)
RSH 0x7 0 dst >>= (src & mask)
NEG 0x8 0 dst = -dst
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
XOR 0xa 0 dst ^= src
MOV 0xb 0 dst = src
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
===== ===== ======= ==========================================================
Underflow and overflow are allowed during arithmetic operations, meaning
the 64-bit or 32-bit value will wrap. If BPF program execution would
result in division by zero, the destination register is instead set to zero.
If execution would result in modulo by zero, for ``BPF_ALU64`` the value of
the destination register is unchanged whereas for ``BPF_ALU`` the upper
If execution would result in modulo by zero, for ``ALU64`` the value of
the destination register is unchanged whereas for ``ALU`` the upper
32 bits of the destination register are zeroed.
``BPF_ADD | BPF_X | BPF_ALU`` means::
``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means::
dst = (u32) ((u32) dst + (u32) src)
where '(u32)' indicates that the upper 32 bits are zeroed.
``BPF_ADD | BPF_X | BPF_ALU64`` means::
``{ADD, X, ALU64}`` means::
dst = dst + src
``BPF_XOR | BPF_K | BPF_ALU`` means::
``{XOR, K, ALU}`` means::
dst = (u32) dst ^ (u32) imm32
dst = (u32) dst ^ (u32) imm
``BPF_XOR | BPF_K | BPF_ALU64`` means::
``{XOR, K, ALU64}`` means::
dst = dst ^ imm32
dst = dst ^ imm
Note that most instructions have instruction offset of 0. Only three instructions
(``BPF_SDIV``, ``BPF_SMOD``, ``BPF_MOVSX``) have a non-zero offset.
(``SDIV``, ``SMOD``, ``MOVSX``) have a non-zero offset.
Division, multiplication, and modulo operations for ``ALU`` are part
of the "divmul32" conformance group, and division, multiplication, and
modulo operations for ``ALU64`` are part of the "divmul64" conformance
group.
The division and modulo operations support both unsigned and signed flavors.
For unsigned operations (``BPF_DIV`` and ``BPF_MOD``), for ``BPF_ALU``,
'imm' is interpreted as a 32-bit unsigned value. For ``BPF_ALU64``,
For unsigned operations (``DIV`` and ``MOD``), for ``ALU``,
'imm' is interpreted as a 32-bit unsigned value. For ``ALU64``,
'imm' is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
interpreted as a 64-bit unsigned value.
For signed operations (``BPF_SDIV`` and ``BPF_SMOD``), for ``BPF_ALU``,
'imm' is interpreted as a 32-bit signed value. For ``BPF_ALU64``, 'imm'
For signed operations (``SDIV`` and ``SMOD``), for ``ALU``,
'imm' is interpreted as a 32-bit signed value. For ``ALU64``, 'imm'
is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
interpreted as a 64-bit signed value.
@ -291,11 +369,15 @@ etc. This specification requires that signed modulo use truncated division
a % n = a - n * trunc(a / n)
The ``BPF_MOVSX`` instruction does a move operation with sign extension.
``BPF_ALU | BPF_MOVSX`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into 32
The ``MOVSX`` instruction does a move operation with sign extension.
``{MOVSX, X, ALU}`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into 32
bit operands, and zeroes the remaining upper 32 bits.
``BPF_ALU64 | BPF_MOVSX`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
operands into 64 bit operands.
``{MOVSX, X, ALU64}`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
operands into 64 bit operands. Unlike other arithmetic instructions,
``MOVSX`` is only defined for register source operands (``X``).
The ``NEG`` instruction is only defined when the source bit is clear
(``K``).
Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31)
for 32-bit operations.
@ -303,43 +385,45 @@ for 32-bit operations.
Byte swap instructions
----------------------
The byte swap instructions use instruction classes of ``BPF_ALU`` and ``BPF_ALU64``
and a 4-bit 'code' field of ``BPF_END``.
The byte swap instructions use instruction classes of ``ALU`` and ``ALU64``
and a 4-bit 'code' field of ``END``.
The byte swap instructions operate on the destination register
only and do not use a separate source register or immediate value.
For ``BPF_ALU``, the 1-bit source operand field in the opcode is used to
For ``ALU``, the 1-bit source operand field in the opcode is used to
select what byte order the operation converts from or to. For
``BPF_ALU64``, the 1-bit source operand field in the opcode is reserved
``ALU64``, the 1-bit source operand field in the opcode is reserved
and must be set to 0.
========= ========= ===== =================================================
class source value description
========= ========= ===== =================================================
BPF_ALU BPF_TO_LE 0x00 convert between host byte order and little endian
BPF_ALU BPF_TO_BE 0x08 convert between host byte order and big endian
BPF_ALU64 Reserved 0x00 do byte swap unconditionally
========= ========= ===== =================================================
===== ======== ===== =================================================
class source value description
===== ======== ===== =================================================
ALU TO_LE 0 convert between host byte order and little endian
ALU TO_BE 1 convert between host byte order and big endian
ALU64 Reserved 0 do byte swap unconditionally
===== ======== ===== =================================================
The 'imm' field encodes the width of the swap operations. The following widths
are supported: 16, 32 and 64.
are supported: 16, 32 and 64. Width 64 operations belong to the base64
conformance group and other swap operations belong to the base32
conformance group.
Examples:
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16/32/64 means::
``{END, TO_LE, ALU}`` with imm = 16/32/64 means::
dst = htole16(dst)
dst = htole32(dst)
dst = htole64(dst)
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 16/32/64 means::
``{END, TO_BE, ALU}`` with imm = 16/32/64 means::
dst = htobe16(dst)
dst = htobe32(dst)
dst = htobe64(dst)
``BPF_ALU64 | BPF_TO_LE | BPF_END`` with imm = 16/32/64 means::
``{END, TO_LE, ALU64}`` with imm = 16/32/64 means::
dst = bswap16(dst)
dst = bswap32(dst)
@ -348,56 +432,61 @@ Examples:
Jump instructions
-----------------
``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
``JMP32`` uses 32-bit wide operands and indicates the base32
conformance group, while ``JMP`` uses 64-bit wide operands for
otherwise identical operations, and indicates the base64 conformance
group unless otherwise specified.
The 'code' field encodes the operation as below:
======== ===== === =========================================== =========================================
code value src description notes
======== ===== === =========================================== =========================================
BPF_JA 0x0 0x0 PC += offset BPF_JMP class
BPF_JA 0x0 0x0 PC += imm BPF_JMP32 class
BPF_JEQ 0x1 any PC += offset if dst == src
BPF_JGT 0x2 any PC += offset if dst > src unsigned
BPF_JGE 0x3 any PC += offset if dst >= src unsigned
BPF_JSET 0x4 any PC += offset if dst & src
BPF_JNE 0x5 any PC += offset if dst != src
BPF_JSGT 0x6 any PC += offset if dst > src signed
BPF_JSGE 0x7 any PC += offset if dst >= src signed
BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_
BPF_CALL 0x8 0x1 call PC += imm see `Program-local functions`_
BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_
BPF_EXIT 0x9 0x0 return BPF_JMP only
BPF_JLT 0xa any PC += offset if dst < src unsigned
BPF_JLE 0xb any PC += offset if dst <= src unsigned
BPF_JSLT 0xc any PC += offset if dst < src signed
BPF_JSLE 0xd any PC += offset if dst <= src signed
======== ===== === =========================================== =========================================
The BPF program needs to store the return value into register R0 before doing a
``BPF_EXIT``.
======== ===== ======= =============================== ===================================================
code value src_reg description notes
======== ===== ======= =============================== ===================================================
JA 0x0 0x0 PC += offset {JA, K, JMP} only
JA 0x0 0x0 PC += imm {JA, K, JMP32} only
JEQ 0x1 any PC += offset if dst == src
JGT 0x2 any PC += offset if dst > src unsigned
JGE 0x3 any PC += offset if dst >= src unsigned
JSET 0x4 any PC += offset if dst & src
JNE 0x5 any PC += offset if dst != src
JSGT 0x6 any PC += offset if dst > src signed
JSGE 0x7 any PC += offset if dst >= src signed
CALL 0x8 0x0 call helper function by address {CALL, K, JMP} only, see `Helper functions`_
CALL 0x8 0x1 call PC += imm {CALL, K, JMP} only, see `Program-local functions`_
CALL 0x8 0x2 call helper function by BTF ID {CALL, K, JMP} only, see `Helper functions`_
EXIT 0x9 0x0 return {CALL, K, JMP} only
JLT 0xa any PC += offset if dst < src unsigned
JLE 0xb any PC += offset if dst <= src unsigned
JSLT 0xc any PC += offset if dst < src signed
JSLE 0xd any PC += offset if dst <= src signed
======== ===== ======= =============================== ===================================================
The BPF program needs to store the return value into register R0 before doing an
``EXIT``.
Example:
``BPF_JSGE | BPF_X | BPF_JMP32`` (0x7e) means::
``{JSGE, X, JMP32}`` means::
if (s32)dst s>= (s32)src goto +offset
where 's>=' indicates a signed '>=' comparison.
``BPF_JA | BPF_K | BPF_JMP32`` (0x06) means::
``{JA, K, JMP32}`` means::
gotol +imm
where 'imm' means the branch offset comes from insn 'imm' field.
Note that there are two flavors of ``BPF_JA`` instructions. The
``BPF_JMP`` class permits a 16-bit jump offset specified by the 'offset'
field, whereas the ``BPF_JMP32`` class permits a 32-bit jump offset
Note that there are two flavors of ``JA`` instructions. The
``JMP`` class permits a 16-bit jump offset specified by the 'offset'
field, whereas the ``JMP32`` class permits a 32-bit jump offset
specified by the 'imm' field. A > 16-bit conditional jump may be
converted to a < 16-bit conditional jump plus a 32-bit unconditional
jump.
All ``CALL`` and ``JA`` instructions belong to the
base32 conformance group.
Helper functions
~~~~~~~~~~~~~~~~
@ -416,78 +505,83 @@ Program-local functions
~~~~~~~~~~~~~~~~~~~~~~~
Program-local functions are functions exposed by the same BPF program as the
caller, and are referenced by offset from the call instruction, similar to
``BPF_JA``. The offset is encoded in the imm field of the call instruction.
A ``BPF_EXIT`` within the program-local function will return to the caller.
``JA``. The offset is encoded in the imm field of the call instruction.
A ``EXIT`` within the program-local function will return to the caller.
Load and store instructions
===========================
For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
8-bit 'opcode' field is divided as:
============ ====== =================
3 bits (MSB) 2 bits 3 bits (LSB)
============ ====== =================
mode size instruction class
============ ====== =================
The mode modifier is one of:
============= ===== ==================================== =============
mode modifier value description reference
============= ===== ==================================== =============
BPF_IMM 0x00 64-bit immediate instructions `64-bit immediate instructions`_
BPF_ABS 0x20 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
BPF_IND 0x40 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
BPF_MEM 0x60 regular load and store operations `Regular load and store operations`_
BPF_MEMSX 0x80 sign-extension load operations `Sign-extension load operations`_
BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_
============= ===== ==================================== =============
The size modifier is one of:
============= ===== =====================
size modifier value description
============= ===== =====================
BPF_W 0x00 word (4 bytes)
BPF_H 0x08 half word (2 bytes)
BPF_B 0x10 byte
BPF_DW 0x18 double word (8 bytes)
============= ===== =====================
For load and store instructions (``LD``, ``LDX``, ``ST``, and ``STX``), the
8-bit 'opcode' field is divided as::
+-+-+-+-+-+-+-+-+
|mode |sz |class|
+-+-+-+-+-+-+-+-+
**mode**
The mode modifier is one of:
============= ===== ==================================== =============
mode modifier value description reference
============= ===== ==================================== =============
IMM 0 64-bit immediate instructions `64-bit immediate instructions`_
ABS 1 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
IND 2 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
MEM 3 regular load and store operations `Regular load and store operations`_
MEMSX 4 sign-extension load operations `Sign-extension load operations`_
ATOMIC 6 atomic operations `Atomic operations`_
============= ===== ==================================== =============
**sz (size)**
The size modifier is one of:
==== ===== =====================
size value description
==== ===== =====================
W 0 word (4 bytes)
H 1 half word (2 bytes)
B 2 byte
DW 3 double word (8 bytes)
==== ===== =====================
Instructions using ``DW`` belong to the base64 conformance group.
**class**
The instruction class (see `Instruction classes`_)
Regular load and store operations
---------------------------------
The ``BPF_MEM`` mode modifier is used to encode regular load and store
The ``MEM`` mode modifier is used to encode regular load and store
instructions that transfer data between a register and memory.
``BPF_MEM | <size> | BPF_STX`` means::
``{MEM, <size>, STX}`` means::
*(size *) (dst + offset) = src
``BPF_MEM | <size> | BPF_ST`` means::
``{MEM, <size>, ST}`` means::
*(size *) (dst + offset) = imm32
*(size *) (dst + offset) = imm
``BPF_MEM | <size> | BPF_LDX`` means::
``{MEM, <size>, LDX}`` means::
dst = *(unsigned size *) (src + offset)
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW`` and
'unsigned size' is one of u8, u16, u32 or u64.
Where '<size>' is one of: ``B``, ``H``, ``W``, or ``DW``, and
'unsigned size' is one of: u8, u16, u32, or u64.
Sign-extension load operations
------------------------------
The ``BPF_MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
The ``MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
instructions that transfer data between a register and memory.
``BPF_MEMSX | <size> | BPF_LDX`` means::
``{MEMSX, <size>, LDX}`` means::
dst = *(signed size *) (src + offset)
Where size is one of: ``BPF_B``, ``BPF_H`` or ``BPF_W``, and
'signed size' is one of s8, s16 or s32.
Where size is one of: ``B``, ``H``, or ``W``, and
'signed size' is one of: s8, s16, or s32.
Atomic operations
-----------------
@ -497,10 +591,12 @@ interrupted or corrupted by other access to the same memory region
by other BPF programs or means outside of this specification.
All atomic operations supported by BPF are encoded as store operations
that use the ``BPF_ATOMIC`` mode modifier as follows:
that use the ``ATOMIC`` mode modifier as follows:
* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
* ``{ATOMIC, W, STX}`` for 32-bit operations, which are
part of the "atomic32" conformance group.
* ``{ATOMIC, DW, STX}`` for 64-bit operations, which are
part of the "atomic64" conformance group.
* 8-bit and 16-bit wide atomic operations are not supported.
The 'imm' field is used to encode the actual atomic operation.
@ -510,18 +606,18 @@ arithmetic operations in the 'imm' field to encode the atomic operation:
======== ===== ===========
imm value description
======== ===== ===========
BPF_ADD 0x00 atomic add
BPF_OR 0x40 atomic or
BPF_AND 0x50 atomic and
BPF_XOR 0xa0 atomic xor
ADD 0x00 atomic add
OR 0x40 atomic or
AND 0x50 atomic and
XOR 0xa0 atomic xor
======== ===== ===========
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
``{ATOMIC, W, STX}`` with 'imm' = ADD means::
*(u32 *)(dst + offset) += src
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
``{ATOMIC, DW, STX}`` with 'imm' = ADD means::
*(u64 *)(dst + offset) += src
@ -531,20 +627,20 @@ two complex atomic operations:
=========== ================ ===========================
imm value description
=========== ================ ===========================
BPF_FETCH 0x01 modifier: return old value
BPF_XCHG 0xe0 | BPF_FETCH atomic exchange
BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
FETCH 0x01 modifier: return old value
XCHG 0xe0 | FETCH atomic exchange
CMPXCHG 0xf0 | FETCH atomic compare and exchange
=========== ================ ===========================
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
The ``FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``FETCH`` flag
is set, then the operation also overwrites ``src`` with the value that
was in memory before it was modified.
The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
The ``XCHG`` operation atomically exchanges ``src`` with the value
addressed by ``dst + offset``.
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
The ``CMPXCHG`` operation atomically compares the value addressed by
``dst + offset`` with ``R0``. If they match, the value addressed by
``dst + offset`` is replaced with ``src``. In either case, the
value that was at ``dst + offset`` before the operation is zero-extended
@ -553,25 +649,25 @@ and loaded back to ``R0``.
64-bit immediate instructions
-----------------------------
Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding defined in `Instruction encoding`_, and use the 'src' field of the
Instructions with the ``IMM`` 'mode' modifier use the wide instruction
encoding defined in `Instruction encoding`_, and use the 'src_reg' field of the
basic instruction to hold an opcode subtype.
The following table defines a set of ``BPF_IMM | BPF_DW | BPF_LD`` instructions
with opcode subtypes in the 'src' field, using new terms such as "map"
The following table defines a set of ``{IMM, DW, LD}`` instructions
with opcode subtypes in the 'src_reg' field, using new terms such as "map"
defined further below:
========================= ====== === ========================================= =========== ==============
opcode construction opcode src pseudocode imm type dst type
========================= ====== === ========================================= =========== ==============
BPF_IMM | BPF_DW | BPF_LD 0x18 0x0 dst = imm64 integer integer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
BPF_IMM | BPF_DW | BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
BPF_IMM | BPF_DW | BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
========================= ====== === ========================================= =========== ==============
======= ========================================= =========== ==============
src_reg pseudocode imm type dst type
======= ========================================= =========== ==============
0x0 dst = (next_imm << 32) | imm integer integer
0x1 dst = map_by_fd(imm) map fd map
0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
0x3 dst = var_addr(imm) variable id data pointer
0x4 dst = code_addr(imm) integer code pointer
0x5 dst = map_by_idx(imm) map index map
0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
======= ========================================= =========== ==============
where
@ -609,5 +705,9 @@ Legacy BPF Packet access instructions
-------------------------------------
BPF previously introduced special instructions for access to packet data that were
carried over from classic BPF. However, these instructions are
deprecated and should no longer be used.
carried over from classic BPF. These instructions used an instruction
class of ``LD``, a size modifier of ``W``, ``H``, or ``B``, and a
mode modifier of ``ABS`` or ``IND``. The 'dst_reg' and 'offset' fields were
set to zero, and 'src_reg' was set to zero for ``ABS``. However, these
instructions are deprecated and should no longer be used. All legacy packet
access instructions belong to the "packet" conformance group.

2
Documentation/bpf/verifier.rst

@ -562,7 +562,7 @@ works::
* ``checkpoint[0].r1`` is marked as read;
* At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed
by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a
by ``clean_live_states()``. After this processing ``checkpoint[0].r1`` has a
read mark and all other registers and stack slots are marked as ``NOT_INIT``
or ``STACK_INVALID``

6
Documentation/conf.py

@ -346,9 +346,9 @@ sys.stderr.write("Using %s theme\n" % html_theme)
html_static_path = ['sphinx-static']
# If true, Docutils "smart quotes" will be used to convert quotes and dashes
# to typographically correct entities. This will convert "--" to "—",
# which is not always what we want, so disable it.
smartquotes = False
# to typographically correct entities. However, conversion of "--" to "—"
# is not always what we want, so enable only quotes.
smartquotes_action = 'q'
# Custom sidebar templates, maps document names to template names.
# Note that the RTD theme ignores this

43
Documentation/core-api/workqueue.rst

@ -77,10 +77,12 @@ wants a function to be executed asynchronously it has to set up a work
item pointing to that function and queue that work item on a
workqueue.
Special purpose threads, called worker threads, execute the functions
off of the queue, one after the other. If no work is queued, the
worker threads become idle. These worker threads are managed in so
called worker-pools.
A work item can be executed in either a thread or the BH (softirq) context.
For threaded workqueues, special purpose threads, called [k]workers, execute
the functions off of the queue, one after the other. If no work is queued,
the worker threads become idle. These worker threads are managed in
worker-pools.
The cmwq design differentiates between the user-facing workqueues that
subsystems and drivers queue work items on and the backend mechanism
@ -91,6 +93,12 @@ for high priority ones, for each possible CPU and some extra
worker-pools to serve work items queued on unbound workqueues - the
number of these backing pools is dynamic.
BH workqueues use the same framework. However, as there can only be one
concurrent execution context, there's no need to worry about concurrency.
Each per-CPU BH worker pool contains only one pseudo worker which represents
the BH execution context. A BH workqueue can be considered a convenience
interface to softirq.
Subsystems and drivers can create and queue work items through special
workqueue API functions as they see fit. They can influence some
aspects of the way the work items are executed by setting flags on the
@ -106,7 +114,7 @@ unless specifically overridden, a work item of a bound workqueue will
be queued on the worklist of either normal or highpri worker-pool that
is associated to the CPU the issuer is running on.
For any worker pool implementation, managing the concurrency level
For any thread pool implementation, managing the concurrency level
(how many execution contexts are active) is an important issue. cmwq
tries to keep the concurrency at a minimal but sufficient level.
Minimal to save resources and sufficient in that the system is used at
@ -164,6 +172,17 @@ resources, scheduled and executed.
``flags``
---------
``WQ_BH``
BH workqueues can be considered a convenience interface to softirq. BH
workqueues are always per-CPU and all BH work items are executed in the
queueing CPU's softirq context in the queueing order.
All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the
only allowed additional flag.
BH work items cannot sleep. All other features such as delayed queueing,
flushing and canceling are supported.
``WQ_UNBOUND``
Work items queued to an unbound wq are served by the special
worker-pools which host workers which are not bound to any
@ -237,15 +256,11 @@ may queue at the same time. Unless there is a specific need for
throttling the number of active work items, specifying '0' is
recommended.
Some users depend on the strict execution ordering of ST wq. The
combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` used to
achieve this behavior. Work items on such wq were always queued to the
unbound worker-pools and only one work item could be active at any given
time thus achieving the same ordering property as ST wq.
In the current implementation the above configuration only guarantees
ST behavior within a given NUMA node. Instead ``alloc_ordered_workqueue()`` should
be used to achieve system-wide ST behavior.
Some users depend on strict execution ordering where only one work item
is in flight at any given time and the work items are processed in
queueing order. While the combination of ``@max_active`` of 1 and
``WQ_UNBOUND`` used to achieve this behavior, this is no longer the
case. Use ``alloc_ordered_queue()`` instead.
Example Execution Scenarios

4
Documentation/dev-tools/checkpatch.rst

@ -168,7 +168,7 @@ Available options:
- --fix
This is an EXPERIMENTAL feature. If correctable errors exists, a file
This is an EXPERIMENTAL feature. If correctable errors exist, a file
<inputfile>.EXPERIMENTAL-checkpatch-fixes is created which has the
automatically fixable errors corrected.
@ -181,7 +181,7 @@ Available options:
- --ignore-perl-version
Override checking of perl version. Runtime errors maybe encountered after
Override checking of perl version. Runtime errors may be encountered after
enabling this flag if the perl version does not meet the minimum specified.
- --codespell

41
Documentation/dev-tools/kasan.rst

@ -169,7 +169,7 @@ Error reports
A typical KASAN report looks like this::
==================================================================
BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [test_kasan]
BUG: KASAN: slab-out-of-bounds in kmalloc_oob_right+0xa8/0xbc [kasan_test]
Write of size 1 at addr ffff8801f44ec37b by task insmod/2760
CPU: 1 PID: 2760 Comm: insmod Not tainted 4.19.0-rc3+ #698
@ -179,8 +179,8 @@ A typical KASAN report looks like this::
print_address_description+0x73/0x280
kasan_report+0x144/0x187
__asan_report_store1_noabort+0x17/0x20
kmalloc_oob_right+0xa8/0xbc [test_kasan]
kmalloc_tests_init+0x16/0x700 [test_kasan]
kmalloc_oob_right+0xa8/0xbc [kasan_test]
kmalloc_tests_init+0x16/0x700 [kasan_test]
do_one_initcall+0xa5/0x3ae
do_init_module+0x1b6/0x547
load_module+0x75df/0x8070
@ -200,8 +200,8 @@ A typical KASAN report looks like this::
save_stack+0x43/0xd0
kasan_kmalloc+0xa7/0xd0
kmem_cache_alloc_trace+0xe1/0x1b0
kmalloc_oob_right+0x56/0xbc [test_kasan]
kmalloc_tests_init+0x16/0x700 [test_kasan]
kmalloc_oob_right+0x56/0xbc [kasan_test]
kmalloc_tests_init+0x16/0x700 [kasan_test]
do_one_initcall+0xa5/0x3ae
do_init_module+0x1b6/0x547
load_module+0x75df/0x8070
@ -277,6 +277,27 @@ traces point to places in code that interacted with the object but that are not
directly present in the bad access stack trace. Currently, this includes
call_rcu() and workqueue queuing.
CONFIG_KASAN_EXTRA_INFO
~~~~~~~~~~~~~~~~~~~~~~~
Enabling CONFIG_KASAN_EXTRA_INFO allows KASAN to record and report more
information. The extra information currently supported is the CPU number and
timestamp at allocation and free. More information can help find the cause of
the bug and correlate the error with other system events, at the cost of using
extra memory to record more information (more cost details in the help text of
CONFIG_KASAN_EXTRA_INFO).
Here is the report with CONFIG_KASAN_EXTRA_INFO enabled (only the
different parts are shown)::
==================================================================
...
Allocated by task 134 on cpu 5 at 229.133855s:
...
Freed by task 136 on cpu 3 at 230.199335s:
...
==================================================================
Implementation details
----------------------
@ -510,15 +531,15 @@ When a test passes::
When a test fails due to a failed ``kmalloc``::
# kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163
# kmalloc_large_oob_right: ASSERTION FAILED at mm/kasan/kasan_test.c:245
Expected ptr is not null, but is
not ok 4 - kmalloc_large_oob_right
not ok 5 - kmalloc_large_oob_right
When a test fails due to a missing KASAN report::
# kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:974
# kmalloc_double_kzfree: EXPECTATION FAILED at mm/kasan/kasan_test.c:709
KASAN failure expected in "kfree_sensitive(ptr)", but none occurred
not ok 44 - kmalloc_double_kzfree
not ok 28 - kmalloc_double_kzfree
At the end the cumulative status of all KASAN tests is printed. On success::
@ -534,7 +555,7 @@ There are a few ways to run KUnit-compatible KASAN tests.
1. Loadable module
With ``CONFIG_KUNIT`` enabled, KASAN-KUnit tests can be built as a loadable
module and run by loading ``test_kasan.ko`` with ``insmod`` or ``modprobe``.
module and run by loading ``kasan_test.ko`` with ``insmod`` or ``modprobe``.
2. Built-In

16
Documentation/dev-tools/kselftest.rst

@ -245,6 +245,10 @@ Contributing new tests (details)
TEST_PROGS, TEST_GEN_PROGS mean it is the executable tested by
default.
TEST_GEN_MODS_DIR should be used by tests that require modules to be built
before the test starts. The variable will contain the name of the directory
containing the modules.
TEST_CUSTOM_PROGS should be used by tests that require custom build
rules and prevent common build rule use.
@ -255,9 +259,21 @@ Contributing new tests (details)
TEST_PROGS_EXTENDED, TEST_GEN_PROGS_EXTENDED mean it is the
executable which is not tested by default.
TEST_FILES, TEST_GEN_FILES mean it is the file which is used by
test.
TEST_INCLUDES is similar to TEST_FILES, it lists files which should be
included when exporting or installing the tests, with the following
differences:
* symlinks to files in other directories are preserved
* the part of paths below tools/testing/selftests/ is preserved when
copying the files to the output directory
TEST_INCLUDES is meant to list dependencies located in other directories of
the selftests hierarchy.
* First use the headers inside the kernel source and/or git repo, and then the
system headers. Headers for the kernel release as opposed to headers
installed by the distro on the system should be the primary focus to be able

28
Documentation/dev-tools/ubsan.rst

@ -49,34 +49,22 @@ Report example
Usage
-----
To enable UBSAN configure kernel with::
To enable UBSAN, configure the kernel with::
CONFIG_UBSAN=y
CONFIG_UBSAN=y
and to check the entire kernel::
CONFIG_UBSAN_SANITIZE_ALL=y
To enable instrumentation for specific files or directories, add a line
similar to the following to the respective kernel Makefile:
- For a single file (e.g. main.o)::
UBSAN_SANITIZE_main.o := y
- For all files in one directory::
UBSAN_SANITIZE := y
To exclude files from being instrumented even if
``CONFIG_UBSAN_SANITIZE_ALL=y``, use::
To exclude files from being instrumented use::
UBSAN_SANITIZE_main.o := n
and::
and to exclude all targets in one directory use::
UBSAN_SANITIZE := n
When disabled for all targets, specific files can be enabled using::
UBSAN_SANITIZE_main.o := y
Detection of unaligned accesses controlled through the separate option -
CONFIG_UBSAN_ALIGNMENT. It's off by default on architectures that support
unaligned accesses (CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y). One could

3
Documentation/devicetree/bindings/Makefile

@ -64,9 +64,6 @@ override DTC_FLAGS := \
-Wno-unique_unit_address \
-Wunique_unit_address_if_enabled
# Disable undocumented compatible checks until warning free
override DT_CHECKER_FLAGS ?=
$(obj)/processed-schema.json: $(DT_DOCS) $(src)/.yamllint check_dtschema_version FORCE
$(call if_changed_rule,chkdt)

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save