bzr branch
http://suren.me/webbzr/alps/pcitool
273
by Suren A. Chilingaryan
Add BIOS and kernel optimization instructions |
1 |
BIOS
|
2 |
====
|
|
3 |
The important options in BIOS: |
|
4 |
- IOMMU (Intel VT-d) - Enable hardware translation between physcal and bus addresses
|
|
5 |
- No Snoop - Disables hardware cache coherency between DMA and CPU
|
|
6 |
- Max Payload (MMIO Size) - Maximal (useful) payload for PCIe protocol
|
|
7 |
- Above 4G Decoding - This seesm to allow bus addresses wider than 32-bit
|
|
8 |
- Memory performance - Frequency, Channel-interleaving, Hardware prefetcher affect memory performance
|
|
9 |
||
10 |
||
11 |
IOMMU
|
|
12 |
=====
|
|
13 |
- As many PCI-devices can address only 32-bit memory, for DMA operation some address
|
|
14 |
translation mechanism is required (also it helps with security limiting PCI devices |
|
15 |
to only allowed address range). There are several methods to achieve this. |
|
16 |
* Linux provides so called Bounce Buffers (or SWIOTLB). This is just a small memory
|
|
17 |
buffer in the lower 4 GB of memory. The DMA is actually performed into this buffer |
|
18 |
and data is, then, copied to the appropriate location. One problem with SWIOTLB |
|
19 |
is that it does not gurantee 4K aligned address when mapping memory pages (to |
|
20 |
optimally use space). This is not properly supported neither by NWLDMA nor by IPEDMA. |
|
21 |
* Alternatively hardware IOMMU can be used which will provide hardware address
|
|
22 |
translation between physical and bus addresses. To allow it, we need to |
|
23 |
allow the technology in the BIOS and in the kernel. |
|
24 |
+ Intel VT-d or AMD-Vi (AMD IOMMU) virtualization technologies have to be enabled
|
|
25 |
+ Intel is enabled with "intel_iommu=on" kernel parameter (alternative is to build kernel with CONFIG_INTEL_IOMMU_DEFAULT_ON)
|
|
26 |
+ Checking: dmesg | grep -e IOMMU -e DMAR -e PCI-DMA
|
|
27 |
||
28 |
DMA Cache Coherency
|
|
29 |
===================
|
|
30 |
DMA API distinguishes two types of memory coherent and non-coherent. |
|
31 |
- For the coherent memory, the hardware will care for cache consistency. This is often
|
|
32 |
achieved by snooping (No Snoop should be disabled in the BIOS). Alternatively, the same |
|
33 |
effect can be achieved by using non-cached memory. There is architectures with 100% |
|
34 |
cache coherent memory and others where only part of memory is kept cache coherent. |
|
35 |
For such architectures the coherent memory can be allocated with |
|
36 |
dma_alloc_coheretnt(...) / dma_alloc_attrs(...) |
|
37 |
* However, the coherent memory could be slow (especially on large SMP systems). Also
|
|
38 |
minimal allocation unit may be restricted to page. Therefore, it is useful to group |
|
39 |
consistent mapping into the groups. |
|
40 |
||
41 |
- On other hand, it is possible to allocate streaming DMA memory which are synchronized
|
|
42 |
using: |
|
43 |
pci_dma_sync_single_for_device / pci_dma_sync_single_for_cpu |
|
44 |
||
45 |
- It may happen that all memory is coherent anyway and we do not need to call this 2
|
|
46 |
functions. Currently, it seems not required on x86_64 which may indicate that snooping |
|
47 |
is performed for all available memory. On other hand, may be only because nothing |
|
48 |
was get cached luckely so far. |
|
49 |
||
50 |
||
51 |
PCIe Payload
|
|
52 |
============
|
|
53 |
- Kind of MTU for PCI protocol. Higher the value, the lower will be slow down due to
|
|
54 |
protocol headers while streaming large amount of data. The current values can be checked |
|
55 |
with 'lspci -vv'. For each device, there is 2 values: |
|
56 |
* MaxPayload under DevCap which indicates MaxPayload supported by the dvice
|
|
57 |
* MaxPayload under DevCtl indicates MaxPayload negotiated between device and chipset.
|
|
58 |
Negotiated MaxPayload is a minimal value among all the infrastructure between the device |
|
59 |
chipset. Normally, it is limited by the MaxPaylod supported by the PCIe root port on |
|
60 |
the chipset. Most systems currently restricted to 256 bytes. |
|
61 |
||
62 |
||
63 |
Memory Performance
|
|
64 |
==================
|
|
65 |
- Memory performance is quite critical as we currently tripple the PCIe bandwidth:
|
|
66 |
DMA writes to memory, we read memory (it is not in cache), we write memory. |
|
67 |
- The most important to enable Channel Interleaving (otherwise a single-channel copy
|
|
68 |
will be performed). On other hand, Rank Interleaving does not matter much. |
|
69 |
- On some motherboards (Asrock X79 for instance), when the memory speed is set
|
|
70 |
manually, the interleaving is switched off in AUTO mode. So, it is safer to set |
|
71 |
interleaving manually on. |
|
72 |
- Hardware prefetching helps a little bit and should be turned on
|
|
73 |
- Faster memory frequency helps. As we are streaming I guess this is more important
|
|
74 |
compared even to slighly higher CAS & RAS latencies, but I have not checked. |
|
75 |
- The memory bank conflicts sometimes may significant harm performance. Bank conflict
|
|
76 |
will happen if we read and write from/to different rows of the same bank (also there |
|
77 |
could be conflict with DMA operation). I don't have a good idea how to prevent this |
|
78 |
now. |
|
79 |
- The most efficient memcpy performance depends on CPU generation. For latest models,
|
|
80 |
AVX seems to be most efficient. Filling all AVX registers before writting increases |
|
81 |
performance. It also gives quite much of performance, if multiple pages copied in |
|
82 |
parallel (still first we reading from multiple pages and then writting to multiple |
|
83 |
pages, see ssebench). |
|
84 |
- Usage of HugePages makes performance more stable. Using page-locked memory does not
|
|
85 |
help at all. |
|
86 |
- This still will give about 10 - 15 GB/s at max. On multiprocessor systems about 5 GB/s,
|
|
87 |
because of performance penalties due to snooping. Therefore, copying with multiple |
|
88 |
threads is preferable. |