Module dma_axi_write_simple
This document contains technical documentation for the dma_axi_write_simple
module.
This module has a register interface, so make sure to study the register interface documentation as well as this top-level document.
To browse the source code, visit the repository on GitHub.
This module contains a simple Direct Memory Access (DMA) component for streaming data from FPGA to DDR memory over AXI. The implementation is optimized for
very low resource usage, and
maximum AXI/data throughput.
Being simplified, however, it has the following limitations:
Can only handle writing to continuous ring buffer space in DDR. Has no scatter-gather capability.
Does not support data strobing or narrow AXI transfers. All addresses must be aligned with the AXI data width.
Uses a static compile-time packet length, with no support for partial packets.
Packet length must be power of two.
These limitations and the simplicity of the design are intentional. This is what enables the low resource usage and high throughput.
C++ driver
There is a complete C++ driver available in the cpp sub-folder in the repository. The class provides a convenient API for setting up the module and receiving stream data from the FPGA. See the header file for documentation.
Simulate and build FPGA with register artifacts
This module is controlled over a register bus, with code generated by hdl-registers. See Register interface for register documentation.
Generated register code artifacts are not checked in to the repository. The recommended way to use hdl-modules is with tsfpga (see Getting started), in which case register code is always generated and kept up to date automatically. This is by far the most convenient and portable solution.
If you dont’t want to use tsfpga, you can integrate hdl-registers code generation in your build/simulation flow or use the hard coded artifacts below (not recommended).
Hard coded artifacts
Not recommended, but if you don’t want to use tsfpga or hdl-registers,
these generated VHDL artifacts can be included in the dma_axi_write_simple
library
for simulation and synthesis:
The first few are source files that shall be included in your simulation as well build project. The last one is a simulation file that shall be included only in your simulation project. These generated C++ artifacts can be used to control the module from software:
Warning
When copy-pasting generated artifacts, there is a large risk that things go out of sync when e.g. versions are bumped. An automated solution with tsfpga is highly recommended.
Register interface
This module is controlled and monitored over a register bus.
Please see separate HTML page
for register documentation.
Register code is generated using hdl-registers based on the regs_dma_axi_write_simple.toml data file.
dma_axi_write_simple.vhd
Main implementation of the simple DMA functionality. This entity is not suitable for instantiation in a user design, use instead e.g. dma_axi_write_simple_axi_lite.vhd.
Packet length
The packet_length_beats
generic specifies the packet length in terms of number of
input stream
beats.
When one packet of streaming data has been written to DDR,
the write_done
interrupt will trigger and the buffer_written_address
register
will be updated.
This indicates to the software that there is data in the buffer that can be read.
Note
The packet length is a compile-time parameter. It can not be changed during runtime and there is no support for writing or clearing partial packets.
This saves a lot of resources and is part of the simple nature of this DMA core.
If the packet length specified by the user equates more than one maximum-length AXI burst, the core will perform burst splitting internally.
Data width conversion
The core supports data width conversion between the input stream
and the AXI bus.
If the stream_data_width
and axi_data_width
are not equal,
width_conversion.vhd will be instantiated for a lightweight conversion.
Note that axi_data_width
must be the native width of the AXI port.
Resource usage
The core has a simple design with the goal of low resource utilization in mind. See Resource utilization for some build numbers. These numbers are incredibly low compared to some other implementations.
The special case when packet_length_beats
is 1 has an optimized implementation that gives
even lower resource usage than the general case.
This comes at the cost of quite poor memory performance, since every data beat becomes and
AXI burst in that case.
AXI/data throughput
The core has a one-cycle overhead per packet.
Meaning that for each packet, the input stream
will stall (stream_ready = 0
) for one
clock cycle.
This is assuming that AWREADY
and WREADY
are high.
If they are not, their stall will be propagated to the stream
.
This performance should be enough for even the most demanding applications. The one-cycle overhead could theoretically be optimized away, but it is quite likely that downstream AXI interconnect infrastructure has some overhead for each address transaction anyway. I.e. the one-cycle overhead in this core is probably not limiting the throughput overall.
If the memory buffer is full, the stream
will stall until there is space.
When the software writes an updated buffer_read_address
register indicating available space,
the stream
will start after two clock cycles.
AXI behavior
The core is designed to be as well-behaved as possible in an AXI sense:
AXI bursts of the maximum length possible will be used.
The
AW
transaction is only initiated once we have at least oneW
beat available.BREADY
is always high.
This gives very good AXI performance.
W channel block
Related to bullet point 2 above, the core does NOT accumulate a whole burst in order to
guarantee no holes in the data.
Meaning, it is possible that an AW
and a few W
transactions happen, but then the
stream
can stop for a while and block the AXI bus before the burst is finished.
This can be problematic if the downstream AXI slave is a crossbar/interconnect that arbitrates between multiple AXI masters.
It is up to the user to make sure that either,
The
stream
never stops within a packet, so that optimal AXI performance is reached.Or, the downstream AXI slave can handle holes without impacting performance. axi_write_throttle.vhd is designed to help with this.
AXI3
Setting the enable_axi3
generic will make the core compliant with AXI3 instead of AXI4.
The core does not use any of the ID fields (AWID
, WID
, BID
) so the only difference
is the burst length limitation.
dma_axi_write_simple_axi_lite.vhd
Top level for the simple DMA module, with an AXI-Lite register interface. This top level is suitable for instantiation in a user design. It integrates dma_axi_write_simple.vhd and an AXI-Lite register file.
See dma_axi_write_simple.vhd for more documentation.
Resource utilization
This entity has netlist builds set up with automatic size checkers in module_dma_axi_write_simple.py. The following table lists the resource utilization for the entity, depending on generic configuration.
Generics |
Total LUTs |
FFs |
Maximum logic level |
---|---|---|---|
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 1 |
156 |
207 |
16 |
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 16 |
157 |
226 |
12 |
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 2048 |
132 |
218 |
11 |
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 16384 |
134 |
218 |
10 |
address_width = 29 stream_data_width = 64 axi_data_width = 32 packet_length_beats = 16384 |
171 |
320 |
11 |
address_width = 29 stream_data_width = 64 axi_data_width = 128 packet_length_beats = 16384 |
198 |
410 |
11 |
dma_axi_write_simple_sim_pkg.vhd
Package with functions to simulate and check the DMA functionality.