Module dma_axi_write_simple

This document contains technical documentation for the dma_axi_write_simple module. This module has a register interface, so make sure to study the register interface documentation as well as this top-level document. To browse the source code, visit the repository on GitHub.

This module contains an open-source Direct Memory Access (DMA) component for streaming data from FPGA to DDR memory over AXI. Sometimes called “AXI DMA S2MM”. The implementation is optimized for

very low resource usage, and
maximum AXI/data throughput.

Being simplified, however, it has the following limitations:

Can only write to continuous ring buffer space in DDR. No scatter-gather support.
Does not support data strobing or narrow AXI transfers. All addresses must be aligned with the AXI data width.
Uses a static compile-time packet length, with no support for partial packets.
Packet length must be power of two.

These limitations and the simplicity of the design are intentional. This is what enables the low resource usage and high throughput.

C++ driver

There is a complete C++ driver available in the cpp sub-folder in the repository. The class provides a convenient API for setting up the module and receiving stream data from the FPGA. It supports an interrupt-based as well as a polling-based workflow. See the header file for documentation.

Simulate and build FPGA with register artifacts

This module is controlled over a register bus, with code generated by hdl-registers. See Register interface for register documentation.

Generated register code artifacts are not checked in to the repository. The recommended way to use hdl-modules is with tsfpga (see Getting started), in which case register code is always generated and kept up to date automatically. This is by far the most convenient and portable solution.

If you dont’t want to use tsfpga, you can integrate hdl-registers code generation in your build/simulation flow or use the hard coded artifacts below (not recommended).

Hard coded artifacts

Not recommended, but if you don’t want to use tsfpga or hdl-registers, these generated VHDL artifacts can be included in the dma_axi_write_simple library for simulation and synthesis:

The first few are source files that shall be included in your simulation as well build project. The last one is a simulation file that shall be included only in your simulation project. These generated C++ artifacts can be used to control the module from software:

Warning

When copy-pasting generated artifacts, there is a large risk that things go out of sync when e.g. versions are bumped. An automated solution with tsfpga is highly recommended.

Register interface

This module is controlled and monitored over a register bus. Please see separate HTML page for register documentation. Register code is generated using hdl-registers based on the regs_dma_axi_write_simple.toml data file.

dma_axi_write_simple.vhd

View source code on GitHub.

$component dma_axi_write_simple is generic ( address_width : axi_address_width_t; stream_data_width : axi_data_width_t; axi_data_width : axi_data_width_t; packet_length_beats : positive; enable_axi3 : boolean; write_done_aggregate_count : positive; write_done_aggregate_ticks : positive ); port ( clk : in std_ulogic; --# {{}} stream_ready : out std_ulogic; stream_valid : in std_ulogic; stream_data : in std_ulogic_vector; --# {{}} regs_up : out dma_axi_write_simple_regs_up_t; regs_down : in dma_axi_write_simple_regs_down_t; interrupt : out std_ulogic; --# {{}} axi_write_m2s : out axi_write_m2s_t; axi_write_s2m : in axi_write_s2m_t ); end component;$

Main implementation of the simple DMA functionality. This entity is not suitable for instantiation in a user design, use instead e.g. dma_axi_write_simple_axi_lite.vhd.

Packet length

The packet_length_beats generic specifies the packet length in terms of number of input stream beats. When one packet of streaming data has been written to DDR, the write_done interrupt will trigger and the buffer_written_address register will be updated. This indicates to the software that there is data in the buffer that can be read.

Note

The packet length is a compile-time parameter. It can not be changed during runtime and there is no support for writing or clearing partial packets.

This saves a lot of resources and is part of the simple nature of this DMA core.

If the packet length specified by the user equates more than one maximum-length AXI burst, the core will perform burst splitting internally.

Data width conversion

The core supports data width conversion between the input stream and the AXI bus. If the stream_data_width and axi_data_width are not equal, width_conversion.vhd will be instantiated for a lightweight conversion. Note that axi_data_width must be the native width of the AXI port.

Resource usage

The core has a simple design with the goal of low resource utilization in mind. See Resource utilization for some build numbers. These numbers are incredibly low compared to some other implementations.

The special case when packet_length_beats is 1 has an optimized implementation that gives even lower resource usage than the general case. This comes at the cost of quite poor memory performance, since every data beat becomes and AXI burst in that case.

AXI/data throughput

The core has a one-cycle overhead per packet. Meaning that for each packet, the input stream will stall (stream_ready = 0) for one clock cycle. This is assuming that AWREADY and WREADY are high. If they are not, their stall will be propagated to the stream.

This performance should be enough for even the most demanding applications. The one-cycle overhead could theoretically be optimized away, but it is quite likely that downstream AXI interconnect infrastructure has some overhead for each address transaction anyway. I.e. the one-cycle overhead in this core is probably not limiting the throughput overall.

If the memory buffer is full, the stream will stall until there is space. When the software writes an updated buffer_read_address register indicating available space, the stream will start after two clock cycles.

AXI behavior

The core is designed to be as well-behaved as possible in an AXI sense:

AXI bursts of the maximum length possible will be used.
The AW transaction is only initiated once we have at least one W beat available.
BREADY is always high.

This gives very good AXI performance.

W channel block

Related to bullet point 2 above, the core does NOT accumulate a whole burst in order to guarantee no holes in the data. Meaning, it is possible that an AW and a few W transactions happen, but then the stream might stop for a while and block the AXI bus before the burst is finished.

This can be problematic if the downstream AXI slave is a crossbar/interconnect that arbitrates between multiple AXI masters.

It is up to the user to make sure that either,

The stream stopping within a packet is rare enough that sufficient AXI performance is reached.
Or, the downstream AXI slave can handle holes without impacting performance. axi_write_throttle.vhd is designed to help with this.

AXI3

Setting the enable_axi3 generic will make the core compliant with AXI3 instead of AXI4. The core does not use any of the ID fields (AWID, WID, BID) so the only difference is the burst length limitation.

`write_done` interrupt aggregation

The write_done interrupt bit is triggered every time a packet has been written to memory (see Register interface). If an interrupt-based control flow is used, this can lead to a high interrupt rate if packets are small but data rate is high. This can be problematic for the CPU.

The generics write_done_aggregate_count and write_done_aggregate_ticks can be used to aggregate the interrupt event so it is triggered more sparsely. See event_aggregator.vhd for details. Commonly, both generics are set to ensure that the interrupt is triggered:

Not too often (too little data per interrupt, too high overhead in interrupt manager).
Not too seldom (too much data per interrupt, buffer might fill up and stream might stall).
Not too delayed (too high latency in data flow).

dma_axi_write_simple_axi_lite.vhd

View source code on GitHub.

$component dma_axi_write_simple_axi_lite is generic ( address_width : axi_address_width_t; stream_data_width : axi_data_width_t; axi_data_width : axi_data_width_t; packet_length_beats : positive; enable_axi3 : boolean; write_done_aggregate_count : positive; write_done_aggregate_ticks : positive ); port ( clk : in std_ulogic; --# {{}} stream_ready : out std_ulogic; stream_valid : in std_ulogic; stream_data : in std_ulogic_vector; --# {{}} regs_m2s : in axi_lite_m2s_t; regs_s2m : out axi_lite_s2m_t; interrupt : out std_ulogic; --# {{}} axi_write_m2s : out axi_write_m2s_t; axi_write_s2m : in axi_write_s2m_t ); end component;$

Top level for the simple DMA module, with an AXI-Lite register interface. This top level is suitable for instantiation in a user design. It integrates dma_axi_write_simple.vhd and an AXI-Lite register file.

See dma_axi_write_simple.vhd for more documentation.

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_dma_axi_write_simple.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for **dma_axi_write_simple_axi_lite** netlist builds.
Generics	Total LUTs	FFs	Maximum logic level
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 1	156	207	16
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 16	157	226	12
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 2048	132	218	11
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 16384	134	218	10
address_width = 29 stream_data_width = 64 axi_data_width = 32 packet_length_beats = 16384	171	320	11
address_width = 29 stream_data_width = 64 axi_data_width = 128 packet_length_beats = 16384	198	410	11
address_width = 29 stream_data_width = 64 axi_data_width = 64 packet_length_beats = 1024 write_done_aggregate_count = 512 write_done_aggregate_ticks = 262144	156	247	12

dma_axi_write_simple_sim_pkg.vhd

View source code on GitHub.

Package with functions to simulate and check the DMA functionality.

Module dma_axi_write_simple

C++ driver

Simulate and build FPGA with register artifacts

Hard coded artifacts

Register interface

dma_axi_write_simple.vhd

Packet length

Data width conversion

Resource usage

AXI/data throughput

AXI behavior

W channel block

AXI3

write_done interrupt aggregation

dma_axi_write_simple_axi_lite.vhd

Resource utilization

dma_axi_write_simple_sim_pkg.vhd

`write_done` interrupt aggregation