# Hands On: Design Flow for Heterogeneous Embedded Computing Infrastructures

The tutorial is divided into three main parts. In the first part we are going to create an hardware accelerator by using MDC tool which is compliant with the ARTICo<sup>3</sup> processing architecture. The system bitstreams will be created by using Vivado invoked by the ARTICo<sup>3</sup> toolchain. In the second part, a dataflow-based application is developed using PREESM. The tool will be in charge of automatically dispatch jobs among available hardware resources (CPUs and/or FPGA slots). The last part of the tutorial will show how to generate and automatically instrument code in order to monitor the whole hardware infrastructure by using PAPIFY.

# **1** Hardware Accelerators Creation

As in the tutorial of the last year<sup>1</sup> (in which is based this one), the use case is an edge detection application involving two different algorithms: Sobel and Roberts. The single networks and HDL component libraries have been created using CAPH<sup>2</sup> tool. Starting from their dataflow descriptions, MDC is capable to merge them in a multi-dataflow one (for more details, see the mentioned tutorial).

#### **ARTICo<sup>3</sup>** Kernel Generation

To use the coarse-grain reconfigurable computing core generated by MDC it is necessary to generate an  $ARTICo^3$  compliant kernel.

1. Launch MDC executable, placed in folder /home/embedded/Desktop/MDC\_CPS/MDC\_tool/eclipse and confirm the workspace with OK.

2. Import Project:

> File > Import... > General > Existing Project into Workspace

Browse to /home/embedded/Desktop/MDC\_CPS/MDC\_input/Tutorial\_EdgeDetection, then:

> OK > Finish

3. Create a configuration window as follows:

> Run > Run configurations...

then right click on Orcc compilation and then select New.

4. Fill in the following compilation settings, as shown in Figure 1.

Name: choose a name for the configuration (for instance "Tutorial\_EdgeDetection")

Project: select "Tutorial\_EdgeDetection"

Backend:

- Select a backend: MDC

- Output Folder: /home/embedded/Desktop/artico3/demos/mdc\_monitors

Options:

- Tick "List of Networks to be Compiled and Merged"
- Number of Networks: 2

<sup>&</sup>lt;sup>1</sup>http://www.cpsschool.eu/wp-content/uploads/2018/09/Tutorial\_Multi-Grain-Reconfiguration-1.pdf
<sup>2</sup>http://caph.univ-bpclermont.fr

- XDF List of Files: select the two input dataflow networks: "edgeDetection.roberts" and "edgeDetection.sobel"

- Merging Algorithm: EMPIRIC
- Tick "Generate RVC-CAL multi-data flow" with "DUMMY" as option
- Select "Generate HDL multi-dataflow"
- Protocol file: MDC\_CPS/MDC\_input/protocol/protocol\_CAPH.xml
- HDL component library: MDC\_CPS/MDC\_input/HDL\_compLib (this folder must contain all the necessary HDL files)
- Tick on "System Generation"
- Tick on "ARTICo<sup>3</sup> Backend"
- Tick on "Enable Monitoring" (selecting the last three monitors)
- 5. Select Apply and choose Run.

#### $Output\ folder$

Output folder contains:

- src/: includes all the necessary files to create the PAPIFY-monitored and ARTICo<sup>3</sup>-compliant CGR accelerator
- mdc-papi\_info.xml: describes the PAPI configurations of the MDC accelerator.

#### System Implementation

- input: HDL files generated by MDC framework
- **output**: bitstreams of the synthetized system

Let's run the synthesis and the bitstream generation by using the ARTICo<sup>3</sup> toolchain and a configuration file build.cfg. This file can be created *ex novo* with the option shown in the Fig.2, but there is one located in the output folder /home/embedded/Desktop/artico3/demos/mdc\_monitors, for tuning the option depending on your own needs.

1. Open a terminal in the output folder (/home/embedded/Desktop/artico3/demos/mdc\_monitors) in which next commands will be launched.

- 2. Set up the  $ARTICo^3$  environment by running:
- \$ source /home/embedded/Desktop/artico3/tools/setting.sh
- 3. Generate the RTL system:
- \$ a3dk
- \$ export\_hw
- 4. Build the system (we are going to SKIP THIS STEP DURING THE TUTORIAL):
- \$ build\_hw

The bitstream will be created in the folder .../mdc\_output/build.hw/bin/. At this point, the bitstreams should be moved upon the target device OS: ARTICo<sup>3</sup> runtime functions will be in charge of managing the FPGA reconfiguration. All the necessary steps are detailed on the ARTICo<sup>3</sup> website<sup>3</sup>. Optional:

<sup>&</sup>lt;sup>3</sup>https://des-cei.github.io/tools/artico3/tutorials/setup#execute-on-target-platform

| Name: Tutorial_EdgeDetection          |                                                       |             |
|---------------------------------------|-------------------------------------------------------|-------------|
| Compilation settings Compilation      | on options 💿 Mapping 🔲 <u>C</u> ommon                 |             |
| Project:                              |                                                       |             |
| Tutorial_EdgeDetection                |                                                       | Browse      |
| Backend:                              |                                                       |             |
| Select a backend: MDC                 | <b>•</b>                                              |             |
| Output folder: /home/embedded/        | Desktop/artico3/demos/mdc_monitors                    | Browse      |
| Options:                              |                                                       |             |
| List of Networks to be Compiled ar    | nd Merged                                             |             |
| Number of Networks: 2                 |                                                       |             |
|                                       |                                                       | 5_1         |
| XDF List of Files: edgeDetection.ro   | berts, edgeDetection.sobel                            | <u>A</u> dd |
| Merging Algorithm                     |                                                       |             |
| EMPIRIC 🔻                             |                                                       |             |
| Generate RVC-CAL multi-datafle        | W                                                     |             |
| CAL type                              |                                                       |             |
| DUMMY 🔻                               |                                                       |             |
| Generate HDL multi-dataflow           |                                                       |             |
| Protocol File: /home/embedded         | /Desktop/MDC_CPS/MDC_input/protocol/protocol_CAPH.xml | Browse      |
|                                       |                                                       |             |
| HDL Component Library: //home         | /embedded/Desktop/MDC_CPS/MDC_input/HDL_compLib       | Browse      |
| System Generation                     |                                                       |             |
| Generate Accelerator IP               |                                                       |             |
| ARTICo <sup>3</sup> Backend           |                                                       |             |
| <ul> <li>Enable Monitoring</li> </ul> |                                                       |             |
| Monitoring of Input FIFO              |                                                       |             |
| Monitoring of Clock Cyc               |                                                       |             |
| Monitoring of Input Tok               |                                                       |             |
| Monitoring of Output To               | okens                                                 |             |
| Enable Profiling                      |                                                       |             |

Figure 1: Compilation settings as in the MDC GUI

In order to connect the PYNQ board to your laptop, two options are available for the tutorial:

1. Using a serial connection using the Port USB1 with a Boud Rate of  $\texttt{115200.}^4$ 

2. Using the Ethernet port of the PYNQ board connected to your Local Area Network  $(LAN)^5$  (or directly to your laptop with a cable). The Pynq board can be set up with a static IP:

\$ ifconfig eth0 192.168.0.xxx

<sup>&</sup>lt;sup>4</sup>**Teraterm** and **Putty** are two options.

<sup>&</sup>lt;sup>5</sup>https://www.wikihow.com/Create-a-Local-Area-Network-(LAN)

```
build.cfg ×

[General]

Sume = MDC_filters

TargetBoard = pynq,c

TargetPart = xc7z020clg400-1

ReferenceDesign = mdc

TargetOS = linux

TargetXil = vivado,2017.1

CFlags = -03

[ASKernel@CGR_accelerator]

HwSource = verilog

MemBanks = 3

Regs = 8

S RstPol = low

...
```

Figure 2: Configuration File Options

To have access to the PYNQ OS command line, please use the ssh protocol:

\$ ssh linaro@192.168.0.xxx

If you want to have full access to the PYNQ Filesystem, the best option is to use the Ubuntu's File Manager and the sftp protocol as shown in Figure 3.



Figure 3: Ubuntu's File Manager to navigate the Pynq Filesystem

# 2 Hardware/Software Code-Generation Setup for Design Space Exploration

- inputs: bitstreams

- outputs: code ready to be compiled and executed

# 2.1 Parameterized and Interfaced Synchronous DataFlow (PiSDF)

The algorithm of the application is one of the main inputs of the method and needs to be specified. Besides, being the algorithm description compliant with a Dataflow Model of Computation (MoC), the method exploits its intrinsic expressiveness of parallelism. For this purpose a PiSDF MoC is utilized: a graph that connects *Actors* and *Parameters* through *FIFO* and *Parameter dependency link*.

#### 1. Open PREESM:

Within the folder:

```
> /home/embedded/Desktop/preesm-3.17.0.201909161224-linux.gtk.x86_64/
open PREESM by double-clicking on
```

```
> eclipse
```

#### 2. Import the template project:

The project created for this tutorial is located within the folder

> /home/embedded/Desktop/preesm\_project/tutorial

In the > Project Explorer panel, click on the > Import projects.

Then, in the appearing wizard window, select:

> General > Existing Project into Workspace > Next

Select root directory:

> /home/embedded/Desktop/preesm\_project/tutorial/tutorialSummerSchoolFixedTile
and press OK and Finish:

| Import Projects<br>Select a directory to sea                                                                                           | arch for existing Eclipse projects.     |   |                  |
|----------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|---|------------------|
| <ul> <li>Select <u>a</u>rchive file:</li> <li><u>P</u>rojects:</li> <li>Options</li> <li>Search for nested projects into we</li> </ul> | * · · · · · · · · · · · · · · · · · · · | • | Browse<br>Browse |
| ?                                                                                                                                      | < Back Next > Cancel                    |   | Finish           |

Figure 4: Select Project

Within the folder Algo, the algorithm is described by making use of the PiSDF. Within the folder Archi, the hardware architecture is described by making use of the S-LAM. Other information can be found in the PREESM web page<sup>6</sup>.

3. **Open the PiSDF**: Within the folder > Algo, double click on the file .diagram file: the PiSDF of the image processing algorithm will be dispayed (Fig. 5).

# 2.2 S-LAM

The specific device to be used to test the application needs to be described and serves as an input for the mapping and scheduling of the application onto the architecture. Because of this, the SLAM was used as an abstract platform model.

1. **Open the S-LAM**: Within the folder > Archi, double click on the file ARTICo3\_4.slam file: the PiSDF of the image processing algorithm will be displayed (Fig. 6).

In the case reported in the above figure, the blue boxes are the Processing Elements (PEs) and the pink boxes are the memories of our architecture. The board used for the tutorial is a Pynq equipped with a Zynq device. The device is composed by two ARM CORTEX-A9 and a Xilinx FPGA.

<sup>&</sup>lt;sup>6</sup>https://preesm.github.io/



Figure 5: PiSDF of the image processing algorithm.



**Figure 6:** Architecture: one CPU and four ARTICo<sup>3</sup> slots.

In the SLAM, one CPU core is modelled (the Core0) and four ARTICo<sup>3</sup> slots (from Slot1 to Slot4).

2. The SLAM can be modified by using the palette on the right side of the screen.

PREESM gives the possibility to specify the nature of the PE within the SLAM in order to describe heterogeneous system. In this case, by choosing a PE and selecting the tab > Properties > Basic on the bottom of the screen (Fig. 7):

Within the > definition, it is possible to set the PE to:

• ARM: it generates code ready to be compiled and executed upon a CPU.

| 🔲 Properties 🔀 🧔 T | asks 🖹 Problems 📮 Console |          |
|--------------------|---------------------------|----------|
| Custom Parameters  | Property                  | Value    |
| Basic              | definition                | Hardware |
| Dasic              | hardwareId                | 3        |
|                    | id                        | Slot1    |
|                    | refinement                |          |

Figure 7: Properties tab on the bottom of the screen.

• **Hardware**: it generates code ready to compiled. It offloads some "processing" into the FPGA side (by making use of Hardware Accelerators).

In the tutorial, a SLAM is proposed with one CPU and four ARTICo<sup>3</sup> slots (Fig. 6).

# 2.3 Scenario

The **Scenario** is the last input for the mapping and scheduling within PREESM, where additional information is provided: optional affinity for actors forcing their execution on specific processing elements, data size of the FIFOs tokens, timings of the actors executions, etc. A detailed explanation of all the feature available in the **Scenario** can be found online<sup>7</sup>.

Let's open the scenario: in the Project Explorer tab, double click on

Scenario > pynq4slot.scenario.

as shown in Figure 8.

Let's now set up the input files for the PiSDF and the SLAM by clicking on Browse and by choosing:

- PiSDF : FixedTileSize.pi
- $S-LAM : ARTICo3_4.slam$

| ARTICO    | 3_4.slam 🕼 pynq4slots.scenario 🕱                     |
|-----------|------------------------------------------------------|
| Overvi    | ew                                                   |
| - Algor   | ithm file path                                       |
| Enter a f | le path that contains the algorithm                  |
| Edit file | /tutorialSummerSchoolFixedTile/Algo/FixedTileSize.pi |
| Browse    |                                                      |
| ▼ Archi   | tecture file path                                    |
| Enter a f | le path that contains the architecture               |
| Edit file | /tutorialSummerSchoolFixedTile/Archi/ARTICo3_4.slam  |
| Browse    |                                                      |

Figure 8: Scenario overview.

<sup>&</sup>lt;sup>7</sup>https://preesm.github.io/tutos/

Some details of the tab > PAPIFY are going to be analyzed in the last part of the tutorial. Let's focus now the attention on the tab > Constraints as highlight in Fig. 9:

| $\pi$ FixedTileSize                                                   | ARTICO       | 3_4.slam     | [4] pynq     | 4slots.scenar  | io 🛙    |        |
|-----------------------------------------------------------------------|--------------|--------------|--------------|----------------|---------|--------|
| Constraints                                                           |              |              |              |                |         |        |
| + Constraints file                                                    | path         |              |              |                |         |        |
| Enter a excel file (.x<br>the next columns. O<br>timings will be remo | perators w   | th timings w | ill be autor |                |         |        |
| Edit file                                                             |              |              |              |                |         |        |
| - Constraints                                                         |              |              |              |                |         |        |
| The constraints pre                                                   | cise which t | ask can be e | xecuted or   | n the given op | erator. |        |
| Core0                                                                 |              |              | •            |                |         |        |
| ▼ □ FixedTileSize                                                     |              |              |              |                |         |        |
| Read_Imag                                                             | je           |              |              |                |         |        |
| Save_Imag                                                             | je           |              |              |                |         |        |
| 🗹 Filter                                                              |              |              |              |                |         |        |
| Tiling                                                                |              |              |              |                |         |        |
| 🗌 Merge                                                               |              |              |              |                |         |        |
| Overview Constraint                                                   | s Timings    | Simulation   | Codegen      | Parameters     | PAPIFY  | Energy |

Figure 9: Scenario in PREESM.

In this tab, we can assign a specific actor (or a set of actors) execution to a specific PE (or a set of specific PEs). Keep in mind that you can execute on the FPGA **only** actor which behaviour has been previously synthesized using Vivado. If you assign any other actor to the FPGA side, the code generation will end with no error but, during the execution, the software will not find the right bitstream to be written in the Configuration Memory.

Having designed only the hardware accelerator for the *Filter* actor, let's set the *Constraints* as follow:

- Core0: enable all actors execution
- Slot1: select just *Filter*
- Slot2: select just *Filter*
- Slot3: select just *Filter*
- Slot4: select just *Filter*

## 2.4 Design Space Exploration

It is possible to change the parameter values on the PiSDF, the SLAM and/or the Scenario and execute the workflow as many times you want. After the execution of the generated code on the target device, the consequence of the changing can be observed and collected, thus allowing a Design Space Exploration (DSE).

# 3 Monitoring, Code Generation and Profiling

### **Monitoring Configuration**

The configuration is done in PAPIFY tab, from scenario file. The resulting configuration is shown in Figure 10

- 1. Import monitoring info
  - Click on Browse button
  - Select  $PAPI\_info.xml$  available in tutorialSummerSchoolFixedTile/Code
- 2. In PAPIFY PE configuration, associate PAPI components with PE types
  - perf event  $\leftrightarrow x86$
  - $artico3 \leftrightarrow Hardware$
- 3. In PAPIFY actor configuration, associate PAPI events with actors
  - Select timing and PAPI\_L1\_DCM event for every actor
  - Select artico3:::MDC CLOCK CYCLE event for actor Filter

| <ul> <li>PAPIFY file path</li> </ul>                                                       |                   |                             |                   |                   |                  | ▼ PAPIFY PE               | configura                               | ition            |          |                  |             |                  |                |
|--------------------------------------------------------------------------------------------|-------------------|-----------------------------|-------------------|-------------------|------------------|---------------------------|-----------------------------------------|------------------|----------|------------------|-------------|------------------|----------------|
| Enter an xml file path that conta<br>within the target platform. PAPI                      |                   |                             |                   |                   |                  | Each SLAM pi<br>component | rocessing e                             | element inst     | ance nee | ls to be asso    | ciated with | n its corre      | sponding PAP   |
| added to the selection options.<br>dit file //tutorialSummerSch                            | he el Tive d'Tile | Code (DA DI ini             | for your          |                   |                  | Component                 | type \ PE                               | type             |          | x86              |             | Ha               | rdware         |
|                                                                                            | nooiFixedTile     | PCODE/PAPI_IN               | IO.XIIII          |                   |                  | perf_ever                 | nt                                      |                  | <b>√</b> | YES              | *           |                  | NO             |
| Browse                                                                                     |                   |                             |                   |                   |                  | artico3                   |                                         |                  | ×        | NO               | <b>V</b>    |                  | YES            |
| 5                                                                                          |                   | orresponding e              | event <u>(</u> s) |                   |                  |                           |                                         |                  |          |                  |             |                  |                |
| ach actor needs to be associat                                                             |                   | orresponding e<br>Timing    | A 1               | API_L1_DCM        |                  | PAPI_L1_ICM               | P/                                      | API_TLB_DM       | 1        | PAPI_TL          | .B_IM       | P                | API_HW_INT     |
| ach actor needs to be associat<br>Actor name \ Event name                                  |                   | 1 5                         | A 1               | API_L1_DCM<br>YES | *                | PAPI_L1_ICM<br>NO         | P/                                      | API_TLB_DN<br>NO | 3        |                  |             | Pi               | API_HW_INT     |
| ach actor needs to be associat<br>Actor name \ Event name                                  | ted with its co   | Timing                      | P                 |                   |                  |                           |                                         |                  |          | N                | 0           |                  |                |
| ach actor needs to be associat Actor name \ Event name FixedTileSize                       | ted with its co   | Timing<br>YES               | P.                | YES               | ××××             | NO                        | ××××××××××××××××××××××××××××××××××××××× | NO               | 3        | N<br>N<br>N      | 0           | ×××              | NO             |
| Read_Image                                                                                 | ted with its co   | Timing<br>YES<br>YES        | P.                | YES               | X<br>X<br>X<br>X | NO<br>NO                  | ×××××                                   | NO<br>NO         | 3        | N<br>N<br>N      | 0           | X<br>X<br>X<br>X | NO<br>NO       |
| ach actor needs to be associat Actor name \ Event name FixedTileSize Read_Image Save_Image | ted with its co   | Timing<br>YES<br>YES<br>YES | ₽.<br>✓<br>✓      | YES<br>YES<br>YES | ××××             | NO<br>NO<br>NO            | ××××××××××××××××××××××××××××××××××××××× | NO<br>NO<br>NO   | 3        | N<br>N<br>N<br>N | 0 0 0 0 0   | ×××              | NO<br>NO<br>NO |

Figure 10: PAPIFY configuration.

### **Code Generation**

- Right click on Codegen.workflow available in Workflows folder
- Click on Preesm > Run Workflow
- Select pynq4slots.scenario from tutorialSummerSchoolFixedTile/Scenarios folder

#### Compile and Execute on Pynq Board

- Copy on the Pynq board the complete tutorialSummerSchoolFixedTile/generated/Code folder
- Compilation and execution set up: source compile\_and\_setup.sh
- Go to execution directory: cd /home/linaro/mdc\_summer\_school/bin

- Execute the application: <code>./summerSchoolFixedTile</code>

#### **Profiling analysis**

- On your laptop, after installing PAPIFY-VIEWER  $tool^8$ , open its containing folder:

cd /home/embedded/Desktop/papify/PapifyViewer

- Launch PAPIFY-VIEWER tool: python PapifyViewerDynamic.py

- In *Choose Folder* option, select the *papify-output* folder (it can be found in the same folder from where the application is executed. In our case cd /home/linaro/mdc\_summer\_school/bin/papify-ouput)

- Select *Cores fixed* option to visualize the application timing execution. The result should be equivalent to the one shown in Figure 11.



Figure 11: PAPIFY configuration.

 $<sup>^{8}</sup> How \ to \ install \ {\tt Papify-Viewer: https://gitlab.citsem.upm.es/papify/papify/tree/master/PapifyViewer}$