Offloading CPU-2-FPGA
BTC
5G Baseband for L1 Offloading.
Key Zynq UltraScale+ RFSoC Portfolio Benefits:
- Integrated Soft-Decision Forward Error Correction (SD-FEC) cores
- LDPC codec (SD-FEC) to meet 5G standards and support for custom codes
- Turbo Decode (SD-FEC) for 4G LTE-Advanced and 4G LTE Pro
- DSP48-rich fabric (6,620 GMACs) provides high-performance filtering and encoding/decoding
- 33 Gb/s transceivers for 12.2G CPRI and expansion into 16G & 25G CPRI
Hardware programming for software developers:
Several disrupting
factors in the traditional microprocessors being the chip of choice for C-Programming algorithms.
These include cost and accessibility of cross-compilation tools, the processing power, the speed limitations of microprocessors, and the availability of more reliable building blocks.
These include cost and accessibility of cross-compilation tools, the processing power, the speed limitations of microprocessors, and the availability of more reliable building blocks.
This is based on hundreds of hours of class and lab testing.
These authors are willing to share their instructional materials, curricula, and advice for the readers here.
Problem identification:
Microprocessors continue to represent the largest "bang for the buck" and are at the center of most systems.
FPGAs are semi-custom, co-processing resource that is "picking off" the parallelization tasks from CPUs. FPGAs do this – at lower clock speeds and power – by deploying multi-core parallelism.</p>
HPRC:
High Performance Reconfigurable Computing (HPRC) as a branch of Computer Science is thriving.
HPRC Largely driven by GPGPU (general-purpose graphics processing unit) growth, HPRC is also supported by FPGA-based applications.
The programming environment is considered to be the main obstacle preventing FPGAs from being used to their full potential in accelerators. Thus, the need to gain familiarity with High Level Languages (HLLs) is inevitable.
A high-level language (HLL) is a programming language such as C, FORTRAN, or Pascal that enables a programmer to write programs that are more or less independent of a particular type of computer. Such languages are considered high-level because they are closer to human languages and further from machine languages.
Architectural differences in C for FPGAs vs. C for CPUs:
The C language that is refactored for FPGA, can be characterized as a stream-oriented programming language, and process-based language.
Processes are organized into main building blocks interconnected using streams to form the architecture for the desired hardware module.
From a hardware perspective, processes and streams are hardware modules and FIFOs (First In, First Out registers) respectively.
The C programming model is generally based on the Communicating Sequential Processes model.
Every process must be classified as a hardware or a software process. It is the programmer's responsibility to ensure inter-process synchronization.
Like all human readable HLLs, C does not provide access to the clock signal, which relieves the designer from implementing cycle synchronization procedures.
Understand, it is possible to attach HDL modules and synchronize them at the RTL, (Real Time Level), using clock signals. It is worth noting that C as a hardware design language does not permit dynamic resource allocation (e.g., "malloc()" and "calloc()").
The second unique language construct, besides being process-oriented, is stream orientation. Streams are unidirectional and can interconnect only two processes, which imposes restrictions on hardware module architectures designed in C.
Since pipelines can become a source of deadlocks, the designer particularly needs to consider mechanisms to avoid them. Unfortunately, occurrences of deadlocks are difficult to trace during simulations since the "#pragma co pipeline" C-to-HDL compiler directive is ignored during software simulation.
These problems are usually revealed after implementation when the module is tested in hardware.
In addition to streams and processes, C as a design method provides signals and semaphores. These structures are used for inter-process synchronization. The best practice is often to implement pure pipeline modules, with the lowest possible number of synchronization signals.
Software processes are converted to multiple streaming hardware processes where they use streams, signals, or memory for synchronization. |
HLLs are used for purposes of data type flexibility in terms of HDL module integration use.
Typically, there will be a range of data structures available such as co_int2, co_int32, co_uint1, co_uint32, etc.
These constructs are also a source of inconsistency between the software and hardware implementations.
Prior to FPGA implementation, all hardware modules should be simulated on a GPP (general purpose processor) where their data structures are mapped on the types available on the GPP.
Unfortunately GPPs use limited sets of data types, so each time a simulation is performed, the data is extended to the nearest wider data type, which affects intrinsic computation precision.
This occurs unless a dedicated macro is used (e.g. "UADD4()" and "UDIV20()"); thus, using macros is encouraged.
Special attention must be paid to functions, since they highly simplify modular implementation, since this is the common design strategy.
The following pragmas are useful: "co inline," "co implementation," "co unroll," "co pipeline," and "co set."
These allow module shaping, providing a set of restrictions. By example, using "inline" in a function body enables the compiler to freely modify the internal architecture of the module.
if not implimented, the function is treated as a uniform module, which cannot be modified by the compiler.
Static recurrence is permitted and proves to be a useful structure in many applications such as a binary tree implementation with the "add_tree()" command is needed.
The limitations of using C for hardware result mostly from the exceptions of the adaptation of ANSI C to hardware design. Notable examples include:
- Lack of dynamic recurrence
- No support for unions
- Dynamic memory allocation is unsupported (free(), malloc()) in hardware
- Limited support of pointers
- A pointer may only point to one block of memory
- Pointers must be determined at compilation time
Several techniques to optimize performance of the implemented hardware modules that may work.
Reading and writing to streams can be implemented in several ways and one is to provide an efficiant pipeline performance (1 cycle per single operation).
Data access conflicts may also contribute to significant reduction of expected performance. A solutions, is important to debug such conflicts by memory duplication or multi-object-optimization by table scalarization (using the "co_array_config()" instruction).
The number of combination logic levels should be kept reasonably low, which can be achieved with a single pragma parameter (e.g., "co set Stage Delay 32").
Stage Delay Analysis provides the tools needed to see how decisions made in C algorithms will propagate in logic and clock cycles. |
As a general rule, it is recommended to use appropriate data structures to maximize data throughput.
The C-to-FGPA IDE delivers a range of tools which facilitate flexable debugging.
One convenient tool is the Stage Master Explorer (SME), which may be used to examine code and pinpoint throughput bottlenecks.
The measured performance in the SME is expressed through a set of four parameters, which characterize the digital module: Latency, Rate, Max. Unit Delay, and Effective Rate.
Two simple teaching examples:
1.) Implementation of an FPGA-accelerated HASH function using C:
In this exercise exampled below, the programmer's, task was to implement a hash function in C.
The algorithm could not be implemented in a naive way (by copying and pasting it inside the hardware main function body), but it had to be re-coded and optimized as hardware.
The first step was to run a given hash function on a GPP to produce a reference output for the given input.
The reference code in ANSI-C is shown below.
Every student-programmer had this same input vector, but different functions, preventing plagiarism. Such solutions have the additional advantage of verifying if the whole solution is correct.
To do so, a programmer must only check the correctness of one output number.
Noted that the interface between the CPU and FPGA is 32 bits wide, and equates to four-bytes of data is transferred at once to the hardware module to minimize bottleneck via system throughput.
The programmers results differed in quality, but the best ones used pipelined operations, which resulted in a high throughput with slightly higher latency than for non-pipelined cases programming solutions.
A few student-programmers unrolled an internal for loop using the pragma "CO UNROLL" directive.
By using pipelined operations, they reduced the hashing function execution time N times and increased output N times. The disadvantage of such an approach was the high usage of available logic resources for high over-all data-flow.
#2 A Prime Number Generator using C:
Another tutorial began by asking student-programmers to find algorithms in literature used as Prime Number Generators (PNGs).
This knowledge was used during the following hands-on exercises.
The students also had to write their own module(s) in VHDL, which were used in the PNG.
The module could be a trivial one but – if so – the number had to be greater than one (e.g., the operation of square and incrementation by constant factor).
The modules had to be 100% compatible with the C external module standards and verified as accurate in a test-bench.
The students got their guidelines about the communication interface connecting the CPU and FPGA (using a 32-bit bus) and software design was a user interface that collects information regarding the lower and upper bounds between which the prime numbers had to be found.
This information had to be sent to the hardware module using a stream, after which the hardware process started to run. The algorithms selected by student-programmers generated proper primes and sent them to the software application, whose role was to present these results on-screen and write them into a text file.
PNG was used as an example for two reasons:
- It is hard to write such a function in pure VHDL, which makes the advantages of C more visible
- Every algorithm that can be used to generate primes is composed of many operations. Therefore it is easier to pick one and implement it in VHDL, avoiding duplication across students within the same group.
Some students decided to implement PNG in naive way using algorithms called trial division; that is, by checking if a number can be divided, without a remainder, only by itself. As a "reward," they had to implement some "nasty" operation in VHDL, like a modulus or a square root. Many students decided to use the Sieve of Eratosthenes; a few students decided to use the "Fermat's 4k+1" and "Euler's 6k+1" algorithms to check if a number is a prime. The best method was one proposed by student Grzegorz Glowka BSc due to its adaptation of all three previously mentioned algorithms. Student Glowka observed that -- in some intervals -- some methods are more efficient than others, and he implemented his PNG in such a manner as to leverage this fact.
The lessons from the exercise were as follows:
- Implement everything in C and perform software simulations, then run working applications and tests that measure their performance in the hardware
- Find weaknesses of C-to-HDL translation and eliminate them by replacing them with regards to HDL blocks. Regenerate the whole design, implement it in hardware, and then run tests and measure their performance.
- Consider rewriting key elements in HDL to tune performance.
- Rewrite the whole hardware part into HDL, or only those parts that are responsible for data processing, leaving C to interface and transmit data, etc.
Results:
Since we are teachers, our students are our "results."
We are happy to report that both their grades were up (from prior years of similar coursework) and that students felt a little more prepared for corporate life with more hands-on method of approach.
The class continues to develop as the institution frame-work provides.
Acknowledgments:
The authors would like to acknowledge the assistance given by Brian Durwood of Impulse Accelerated Technologies in the preparation of this article.
About the authorsGrzegorz Gancarczyk was born in Nowy Sacz, Poland, in 1984. He received an MSc degree in the field of electronics from the AGH University of Science and Technology (AGH-UST), Krakow, Poland, in 2009.
Since 2009, he is with the Academic Computer Centre (ACC) CYFRONET AGH, Krakow, Poland and now also with the Department of Electronics, AGH-UST, Krakow, Poland. His research interests include engineering education, statistics, stochastic processes, phenomenon of noise, digital signal processing and hardware acceleration of numerical methods. You can contact Grzegorz at gegula@agh.edu.pl
Maciej Wielgosz was born in Krakow, Poland, in 1979. He received his MSc and PhD degrees in the field of electronics from the AGH-UST, Krakow, Poland, in 2005 and 2010, respectively.
Since 2005, he is with the ACC CYFRONET AGH, Krakow, Poland and since 2009 also with the Dept. of Electr., AGH-UST, Krakow, Poland. He has published over 40 papers in journals and conferences and also one book: "FPGA implementation of the selected floating point operations" (Warszawa: Akademicka Oficyna Wydawnicza EXIT, 2010). His research interests include educational issues in electronics, data compression, neural networks and hardware acceleration of computations. You can contact Maciej at wielgosz@agh.edu.pl
Kazimierz Wiatr was born in Tarnow, Poland, in 1955. He received MSc and PhD degrees in the field of electrical engineering from the AGH-UST, Krakow, Poland, in 1980 and 1987, respectively, D. Hab. (habilitation) degree in electronics from the University of Technology of Lodz, Lodz, Poland, in 1999. Professor degree in 2002.
Since 1980, he works at the Dept. of Electr., AGH-UST, Krakow, Poland. Head of Reconfigurable Computing Systems Group. Since 2004 director of the ACC CYFRONET AGH. Since 2006 chairman of the board of PIONIER - Polish Optical Internet - Consortium. Between 1998 and 2002 adviser to the Prime Minister of Poland on "educational and upbringing of the young generation".
Managed 9 Polish Scientific Research Committee research grants. His works resulted in over 200 publications, 19 books, 5 patents and 35 industrial implementations.
Achieved Polish Science and Higher Education Minister's Award. Has been involved with youth education for more than 30 years. One of the founders of the Polish independent scouting movement. His research interests include educational issues, processes automation, image systems, multiprocessor and many core systems, reconfigurable devices and hardware methods of calculations accelerating.
Prof. Wiatr was appointed in 2007 to a chairman of Tarnow Scientific Society. Member of the Polish Information Processing Society, European Organization for Information and Microelectronics (EUROMICRO). In the Sixth and Seventh Term Senate was a chairman of the Science, Educational and Sport Committee.
Reviewer in the IEEE Expert Magazine, IEE Computer and Digital Techniques, IEE Electronic Letters, IEEE Transactions on Neural Networks, Eurasip Journal on Applied Signal Processing, Journal Machine Graphics and Vision. Prof. Wiatr can be contacted at wiatr@agh.edu.pl
If you found this article to be of interest, visit Programmable Logic Designline where – in addition to my Max's Cool Beans blogs – you will find the latest and greatest design, technology, product, and news articles with regard to programmable logic devices of every flavor and size (FPGAs, CPLDs, CSSPs, PSoCs...).
Also, you can obtain a highlights update delivered directly to your inbox by signing up for my weekly newsletter – just Click Here to request this newsletter using the Manage Newsletters tab (if you aren't already a member you'll be asked to register, but it's free and painless so don't let that stop you [grin]).
OptEdit Article Domain-Source
Offloading CPU To FPGA Article ends here:
Door-closes session class over!