Good Grief, Charlie Brown, I am feeling somewhat fatigued. I’m also feeling a tad confused because I now have so many articles linking back and forth to each other that it’s making my head spin.
To briefly summarize the current state of play in a crunchy nutshell, I’m in the process of penning a series of Arduino Bootcamp columns for Practical Electronics (PE), which is the UK’s premier electronics and computing hobbyist magazine.
Since the editor of that illustrious tome has instructed me to focus on nitty-gritty hands-on stuff, I’m writing ancillary articles to capture and share a lot of super-interesting contextual material.
Meanwhile, over on EE Journal, I recently posted Arduino 0s and 1s, LOW and HIGH, False and True, and Other Stuff, which took a rather deep dive into some really interesting stuff that people don’t talk about as much as perhaps they should.
I just finished writing another column for EE Journal called Mysteries of the Ancients: Binary Coded Decimal (BCD). It has to be acknowledged that this one is a whopper (36 pages in Word on my computer). This has taken me days of effort, not least creating the 30+ diagrams I needed to help explain things (two of those diagrams are shown here to give you a tempting taster of what is to come).
I didn’t set out to write so much, but new thoughts along the lines of “I must mention this” and “I wonder if they know that” kept on popping into my mind.
For example, as part of my BCD column, I wanted to cover the different ways folks perform borrow operations in the USA and the UK while performing decimal subtractions. Since that paper was growing so large, I decided to post those discussions as Neither a Borrower nor a Lender Be here on the Cool Beans Blog.
It’s funny how many scientists and professional engineers assume that BCD is a relic from far off times of interest only to geeks and nerds like your humble narrator (I pride myself on my humility). In reality, this form of encoding continues to be employed far more than you might thing. For example, while writing my BCD column, I took a break to chat with my friend Jonny Doin, who is the Founder and CEO at GridVortex Systems in Brazil.
When I told Jonny what I was working on, he replied: “What a coincidence! I just passed the last 10 hours working on porting my single-clock-cycle binary-to-BCD logic to a full serial processing logic implementation as part of my FPGA embedded framework.”
I asked Jonny to tell me more, and he sent me a great write-up. Unfortunately, by this time my BCD column was so large that there was no way to squeeze this in, so I decided to post it here as a Cool Beans Blog. The following is what Jonny sent to me. Enjoy!
At GridVortex, we offer deeply embedded engineering design and consulting services for safety-critical and mission-critical systems. When I say safety- and mission-critical, I mean the sort of systems that, if they fail, people can die, which means we need them not to fail.
During the last two decades, we have designed systems that are running in the field continuously, having amassed more than two million operating hours. These are instrumentation and measurement systems that are running every day in fully automated industrial plants, laboratories, and businesses. We are known by some as “The fail-safe bare-metal guys.”
Having come from the circuit design world, our firmware was always very much influenced by RTL logic design, and you can see from the similarity between our C code and VHDL how we think in a logic-partitioning and aspect-oriented approach.
One of the things I love as an embedded designer is writing boilerplate code. Logic infrastructure. Also called “computer machinery.” Besides embedded firmware in C, we also do RTL logic design and modeling and FPGA circuit design. We have quite a few FPGA test boards in the lab with really powerful chips, but I wanted to have a platform to design and implement embedded systems directly in FPGA logic. I wanted something that did not involve a soft processor, but rather used the logic resources in the FPGA to implement generic boilerplate computer machinery that supports the embedded system logic written as top-layer modules.
During the COVID pandemic I started to create this framework targeted primarily at large FPGAs, but also to small and low-cost technologies. I purchased a few ICE40UP5K chips and boards and started to implement our embedded systems core in VHDL.
Our system is a streams-oriented system. Think about a Unix machine, where everything is a stream. You have stdin(), stdout(), files, and physical devices. Everything is a stream, so you can have a serial terminal command line interface (CMI) written for a serial port and reuse it encapsulated in a pipe. Or taking input from one file and generating JSON frames into another file. Or sending graphics commands to a user interface, serially.
All of this is possible in a Unix system due to the computer machinery of the operating system. It is written in such a way that from the perspective of each process, the system simply “has” all those logic layers existing for the process to use. In reality, software is a virtual abstraction of computer machinery run serially by a magic component called a central processor. What we did was to take the processor out of the system.
The advantages to approach embedded systems using pure logic and distributed computer machinery is that we get a more robust system by eliminating the single point of failure CPU. But we also get more energy efficient systems because we can run with 10 MHz or less clocks and obtain the same performance, latency, and real-time response as a 200+ MHz 32bit processor. The only downside is we take more logic.
Writing an embedded-systems-ready FPGA framework is exactly that, implementing the computer machinery needed for the embedded logic to behave in a streams-oriented environment. It also means that all things you take for granted have to be built from thin air.
One such thing is at the core of every computer system in existence—an infrastructure responsible for converting numeric information back and forth from binary to human-readable text. This is one of the oldest and most important pieces of boilerplate logic of all time: the printf() family of functions. Think about it. To truly be able to call my system an “embedded framework,” I needed to implement printf() functionality in logic.
At the heart of printf() numeric conversion functions is BCD. The core of numeric conversion and ASCII representation is essential in printf() machinery. My first PRINTF() implementation was targeted to high-performance, single-clock-cycle streams, capable of generating Gbps text streams. The BCD functions were large single-clock functions capable of processing 32-bit and 64-bit numbers in parallel. However, when it came to implementing this functionality in ultrasmall FPGAs where every look-up table (LUT) counts, a change in the BCD conversion algorithm was required. In this case, I needed a fully serial algorithm.
One very popular algorithm for BCD conversion is known by the nickname “double-dabble.” This involves a simple decimal-shift-register, which takes a serial stream of binary bits and applies these bits to a BCD-encoded shift register. The most popular form of the algorithm is known as a “shift_and_add_3,” in which you shift the bits and apply a correction to every digit larger than 4, so the digit will generate a “decimal carry over.” The problem with the “add 3” algorithm is that it does not scale well for arbitrary binary widths, but we needed fully generic logic.
Thus, we implemented a slightly different algorithm, which applies a pre-bias prior to multiplication to all digits before a binary shift. The algorithm is simple and was implemented as a single-clock layer in the high-speed version of the BCD core.
The fully serial version needed to save on LUTs, so the pre-bias circuit was reduced to a single digit. Now we have 4-bit input and 4-bit output lookup tables that map nicely to 4 LUT4s, as opposed to 40 LUT4s for a 32-bit binary-to-BCD conversion. To maintain our savings, though, we need to avoid more LUTs. This means we need some way of making our single-digit LUT process every decimal digit without using indexing (variable indexes generate multiplexers along with lots and lots of LUTs). Instead, I implemented a barrel shifter organized as a circular ring register. Barrel shifters are made out of pure wires with no logical LUTs involved (the mapper can still use LUTs as routing though).
At each clock, the ring register shifts 4 bits up from the least significant BCD digits to the most significant BCD digits, so if you apply 10 clocks, you have your 10 digits circulated. The last digit (the most significant BCD digit) is recirculated back into the first digit (the least significant), through our single digit pre-bias LUT, so pre-biasing all 10 digits requires 10 clocks. Of course, this means we can no longer process Gbps text streams, but our embedded system can make do with ~3.5Mbps per stream.
This kept the BCD core reasonably small. For example, a PRINTF_BIN2ASC block, complete with leading spaces, decimal fixed digital point, and minus/plus sign fits in under 140 LUTs. That enables the rest of the text generating framework to handle binary-to-human-readable data on the fly, from an upstream string FIFO stream to a downstream FIFO stream, like a UART_TX stream.
By using FIFOs at every stream interface, and internal busses, we can have dynamic text generation, string buffers, and runtime selectable streams.
If you are interested in leading more about this, or about the systems we create at GridVortex, please feel free to contact me at Jonny Doin on LinkedIn.