wonderfly Blog

Elbert V2 FPGA Programming Tips

About

The Elbert V2 is an entry level FPGA board manufactured by Numato Lab. It has a Xilinx Spartan 3a FPGA, and 16 Mb SPI flash memory. This blog is about how to generate (“synthesize”) bitstream for it and how to program the bitstream to it, in Linux.

Install Xilinx ISE Webpack on Linux

The Xilinx ISE Webpack is the official (and only?) tool to generate bitstream for Xilinx FPGAs. The download and installation process is a full of roadblocks and in great lack of documentation, but in the end the Linux version worked out pretty well for me. The tool is free, though to download you need first create an account with Xilinx and get a free license (that lasts a year?). Then the installation package is composed of four individual tarballs, each is about 2GB in size. Extract the first one, ending with .tar, a file called xsetup will be present in the root directory, which is the installer. Run it and it’ll guide you through the installation process. The installation is relatively painless, except… that when the installation completes, the wizards simply exists, leaving you wondering where the heck the program had been installed to.

Towards the end of the installation, it’ll ask you an installation location, though at that location the installer will create a deep chain of directories and dozens of files. The final program, ISE, is at $INSTALL_DIR/14.7/ISE_DS/ISE/bin/lin64/ise (14.7 is the version of the ISE webpack installed so yours may be different, and lin64 is the 64-bit version; there is a 32-bit version too). One more thing you need to be aware, is that before invoking that program, you’ll need to source a shell script installed at $INSTALL_DIR/14.7/ISE_DS/settings64.sh (again, note the version and 64-bit).

Once you’ve worked around all that, the actual ISE program is nicely designed, runs pretty smoothly. You can create projects, loading existing projects, generating bitstreams, etc.

Programming Elbert V2 SPI Flash

The Elbert V2 can be connected to a Linux PC via USB, and the standard CDC_ACM Linux kernel module will create a pseudo tty device, e.g., /dev/ttyACM0 for connected Elbert V2. To program a bitstream file (with the file extension .bin) to the Elbert V2, there is a python script provided by Numato Lab: https://github.com/numato/samplecode/blob/master/FPGA/ElbertV2/tools/configuration/python/elbertconfig.py. The only thing to note here is the need to have the pyserial library, otherwise programming works like a charm: python elbertconfig.py /dev/ttyACM0 <bistream-file.bin>.

Intel NUC Tips and Tricks

Workaround Ubuntu’s “blank screen” boot failure

In my experience this not unique to Ubuntu (Fedora, ArchLinux also had a similar problem), but after a fresh install of Ubuntu on the NUC, if you reboot, your NUC will flash the “Intel NUC” BIOS screen for a quick second and go blank indefinitely. Power cycling doesn’t seem to help. The fix is to append nomodeset to the kernel command line.

To do this, plug in a USB stick with a bootable image. Reboot and select the bootable image from the USB stick. Select “Try Ubuntu” instead of “Install Ubuntu” so you won’t wipe out the image already installed on your hard drive/SSD. Once booted into the live CD, open a shell, and mount the EFI and rootfs partitions of the on-disk OS:

$ mkdir root boot
$ sudo mount /dev/sda1 boot
$ sudo mount /dev/sda2 root
$ sudo vi root/boot/grub/grub.cfg
# Append 'nomodeset' to the kernel command line
$ sudo umount root
$ sudo umount boot
$ sudo reboot

NOTE: here I’m editing the file in the rootfs partition because the “root” GRUB config that lives in the EFI partitions simply sources the one in the rootfs partition. So always start with the one in the EFI partition because that’s what the BIOS will load.

Unplug the USB stick, reboot, and you should be able to see the boot continue as usual.

Use a desktop switch to connect your NUC and your primary desktop computer for faster file transfers

The NUC comes with a wifi interface which is convenient for connecting to the Internet. But for data transfer between the NUC and your primary desktop, it’s best to use an Ethernet switch for more bandwidth and less variability. When I did this I saw a 100x increase in bandwidth:

[I] ~> iperf -c 192.168.86.31 (wifi endpoint)
------------------------------------------------------------
Client connecting to 192.168.86.31, TCP port 5001
TCP window size:  129 KByte (default)
------------------------------------------------------------
[  1] local 192.168.86.250 port 53639 connected with 192.168.86.31 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.19 sec  12.1 MBytes  9.98 Mbits/sec

[I] ~> iperf -c 192.168.5.31 (ethernet endpoint)
------------------------------------------------------------
Client connecting to 192.168.5.31, TCP port 5001
TCP window size:  129 KByte (default)
------------------------------------------------------------
[  1] local 192.168.5.32 port 53640 connected with 192.168.5.31 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.04 sec  1.10 GBytes   940 Mbits/sec

Use X11 forwarding to run GUI applications from your primary desktop

If you are using the NUC in a headless configuration (i.e., no monitor connected), and yet need to run some GUI application from time to time, X11 forwarding over SSH could be a good option. For MacOS, there is XQuartz. After installing XQuartz, run ssh nuc -Y (-X most likely will also work) and you will get an SSH session with X11 forward. If you then type, e.g., firefox, in the terminal, a firefox window will be started on your MacOS.

I am not yet sure if this is better than VNC but in my experience VNC gave me a gray screen with no content - could be a mistake on my end but I didn’t invest further.

Comparing two unsigned 64 bit numbers

Suppose you are to write a C function that compares two uint64_t numbers. The function is supposed to return a negative number if the first argument compares less than the second, zero if they compare equal, or a positive number if the first is greater. Sounds pretty familiar, right? Many C library functions takes a “comparator” function pointer that does that. While it sounds trivial, there is a subtle gotcha that is worth writing a blog post about.

It is tempting to writing a trivial comparator like this.

int compare1(uint64_t a, uint64_t b) {
  return a - b;
}

While it looks pretty neat, and seemingly satisfies the requirement, it is vulnerable to a common pitfall when it comes to arithmetics of large numbers: overflow! To demonstrate the problem with this implementation, let’s look at the following sample program:

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
  uint64_t a = 0;
  uint64_t b = UINT64_MAX;

  printf("a - b (as uint64_t) = %" PRIu64 "\n", a - b);
  printf("a - b (as int64_t) = %" PRIi64 "\n", (int64_t)(a - b));
  printf("a - b (as delta of int64_t's) = %" PRId64 "\n",
         (int64_t)a - (int64_t)b);
  return 0;
}

What do you think the output of the three printf statements would be?

~$ clang -c source.c && ./a.out
a - b (as int64_t) = 1
a - b (as uint64_t) = 1
a - b (as delta of int64_t's) = 1
~$

All three would print a positive 1, even though a (0) is clearly smaller than b (UINT64_MAX). It is because 0 - UINT64_MAX overflows the range of 64 bit integers and wraps around to become 1. As you can see, casting the result or a and b individually to signed numbers don’t help because UINT64_MAX is already larger than INT64_MAX. If a and b were 32 bit numbers and you had a 64 bit machine you could try up-casting them to 64 bit numbers to avoid the overflow problem, but for 64 bit numbers that trick doesn’t work any more.

So what is the fix? Well, don’t try to be smart to do subtractions. Use plain old comparisons:

int compare2(uint64_t a, uint64_t b) {
  if (a < b) return -1;
  if (a > b) return 1;
  return 0;
}

For those that understand a little bit of assembly may wonder, doesn’t comparisons boil down to subtractions eventually since cmp src dst instructions basically subtracts dst from src any way? While that’s true, the hardware is kind enough to have a flag register which keeps track of overflows in arithmetic operations. In this case, since we are dealing with unsigned integer overflows, the carry flag will be set if we tried to cmp 0 UINT64_MAX. On the other hand, the compiler is smart enough to read this flag after a cmp instruction, and will use an “overflow aware” jump instruction like jae (jump if the first operand of the previous cmp instruction is above or equal to its second operand, taking into consideration the carry flag) to always get the correct comparison result.

This is evident in the compiler generated assembly code for the above two comparator implementations (generated with clang -S source.c):

	.p2align	4, 0x90         ## -- Begin function compare1
_compare1:                              ## @compare1
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movq	%rdi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movq	-8(%rbp), %rax
	subq	-16(%rbp), %rax
                                        ## kill: def $eax killed $eax killed $rax
	popq	%rbp
	retq
	.cfi_endproc
                                        ## -- End function
	.p2align	4, 0x90         ## -- Begin function compare2
_compare2:                              ## @compare2
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movq	%rdi, -16(%rbp)
	movq	%rsi, -24(%rbp)
	movq	-16(%rbp), %rax
	cmpq	-24(%rbp), %rax
	jae	LBB2_2                  <===== Notice the use of JAE here
## %bb.1:
	movl	$-1, -4(%rbp)
	jmp	LBB2_5
LBB2_2:
	movq	-16(%rbp), %rax
	cmpq	-24(%rbp), %rax
	jbe	LBB2_4
## %bb.3:
	movl	$1, -4(%rbp)
	jmp	LBB2_5
LBB2_4:
	movl	$0, -4(%rbp)
LBB2_5:
	movl	-4(%rbp), %eax
	popq	%rbp
	retq
	.cfi_endproc
                                        ## -- End function

grep, sed and awk

These three tools seem to have a great deal of overlapping functionalities. All have built-in regular expression matching, and many grep functionalities can be replicated by sed and awk. For example, if I have a file, test.txt, with the content:

NE 1 I
TWO 2 II
#START
THREE:3:III
FOUR:4:IV
FIVE:5:V
#STOP
SIX 6 VI
SEVEN 7 VII

And if I want to print the lines that don’t have the keywords “START” or “STOP” in them, I could do:

grep -vE "START|STOP" test.txt 

Or:

sed -nr '/START|STOP/ !p' test.txt 

Or:

awk '$0 !~ /START|STOP/' test.txt

Having used all, my rules of thumb for picking the right tool to use for a task are:

  • grep is for search only
  • sed if for search and replace. The underscore is on ed, for editing. You start with a regular expression matching to find the text you want to edit on, and take an action on the matching (or non-matching) text. The possible actions are:

    • substitute, with the s/<pattern>/<substitue>/ command
    • delete, with the d command
    • or simply print, with the p command
  • awk’s unique power is in processing columns or fields in the matching lines of text. Suppose you have a table of data. You could first use pattern matching to select the rows you want to operate on, and then use awk’s powerful column matching to process each column - printing, modifying, skipping, etc.

Of course I could separate the search from the editing and column processing, by using grep for the search and piping its output to sed or awk, but sometimes it’s nicer to have them all done with one command.

System Design

Design Netflix

Netflix Architecture

From https://medium.com/@narengowda/netflix-system-design-dbec30fede8d.

Notes

Video transcoding. For example, H.265 –> MP4. May change resolution and frame rate to better user experience on a particular device.

Netflix also creates difference video files depending on the network speeds you have. On average, 1200 files are created for one video.

Once a video is transcoded, the thousands of copies of it get pushed to the many streaming servers that Netflix has, via its Content Dellivery Network (CDN).

When a user clicks the play button on a video, a nearby streaming server is matched, for the video copy with the best resolution and framerate for the user’s device.

Content recommendation. Netflix uses data mining (Hadoop) and machine learning to recommend videos the user might enjoy watching based on their browse / view history.

Client side. Netflix supports 2200 different devices: Android, iOS, gaming consoles, web apps, etc, involving various client side technology. On the website, they use React JS a lot, for its startup speed, runtime performance and modularity.

Front end load balancing. User requests are routed to Netflix’s frontends via AWS’s Elastic Load Balancer, which is the two tier load balancer. At the first tier, it uses DNS based round robin to select an ELB endpoint in a particular “zone”. At the second tier, it does another round robin to select a server within that zone.

EV Cache. A distributed key value store based on Memcached. Reads are served by the nearest server, or a backup server if the nearest is down. Writes are duplicated to all servers. SSDs are used for persistent yet performance storage.

Database. Static data like user profiles and billing information are stored in MySql databases. Netflix runs its own MySql deployment on EC2 VMs. Handles database replication and etc. Replicate fail-over is done via updating a DNS entry for the DB host.

Cassandra. A distributed wide column NoSQL data store. Designed for consistent read/write performance at scale. Netflix stores user viewing history in Cassandra. Latest view history that undergos frequent reads and writes are stored uncompressed while older history is compressed, to save storage.

Monitoring and event processing. The Netflix’s clients generate a lot of events every second of every day: ~500 billion events per day (~1.3 PB data) and ~8 million events and ~24GB data to be processed per second during peak hours. Events include video viewing activities, UI activities, error logs, and etc.

They use Apache Chukwa to collect and monitor these events. Chukwa is based on HDFS and Map Reduce. The collected events are routed by Kafka to various data sinks: S3, Elastic Search, and possibly a secondary Kafka.

Netflix uses Elastic Search to help with mapping a user end failure (failure to watch a video) to error logs. Elastic Search is a document search engine. AWS has a managed version of it.

Autoscaling. When people get home from work, load increases and the system automatically scales up.

Media processing. Switching gears a bit, over to the content production side. Before a video is put on the Netflix site, it undergos a lot of processing steps. For instance, large videos are split and encoded in chunks, in parallel.

Spark. Content recommendation and personalization is done via managed Spark clusters.