Phillip Stanley-Marbell
Foundations of Embedded Systems
Department of Engineering, University of Cambridge
http://physcomp.eng.cam.ac.uk
(~40 minutes)
Version 0.2020
Fascicle 01: RISC-V Processor Design Project Introduction
(Video)
27
Intended Learning Outcomes for This Tutorial
2
Enumerate differences between PALs, PLAs/CPLDs, and FPGAs
By the end of this session, you will be able to:
Enumerate differences between a programmable processor and programmable logic
Identify different methods to use circuits to achieve computation
Have a clear understanding of the requirements and logistics of this project
27
Logistics
3
▶︎ Remember to maintain a lab notebook
▶︎ Expect to spend about 20 hours per week
▶︎ Handout is linked from Moodle, along with several self-contained tutorials (“fascicles”)
Online Version of Project: Coordinate times with your team members
On-Campus Version of Project:
Tuesday 11:00–13:00
Fridays 09:00–11:00
Fridays 14:00–17:00
Required
Optional Afternoons
See http://teaching.eng.cam.ac.uk/node/444 for more
▶︎ Attendance tracking on Moodle:
(No on-campus version in 2020)
27
Resources (online and live video sessions)
4
Server with tools installed (cpu0.f-of-e.org)
▶︎ Remote access to Lattice iCE40 MDP evaluation board
▶︎ Remote access to Keithley SMU2450 for power measurements
1hr live video sessions on Tue. and Fri. (09:00 London / 10:00 Paris / 16:00 Beijing)
Use the live video sessions to request power measurements
On Microsoft Teams (that’s the University’s / Department’s choice) (link)
27
Teams of Three (grouped by last name)
5
Team 1
Ch
Du
Fl
Team 2
Ge
Ge
Ha
Team 3
Ho
La
Le
Team 4
Ne
Sl
Zh
The University of Cambridge has sanctioned using Microsoft Teams
27
Assessment
6
80 marks
See http://teaching.eng.cam.ac.uk/node/444 for report guidelines
Cover sheet + two A4 sides, 10pt font, 2cm borders (20pts)
Interim Report #1
Due at 16:00 UK on Friday 22nd May, 2020
Cover sheet + two A4 sides, 10pt font, 2cm borders (30pts)
Interim Report #2
Due at 16:00 UK on Friday 29th May, 2020
Cover sheet + ten A4 sides, 10pt font, 2cm borders (30pts)
Final Report
Due at 16:00 UK on Friday 5th June, 2020
▶︎ Submit your reports via Moodle
▶︎ Remember to indicate your “attendance” via Moodle (see slide 3)
and attend the live video sessions via Microsoft Teams (see slide 4)
27
Intended Learning Outcomes for the Project
7
Gained experience using FPGAs to implement digital hardware designs
By the end of the project (four weeks from today), you will have:
Gained experience using hardware description languages to implement digital circuits
Worked in a team to design improvements for a processor microarchitecture
Obtained hands-on experience modifying the design of a complete microprocessor
Evaluated tradeoffs between design requirements (power/energy, time, FPGA resources)
27
How Many Gates Does it Take to Blink an LED?
8
We will look at dierent ways to achieve the same computational task
Computational task: Toggle an LED at ~1Hz
Computational state: State of LED (on or o)
Approaches: Computation directly in gates, or use gates to make a processor and then use software
Which approach is best (power/energy efficiency, time efficiency, resource efficinecy, engineer’s time)?
27
How Many Gates Does it Take to Blink an LED?
9
Using a custom circuit implemented directly in FPGA logic (using the D-flip-flop hard IP)
$ yosys -p "synth_ice40 -blif blinkDataflow.blif;
write_json blinkDataflow.json" dffHardIP.v blinkDataflow.v
$ nextpnr-ice40 --up5k --package uwg30 --json blinkDataflow.json
--pcf blink.pcf --asc blinkDataflow.asc
Using a C program, which runs on a processor,
which is in turn implemented using FPGA logic
$ yosys -q ../../../yscripts/sail.ys
$ nextpnr-ice40 --up5k --package uwg30 --json sail.json
--pcf pcf/sail.pcf --asc sail.asc
More details in
Fascicle 10 (link)
and Fascicle 11 (link)
27
10
Multiplexing Hardware…
▶︎ Space
▶︎ Time
◀Examples? Advantages?
◀Examples? Advantages?
When would you prefer one option to the other?
27
From Logic Gates to Processors
11
Computing systems (e.g., a microcontroller) are composed of several individual logic gates
Logic gates are chained in space to construct adders, multipliers, whole ALUs, pipelines, and so on
a
b
a op b
The whole processor is also reused in time, to execute multiple iterations of algorithms
IF
ID
MA
WB
EX
Image source: Wikipedia
Image source: Wikipedia
27
buslock=0, buslocker=-1, EX touches mem = 0
WB: [TAS],0
MA: []
EX: [ANDI],1
ID: [0xe100],0
IF: [0x6013],0
node ID=0, PC=0x80042ba, ICLK=1668327, sleep?=0
buslock=1, buslocker=0, EX touches mem = 0
WB: []
MA: [ANDI],2
EX: []
ID: []
IF: [0xd109],0
node ID=0, PC=0x80042bc, ICLK=1668328, sleep?=0
buslock=1, buslocker=0, EX touches mem = 0
WB: []
MA: [ANDI],1
EX: []
ID: [0xd109],0
IF: [0x6413],0
node ID=0, PC=0x80042be, ICLK=1668329, sleep?=0
buslock=1, buslocker=0, EX touches mem = 1
WB: [ANDI],0
MA: []
EX: [MOVBSG],1
ID: [0x6413],0
IF: [0xd109],0
node ID=0, PC=0x80042c0, ICLK=1668330, sleep?=0
buslock=1, buslocker=0, EX touches mem = 0
WB: []
MA: [MOVBSG],0
EX: [MACL],1
ID: [0xd109],0
IF: [0x410b],0
Multiplexing Hardware in Time: Microprocessors
12
16 + 8
Architectural
registers
clk
clk
data
Main
memory
Cache
address
data
addr
addr
clk
clk
data
clk
Memory-mapped
peripherals
Timer / RTC
UART
A/D Converter
Battery Monitor
Failure Monitor
Network Interface
= Structures modeled at bit-level, enabling monitoring of signal transition activity and SEUs during simulation
Programmable
clock source
Program
Counter
clk
Interrupt
Controller
Register File write-back
Memory Access
Execute
Decode
Fetch
Memory
Management
Unit (MMU)
▶︎
See Sunflower showpipe command
27
Multiplexing Hardware in Space: Programmable Logic Devices
13
Microprocessors
Fixed hardware; achieve different functionality by loading different programs
Programmable Logic
▶︎ Programmable in this case means configurable
▶︎ Achieve different functionality by wiring up generic components
▶︎ Generic components may be a collection of ANDs and ORs, a collection of lookup tables (LUTs), etc.
Logic gates are connected in a fixed configuration to construct adders, multipliers, whole ALUs, and so on
a
b
a op b
IF
ID
MA
WB
EX
27
PALs, PLAs (CPLDs), and FPGAs
14
Historical progression
▶︎ The earliest programmable logic devices were one-time mask-programmable, not reprogrammable
▶︎ One of the earliest reprogrammable logic array devices was the Altera EP300 (1984):
Source: Altera
27
15
Notation
27
PALs, PLAs (CPLDs), and FPGAs
16
PAL Architecture: Programmable AND array and fixed OR Array
▶︎ Any of the macrocell’s inputs or its complement can be routed to any AND gate in the AND array
▶︎ Design is broken up into macrocells. Each macrocell is a sum of products
PALs (and PLAs, which we will see next) are a good match for designs that are mostly combinational logic
All the product terms are summed in a single OR gate for each macrocell
Source: Altera
27
PALs, PLAs (CPLDs), and FPGAs
17
PLA/CPLD: Programmable AND array and programmable OR Array
▶︎ Design is again broken up into macrocells
(Xilinx CoolRunner-II)
Source: Xilinx
27
PALs vs PLAs/CPLDs
18
PAL Architecture
Programmable AND array, fixed OR array
Both PALs and PLAs/CPLDs: have few sequential logic elements; do not scale to large designs
PLA/CPLD Architecture
Programmable AND and OR arrays
Source: Xilinx
Source: Altera
27
Field-Programmable Gate Arrays (FPGAs)
19
Fine-grained: A large collection of generic LUTs rather than AND and OR arrays
All can be wired together in essentially arbitrary topologies
Source: Lattice
27
The FPGA for This Project: Lattice iCE40 Ultra Plus FPGA
20
27
The FPGA for This Project: Lattice iCE40 Ultra Plus FPGA
21
Source: Lattice
27
Size Comparison: Lattice iCE40 vs Typical FPGAs
22
Think of the iCE40 as an IC with two SPI interfaces, two I2C interfaces, about
5000 4-input lookup tables and D-flip-flops, and ~20 configurable I/O buffers
and differential amplifiers which you can wire together in (essentially) any
topology of your choosing
Source: Lattice
Source: Xilinx
27
FPGAs vs. PALs/PLAs/CPLDs
23
FPGAs
PALs/PLAs/CPLDs
Almost always require an additional SRAM IC to store the FPGA configuration
Almost always nonvolatile (configuration stored directly in the same IC)
A few FPGAs can store configuration in non-volatile on-chip memory (e.g., Actel/Microsemi Igloo and Lattice iCE40)
27
Tour of the Material
24
This Video
27
Tour of the Material
25
27
Suggested Four-Week Plan
26
Week 1: Clone a copy of the course git repository into your account on the coursework server and work through
the introductory examples in the section of the handout titled “Blinking an LED using the iCE40 FPGA and the
Open-Source FPGA Tools”. Perform your first baseline power/timing/resource measurements, skim the iCE40
Ultra Plus FPGA datasheet (link) and read the research article (link) on processor performance limits.
Week 2: Become familiar with the RISC-V ISA and with the provided baseline RISC-V processor implementation in
Verilog. Read the section of the handout titled “A Brief Introduction to Computer Architecture with RISC-V” and watch
the videos on computer architecture, pipelining, and the Sunflower processor emulator. Complete your baseline
power, performance, and resource usage measurements. Propose planned changes. Submit interim report #1.
The section of the handout titled "RISC-V Processor Design Project Logistics" (link) provides a more
detailed walk-through of the project activities over the four weeks.
Week 3: Implement your proposed changes to improve timing, resource usage, and power/energy. Watch the video
on design optimization and Pareto optimality. Skim the iCE40 Memory Usage Guide (link) and the DSP Function
Usage Guide (link). If relevant to your proposed ideas, you might also want to skim the SPRAM Usage Guide (link),
the Oscillator Usage Guide (link), and the Technology Library Usage Guide (link). Submit interim report #2.
Week 4: Finalize your design, prepare for the competition. Write and submit your final report.
27
Things to Do
27
Complete “Muddiest Point” for Fascicle 01(link)
Read/watch the material in Fascicle 01 (link)
Start reading/watching the material in Fascicle 03 (link) and Fascicle 11 (link)
Decide which team member will lead power-/energy-, time-, and resource reduction, respectively
Read the article “Understanding Some Simple Processor Performance Limits” (link)
Backup
28