Related blog - IoT or Smart Everything?
To get more answers about Cheetah I watched a webinar on February 21st.
There's a new three letter acronym for us - FGP, Fine-Grained Parallelism and the VCS 2017.03 release is the first production code to use it. Here's what to expect from attending the archived webinar:
- Enabling parallelism in simulation flows
- Describing the FGP technology and how it uses many core processors
- Importance of native integration of FGP technology
- Using VCS 2017.03 FGP, with examples
When I first heard about Cheetah I expected that the new technology would force me to learn a new methodology, new coding or require something different in the design and verification flow, however I've since learned that you can leave your design as-is, and don't need to change the VCS simulation environment or your testbenches. The learning curve seems brief, which is always helpful.
So, who benefits from the parallelism of Cheetah? Design and verification engineers that are simulating RTL and gate-level designs. You also get support for familiar technologies like:
- Native Low Power (NLP)
- Verdi debug with parallel fast signal database (FSDB)
The growth in many-core processor architecture continues, and by 2020 we could be seeing up to 150 cores per processor according to Intel and wcctech.com. Up to 8 cores per processor is considered Multicore a point reached back in 2012, and above that number we are living in the many-core world:
Knowing this processor roadmap makes it really clear that EDA software tools like simulators need to use these many-cores in an efficient manner by design. The very first logic simulators were designed for single cores using serial simulation, so even if you had a 2 socket, 16-core server, the simulation only used a single core, leaving 15 cores idle:
Using coarse-grain parallelism you could divide your four blocks of RTL design into four cores, still leaving 12 empty cores idle:
With the Cheetah FGP approach we finally have fine-grained parallelism, which use all 16 cores:
With FGP Synopsys has patented their approach to load-balance how the design is partitioned and simulated across many-cores using X86 processors.
From a pragmatic viewpoint, which types of simulation will work best with FGP and which are ill-suited? The following table provides guidance that SoC designs for graphics, CPU and networking are well-suited for parallelism in simulation:
|Well Suited||Ill Suited|
|Graphics RTL designs||Testbenches|
|Low power RTL designs||Lots of PLI/DPI content|
|Multicore CPU RTL||Low activity per cycle simulations|
|Networking RTL designs||Interactive debug sessions|
|Gate level designs||Short-running simulations|
Simulation speedup results of using the FGP technology are shown below in a series of complex designs ranging from 12M to 2,000M gates:
For every unique design you have to figure out what the best FGP speedup is going to be in terms of the number of cores to use. Here's an example of running FGP on an RTL and Gate Level System (GLS) and varying the number of cores:
The two people from Synopsys speaking at the webinar were David Hsu from the marketing side, and Bruce Greene from the AE ranks. Bruce has been with Synopsys since 2001, so has a deep understanding of VCS.
To compile your design with FGP just add one switch:
% vcs -f flist -fgp
At run time you define how many cores to use:
% simv -fgp=num_threads:8
To figure out if FGP will provide you any simulation speedup benefits, just use a new tool called the Dynamic Profiler, and it will tell you if there's enough activity going on in your design to warrant a benefit with FGP. After an FGP run is completed you can even visualize how each core was used, here's a chart showing how 8 cores were used:
Multi-core CPUs are fantastic in terms of offering more compute power, however to get the best results you need software that can also use parallelism. The engineering team at Synopsys has parallelized their VCS simulator with a FGP technology they call Cheetah, and you can now enjoy the simulation speed improvements. Sign up for the archived webinar and learn more about this promising approach to speeding up simulation run times for RTL and gate-level designs.