You are currently viewing SemiWiki as a guest which gives you limited access to the site. To view blog comments and experience other SemiWiki features you must be a registered member. Registration is fast, simple, and absolutely free so please, join our community today!

  • Architecting an ML Design

    Discussion on machine learning (ML) and hardware design has been picking up significantly in two fascinating areas: how ML can advance hardware design methods and how hardware design methods can advance building ML systems. Here Ill talk about the latter, particularly about architecting ML-enabled SoCs. This approach is getting major traction for inferencing applications particularly when driven by power considerations (eg in the IoT and increasingly in automotive apps), also in training when driven by demand for highest performance per Watt (eg Google TPU).

    Architecture has to be the center of design in such a system, optimizing the algorithm and architecture for your CNN. For the core algorithm, this has to be in choice of number, types (convolution, pooling, ) and characteristics (stride, ) of layers in the network. In the implementation, memory support is one critical consideration. Neurons like tightly-coupled memories for weights, activation functions and retrieving and storing neuron inputs and outputs, but everything cant be tightly-coupled so you also need high-bandwidth access to larger memories like caching but implementation can be quite different in CNNs. And you want to optimize PPA, aggressively if the design is targeted to a battery-operated application.

    Article: Double Patterning for IC Design, Extraction and Signoff-platform-architect-ultra-ai-min.jpg

    There are challenges in finding an optimal architecture for these systems. RTL isnt an option you need to test out different options with 100s of MB of data (images as an obvious example) so this has to run at a higher level of abstraction. And its not obvious (to me at least) how you would experiment with CNN architectures in a JIT-type virtual-prototype running on a general-purpose processor (or even a cluster). This class of architecture design seems a much better fit with TLM approaches, for example Platform Architect Ultra.

    Youll likely start in one of the standard CNN frameworks (Caffe, TensorFlow, etc) where most experts in the field are familiar with building CNN graphs. You can translate this into a workload model in Platform Architect (Synopsys calls this a task graph). This can then be mapped into an IP in your larger SoC model: CPUs, memory interfaces and on-chip bus. Naturally the platform supports a rich library of IPs and VIPs for this task:

    • Processors Arm, Synopsys ARC, Tensilica and CEVA DSP, traffic generators and more
    • Memory subsystem DDR and multiport memory controllers from Arm and Synopsys
    • Interconnect models from Synopsys, Arteris IP, Arm and NetSpeed

    Architecting the bulk of the SoC is pretty familiar; what is different here is designing the task graph for the CNN, or adapting it from an imported graph. I dont want to get too much into the details of exactly how this works you should follow the webinar link at the end to get a more complete description. But I think it is useful to highlight some of the key points for those of us who think more in terms of conventional logic.

    Since youre likely working with an imported CNN task graph, you probably need to build sub-components which will ultimately connect to TLM models, starting for example with a convolution layer in which you need tasks like read_input (reading an image sub-block), read_coefficients (e.g. for weights), process the neural function and write_result. The tool supports this through creation of a task components as blocks which you then connect to create a graph, indicating serial versus parallel processing. You also add parametrization to tasks and connections for things like height and width of a frame and the stride (how much the filter shifts on the input map per convolution).

    You can them simulate this model and sweep over different scenarios, looking at basic latencies and utilization of resources, and fine-tune and connect this as a hierarchical sub-component in the larger task graph. You will also build a structure to support the main body of your SoC (this is standard Platform Architect stuff), into which your CNN task graph is instantiated.

    From here you start mapping between tasks and TLM models in the library connecting your task-graph to a TLM implementation. Now when you simulate with scenarios, you get more realistic info on latency, throughput, utilization and so on. You can also do system-level power analysis at this stage by adding UPF3 power monitors (state machines, bound to each TLM model, with estimated power per state. Ah-hah - so thats where these UPF3 monitors are used &#128526

    The presenters (Malte Doerper PMM, and Tim Kogel AE manager, both in the virtual prototyping group) show application of these concepts to an AlexNet case study (AlexNet is the reference CNN benchmark everyone uses). Their goal on starting is to process a frame in 10ms with 4 frames processing in parallel, and to consume less than 500mW. They show a nice analysis, based on running over multiple scenarios where they can compare implementation choices against meeting the performance spec, power and energy (the old trick run faster, higher power but lower energy). And of course they meet their power and performance goals!

    I found this a pretty interesting exposition of how CNN design can fit comfortably into the full SoC architecture goal. Modeling convolution layers through tasks may look a little different, but it all seems to flow quite smoothly into full architecture creation. Theres a lot more detail in the webinar, along with a quite detailed Q&A. You can learn more HERE.