800x100 static WP 3
WP_Term Object
(
    [term_id] => 3611
    [name] => IoT
    [slug] => iot-internet-of-things
    [term_group] => 0
    [term_taxonomy_id] => 3611
    [taxonomy] => category
    [description] => Internet of Things
    [parent] => 0
    [count] => 547
    [filter] => raw
    [cat_ID] => 3611
    [category_count] => 547
    [category_description] => Internet of Things
    [cat_name] => IoT
    [category_nicename] => iot-internet-of-things
    [category_parent] => 0
)

Getting Maximum Performance Bang for Your Buck through Parallelism

Getting Maximum Performance Bang for Your Buck through Parallelism
by Bernard Murphy on 06-26-2016 at 12:00 pm

Finding a way to optimally parallelize linear code for multi-processor platforms has been a holy grail of  computer science for many years. The challenge is that we think linearly and design algorithms in the same way, but then want to speed up our analysis by adding parallelism to the algorithms we have already designed.

But the “first design, then parallelize” approach is intrinsically hard because you’re trying to impress parallel structure onto sequential code, which still leaves algorithm designers with the burden of deciding where and how best to do that. Which they do with varying degrees of success, facing risks in missing potential race conditions and bigger and further opportunities to parallelize.

This is not to say that there hasn’t been progress in at least simplifying the description task. OpenMP is an approach using a directive-based style, where pragmas are defined on top of existing code. This works in as far as it provides that simplification of describing where and how you want to parallelize. Pragmas can be used to separate functions which can be run in parallel, which is certainly simpler than hand-written threading. But you still have to be sure that underlying functions are thread-safe and intuitive understanding of the algorithm often becomes significantly harder as you move pieces around to accommodate those pragmas.

For many applications this may be good enough. But no-one would claim this comes anywhere near the aspirational goal – describe what you want in some kind of algorithm and let the compiler take care of optimizing for maximum parallelism with race-free safety. For many applications, best possible performance is a non-negotiable top-priority. Finite Element Analysis for stress, thermal, flow and EMI problems are good examples. Higher performance means more accurate results and highest possible accuracy/quality of result is the only thing that matters.

That means algorithm designers are often willing to re-write core algorithms, even in new languages, especially when code needs to be re-factored anyway. And that opens up opportunities to consider very different approaches to coding, including switching from directive-based programming (describing what calculation to perform and how to perform it) to declarative programming (describing what calculation to perform and letting the compiler figure out the best way to perform it).

One such approach, designed originally by Texas Tech in partnership with NASA, subsequently transferred through and now marketed by Texas Multicore Technologies (TMT), is based on a language called SequenceL™. The product and language don’t aim to be yet another general purpose language. This is designed for serious math and science algorithms. The compiler optimizes from SequenceL into C++ (with optional OpenCL for GPU targeting), which can coexist with algorithms for other purposes built through more pedestrian paths.

As a very simple example you can multiply 2 matrices in a single statement. Parallelism is inferred from this structure.
MatrixMultiply(A(2),B(2))[i,j]:= sum( A[i,all] * B[all,j] );

This illustrates the objective to express mathematical intent and not have it become entangled and made opaque by implementation details, which in SequenceL are left to the compiler.

How well this works is illustrated by multiple customer results. One industrial application was Emerson Process Management’s need to improve software for building network graphs for plants and oilfields. Their existing Java-based solution was estimated to require unreasonable runtimes to build a graph for 1000 nodes. Worse still, after 5 months of redesign the authors of the original code failed to improve the Java implementation sufficiently to get within target performance. Then they reached out to TMT who completed a SequenceL solution in 3 weeks, a solution which also happened to beat target performance by 10X.

Another customer was able to get a 26X speed up in a core Fortran-based computational fluid dynamics solver with 25% less code, delivering a solution that used to take 2 weeks, now completing in overnight runs. Yet another customer, very experienced in parallelizing code, commented that SequenceL was like MatLab on steroids. Pretty high praise. A lot of this is apparently due not just to automating obvious parallelism but also to being able to find and automate finer-grained opportunities to optimize that would be beyond human patience (and schedules) to explore. And throughout to do so with guaranteed safety against race conditions.

As the IoT takes off, problems like this are going to become increasingly important. It isn’t all going to be about Big Data analytics. It’s also going to be about hard science and engineering analytics. I suspect you’re going to be hearing more about TMT in the near future. You can learn more about TMT and SequenceL HERE.

More articles by Bernard…

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.