LCPC 2009: Third day
Third day started as usual with a breakfast although today everything was going to be 15 min late because of the laziness caused by the wonderful dinner of the previous day.
This day there would be two sessions and a tutorial, but neither Nikola or me were registered at the tutorial. Its topic was SSA-based register allocation, not that it was not interesting but made things a bit more difficult if we wanted to go back to NYC that same day.
First session was about runtime and feedback-driven optimization. First presented work was that of Vijay Nagarajan and Rajiv Gupta. Their idea consisted on (let's say it this way) bravely parallelize regions of code and then if a dependence was violated, recover itself. Their implementation relied on an additional Itanium instruction and the reorder of instructions to minimize misspeculation. Conflicts were detected at barrier point. This proposal was evaluated in an Itanium simulator called SESC.
Next speaker was Tongxin Bai who presented another speculative parallelization technique this time including a fastpath. The work included a detailed mathematical model to foresee the profitability of the speculation itself and which constraints had the runtime system to fulfill to keep up with the performance.
Next speaker was Catherine Mills who studies address streams, this is, the addresses yield by applications in their loads and stores. The main problem when analyzing these streams comes from the fact that applications can generate up to 44 GB per minute of addresses, so a plain logging is completely unfeasible. Their approach was to synthesize those streams so only the minimal information is stored in order to reproduce and analyze the execution performance. The strategy used showed acceptable rates of error of 8% compared to a real evaluation, which is not bad considering the fact that the used traces where hundredths of magnitude smaller than full traces.
Last speaker was Amir Kamil and proposed a very interesting proposal for PGAS languages (like UPC but I believe it can be applied also to SPMD environments like MPI). Basically they aimed at reducing deadlocks caused by collectives like broadcast or barrier. To do this they enforced in runtime what is called textual alignment. Textual alignment means that your (source) code must be aligned. Intuitively, this means that no matter the flow of each particular thread, collectives always must happen in the same order. This approach was very useful for debugging purposes in their PGAS language called Titanium and represented a small footprint, even for cases where the precision of the misalignment was reduced.
After the break started the short paper session, which included mine at the end.First speaker was Henrique Andrade who presented a code generation strategy for the SPADE compiler. SPADE is a full environment for productive stream development which includes a compiler. Henrique presented how they could leverage on SIMD in order to improve the overall performance of the processes.
Kevin Williams presented a portable just in time specialization of dynamically typed scripting languages. His example was based on Lua, a nice imperative dynamic language and the portability of the JIT, which by definition normally leads to non portable cases, relied on runtime C code generation. Several details had to be taken into account in order to make sure that a routine had not been compiled before. In my opinion, relying on a C compiler can imply too much payload unless it is possible to invoke it extremely fast. If this is not possible, maybe the typical non-portable JIT solution turns better.
Next presented work was about machine learning (ML) for a one-shot compiler. John Thomson presented how a compiler can be trained beforehand its deployment, which is architecture dependent, in a sensible amount of time. The training means figuring which are the optimization passes that are more profitable in terms of performance (which, while typing this leads me to the question of whether this technique can be used for a setting for minimizing compilation time, I suppose so). After this training our compiler is like any compiler, simply with much better knowledge about the interactions between optimization and the underlying architecture. This is what one-shot means, no need to train the compiler when using it.
After this, Boubacar Diouf presented an optimizing local memory allocation and assignment through a decoupled approach. Their strategy takes SSA and extends it to cover range blocks. In fact this is what current optimization trends are heading: to go beyond scalar variables to array sections or ranges. Needless to say that reasoning on these is much more complex. Using this extended SSA and for some restricted loops already tiled, they can optimize the local memory allocation, this is, local memories for forthcoming heterogeneous processors.
Finally it was mine turn. I presented an extension of a classical loop optimization, loop unrolling, to cover the case for task parallelism. This way granularity of tasks that are deemed too fine grained, and thus yield none or worse performance than serial, can be aggregated and their granularity be coarsened. If tasks were created in conditional code the aggregation can be performed as well, which leads to the nice fact that unbounded loops and uncountable loops are also elgibile for this aggregation.
After this session finished, the organizing committee thanked us for attending and hoped us to see the forthcoming year, which will be hosted by the Rice University, Texas.
Nikola and me headed to New York where we will be spending three days of leisure, taking advantage of the fact that 12th October is a bank holiday in Spain (and Columbus Day in USA).
We'll be officially back in 14th, but maybe we'll not be in the office until 15th due to jet lag issues. See you then.
- rferrer's blog
- Add new comment
- 804 reads
