LCPC 2009: Second day

This second day of LCPC 2009 started like yesterday: with a second breakfast (for those like us who already had breakfast at the hotel, because it is free :). Today there was a fruit yogurt which at our european eyes was ill-shaped: the base was wider than the top and the lid of the top was "squirt-prone".

Once got through the breakfast, today's keynote was presented by Paul Feautrier. Paul spoke about the past, the present and the futur of the polytope model. In a nutshell: if we represent the iteration space of a loop nest (DO loops) as a vectorial space, it is possible to have an inequality representation (rather compact and not depending on the size of the iteration space itself) which is useful for reasoning about the loops itself. Polytope model is very useful for lots of optimization tasks with loops. Unfortunately, it is not the ultimate model for loops. Some of its problems include too stringent constraints on the kind of loops it can model and the fact that most interesting algorithms to be applied on this model have exponential complexity.  This last fact limits the size of the loops to be analyzed. Besides of this, polytope model has been used for scheduling loops, allocation of arrays and is currently the basis of GRAPHITE. GRAPHITE is a loop optimization framework bundled in recent versions of GCC.

Paul sketched a possible solution for the scalability problem of the polytope model in which applications must be regarded as a set of process, like in streaming computing, where polytope model can be applied individually, thus limiting the span of code analyzed, and then combined. Unfortunately, it looks that maybe not all programs can be expressed this way, at least in a straightforward way. Paul also made interesting points about domain specific parallelizations concerning to "external" knowledge that cannot be obtained from the code but a human knows because of the domain. For instance, in some physics simulations, influence of two elements very far from each other may be considered negligible and the underlying dependence of the code be ignored.

François Boudin was the chair of today's first session on languages and programing enironment.  First speaker, Ronald Veldema has presented a set of extensions to object oriented languages to make them more amenable to multicore programming. This extensions included a better macro system, class proxies, class mixins and functional classes. It is been a controversial talk since it looked like most of what he was proposing was already done in languages like Scala. The speaker, in my opinion, did not give a convincing answer farther than "I don't like Haskell" or "Scala is not functional".

Next speaker was Yonghong Yan who presented Hierarchical Place Trees (HTP). This was an abstraction implemented on top of X10, but easily extensible to other task based programming models, Basically it defines trees of memory hierarchy whose nodes are linked tasks, and these nodes specify which tasks can be run in order to obtain good locality.

OSCAR API was presented by Keiji Kimura. Basically a compiler extended by means of pragmas, including some of OpenMP and some of their own, along with a library API for high performance computing on low power multicore. According to their results they were able to decode MPEG2 and AAC in less than 6 watt, which is an impressive result (at least so looks to me). The interface that catched me the most was a power interface control which enabled several power savings techniques like clock gating or total processor shutdown (the latter a bit drastic maybe).

After lunch on our own, Nikola and me have gone to a mexican restaurant (note aside: a cool thing of USA is free tap water by default in every restaurant), it was the session on parallel compiler technology. The opening talk of this session dealt with Intervals. Presented by Nicholas Matsakis, Intervals were a neat and clean approach for expressing parallel control flow. Everything was built on top of an interval, a time conceptual entity with a beginning and an end. Work is supposed to happen between the start and the end. Intervals can be connected to other ones by means of a very simple "happens before" relationship. With this apparently simple scheme, it is possible to build execution models like fork-join, futures and preconditioned execution. This model also includes support for locks. It is a library written in Java and amenable for static analysis without all the pitfalls of static analysis using Java's Thread.

Next presentation was Nikola's one. Nikola presented a novel software cache management strategy for Cell. It is based on a high locality cache and a transactional cache and several policies of replacements based on the exact distribution of cache lines among the SPU threads. This classification allows to use more efficient algorithms in some cases which turns into speedups of 1.2X to 1.4X by just changing the software cache management code.

Albert Cohen presented an iteration-space slicing which tries to overcome some of the limitations of the polyhedral model under the Burgmeier arithmetic. I would say something more about this if I had understood it better. According to comments in the question turn, it seems a very promising line concerning to optimizations of loops currently limited by existing limitations of the polyhedral model.

An automatic data distribution for data locality for the Cell architecture was been presented by François Bodin. Basically it is a better implementation of OpenMP over Cell but trying to make the most of the ring-shaped bus of the Cell. Some additional operations like shift, broadcast and transpose were specifically designed on top of the Cell bus and used as communication primitives between SPUs.

After the sundae (yes, ice cream) break, it started the session about general compiler technology. Mattias Holm opened this session presenting an automated restructuring of linked data structures. Mainly targeted at sparse data structures, or structures loaded after input files, he shown a technique of allocation based on pools of data types which was performed before entering loops working on this linked data. They used LLVM to perform hot spot analyisis and after applying these optimizations on several linear algebra applications, they obtained impressive speedups of at least 5X up to 22X.

Dae Gon was in charge of the next presentation where he showed a smart technique for tiling skewed iteration spaces. In a nutshell, a bounding box around the skewed iteration space was defined and used for tiling instead. This solved a common problem appearing when tiling skewed loops.

Finally, Chunhua Liao presented the ROSE outliner. A tool to extract kernels off applications. He showed the strategy used by the source-to-source ROSE compiler when outlining, which involves some data liveness analysis for better decisions on the outline itself. This resulted in better quality outlines which did not degrade the performance like it happened in usual approaches based on pointers.

After this last session, all the conference attendees have headed to Bayard House, a nice restaurant «located along the waterfront on the banks of the Chesapeake and Delaware Canal in historic South Chesapeake City, Maryland». We had a very good dinner but before it we could listen to Bill Carlson giving us a talk about the history of UPC. It is been a fantastic keynote, very interesting full of nice (non technic) details about the origin, evolution and maturing process of the UPC language. Very fun and pleasant. The dinner was very good, well sized for an european like me: not too much not too little food and a fine dessert :)

Tomorrow is the last day. I'm presenting my paper on the session related to short papers. Stay tuned for more.