LCPC 2009: First day

Zero day: Arrival to North East, MD

It's been a long day. Our flight left Barcelona El Prat airport at 11:20, more or less (not bad considering that the scheduled time was 11:05). The flight lasted 8 hours and it was quite except for the final two hours which were a bit more shaky due to annoying turbulences here and there. After disembarking, we had to follow the ritual at border control, which took about half an hour. I did not have any problem but Nikola had filled the green card and since he had a visa he did not have to fill any form but a very simple paper. Next: Airtrain to Federal Circle to pick our car at Dollar. A rather new Ford Fusion which Nikola started hating because it is automatic but now, when I'm writing this, he is starting to like it (not because of the automatic gears but the finish of the dashboard, the radio car, etc).

Then the GPS decided the best way to go from JFK to North East, Maryland, was going through the middle of Brooklyn. Although painfully slow, stopping at each traffic light, we had a flavour of that part of the city that probably we would not have had if we had gone Shore Parkway. We had to cross Verrazano Bridge which has an expensive toll of $11!!! After that, the GPS insisted on going through several quarters topped with those small houses that are everywhere where there are no flats. A bit fed up, we restarted the GPS and suddenly it took us to a nearby motorway. We'll never know whether we were already heading there before restarting it. Once in the highway we did the remaining 100 miles (~150 km) to North East. We arrived there at 20:15 hours, we had dinner (supper in the US :) and went to sleep. We were dead.

 First Day

We have got up at 7:00. Well, me a bit earlier because I wanted to have a shower. Unfortunately I did not know how the tap worked and I had a cold shower! Nikola told me the trick with that tap. Tomorrow it will be a warm shower, hopefully. After having breakfast we headed to Newark, DE. Fortunately there are no tolls in this route. It was breakfast time when we arrived at Trabant Center in the University of Delaware, so we had a second breakfast.

The keynote of this first day was held by Thomas Sterling. He talked about the challenge of Exascale, how this deed is menaced by current limitations in all the infrastructure, or stack (but he preferred to use a hourglass to draw it, putting «Execution model» in the narrowest part). He finished his speech introducing us an example execution model called ParalleX. «Parcels» are a key part of this execution model. They are something that encode an action, a payload and a continuation environment. User level threads can be used to run these parcels and, if I correctly understood it, the execution model exposes a PGAS (Partitioned Global Address Space) behaviour like UPC or DSM.

After that, Jaspal Subhlok presented a framework for parallel computed on volatile computers. Volatile computers are desktops, whose power is most of the time underused. It was interesting to see such a framework but the evaluation was done in a cluster, so they did not recreate a real environment where real failure resilience, which was one of the goals of the framework, was observed.

Gabriel Tanase presented us pList, a parallel list implementation that belongs to STAPL. STAPL is like a parallel STL for C++. This pList brings the benefits of a vector, random access, to parallel programming. In essence they create a list of vectors which is actually distributed among the processors.

I had some particular interest on the next talk about hardware support on OpenMP collectives. Soohong Kim presented the Aggregated Function Units which were used by the main processor by means of an extension in the instruction set: checkin and checkout, to start the collective and end the collective. The results were very good for barriers but somehow disappointing for reduction, on which I had a special interest. The speaker has been unable to give a clear reason why reductions did not experience better improvement. Anyway it is interesting to see such a design since it can be a an inspiration for future manycores with special programming model support.

After these three presentations we have had lunch. Food was fine, though it looked like too sophisticated to me. It's maybe the first time I have had a croissant filled with shrimp paste for lunch.

Jacqueline Chame was the next speaker, she introduced a full workflow for loop transformation and tuning. She also made a call to the community to build a single and well defined transformation strategy. She showed CHiLL scripts used for transformation and was able to convert a very naive Jacobi kernel into an alien-like code which looked very unfriendly. I guess this proves: a) do not transform your codes directly or you will have maintenance issues b) the complexity of transformation scripts is on par to the complexity of transfomred code.

A very fun and entertaining presentation by Henry Dietz has followed. He has showed how to implement a MIMD execution model into a GPU. Although results are still a bit disappointing, slowdowns of about 6x, they've had good progress given that they started with 10x slowdowns. They used a specially crafted version of a C compiler and a special, made by them, CUDA.

Ge Gan has presented a Decoupled Access Execution strategy to be applied in Cyclops 64 and OpenMP 3.0 tasks. He presented an OpenMP pragma to be able to offload work to the cores of the Cyclops. All the results where obtained by means of simulation since no implementation of Cyclops 64 exists.

Raymon Manley presented a technique where they implemented Streaming languages into general purpose processors. The key strategy was the extensive use of vectorization which gave better speedups compared to other programming models.

The next session chair has been Lawrence Rauchberger which is always an interesting chair. First in this session has been Souad Koliaï from the University of Versailles, France. She explained the techniques and workflow they used for tuning applications. One of the techniques has been a bit controversial since it was based on removing instructions and taking new metrics for comparison, regardless of the fact this could led to an invalid application.

After Souad, Chirag Dave, presented a contribution on automatically tuning parallel and (hand) parallelized applications. Beside the, sadly usual, results of automatic parallelization, this talk has proved that: a) NAS benchmarks are as well parallelized as possible by hand and no tuning improves them, b) the previous claim is only false for SP which can be slightly improved a bit more.

Finally, Liang Gu has presented a model for predicting the performance of the Discrete Fourier Transform implemented in the FFTW library , which has been shown to be faster than the exploration of FFTW and rather good in performance.

Panel

There's been an interesting, and fun, panel after the sessions where the new trends of parallel programming have been discussed. Henry Dietz has raised the question why there is no transparency in terms of performance when programming in parallel: it is not obvious that splitting two loops can lead to dramatic improvement. The best moment has come when David Padua has made a small presentation criticizing GPU programming as he considers that GPUs will be finally integrated on chip, so there is no need to worry about the additional complexities of current GPU programming.

And thats all for today. Tomorrow more.