Project: Multicore computing
Primary UITS contact: Judy Qiu
Last update: April 7, 2009
Description: A multicore CPU combines two or more cores (independent microprocessors) on a single board. Currently only dual-core and quad-core models are commonly available from vendors. However, Intel, for example, has an 80-core prototype. Moreover, according to a respected white paper (The Landscape of Parallel Computing Research: A View From Berkeley), in the future CPUs will have hundreds or thousands of cores. This will increase computing power for both research and commercial applications but also undoubtedly will present significant software challenges for parallel applications. In collaboration with the Community Grids Lab at IU, we are researching multicore technologies.
General goals:
- Develop, demonstrate, and test a hybrid model of parallel computing, involving workflow and mashups linking high performance parallel modules implemented on multicore clusters
- Implement and evaluate the performance of clusters of multicore systems
- Develop a suite of parallel data mining algorithms for applications including GIS, cheminformatics, bioinformatics, speech recognition, and image processing for polar science
Current status:
- We have defined a general parallel algorithm covering clustering and mixture models with built-in annealing to improve convergence.
- We have developed parallel methods for mapping high-dimensional spaces to a smaller number of dimensions for easier visualization and analytic processing. We are comparing principal component analysis (PCA), generative topographic mapping (GTM), and multidimensional scaling (MDS). We intend to add Bourgain's random projection method.
- We have measured and understood basic performance of two-processor systems with a total of eight cores.
- We have successfully tested methods on GIS and cheminformatics applications using a hybrid software model based on Microsoft CCR and DSS. We currently are looking at a bioinformatics clustering problem. We support several types of variables, including real-valued, binary, and profile representation used in bioinformatics.
- We have extended our work to a total of 32 cores on multicore clusters. We expect to further expand our test platforms to a total of 512 cores using a combination of different paradigms, from low-level technologies like MPI and CCR to new workflow and Internet computing approaches including DSS, Google's MapReduce, and Yahoo's Hadoop.
- We are investigating overheads coming from communication, the programming paradigm, and the use of virtual machines.
- With this enhanced computing resource, we will tackle major new applications, including the search for gene families in a collection of a million sequences. Early results suggest that our new annealing algorithms perform better than existing clustering and dimensional reduction methods.
- Results have been presented worldwide at three conferences in fall 2007 and three so far in 2008, including presentations at CYFRONET AGH in Krakow, Poland, and the Many-Core Workshop in Shanghai, China. A presentation is also planned in Bloomington for October 2008 as part of the Research Technologies Roundtable series. We also plan to have an exhibition at the Fourth IEEE International eScience 2008 conference, December 7-12, 2008.
- High Energy Physics data analysis is both data (petabyte) and computation intensive. We have developed a data analysis tool using DraydLINQ and its MapReduce support to analyze LHC particle physics experiment data from the Center for Advanced Computing Research at the California Institute of Technology. The tool uses DryadLINQ to distribute the data files across available computing nodes, and then execute a set of analysis scripts written in CINT (an interpreted language of the physics analysis package ROOT) on all these files in parallel. After processing each data file, the analysis scripts produce histograms of identified features, which are merged (the "Reduce" of MapReduce) to dynamically produce a final result of overall data analysis.
- We are working with IU School of Medicine to relate patient records to environmental factors, and the figure shows clusters in the patient records visualized after MDS dimension reduction. This involves clustering of 160 dimensional vectors of more than 360,000 patient records. The results would contribute to identify environmental conditions of the obesity epidemic, which is a well-documented public health problem in the United States, and environmental conditions, as intervening factors through their impact on physical activity and eating habits.
- Since the end of 2008, we have published two papers at the eScience 2008 conference and one paper in the book Trends in High Performance and Large Scale Computing. We have yet another invited book chapter to write in Data Intensive Distributed Computing by mid-May 2009. We presented our work at the Microsoft Research TechFest Feburary 24, 2009, and the Microsoft External Research Symposium March 31, 2009. Our plan of activities for 2009 includes making our core parallel algorithms (e.g., vector-based deterministic annealing clustering and pairwise clustering) as services for public access, and writing grant proposals for medical and biology applications.

