Due to stagnating processor clock rates, parallelism will be the source for future performance improvements. Despite the growing complexity, users now demand for better performance of general-purpose parallel software without sacrificing portability and maintainability. This is difficult to achieve, and a coordinated approach is needed for the entire software hierarchy, including operating systems, compilers, and applications. In particular, performance optimization becomes more difficult because of the growing number of targets to optimize for. With mass markets for multicore systems, the diversity of multicore architectures has increased as well. Their characteristics often differ slightly, e.g., in the number of executable threads, the cache sizes and architectures, or the memory access times and bandwidth. Many parallel applications are optimized at design-time to achieve peak performance on a particular machine, but perform poorly on others. This is unacceptable for applications used in every-day life.Auto-tuners have great potential to tackle this problem effectively. Instead of being hard-wired in the code, the performance-relevant parameters of a multicore application are made configurable. An auto-tuner is used on thetarget platform where the program is executed to systematically find an optimal configuration this is typically not known beforehand and may be counter-intuitive. When the program is migrated to another machine, auto-tuning is re-peated, thus preserving portability. In this paper, we present our novel contributions to make auto-tuning work for general-purpose parallel programs, not just scientific numerical programs. Our experimental results show that auto-tuning is promising and worth integrating into operating systems and compilers, so that manycore applications can be tuned more effectively at run-time.