I want to use multi-threads to accelerate my program, but not sure which way is optimal.
Say we have 10000 small tasks, it takes maybe only 0.1s to finish one of them. Now I have a CPU with 12 cores and I want to use 12 threads to make it faster.
So far as I know, there are two ways:
There are always 12 threads running, each of them get one new task from the tasks pool after it finished its current work.
By separating the 10000 tasks into 12 parts and each thread works on one part.
The problem is, if I use tasks pool it is a waste of time for lock/unlock when multiple threads try to access the tasks pool. But the 2nd way is not ideal because some of the threads finish early, the total time depends on the slowest thread.
I am wondering how you deal with this kind of work and any other best way to do it? Thank you.
EDIT: Please note that the number 10000 is just for example, in practice, it may be 1e8 or more tasks and 0.1 per task is also an average time.
EDIT2: Thanks for all your answers :] It is good to know kinds of options.
Both ways suggested in the question will perform well and similarly to each another (in simple cases with predictable and relatively long duration of the tasks). If the target system type is known and available (and if performance is really a top concern), the approach should be chosen based on prototyping and measurements.
Do not necessarily prejudice yourself as to the optimal number of threads matching the number of the cores. If this is a regular server or desktop system, there will be various system processes kicking in here and then and you may see your 12 threads variously floating between processors which hurts memory caching.
There are also crucial non-measurement factors you should check: do those small tasks require any resources to execute? Do these resources impose additional potential delays (blocking) or competition? Are there additional apps competing for the CPU power? Will the application need to be grow to accommodate different execution environments, task types, or user interaction models?
If the answer to all is negative, here are some additional approaches that you can measure and consider.
Use only 10 or 11 threads. You will observe a small slowdown, or even a small speedup (the additional core will serve OS processes, so that thread affinity of the rest will become more stable compared to 12 threads). Any concurrent interactive activity on the system will see a big boost in responsiveness.
Create exactly 12 threads but explicitly set a different processor affinity mask to each, to impose a 1-1 mapping between threads and processors. This is good in the simplest near-academical case where there are no resources other than CPU and shared memory involved; you will see no chronic migration of threads across processes. The drawback is an algorithm closely coupled to a particular machine; on another machine it could behave so poorly as to finish never at all (because of an unrelated real time task that blocks one of your threads forever).
Create 12 threads and split the tasks evenly. Have each thread downgrade its own priority once it is past 40% and again once it is past 80% of its load. This will improve load balancing inside your process, but it will behave poorly if your application is competing with other CPU-bound processes.
So one midway between the two approaches is to break into say 100 batches of 100 tasks each and let the a core pick a batch of 100 tasks at a time from the task pool.
Perhaps if you model the randomness in execution time in a single core for a single task, and get an estimate of mutex locking time, you might be able to find an optimal batch size.
But without too much work we at least have the following lemma :
The slowest thread can only take at max 100*.1 = 10s more than others.
Task pool is always the best solution here. It's not just optimum time, it's also comprehensibility of code. You should never force your tasks to conform to the completely unrelated criteria of having the same number of subtasks as cores - your tasks have nothing to do with that (in general), and such a separation doesn't scale when you change machines, etc. It requires overhead to collaborate on combining results in subtasks for the final task, and just generally makes an easy task hard.
But you should not be worrying about the use of locks for taskpools. There are lockfree queues available if you ever determined them necessary. But determine that first. If time is your concern, use the appropriate methods of speeding up your task, and put your effort where you will get the most benefit. Profile your code. Why do your tasks take 0.1 s? Do they use an inefficient algorithm? Can loop unrolling help? If you find the hotspots in your code through profiling, you may find that locks are the least of your worries. And if you find everything is running as fast as possible, and you want that extra second from removing locks, search the internet with your favorite search engine for "lockfree queue" and "waitfree queue". Compare and swap makes atomic lists easy.
100ms/task - pile 'em on as they are - pool overhead will be insignificant.
1E8 tasks @ 0.1s/task = 10,000,000 seconds = 2777.7r hours = 115.7 days
That's much more than the interval between patch Tuesday reboots.
Even if you run this on Linux, you should batch up the output and flush it to disk in such a manner that the job is restartable.
Is there a database involved? If so, you should have told us!
Each working thread may have its own small task queue with the capacity of no more than one or two memory pages. When the queue size becomes low (a half of capacity) it should send a signal to some manager thread to populate it with more tasks. If queue is organized in batches then working threads do not need to enter critical sections as long as current batch is not empty. Avoiding critical sections will give you extra cycles for actual job. Two batches per queue are enough, and in this case one batch can take one memory page, and so queue takes two.
The point of memory pages is that thread does not have to jump all over the memory to fetch data. If all data are in one place (one memory page) you avoid cache misses.