[Issues] some CPUs are not used

Ewout M. Helmich helmich at astro.rug.nl
Thu Sep 27 21:58:07 CEST 2007


I checked and confirm that the node with the 4 jobs is not released 
until the entire job is finished. With 40 frames the overhead is smaller 
percentage wise compared to 20, not bigger. I think it is correct that 
two different users can not run jobs on the same node at the same time, 
even if one of the CPUs is idling, which is what you seem to suggest. 
It's easy to make the groups so that always an even number are created, 
but it really depends on the number of processes that you let run 
simultaneously on each node, which could be 1 or 3 as well as 2. I can 
probably use that information, however. I'd have to see how MDia is 
normally used before I can say anything about that; if you want to use 
the cluster more effectively with MDia then what's needed is a 
dpu.run(..) command that includes information for more than 1 
independent MDia tasks, which can then be run on different nodes/CPUs.

Ewout


Johannes Koppenhoefer wrote:
> Hi Ewout,
>
> in the situation you mentioned the 4 frame CPU will indeed be blocked until the other 8 frame CPUs are finished. As you said this causes only minor overhead.
> But if you submit a job with 40 frames, the job will be split on 5 CPUs and 3 nodes out of which one node only uses one CPU. The other CPU on this node will not be used by other jobs until the job finishes (at least on our cluster). This is causing bigger overheads up to 50% if you submit single-file jobs (or single-CPU jobs like in MDia). The workaround is, as John pointed out, is to optimize the lists, i.e. choosing a clever GROUP_SIZE. 
> I am doing this now but it might be useful to integrate a piece of code in Pipeline.py which chooses the GROUP_SIZE adequately for all Tasks in order to always have an even number of CPUs used.
>
> Bye,
> Johannes
>
>
> "Ewout M. Helmich" <helmich at astro.rug.nl> schrieb am 27.09.2007 16:22:13:
>   
>> Hi Johannes,
>>
>> I'm not sure I completely understand your problem, but I can explain a 
>> few things. In DBRecipes/mods/Pipeline.py a variable GROUP_SIZE is used 
>> which in the case of the image pipeline is 8. This results in your 20 
>> filenames being split up in groups of 8, 8 and 4. The GROUP_SIZE was 
>> chosen so as to work best for the HPC cluster in Groningen, in 
>> particular because of the ~30min job limitation in the "short queue" 
>> here. If the number of processes per node (CPUs/cores) is 2, as in 
>> Groningen, that means two nodes are reserved in the call to the PBS 
>> queuing system, where one is handling 16 frames and the other 4. That is 
>> not very balanced and we could try to optimize by dividing the load 
>> evenly. On the other hand I doubt that this alone would be a serious 
>> problem (the main question here being whether the node that handles 4 
>> files is occupied for the entire time the node that handles 16 is busy). 
>> You mention losing 50% of the CPUs; how is that exactly? Are your 
>> submitting many jobs where you specify 1 filename?
>>
>> Regards,
>> Ewout
>>
>> John P. McFarland wrote:
>>     
>>> Hi Johannes,
>>>
>>> The DPU/CPU behavior might have something to do with the cluster queueing 
>>> system not controlled by the DPU, but that is only a guess.  For now, you 
>>> could simply try to optimize the lists you use both CPUs on one node and no 
>>> others if possible.
>>>
>>> I'm CCing this to the Issues list so that anybody else with some ideas 
>>> (especially our DPU experts) can chime in.
>>>
>>> Cheers,
>>>
>>>
>>> -=John
>>>
>>>
>>> On Mon, 24 Sep 2007, Johannes Koppenhoefer wrote:
>>>
>>>   
>>>       
>>>> Hello John,
>>>>
>>>> I have realized, that if you submit a job on the dpu with the 
>>>> red_filenames option, and the number of files is e.g. 20 it results in 3 
>>>> CPU jobs, two on one node and on on the next node. Now, for some reason I 
>>>> do not understand, the second CPU on the second node is not going to be 
>>>> used by other processes. This is in particular painful in my situation 
>>>> where I have to submit jobs that run on a single CPU, because I can use 
>>>> only half of the CPUs on our cluster and the rest is blocked. Is there any 
>>>> reason for this dpu-behavior? Do you know of any quick workaround for me?
>>>>
>>>> Cheers,
>>>> Johannes
>>>>
>>>>     
>>>>         
>>> _______________________________________________
>>> Issues mailing list
>>> Issues at astro-wise.org
>>> http://listman.astro-wise.org/mailman/listinfo/issues
>>>   
>>>       
>> -- 
>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>> Drs. Ewout Helmich		<><>
>> Kapteyn Astronomical Institute	<><> Astro-WISE/OmegaCEN
>> Landleven 12			<><>
>> P.O.Box 800			<><> email: helmich at astro.rug.nl
>> 9700 AV Groningen		<><> tel  : +31(0)503634548
>> The Netherlands			<><>
>> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>
>> _______________________________________________
>> Issues mailing list
>> Issues at astro-wise.org
>> http://listman.astro-wise.org/mailman/listinfo/issues
>>
>>     


More information about the Issues mailing list