Page 1 of 1
#1 Infinite DistRTgen WU
Posted: Sun Feb 03, 2013 5:33 am
by Janos (retired)
Anyone else having problems with infinitely running DistRTgen WU's?
On ALL of my machines crunching DistRTgen I get a WU once or twice a DAY which just runs and runs. 100% usage for the duration and I have caught tasks running for over 8 hours (I should never sleep!).
I've tried all the normal stuff like a big hammer and a strong telling off but no resolution as yet. Project reset, reinstall of BOINC, under clocking, default clocking. I've even tried a complete OS reinstall.
All the machines are working fine otherwise and it is just DistRTgen WU's which give me bother. There is nothing on the DistRTgen forums which tends to make me think it is something I am doing wrong.
Even a healthy dose of single malt has not solved things. Any ideas? Think I should try some more 18 year old Macallan?
#2 Re: Infinite DistRTgen WU
Posted: Sun Feb 03, 2013 10:10 am
by Gary Mc
Janos wrote:Anyone else having problems with infinitely running DistRTgen WU's?
On ALL of my machines crunching DistRTgen I get a WU once or twice a DAY which just runs and runs. 100% usage for the duration and I have caught tasks running for over 8 hours (I should never sleep!).
I've tried all the normal stuff like a big hammer and a strong telling off but no resolution as yet. Project reset, reinstall of BOINC, under clocking, default clocking. I've even tried a complete OS reinstall.
All the machines are working fine otherwise and it is just DistRTgen WU's which give me bother. There is nothing on the DistRTgen forums which tends to make me think it is something I am doing wrong.
Even a healthy dose of single malt has not solved things. Any ideas? Think I should try some more 18 year old Macallan?
It happens from time to time. Just check it every now and then and delete work until if it is over running excessively. Having said that it has not happened to me for some time but I have done nothing in particular to try to resolve it.
Just cos I have a lot of credits does not mean I know what I am doing.
#3
Posted: Sun Feb 03, 2013 11:25 am
by Janos (retired)
I am going to have to find a fix as it is happening way to often. Like you Gary I have had the odd one in the past but the last few days have been crazy. Each machine is currently getting two or three a day. I just killed one WU which had been going for 2h 21m.

#4
Posted: Sun Feb 03, 2013 11:46 am
by Gary Mc
Janos wrote:I am going to have to find a fix as it is happening way to often. Like you Gary I have had the odd one in the past but the last few days have been crazy. Each machine is currently getting two or three a day. I just killed one WU which had been going for 2h 21m.

My ATI7970 machine is running Bionic 7.0.27 (x64)
I am running Windows 7 latest version - automatic updates on
Driver information

- what ever this means...
Driver Packaging Version 8.961-120405a-137813C-ATI
Provider Advanced Micro Devices, Inc.
2D Driver Version 8.01.01.1243
2D Driver File Path /REGISTRY/MACHINE/SYSTEM/ControlSet001/Control/CLASS/{4D36E968-E325-11CE-BFC1-08002BE10318}/0002
Direct3D Version 7.14.10.0903
OpenGL Version 6.14.10.11631
AMD VISION Engine Control Center Version 2012.0405.2205.37728
AMD Audio Driver Version 7.12.0.7706
Best of luck, as I said I usually just wait for these problems to resolve themselves

#5
Posted: Sun Feb 03, 2013 12:38 pm
by Alez
I found this
http://www.freerainbowtables.com/phpBB3 ... fd2e117741
don't know if it helps but on the last post.
Mikey: I'm running stock settings. But 2 of the 4 cards involved came from the factory overclocked. I've got some additional info.
1. The dual GPU machine does get hanging WUs on both Device 0 and 1, it's just that Device 0 hangs are more common.
2.
It appears that there is a driver reset each time the hang up appears. (This is probably the causative factor)
3. All cards involved come from Zotac.
Based on the above information, I don't know if the manufacturer is to blame or not. My other GPU machines (running older Pentium machine (DELL) - GTX560 or laptop - GTX660M) don't seem to ever have a problem. I do think that the problem most likely occurs when there is a switch between another project and DistrRTgen (Collaz or PrimeGrid). What I do know is that all GTX560 / Ti are running the same driver version and Windows 7.
#6
Posted: Sun Feb 03, 2013 12:50 pm
by Alez
#7
Posted: Sun Feb 03, 2013 4:39 pm
by Janos (retired)
Ah nice work! It certainly seems logical that that a GPU thread could cause a timeout.
I will give the registry settings a whirl and report back.
Cheers
#8
Posted: Sun Feb 03, 2013 5:38 pm
by Janos (retired)
Just one hour after installing the new registry settings (with reboot) and I have an infinite WU.

#9
Posted: Sun Feb 03, 2013 6:28 pm
by Janos (retired)
And another. This is nuts!
#10
Posted: Sun Feb 03, 2013 8:31 pm
by Janos (retired)
Happened now 4 times but all on the same machine. It looks like the other two are fixed (famous last words).
I am going to reinstall drivers, windows updates, the registry heck, etc on the "failing machine" and see if that resolves things.
#11
Posted: Sun Feb 03, 2013 9:41 pm
by Alez
Are you running multiple units on the card or single incident ? Might it be that that machine is being memory bound ? If you're not already doing so try running 1 unit per GPU with a whole cpu core spare for each card.
I have a similar problem with poem and my 3 GPU's. units run fine on 1GPU but simply stall out and keep resetting on the other two. I have to exclude Poem on these two to run it on 1 GPU.
#12
Posted: Sun Feb 03, 2013 9:45 pm
by Alez
What version of Boinc are you running ?
#13
Posted: Sun Feb 03, 2013 9:45 pm
by Janos (retired)
I am running a default config with a single WU running on a single 7970. There are no other tasks running (during this testing phase).
The PC is using less than 25% memory.
Touch wood, the other two machines are still working well.
#14
Posted: Sun Feb 03, 2013 9:46 pm
by Janos (retired)
alezevo1 wrote:What version of Boinc are you running ?
Windows 7 - 7.0.28 64bit
#15
Posted: Sun Feb 03, 2013 9:50 pm
by Alez
I'm running 7.0.47. So far it's running very well. You could try upgrading. If it doesn't work you can always just reinstall 7.0.28.
I was running 7.0.45 and that eventually caused the whole of Boinc to go into a reset loop both on the CPU's and the GPU's when it tried to start a second Poem on the same GPU.
#16
Posted: Sun Feb 03, 2013 9:55 pm
by Janos (retired)
Yeah, I might try that tomorrow. I was also thinking about swapping the 7970 in the machine which keeps failing with a card on another machine which seems to now be working - to test out any hardware issues.
#17
Posted: Sun Feb 03, 2013 10:01 pm
by Alez
Two other thoughts...
Check the power management options haven't reverted to sleeping the monitor or turning off the GPU or you're new KVA switch could it be causing the card to sense no monitor load and sleep the GPU ?
and if I remember right do the ATI cards not have a turbo mode or similar ? I use afterburner to check the core clocks etc on my cards as my 610 has a tendency ( why I don't know ) to overclock itself fro 810 mhz. At 870 mhz it runs seti fine but at 910 the seti units stall on the GPU and sit with a schedular wait message. The error messages are better and the config setup for GPU's are far better on 7.0.47. Does you're cards not have the ability to increase the core clock speed as the demand on the GPU goes up ?
#18
Posted: Sun Feb 03, 2013 10:08 pm
by Janos (retired)
Power checked
Sleep checked
Same clock settings as the other two cards
Same driver versions
Same windows setup (well almost this one is Ultimate and the other two are Pro)
I am going to test out the hardware tomorrow. Maybe flash the motherboard bios...
#19
Posted: Sun Feb 03, 2013 11:03 pm
by Alez
Knew I'd read something about this when looking at ATI cards ( all mine are nVidea )
http://forums.anandtech.com/archive/ind ... 44769.html
Apparently the zerocore tech can sometimes idle your cards by itself. Think you need to set the cards into high performance mode or something. Not sure if this is of any help or not.
#20
Posted: Mon Feb 04, 2013 1:45 am
by Janos (retired)
I *think* I have fixed it.
After much use of big hammer and single malt, I swapped the power feed to the card and it now seems to be working perfectly.
Credits be incoming :)
Thanks guys for the help with the debug process.
The registry settings were a superb tip and definitely fixed the mission of Microsoft to protect the user from his own stupidity, at all cost, because Windows knows better.
#21
Posted: Mon Feb 04, 2013 1:47 am
by Alez
Microsoft always knows better

#22
Posted: Mon Feb 04, 2013 3:47 am
by Janos (retired)
Janos wrote:I *think* I have fixed it.
Hmm, not quite. I just had to kill another task on the same machine.

#23
Posted: Mon Feb 04, 2013 10:21 am
by Janos (retired)
Different machine this time. The rate of infinite WU's has dropped dramatically but still happening all too often.
I will try plan D and see what happens next...
#24
Posted: Tue Feb 05, 2013 7:14 am
by Janos (retired)
Out of my three crunchers, two had infinite units this morning: one of 6:03:01 and the other 5:47:41 of utter wasted electric. Not happy.
#25
Posted: Tue Feb 05, 2013 6:25 pm
by Janos (retired)
Came home to find more locked up WU's. That is more than 27 hours of GPU time lost today. Going to faff with more settings this evening but if it continues I will use a way bigger hammer.
#26
Posted: Wed Feb 06, 2013 10:00 pm
by Janos (retired)
I sat down tonight and thought I can't find the cause, how do I cure the symptom?
So, I have written some code to check the time the WU has been running and if it is over 20% of the average of the last 5 WU completion times then it auto suspends the active WU. I can then manually see, at my leisure, if the suspended WU should be restarted or aborted.
Worst case is half a dozen suspended work units per day rather than 2 or 3 crunchers locked up achieving not very much for hours at a time.