Ticket #905 (closed enhancement: ignored)

Opened 13 years ago

Last modified 12 years ago

Investigate fglrx

Reported by: jdreed Owned by:
Priority: normal Milestone: Current Semester
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:

Description

Investigate whether fglrx is a good idea and maintainable in the cluster environment. In particular, we care about robustness in the face of kernel upgrades (dkms is ideally better than some sketchy ATI script).

Change History

comment:1 Changed 13 years ago by geofft

Not quite sure what you're asking, but fglrx on Natty at least uses DKMS.

Setting up dkms (2.1.1.2-5ubuntu1) ...
Setting up fglrx (2:8.840-0ubuntu4) ...
update-alternatives: using /usr/lib/fglrx/ld.so.conf to provide /etc/ld.so.conf.d/GL.conf (gl_conf) in auto mode.
update-initramfs: deferring update (trigger activated)
Loading new fglrx-8.840 DKMS files...
First Installation: checking all kernels...
Building only for 2.6.32-32-generic

etc.

comment:2 Changed 13 years ago by geofft

Saw this today on thornhump, a Dell 790 using Natty's fglrx:

[623243.846343] [fglrx] ASIC hang happened
[623243.846346] Pid: 3989, comm: Xorg Tainted: P        W   2.6.38-10-generic #46-Ubuntu
[623243.846348] Call Trace:
[623243.846371]  [<ffffffffa053cd4e>] ? KCL_DEBUG_OsDump+0xe/0x10 [fglrx]
[623243.846388]  [<ffffffffa054a14c>] ? firegl_hardwareHangRecovery+0x1c/0x50 [fglrx]
[623243.846419]  [<ffffffffa05cb619>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx]
[623243.846448]  [<ffffffffa05cb5cc>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x6c/0xb0 [fglrx]
[623243.846477]  [<ffffffffa05c9c74>] ? _ZN15ExecutableUnits10CPRingIdleE15idle_WaitMethod12_QS_CP_RING_+0xe4/0x1a0 [fglrx]
[623243.846510]  [<ffffffffa05d49c7>] ? _ZN11AsicCypress21initializeMicroEngineEv+0x147/0x160 [fglrx]
[623243.846539]  [<ffffffffa05c9b3b>] ? _ZN15ExecutableUnits7PM4idleE15idle_WaitMethod+0x4b/0x90 [fglrx]
[623243.846567]  [<ffffffffa05c9826>] ? _ZN15ExecutableUnits9assertPM4Eb+0x56/0x70 [fglrx]
[623243.846596]  [<ffffffffa05d1920>] ? _ZN8AsicR6009assertPM4Eb+0x40/0x70 [fglrx]
[623243.846622]  [<ffffffffa05a6e4a>] ? CMMQS_Initialize_WA+0x14a/0x170 [fglrx]
[623243.846641]  [<ffffffffa05671ad>] ? firegl_cmmqs_init+0x56d/0xa80 [fglrx]
[623243.846656]  [<ffffffffa05422b2>] ? firegl_addmap+0x4b2/0x870 [fglrx]
[623243.846675]  [<ffffffffa0566688>] ? firegl_cmmqs_createdriver+0x48/0x130 [fglrx]
[623243.846679]  [<ffffffff81279db9>] ? security_capable+0x29/0x30
[623243.846697]  [<ffffffffa0566640>] ? firegl_cmmqs_createdriver+0x0/0x130 [fglrx]
[623243.846712]  [<ffffffffa0545d9a>] ? firegl_ioctl+0x1ea/0x250 [fglrx]
[623243.846723]  [<ffffffffa0536d7e>] ? ip_firegl_unlocked_ioctl+0xe/0x20 [fglrx]
[623243.846728]  [<ffffffff811764cf>] ? do_vfs_ioctl+0x8f/0x360
[623243.846731]  [<ffffffff81164e73>] ? vfs_write+0x123/0x180
[623243.846734]  [<ffffffff81176831>] ? sys_ioctl+0x91/0xa0
[623243.846737]  [<ffffffff8100c002>] ? system_call_fastpath+0x16/0x1b

The previous warnings seem to be {{{
[623034.464127] [fglrx] ACPI is disabled on this system
[623034.563758] WARNING: at /build/buildd/linux-2.6.38/drivers/pci/msi.c:685 pci_enable_msi_block+0xc8/0xe0()

[623063.874197] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[623063.874214] WARNING: at /build/buildd/linux-2.6.38/drivers/gpu/drm/radeon/radeon_fence.c:248 radeon_fence_wait+0x36f/0x3e0 [radeon]()
[623063.874217] GPU lockup (waiting for 0x00402B25 last fence id 0x00402B24)
}}}

"ACPI is disabled on this system" is suspicious -- this is our only system that took the noacpi hack, I think. I wonder if this happens if that hack isn't in place.

comment:3 Changed 12 years ago by jdreed

  • Status changed from new to closed
  • Resolution set to ignored

I think we've sufficiently stopped caring. We have demonstrated that fglrx "just installs" these days. If we get to the point where we need it on cluster, we should enable it, but this ticket is sufficiently vague as to not be useful.

comment:4 Changed 12 years ago by jdreed

Also, for completeness, I don't know what thornhump had, but our official 790 hack was "reboot=pci", not "noacpi".

Note: See TracTickets for help on using tickets.