Ticket #165 (closed defect: wontfix)

Opened 12 years ago

Last modified 12 years ago

HPET clock is broken on Dell 760 Hardware

Reported by: wdc Owned by:
Priority: blocker Milestone: Summer 2009 Deployment
Component: -- Keywords:
Cc: Fixed in version:
Upstream bug:  LP:348694  Kernel:13053

Description (last modified by wdc) (diff)

When you boot from older Intrepid media, you see an spew of kernel oops messages. The apparent work-around for this is to power cycle the machine.

A one-line change to print the error message only once is incorporated into the Intrepid update kernel, so this symptom only affects systems lacking the update kernel.

The warning is reporting a fault in the HPET clock which *IS* real.
It takes forever for the system to boot up, and logging in takes rather long. The work around is to boot with the option acpi=off.

When you boot Jaunty, even from media as recent as 3/24, you get a blank screen unless you use the boot option acpi=off.

We need to find out where the fault lies and get it fixed.

Attachments

760-dmesg-oops.log Download (37.4 KB) - added by wdc 12 years ago.
dmesg output from debathena-cluster system with normal code path. acpi enabled, slow response. Notice the oops.
760-dmesg-apic-off.log Download (32.6 KB) - added by wdc 12 years ago.
Same system booted with "acpi=off"
760-dmesg-pit.log Download (41.6 KB) - added by wdc 12 years ago.
dmesg when clocksource=pit. Oddly, it complains of an hpet problem!
760-dmesg-hpet-disable.log Download (38.9 KB) - added by wdc 12 years ago.
dmesg with hpet=disable

Change History

comment:1 Changed 12 years ago by wdc

This turns out to be more than a one-line kernel change.
The kernel change makes the endless spew stop, but does not fix HPET.

The manifestation of the problem is more serious under Jaunty.
If you choose, "Try ubuntu..." from the live CD, you get a blank screen.

Under both Intrepid and Jaunty, you get much better operation if you set "ACPI=off"

I have opened LP 348694
 https://bugs.launchpad.net/bugs/348694

on this issue.

comment:2 Changed 12 years ago by wdc

  • Description modified (diff)
  • Summary changed from 2.6.26 kernel does not get along with Dell 760 Hardware to HPET clock is broken on Dell 760 Hardware

Revised description to reflect current understanding of the problem.

Changed 12 years ago by wdc

dmesg output from debathena-cluster system with normal code path. acpi enabled, slow response. Notice the oops.

Changed 12 years ago by wdc

Same system booted with "acpi=off"

comment:3 Changed 12 years ago by wdc

Note! Attachment named "760-dmesg-apic-off.log" is named incorrectly.
It is "acpi=off" NOT apic that was set.

Changed 12 years ago by wdc

dmesg when clocksource=pit. Oddly, it complains of an hpet problem!

Changed 12 years ago by wdc

dmesg with hpet=disable

comment:4 Changed 12 years ago by wdc

With hpet=disable /proc/interrupts looks like this:

wdc-ubuntu-test% more /proc/interrupts

CPU0 CPU1

0: 41858 43789 IO-APIC-edge timer
1: 1 1 IO-APIC-edge i8042
7: 0 0 IO-APIC-edge parport0
8: 1 1 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi

12: 2 2 IO-APIC-edge i8042
16: 806 730 IO-APIC-fasteoi uhci_hcd:usb1, HDA Intel
17: 708 608 IO-APIC-fasteoi uhci_hcd:usb2, uhci_hcd:usb6
18: 0 0 IO-APIC-fasteoi uhci_hcd:usb7
22: 2 1 IO-APIC-fasteoi uhci_hcd:usb3, ehci_hcd:usb4
23: 0 0 IO-APIC-fasteoi uhci_hcd:usb5, ehci_hcd:usb8

219: 5348 5124 PCI-MSI-edge eth0
220: 11135 9240 PCI-MSI-edge ahci
NMI: 0 0 Non-maskable interrupts
LOC: 11293 16932 Local timer interrupts
RES: 13002 13790 Rescheduling interrupts
CAL: 305 195 function call interrupts
TLB: 105 87 TLB shootdowns
SPU: 0 0 Spurious interrupts
ERR: 0
MIS: 0

comment:5 Changed 12 years ago by wdc

With acpi=off /proc/interrupts looks like this:

wdc-ubuntu-test% cat /proc/interrupts 
           CPU0       
  0:         70   IO-APIC-edge      timer
  1:          2   IO-APIC-edge      i8042
  2:          0    XT-PIC-XT        cascade
  4:          2   IO-APIC-edge    
  8:          2   IO-APIC-edge      rtc0
 12:          4   IO-APIC-edge      i8042
 16:       1467   IO-APIC-fasteoi   uhci_hcd:usb1, HDA Intel
 17:        485   IO-APIC-fasteoi   uhci_hcd:usb2, uhci_hcd:usb6
 18:          0   IO-APIC-fasteoi   uhci_hcd:usb7
 22:          3   IO-APIC-fasteoi   uhci_hcd:usb3, ehci_hcd:usb4
 23:          0   IO-APIC-fasteoi   uhci_hcd:usb5, ehci_hcd:usb8
219:       5859   PCI-MSI-edge      eth0
220:      17928   PCI-MSI-edge      ahci
NMI:          0   Non-maskable interrupts
LOC:      31190   Local timer interrupts
RES:          0   Rescheduling interrupts
CAL:          0   function call interrupts
TLB:          0   TLB shootdowns
SPU:          0   Spurious interrupts
ERR:          0
MIS:          0

comment:6 Changed 12 years ago by wdc

Further testing under Jaunty gives new insight:

Setting "hpet=disable" results in a working jaunty system.
With the default boot options, what's happening is that the kernel never sees the clock interrupts. You can simulate interrupts by hitting the power switch until it configures the keyboard, then hitting a key until it starts gdm and configures the mouse. After that, the system runs just fine as long as you wiggle the mouse to create interrupts.

comment:7 Changed 12 years ago by broder

  • Component changed from dotfiles to default

comment:8 Changed 12 years ago by wdc

We have more insight:

The root cause seems to be a bad interaction with the enablement of
"C States" in the "Performance" BIOS menu. This option is only available for high end processors, and appears to have gotten turned on for MIT machines because we wanted the maximum Energy savings.

Clearing this bit is a work-around for the problem.

comment:9 Changed 12 years ago by wdc

In case people are not following the LP, there is significant status update here.

  1. BZ 13053 has been opened with kernel.org.

 http://bugzilla.kernel.org/show_bug.cgi?id=13053

  1. Setting boot option: acpi_skip_timer_override seems right now to be the smallest footprint work-around for the problem.
  1. Because of what acpi_skip_timer_override does (enable a legacy IRQ override), we are asking, "Is this a BIOS bug."

For now, setting acpi_skip_timer_override enables the C States BIOS option to do its work, and for interrupts to get to the CPU so that everything runs, even with the HPET clock.

comment:10 Changed 12 years ago by jdreed

  • Milestone set to Summer Deployment

comment:11 Changed 12 years ago by broder

  • Upstream bug set to https://bugs.launchpad.net/bugs/348694, http://bugzilla.kernel.org/show_bug.cgi?id=13053

comment:12 Changed 12 years ago by broder

  • Upstream bug changed from https://bugs.launchpad.net/bugs/348694, http://bugzilla.kernel.org/show_bug.cgi?id=13053 to LP:348694 Kernel:13053

comment:13 Changed 12 years ago by geofft

  • Status changed from new to closed
  • Resolution set to wontfix

We won't see any progress on this by the first summer deployment date (Friday), but Hotline has decided to just disable the "C States" in the BIOS until we hear more.

comment:14 Changed 12 years ago by wdc

On June 17, 2009 the updates (rev A03) BIOS became publicly available at:

 http://ftp.us.dell.com/bios/O760-A03.EXE

comment:15 Changed 12 years ago by wdc

I have installed the A03 BIOS on my test system and confirmed that the Jaunty Live CD
functions properly even when C States is enabled. With this sanity check test, we now have a bona fide fix in BIOS rev A03.

Note: See TracTickets for help on using tickets.