Skip to content

Commit 6145b29

Browse files
aegljfvogel
authored andcommitted
x86/mce: Fix machine_check_poll() tests for error types
There has been a lurking "TBD" in the machine check poll routine ever since it was first split out from the machine check handler. The potential issue is that the poll routine may have just begun a read from the STATUS register in a machine check bank when the hardware logs an error in that bank and signals a machine check. That race used to be pretty small back when machine checks were broadcast, but the addition of local machine check means that the poll code could continue running and clear the error from the bank before the local machine check handler on another CPU gets around to reading it. Fix the code to be sure to only process errors that need to be processed in the poll code, leaving other logged errors alone for the machine check handler to find and process. [ bp: Massage a bit and flip the "== 0" check to the usual !(..) test. ] Fixes: b79109c ("x86, mce: separate correct machine check poller and fatal exception handler") Fixes: ed7290d ("x86, mce: implement new status bits") Reported-by: Ashok Raj <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: Ashok Raj <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: linux-edac <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: x86-ml <[email protected]> Cc: Yazen Ghannam <[email protected]> Link: https://lkml.kernel.org/r/20190312170938.GA23035@agluck-desk (cherry picked from commit f19501a) Orabug: 29547647 Signed-off-by: Somasundaram Krishnasamy <[email protected]> Reviewed-by: John Donnelly <[email protected]>
1 parent d73fd7c commit 6145b29

File tree

1 file changed

+37
-7
lines changed

1 file changed

+37
-7
lines changed

arch/x86/kernel/cpu/mce/core.c

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -739,19 +739,49 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
739739

740740
barrier();
741741
m.status = mce_rdmsrl(msr_ops.status(i));
742+
743+
/* If this entry is not valid, ignore it */
742744
if (!(m.status & MCI_STATUS_VAL))
743745
continue;
744746

745747
/*
746-
* Uncorrected or signalled events are handled by the exception
747-
* handler when it is enabled, so don't process those here.
748-
*
749-
* TBD do the same check for MCI_STATUS_EN here?
748+
* If we are logging everything (at CPU online) or this
749+
* is a corrected error, then we must log it.
750750
*/
751-
if (!(flags & MCP_UC) &&
752-
(m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
753-
continue;
751+
if ((flags & MCP_UC) || !(m.status & MCI_STATUS_UC))
752+
goto log_it;
753+
754+
/*
755+
* Newer Intel systems that support software error
756+
* recovery need to make additional checks. Other
757+
* CPUs should skip over uncorrected errors, but log
758+
* everything else.
759+
*/
760+
if (!mca_cfg.ser) {
761+
if (m.status & MCI_STATUS_UC)
762+
continue;
763+
goto log_it;
764+
}
765+
766+
/* Log "not enabled" (speculative) errors */
767+
if (!(m.status & MCI_STATUS_EN))
768+
goto log_it;
769+
770+
/*
771+
* Log UCNA (SDM: 15.6.3 "UCR Error Classification")
772+
* UC == 1 && PCC == 0 && S == 0
773+
*/
774+
if (!(m.status & MCI_STATUS_PCC) && !(m.status & MCI_STATUS_S))
775+
goto log_it;
776+
777+
/*
778+
* Skip anything else. Presumption is that our read of this
779+
* bank is racing with a machine check. Leave the log alone
780+
* for do_machine_check() to deal with it.
781+
*/
782+
continue;
754783

784+
log_it:
755785
error_seen = true;
756786

757787
mce_read_aux(&m, i);

0 commit comments

Comments
 (0)