Smbios 0x17 error supermicro

Проблема с Supermicro X9DAi (SYS-7037a-i) Проблема с Supermicro X9DAi (SYS-7037a-i) Сообщение SciMan » 10 янв 2017, 22:57 В наличии есть рабочая станция

Содержание

  1. Проблема с Supermicro X9DAi (SYS-7037a-i)
  2. Проблема с Supermicro X9DAi (SYS-7037a-i)
  3. Smbios 0x17 error supermicro
  4. Вопрос
  5. Smbios 0x17 error supermicro
  6. Вопрос
  7. Smbios 0x17 error supermicro
  8. Asked by:
  9. Question
  10. Smbios 0x17 error supermicro

Проблема с Supermicro X9DAi (SYS-7037a-i)

Проблема с Supermicro X9DAi (SYS-7037a-i)

Сообщение SciMan » 10 янв 2017, 22:57

В наличии есть рабочая станция на базе Supermicro X9DAi, в корпусе SYS-7037a-i, версия BIOS = 3.0а

Сама станция подключена к бесперебойнику, до недавнего времени все было хорошо (полгода с запуска), а сегодня внезапно начался следующий глюк: включаю питание с передней панели — стартуют кулеры, шумят секунд 5, и мигает красный индикатор «i» — «информация». Потом станция внезапно прерывает запуск и выключается, не показав ничего на экране. Через несколько секунд (примерно 3 сек.) внезапно включает питание и продолжает запуск с нуля. В мане об индикации LED «i» не сказано ничего, там вместо него обозначен индикатор температуры, означает проверить помещение на перегрев, или мешает ли что-то вентиляторам. Разобрал корпус, прочистил его, собрал обратно, проблема осталась.

Некоторое время погуглив, нашел еще вариант источника проблемы — возможно, ошибки в памяти: на борту 128 ГБ Reg-ECC, работал на Force speed 1600, сейчас в BIOS переставил на Auto (стало 1333). Проблема не исчезла. На ночь поставлю Memtest, в нем, кстати, ECC=off. В логах BIOS есть ошибка Smbios 0x01 SINGLE ECC-BIT ERROR. Но в логах она уже очень давно, а проблема началась сегодня.

Подскажите пожалуйста, в чем может быть проблема. Прошивку BIOS обновлял летом, по способу в readme с офсайта Supermicro, все встало без проблем.

Источник

Smbios 0x17 error supermicro

Вопрос

Have SuperMicro X10SAT MB, Xeon E3-1275 CPU, 16 MN DDR3 memory & Win 10 Pro 64 bit OS.

Initial HDD was a WD Velociraptor experiencing same issues. Upgraded to Samsumg EVO SDD and performed a fresh install of Win 10.

There did not appear to be any issues until the next day. I believe some Windows upgrades were applied and the problem has resurfaced.

I ran SFC /SCANNOW and no issues were reported.

The PC appears to be simulating proper operation now, however, past experience suggests it will struggle with future Windows updates & will eventually fail to boot up.

I am in the process or recovering the initial load of Win 10 (prior to any updates) and will go through the process of deferring additional updates as long as possible. Just to see if I can isolate the issue to the Windows updates.

I have contacted SuperMicro to see if a BIOS update is available which will correct this issue.

There is an abundance of information regarding the above referenced issues on the internet, end result is someone is trying to sell some ‘fixit’ software.

Would appreciate suggestions/ideas on how to make these issues go away. We run a small business & the down time is really putting us in a pinch.

**** An update since original post ****

As indicated, I’ve performed a fresh install of Win 10 Pro 64 bit (build 1703). Adobe XI, MS Office 10 Pro Plus, Visio, Acronis 2016 and Vedit applications have been installed. I deferred Win 10 updates. I have yet to connect to the internet.

SuperMicro could provide no definitive information if a BIOS upgrade would remedy the issue. as in prevent it from happening again. Will reluctantly try this sometime in the future, however, I’m somewhat reluctant to go this route not knowing if it will fix the issue.

The PC is presently booting up just fine with out MBR 2 or MBR 3 errors, and, the error logging (yes it’s turned on) is not reflecting the SMBIOS 0x17 or 0x0a errors encountered previously. However, no Wind 10 updates have been applied.

My research seems to suggest many of the Win 10 updates frequently cause issues with certain PCs. Presently we have a variety of PC’s & laptops running Win 10 Pro 64 bit & 32 bit without issues. The issue is specific to this PC.

All of this leads me to believe Win 10 updates are causing this machine to lose it’s mind. All is backed up for recovery, I just hate expending the time to redo something over & over again only to end up with the same result.

I’m looking for some insight and suggestions.

Источник

Smbios 0x17 error supermicro

Вопрос

Have SuperMicro X10SAT MB, Xeon E3-1275 CPU, 16 MN DDR3 memory & Win 10 Pro 64 bit OS.

Initial HDD was a WD Velociraptor experiencing same issues. Upgraded to Samsumg EVO SDD and performed a fresh install of Win 10.

There did not appear to be any issues until the next day. I believe some Windows upgrades were applied and the problem has resurfaced.

I ran SFC /SCANNOW and no issues were reported.

The PC appears to be simulating proper operation now, however, past experience suggests it will struggle with future Windows updates & will eventually fail to boot up.

I am in the process or recovering the initial load of Win 10 (prior to any updates) and will go through the process of deferring additional updates as long as possible. Just to see if I can isolate the issue to the Windows updates.

I have contacted SuperMicro to see if a BIOS update is available which will correct this issue.

There is an abundance of information regarding the above referenced issues on the internet, end result is someone is trying to sell some ‘fixit’ software.

Would appreciate suggestions/ideas on how to make these issues go away. We run a small business & the down time is really putting us in a pinch.

**** An update since original post ****

As indicated, I’ve performed a fresh install of Win 10 Pro 64 bit (build 1703). Adobe XI, MS Office 10 Pro Plus, Visio, Acronis 2016 and Vedit applications have been installed. I deferred Win 10 updates. I have yet to connect to the internet.

SuperMicro could provide no definitive information if a BIOS upgrade would remedy the issue. as in prevent it from happening again. Will reluctantly try this sometime in the future, however, I’m somewhat reluctant to go this route not knowing if it will fix the issue.

The PC is presently booting up just fine with out MBR 2 or MBR 3 errors, and, the error logging (yes it’s turned on) is not reflecting the SMBIOS 0x17 or 0x0a errors encountered previously. However, no Wind 10 updates have been applied.

My research seems to suggest many of the Win 10 updates frequently cause issues with certain PCs. Presently we have a variety of PC’s & laptops running Win 10 Pro 64 bit & 32 bit without issues. The issue is specific to this PC.

All of this leads me to believe Win 10 updates are causing this machine to lose it’s mind. All is backed up for recovery, I just hate expending the time to redo something over & over again only to end up with the same result.

I’m looking for some insight and suggestions.

Источник

Smbios 0x17 error supermicro

This forum has migrated to Microsoft Q&A. Visit Microsoft Q&A to post new questions.

Asked by:

Question

Have SuperMicro X10SAT MB, Xeon E3-1275 CPU, 16 MN DDR3 memory & Win 10 Pro 64 bit OS.

Initial HDD was a WD Velociraptor experiencing same issues. Upgraded to Samsumg EVO SDD and performed a fresh install of Win 10.

There did not appear to be any issues until the next day. I believe some Windows upgrades were applied and the problem has resurfaced.

I ran SFC /SCANNOW and no issues were reported.

The PC appears to be simulating proper operation now, however, past experience suggests it will struggle with future Windows updates & will eventually fail to boot up.

I am in the process or recovering the initial load of Win 10 (prior to any updates) and will go through the process of deferring additional updates as long as possible. Just to see if I can isolate the issue to the Windows updates.

I have contacted SuperMicro to see if a BIOS update is available which will correct this issue.

There is an abundance of information regarding the above referenced issues on the internet, end result is someone is trying to sell some ‘fixit’ software.

Would appreciate suggestions/ideas on how to make these issues go away. We run a small business & the down time is really putting us in a pinch.

**** An update since original post ****

As indicated, I’ve performed a fresh install of Win 10 Pro 64 bit (build 1703). Adobe XI, MS Office 10 Pro Plus, Visio, Acronis 2016 and Vedit applications have been installed. I deferred Win 10 updates. I have yet to connect to the internet.

SuperMicro could provide no definitive information if a BIOS upgrade would remedy the issue. as in prevent it from happening again. Will reluctantly try this sometime in the future, however, I’m somewhat reluctant to go this route not knowing if it will fix the issue.

The PC is presently booting up just fine with out MBR 2 or MBR 3 errors, and, the error logging (yes it’s turned on) is not reflecting the SMBIOS 0x17 or 0x0a errors encountered previously. However, no Wind 10 updates have been applied.

My research seems to suggest many of the Win 10 updates frequently cause issues with certain PCs. Presently we have a variety of PC’s & laptops running Win 10 Pro 64 bit & 32 bit without issues. The issue is specific to this PC.

All of this leads me to believe Win 10 updates are causing this machine to lose it’s mind. All is backed up for recovery, I just hate expending the time to redo something over & over again only to end up with the same result.

I’m looking for some insight and suggestions.

Источник

Smbios 0x17 error supermicro

Профиль | Отправить PM | Цитировать

Добрый день, появилась такая проблема.
Есть сервак, при перезагрузке из-под винды, сервак не хочет стартовать, пищит (5 коротких, 1 длинный), а на экране Intel Reference Code Execution с кодом 02. Судя по бипам — проблемы с памятью. Однако с кнопки ресета всё стартует норм., далее постоял 2 часа и завис. Конфигурация следующая:
— Supermicro X11SSL-F (2 PCI-E x8, 1 PCI-E x16, 4 DDR4 DIMM, Video, Dual Gigabit LAN);
— QuadCore Intel Xeon E3-1240 v5, 3700 MHz (37 x 100);
— 2хKingston 8GB DDR4 PC4-17000 [KVR21E15D8/8];
— Два зеркала на встроенном контроллере Intel.

Перезагружался, т.к. устанавливал последние обновления от MS за сентябрь. ОС — 2008R2. Здесь аналогичная проблема, только память другая. Также натыкался ещё на несколько подобных случаев. Снёс все установленные обновы (кроме обновки для ie11), а именно: KB3184471, KB3177186, KB3175024, KB3184122, KB3185911. И всё ребутнулось успешно. Кто-н имел дело с похожим?

Судя по бипам — проблемы с памятью. »

——-
ВНИМАНИЕ ознакомьтесь, прежде чем создать тему! Процессор — мозг компьютера, блок питания — сердце и печень.

Это сообщение посчитали полезным следующие участники:

Если же вы забыли свой пароль на форуме, то воспользуйтесь данной ссылкой для восстановления пароля.

Adblock
detector

  1. summary Firmware Event Log

Based off work by Tim Hockin @ Google, Inc, 2006

Documentation License:

http://i.creativecommons.org/l/by/3.0/us/88×31.png Creative Commons Attribution 3.0 United States License

<wiki:toc max_depth=»1″></wiki:toc>

Table of Contents

    • Background
    • Terminology
    • Overview
    • Detailed Design
      • Log header
    • Events
    • Event types
      • EventPayloads
        • Event 0x01 — Single bit ECC memory error
        • Event 0x02 — Multi bit ECC memory error
        • Event 0x03 — Memory parity error
        • Event 0x04 — Bus timeout
        • Event 0x05 — IO channel check
        • Event 0x06 — S/W NMI
        • Event 0x07 — POST memory resize
        • Event 0x08 — POST error event
        • Event 0x09 — PCI PERR
        • Event 0x0A — PCI SERR
        • Event 0x0B — CPU failure
        • Event 0x0C — EISA timeout
        • Event 0x0D — Correctable memory log disabled
        • Event 0x0E — Log disabled
        • Event 0x10 — System limit exceeded
        • Event 0x11 — Async HW timer (WDT) timeout
        • Event 0x12 — System configuration information
        • Event 0x13 — HDD information
        • Event 0x14 — System reconfigured
        • Event 0x15 — Uncorrectable CPU-complex error
        • Event 0x16 — Log area reset/cleared
        • Event 0x17 — System boot
    • Log shrinking

Background

Modern firmware implementations often reserve a small region in non-volatile memory to store diagnostic and debugging information. However, there is no universal standard for the format. There is no standard location for the log to reside. And there is no standard, free/open-source code to store and retrieve records.

Google’s firmware event log format is based off of a well-understood standard format — SMBIOS System Event Log (Type 15). SMBIOS, however, usually exists strictly in RAM. We require that the event is stored persistently so that information about it can be retrieved at any later point in time, so methods have been defined to store the data in non-volatile memory (assumed to be flash memory).

This event log will store information about events critical to system operation. Such events include DRAM errors, bus failures, and arbitrary OEM-defined events such as basic kernel crash information and crisis recovery notification. The events which are logged are implementation-specific; that is, a given implementation might not actually recognize or commit log entries for every event.

Terminology

SMBIOS: System Management BIOS.
SMBIOS is a public specification which defines structured data about the system and methods for accessing that data.

    * http://www.dmtf.org/standards/smbios/

Overview

What we want is a way to keep track of a few dozen or possibly hundreds of events persistently across power cycles and reboots.

We will use a log in the system flash (memory mapped), which can be safely shrunk and copied with a double-buffered copy-and-shrink operation (more in the «Log shrinking» section below). We will enable logging of significant events which occur any time the system is in a running state, whether it’s booting or whether the OS is operational.

This will provide rich, detailed information about the recent history of a system, which can help us when diagnosing systemic problems, both during platform development as well as after deployment.

Detailed Design

The SMBIOS specification defines that the event log consists of an arbitrary length header (0 length is valid), followed immediately by the log data. There are a number of options for log access method, the simplest of which is memory-mapped. We will be using the system flash (which is usually already memory mapped) to store the event log.

The SMBIOS Event Log specification is not perfect, and creates a few rather unfortunate limitations. The largest possible event log, including any header is 65535 (0xffff) bytes. We will make our log as large as allowed. The last byte of the 64 KB flash sector will be left as 0xff, which is conveniently the end-of-log marker.

Log data is stored as a series of variable-length event records in chronological order. The SMBIOS specification does not provide for ring-buffer semantics, which makes for a less than ideal failure mode. If the log reaches capacity, the log is simply marked as full. In order to avoid this failure mode we will check the log size whenever we write an event. If the log has crossed a size threshold, the log will be automatically shrunk and relocated (for a safe copy operation — see «Log shrinking» below).

Log header

The event log header is found at the start of the log area. It can be either a standard format or an OEM-specific format. We will define it as an OEM-specific packed structure with the following layout:

The *elog_magic* field is used to identify the active log area. Since we will be using a two buffer areas, the signature could be at either of two locations.

The *elog_version* field indicates what the layout of the log header is. This document describes the version 1 header.

The *elog_size* field indicates the size of the elog_header, in bytes.

The most significant bit of the *elog_sequence* indicates whether the value is valid or not: if the bit is set (i.e. the sequence number is negative), then the sequence number is not valid. The most significant byte of the sequence number must be the very last byte written when relocating the log, which will ensure that the entire log has been committed to flash before the header is considered valid. In the case of a mid-copy failure, it is possible that two instances of the magic number or even two valid *elog_header* structures could exist in the flash. The larger sequence number determines which header is correct.

The *elog_sequence* plus the number of events in the log yields the total event count. The event count must be essentially monotonic. The only times the event count may move backwards is when the log is cleared, in which case the *elog_sequence* becomes 0, and in the case of a wrap-around.

The following pseudo-code describes the BIOS log-locating algorithm:

Events

The SMBIOS specification defines the layout of an event structure. Each event consists of a well defined, fixed size structure and an optional variable length data payload. The fixed size portion of an SMBIOS event can be interpreted as a packed structure of the following form:

The *event_id* field is a value defined in the «Event types» section below.

The *event_size* field defines the total size of the event, including the fixed and variable data portions.

The *event_time[]* field is the timestamp of the event. Each byte is a BCD value. The six bytes represent year, month, day, hour, minute, second.

The *event_data[]* field is the variable length data payload. The contents of this field are detailed in the «Event payloads» section below. Every event has at least 1 byte of payload. The final (or only) byte of payload is used as a sum-to-zero byte. All valid events will have a byte checksum of zero. There is still a small chance that a corrupted event, including corrupted data could sum to zero, but this is probably negligible. If we want to be even more bulletproof, we could make the checksum be 2 bytes. If the 1 byte sum is < 0xff, the second checksum byte is 0. If the 1 byte sum is exactly 0xff, then the first checksum byte is 0xfe and the second checksum byte is 1. This is probably not worth the complexity or storage cost of 1 byte per event.

Event types

The SMBIOS specification defines several standard types and leaves room for OEM defined types. Below is the list of SMBIOS standard events (As of version 2.6 of the spec):

|| *Event ID* || *Meaning* ||
|| 0x00 || Reserved ||
|| 0x01 || Single-bit ECC error ||
|| 0x02 || Multi-bit ECC error ||
|| 0x03 || Memory parity error ||
|| 0x04 || Bus timeout ||
|| 0x05 || IO channel check ||
|| 0x06 || Software NMI ||
|| 0x07 || POST memory resize ||
|| 0x08 || POST error ||
|| 0x09 || PCI parity error ||
|| 0x0A || PCI system error ||
|| 0x0B || CPU failure ||
|| 0x0C || EISA FailSafe Timer timeout ||
|| 0x0D || Correctable memory log disabled ||
|| 0x0E || Specific event type log disabled ||
|| 0x10 || System limit exceeded (e.g. temperature) ||
|| 0x11 || Async HW timer (WDT) timeout ||
|| 0x12 || System configuration information ||
|| 0x13 || Hard disk information ||
|| 0x14 || System reconfigured ||
|| 0x15 || Uncorrectable CPU-complex error ||
|| 0x16 || Log area reset/cleared ||
|| 0x17 || System boot ||
|| 0x80-0xFE || OEM-specific events ||
|| 0xFF || End of log ||

EventPayloads

The SMBIOS specification allows for a variable-length data payload on each event. This section details the payload for each event. If an event is not detailed here, there is no data payload other than the checksum. All payloads are assumed to be packed structures.

Event 0x01 — Single bit ECC memory error

This event indicates that a single bit ECC (SBE) memory error had occurred. The event payload `dimm_number` indicates the DIMM that failed.

Event 0x02 — Multi bit ECC memory error

This event indicates that a multi-bit ECC (MBE) memory error had occurred. The event payload is same as in Event 0x01.

Event 0x03 — Memory parity error

This event indicates that a memory parity error had occurred. The event payload is same as in Event 0x01.

Event 0x04 — Bus timeout

This event indicates that a bus timeout had occurred while accessing a bus. The payload `which` field indicates the type of the bus such as PCI,
I2C etc. The `sub_type` field refers to the instance of the bus (say AMB I2C etc).

Event 0x05 — IO channel check

This event indicates that an error had occurred in an I/O channel in the system. The payload `which` field indicates type of the error as follows:


The `device` field contains the PCI function address (PFA) of the device reporting the error. The PFA is encoded as:

   * bits [15:8] - PCI bus number
   * bits [7:3]  - PCI device number
   * bits [2-0]  - PCI function number

Event 0x06 — S/W NMI

This payload-less event indicates that a S/W NMI had occurred.

  • Note*: All events contains checksum as standard payload

Event 0x07 — POST memory resize

This event indicates that memory resizing had occurred during the firmware POST. There is no payload associated with this event except checksum.

Event 0x08 — POST error event

This event lists the POST errors occurred during the boot process and reports them in a bit pattern defined in SMBIOS specification.

Event 0x09 — PCI PERR

Event 0x0A — PCI SERR

These events indicates that PCI PERR/SERR had occurred and lists the device address in `device` field of the payload.

Event 0x0B — CPU failure

This event indicates that a CPU failure had occurred. The nature of the error is indicated in `sub_type` field as follows:

   * 0x01 - CPU mismatch
   * 0x02 - CPU IERR# assertion
   * 0x03 - CPU BINIT# assertion

The `cpu_number` field identifies the CPU that caused the failure.

Event 0x0C — EISA timeout

This event indicates a timeout had occurred in the EISA bus. This event does not have any payload other than checksum.

Event 0x0D — Correctable memory log disabled

This event indicates that the logging of correctable memory errors had been disabled. This event does not have any payload other than checksum.

Event 0x0E — Log disabled

The `event_type` field indicates which type of logs are disabled. This event occurs when a particular event happens too frequently and the logger decided to stop logging it to avoid event log full condition (all events are of same type!).

Event 0x10 — System limit exceeded

The `which` field points to the parameter that exceeded the system limit. The values for this field are not defined.

Event 0x11 — Async HW timer (WDT) timeout

The timer field indicates which watchdog timer indicated a timeout. Known values for this field are:

   * 0x0001 - TCO watchdog
   * 0x0002 - .Net watchdog

Event 0x12 — System configuration information

Event 0x13 — HDD information

No payload format is defined for these events.

Event 0x14 — System reconfigured

The which field indicates which aspect of the system was reconfigured. Known values for this field are:

   * 0x0001 - DIMMs reconfigured

Event 0x15 — Uncorrectable CPU-complex error

The `subtype` field indicates type of the CPU error. `cpu_number` identifies the CPU that generated the error.

Event 0x16 — Log area reset/cleared

The bytes field indicates how many bytes of the log area were discarded.

  • NOTE*: Since the number of bytes cleared cannot be 0, the count field is 0-based. That is, the value of 0 indicates 1 byte is cleared. This way we can indicate values from 1 to 64K (0xFFFF). Also it is decided not to increase the size of _bytes_ field from _uint16_t_ to _uint32_t_ because the current elog size can be only 64K.

Event 0x17 — System boot

The `bootnum` field contains the sequential boot number of the boot being logged. This will always be the first event logged during boot up.

Log shrinking

Because the SMBIOS specification does not provide ring-buffer semantics, we want to avoid the «log full» scenario if at all possible. When writing new events to the log, we must ensure that there is space available for the new event. If there is not space available, then the log must be shrunk and re-written.

Because flash bits are only programmable from ‘1’ to ‘0’, it is effectively a write-once medium, with sector erases required in order to re-write. Flash chips have a variety of erasable sector sizes. Because we have multiple parts, we must be ready to handle the worst case — 64 KB erasable areas. That means that you can only erase the device 64 KB at a time. There is no way to erase less than that (on at least some flash chips). When we need to shrink the log, we need to avoid any time window where the log is not present on the flash chip. To do this, we will use a double-buffered copy operation. When we near capacity on one 64 KB log, we will copy a portion of the log to a second flash block, and then indicate that the second block is valid. Only after the new log is committed to the flash medium can we erase the old log. This should provide a reliable way of shrinking the log without risking a corrupted or missing log.

Shrinking the log will not be a fast operation. We want to avoid doing the log shrink at run-time if at all possible. To do this, we will evaluate the fullness of the log at bootup time. If the log is found to have crossed the «fullness threshold» (defined below), we can do a log shrink immediately, which will hopefully reduce the need to do log shrinks when logging events. This does not eliminate the need to check for space when logging an event, but should reduce the likelihood of not finding space.

To shrink the log, we scan forward from the start of the log until we cross the shrink threshold. Because we only want to discard whole events, we might discard slightly more than the nominal shrink size. We can then copy the remainder of the log to the new location. The following pseudo-code describes the BIOS log-shrinking algorithm in more detail:

    /*
     * The SMBIOS spec defines the log area size (including header) as a 16 bit field.
     */
    #define ELOG_TOTAL_SIZE       0xffff
    #define ELOG_SHRINK_SIZE      0x4000
    #define ELOG_FULL_THRESHOLD   0xf000

On the motherboard ASUS V-PRO Z77 installed 2×2 (Kingston and Corsair) memory slots. All four slats of RAM do not cause any complaints. They work properly. But, below is the output of the result of running the dmidecode program and in the output I’m interested in the following lines:
Error Information Handle: 0x0060
Error Information Handle: 0x0063
What does it mean?
What are the reasons for these errors?
Unfortunately, I could not find information about this in Google.

$ sudo dmidecode --type 17
# dmidecode 2.12
# SMBIOS entry point at 0x000f04c0
SMBIOS 2.7 present.

Handle 0x005B, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x005C
    Error Information Handle: 0x0060
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: ChannelA-DIMM0
    Bank Locator: BANK 0
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1333 MHz
    Manufacturer: Kingston
    Serial Number: 9333B00B
    Asset Tag: 9876543210
    Part Number: 99U5584-007.A00LF 
    Rank: 1
    Configured Clock Speed: 1333 MHz

Handle 0x005F, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x005C
    Error Information Handle: No Error
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: ChannelA-DIMM1
    Bank Locator: BANK 1
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1333 MHz
    Manufacturer: 029E
    Serial Number: 00000000
    Asset Tag: 9876543210
    Part Number: CMZ8GX3M2A1600C9  
    Rank: 2
    Configured Clock Speed: 1333 MHz

Handle 0x0062, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x005C
    Error Information Handle: 0x0063
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: ChannelB-DIMM0
    Bank Locator: BANK 2
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1333 MHz
    Manufacturer: Kingston
    Serial Number: 1D10C373
    Asset Tag: 9876543210
    Part Number: 99U5584-018.A00LF 
    Rank: 1
    Configured Clock Speed: 1333 MHz

Handle 0x0065, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x005C
    Error Information Handle: No Error
    Total Width: 64 bits
    Data Width: 64 bits
    Size: 4096 MB
    Form Factor: DIMM
    Set: None
    Locator: ChannelB-DIMM1
    Bank Locator: BANK 3
    Type: DDR3
    Type Detail: Synchronous
    Speed: 1333 MHz
    Manufacturer: 029E
    Serial Number: 00000000
    Asset Tag: 9876543210
    Part Number: CMZ8GX3M2A1600C9  
    Rank: 2
    Configured Clock Speed: 1333 MHz  

UPD

$ sudo dmidecode --type 18
# dmidecode 2.12
# SMBIOS entry point at 0x000f04c0
SMBIOS 2.7 present.

Handle 0x005D, DMI type 18, 23 bytes
32-bit Memory Error Information
    Type: OK
    Granularity: Unknown
    Operation: Unknown
    Vendor Syndrome: Unknown
    Memory Array Address: Unknown
    Device Address: Unknown
    Resolution: Unknown

Handle 0x0060, DMI type 18, 23 bytes
32-bit Memory Error Information
    Type: OK
    Granularity: Unknown
    Operation: Unknown
    Vendor Syndrome: Unknown
    Memory Array Address: Unknown
    Device Address: Unknown
    Resolution: Unknown

Handle 0x0063, DMI type 18, 23 bytes
32-bit Memory Error Information
    Type: OK
    Granularity: Unknown
    Operation: Unknown
    Vendor Syndrome: Unknown
    Memory Array Address: Unknown
    Device Address: Unknown
    Resolution: Unknown

Handle 0x0066, DMI type 18, 23 bytes
32-bit Memory Error Information
    Type: OK
    Granularity: Unknown
    Operation: Unknown
    Vendor Syndrome: Unknown
    Memory Array Address: Unknown
    Device Address: Unknown
    Resolution: Unknown

Hello,

I have an similar build just with 2 x EPYC 7281 and similar issue.

It seems the system resets itself once load is >80% and I/O >50%.

Looking at that closer the only piece(es) HW can do that are the

watchdogs ( which are not working in Linux anyway right now ) also

maybe the BCM has some sort own watchdog ( can’t find any good documentation

for the motherboard ). Also supermicro’s manual about the modtherboard is strange.

It looks to me like is a matter of the memory configuration one is using.

4 , 8 , 12 , 16 RAM Modules ( which is completly undocumented in the manual )

and the used SATA/PCI-E/NVME ports.

I use 4 x 32GB right now.

I’m using the internal M.2 port with a ‘Samsung SSD 960 EVO 250GB’ for *system*

and have a second one in the PCI-e x8. Also 8 X 2TB NAS HDD’s , 4 for each CPU SATA port

( using vendors calbles )..

Original configuration looked like this :

4 x 32GB RAM Modules D1/F1 ( like the motherboard manual suggest )

PCI-e x8 CPU1 slot the second NVME

M.2 CPU1 the system NVME

( NVME_0 , NVME_1 port unused )

CPU1-SATA 4x 2TB HDD

CPU2-SATA 4x 2TB HDD

No go with that stressing the system a bit it just reboot itself

after 5 to 10 minutes..

Also turned on edac in kernel and mce and now I see an mce on CPU24

but I don’t think that’s real since occurs like this:

BCM reports error on Disk18 , SMART Asseration ( huh? I don’t have  18 disks ..)

followed by in kernel MCE correctable , eg:

mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:24 (17:1:2) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000005a00000d
[Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2
[Hardware Error]: Power, Interrupts, etc. Error: Error on GMI link.
[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)

*and* only occurs on High load , on normal load I never see that..

Now here is what I did to workaround for now:

First get at least an kernel-4.15-rc6 ( this has fixed edac for epyc )

Be sure you have EDAC turned on on kernel config.

On HW site:

power of the box.

pull out any PCI-e cards , any HDDs you don’t need but your

HDD/SSD to boot the system.

Power On and in BIOS turn OFF:

Watchdog

IOMMU

ACS

SR-IOV

PCIe Spread Spectrum

Core Performance Boost

Global C-state Control

and any PCI-e/NVME’s OPROM’s you don’t need.

Change:

Determinism Slider to Performance

Memory Clock to 2666Mhz

( if you use UEFI change the remaining OPROM’s to EFI )

Save and performe an Power Cycle.

Once the box is UP open IPMI Webinterface.

Change FAN mode to HavyIO

Turn On extra event features.

Here it works as workaround , I stress the box with an loop compiling libreoffice

and the kernel-tree with -j$core_count for near a day now.

I see the mce from time to time and something may be wrong but right now I’m not sure hwo to blame

( PS: you can find me on freenode just PM crazy if you wish )

Понравилась статья? Поделить с друзьями:

Читайте также:

  • Smbclient failed error nt status unsuccessful
  • Smart hub код ошибки 102
  • Smb1 disabled no workgroup available как исправить
  • Smart hub error model bind
  • Smb signing not required как исправить

  • 0 0 голоса
    Рейтинг статьи
    Подписаться
    Уведомить о
    guest

    0 комментариев
    Старые
    Новые Популярные
    Межтекстовые Отзывы
    Посмотреть все комментарии