igt@xe_exec_system_allocator@madvise-split-vma on shard-bmg@IGTPW

Results for igt@xe_exec_system_allocator@madvise-split-vma

Detail	Value
Duration	unknown
Hostname	shard-bmg-2
Igt-Version	IGT-Version: 2.3-gced6a76d1 (x86_64) (Linux: 7.0.0-rc4-lgci-xe-xe-4749-4ae9f18564e78a544-debug+ x86_64)
Out	Using IGT_SRANDOM=1774064821 for randomisation Opened device: /dev/dri/card0 Starting subtest: madvise-split-vma runner: This test was killed due to a kernel taint (0x244). This test caused an abort condition: Kernel badly tainted (0x244, 0x200) (check dmesg for details): TAINT_WARN: WARN_ON has happened.
Err	Starting subtest: madvise-split-vma (xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: Test assertion failure function xe_bo_create, file ../lib/xe/xe_ioctl.c:411: (xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: Failed assertion: __xe_bo_create(fd, vm, size, placement, flags, ((void *)0), &handle) == 0 (xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: Last errno: 125, Operation canceled (xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: error: -1 != 0 Received signal SIGQUIT. Stack trace: #0 [fatal_sig_handler+0x17b] #1 [__sigaction+0x50] #2 [read+0x11] #3 [_IO_file_underflow+0x165] #4 [__getdelim+0x15f] #5 [dwfl_report_module+0x91b] #6 [dwfl_linux_proc_report+0xd3] #7 [print_backtrace+0x5d] #8 [__igt_fail_assert+0x104] #9 [xe_bo_create+0x77] #10 [test_exec+0x456] #11 [__igt_unique____real_main2349+0x32d5] #12 [main+0x2d] #13 [__libc_init_first+0x8a] #14 [__libc_start_main+0x8b] #15 [_start+0x25]
Dmesg Scroll to first warning	<6> [94.202895] Console: switching to colour dummy device 80x25 <6> [94.203211] [IGT] xe_exec_system_allocator: executing <3> [96.495609] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11367 recv=11366 <3> [98.799270] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11367 recv=11366 <6> [98.912593] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [98.912617] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [98.912627] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [98.912636] nvme 0000:05:00.0: [ 0] RxErr (First) <3> [101.103188] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11368 recv=11366 <6> [101.216068] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [101.216093] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [101.216102] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [101.216111] nvme 0000:05:00.0: [ 0] RxErr (First) <3> [103.407420] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11368 recv=11366 <3> [105.711726] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11369 recv=11366 <3> [108.016124] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11369 recv=11366 <6> [108.019567] [IGT] xe_exec_system_allocator: starting subtest madvise-split-vma <7> [108.020135] xe 0000:03:00.0: [drm:drm_pagemap_dev_unhold_work [drm_gpusvm_helper]] Releasing reference on provider device and module. <6> [108.133454] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [108.133480] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [108.133488] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [108.133497] nvme 0000:05:00.0: [ 0] RxErr (First) <7> [108.754163] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 0 val 0x2c2d292a <7> [108.754307] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 1 val 0x2c2c2c2d <6> [108.861165] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [108.861189] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [108.861198] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [108.861207] nvme 0000:05:00.0: [ 0] RxErr (First) <6> [108.970280] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [108.970304] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [108.970313] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [108.970322] nvme 0000:05:00.0: [ 0] RxErr (First) <6> [109.080364] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [109.080389] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [109.080398] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [109.080407] nvme 0000:05:00.0: [ 0] RxErr (First) <3> [110.320405] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11370 recv=11366 <3> [110.320478] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11371 recv=11366 <6> [110.429605] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [110.429629] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [110.429639] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [110.429648] nvme 0000:05:00.0: [ 0] RxErr (First) <3> [112.624626] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11370 recv=11366 <3> [112.624700] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11371 recv=11366 <3> [114.928900] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11372 recv=11366 <3> [114.928973] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11373 recv=11366 <3> [117.233113] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11372 recv=11366 <3> [117.233189] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11373 recv=11366 <3> [119.537245] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11374 recv=11366 <3> [121.842493] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11374 recv=11366 <7> [123.767164] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 0 val 0x2d2d2a2a <7> [123.767309] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 1 val 0x2d2d2c2d <3> [124.145673] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11375 recv=11366 <6> [125.276463] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [125.276565] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [125.276574] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [125.276583] nvme 0000:05:00.0: [ 0] RxErr (First) <6> [125.386671] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0 <4> [125.386696] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) <4> [125.386705] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000 <4> [125.386714] nvme 0000:05:00.0: [ 0] RxErr (First) <3> [126.449881] xe 0000:03:00.0: [drm] ERROR TLB invalidation fence timeout, seqno=11375 recv=11366 <7> [131.827342] xe 0000:03:00.0: [drm:xe_hw_engine_snapshot_capture [xe]] Tile0: GT0: Proceeding with manual engine snapshot <4> [131.827837] xe 0000:03:00.0: [drm] Tile0: GT0: Check job timeout: seqno=16716, lrc_seqno=16716, guc_id=0, not started <4> [136.946678] xe 0000:03:00.0: [drm] Tile0: GT0: Check job timeout: seqno=16716, lrc_seqno=16716, guc_id=0, not started <7> [138.737656] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 0 val 0x2d2d2a2b <7> [138.737798] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 1 val 0x2d2d2c2d <4> [142.066775] xe 0000:03:00.0: [drm] Tile0: GT0: Check job timeout: seqno=16716, lrc_seqno=16716, guc_id=0, not started <4> [147.186254] xe 0000:03:00.0: [drm] Tile0: GT0: Schedule disable failed to respond, guc_id=0 <6> [147.371507] xe 0000:03:00.0: [drm] Xe device coredump has been created <6> [147.371523] xe 0000:03:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data <6> [147.371525] xe 0000:03:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe] <6> [147.371627] xe 0000:03:00.0: [drm] Tile0: GT0: reset queued <6> [147.371719] xe 0000:03:00.0: [drm] Tile0: GT0: reset started <7> [147.371915] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: GuC CT communication channel stopped <7> [147.372358] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: Applying GT save-restore MMIOs <7> [147.372452] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0x4148] = 0x00000000 <7> [147.372537] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0x8828] = 0x00800000 <7> [147.372617] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb0c8] = 0x11111440 <7> [147.372695] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb104] = 0x08104440 <7> [147.372772] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb108] = 0x30200000 <7> [147.372846] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb158] = 0x0000007f <7> [147.372917] xe 0000:03:00.0: [drm:xe_wopcm_init [xe]] WOPCM: 4096K <7> [147.373022] xe 0000:03:00.0: [drm:xe_wopcm_init [xe]] GuC WOPCM is already locked [6144K, 832K) <7> [147.373133] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: GuC CT communication channel disabled <7> [147.374336] xe 0000:03:00.0: [drm:xe_guc_ads_populate [xe]] Tile0: GT0: Updated ADS capture size 20480 (was 49152) <3> [147.385043] xe 0000:03:00.0: [drm] ERROR Tile0: GT0: load failed: status = 0x400000A0, time = 9ms, freq = 2150MHz (req 2133MHz) <3> [147.385139] xe 0000:03:00.0: [drm] ERROR Tile0: GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 <3> [147.385151] xe 0000:03:00.0: [drm] ERROR Tile0: GT0: firmware signature verification failed <3> [147.385317] xe 0000:03:00.0: [drm] ERROR Tile0: GT0: reset failed (-EPROTO) <3> [147.385372] xe 0000:03:00.0: [drm] ERROR CRITICAL: Xe has declared device 0000:03:00.0 as wedged. IOCTLs and executions are blocked. For recovery procedure, refer to https://docs.kernel.org/gpu/drm-uapi.html#device-wedging Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new <7> [147.385397] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: GuC CT communication channel stopped <7> [147.385541] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT1: GuC CT communication channel stopped <3> [147.445680] xe 0000:03:00.0: [drm] ERROR Tile0: GT1: GuC mmio request 0x5507: no reply 0x5507 <6> [147.445701] xe 0000:03:00.0: [drm] device wedged, needs recovery <4> [147.446062] ------------[ cut here ]------------ <4> [147.446064] xe 0000:03:00.0: [drm] Tile0: GT0: Kernel-submitted job timed out <4> [147.446066] WARNING: drivers/gpu/drm/xe/xe_guc_submit.c:1641 at guc_exec_queue_timedout_job+0x1424/0x2400 [xe], CPU#8: kworker/u64:19/2339 <4> [147.446201] Modules linked in: snd_hda_codec_intelhdmi snd_hda_codec_hdmi pmt_crashlog mei_gsc_proxy mei_lb mei_gsc mtd_intel_dg xe drm_gpuvm drm_gpusvm_helper drm_buddy drm_ttm_helper ttm gpu_sched drm_suballoc_helper drm_exec drm_display_helper cec rc_core drm_kunit_helpers i2c_algo_bit kunit intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp hid_generic cmdlinepart eeepc_wmi spi_nor asus_wmi mei_pxp mei_hdcp sparse_keymap platform_profile mtd wmi_bmof kvm_intel binfmt_misc kvm usbhid irqbypass snd_hda_intel ghash_clmulni_intel hid snd_intel_dspcfg aesni_intel snd_hda_codec rapl r8169 video intel_cstate snd_hda_core snd_hwdep realtek i2c_i801 snd_pcm snd_timer i2c_mux idma64 nls_iso8859_1 spi_intel_pci snd mei_me soundcore spi_intel i2c_smbus mei intel_pmc_core pmt_telemetry wmi pmt_discovery pmt_class intel_pmc_ssram_telemetry pinctrl_alderlake acpi_tad intel_vsec acpi_pad dm_multipath msr nvme_fabrics fuse <4> [147.446277] efi_pstore nfnetlink autofs4 <4> [147.446282] CPU: 8 UID: 0 PID: 2339 Comm: kworker/u64:19 Tainted: G S U 7.0.0-rc4-lgci-xe-xe-4749-4ae9f18564e78a544-debug+ #1 PREEMPT(lazy) <4> [147.446285] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER <4> [147.446287] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1645 03/15/2024 <4> [147.446288] Workqueue: gt-ordered-wq drm_sched_job_timedout [gpu_sched] <4> [147.446296] RIP: 0010:guc_exec_queue_timedout_job+0x142d/0x2400 [xe] <4> [147.446377] Code: 74 04 48 8b 7f 08 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 76 5b 5e e1 48 89 c6 48 8d 3d cc a1 39 00 41 89 d8 44 89 e1 4c 89 ea <67> 48 0f b9 3a 48 8b 45 90 48 8b 40 60 e9 c6 ee ff ff 8b 70 08 49 <4> [147.446379] RSP: 0018:ffffc90003fcbca0 EFLAGS: 00010246 <4> [147.446381] RAX: ffffffffa11ff2bf RBX: 0000000000000000 RCX: 0000000000000000 <4> [147.446382] RDX: ffff888104985890 RSI: ffffffffa11ff2bf RDI: ffffffffa1003dc0 <4> [147.446384] RBP: ffffc90003fcbdb0 R08: 0000000000000000 R09: 0000000000000000 <4> [147.446385] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 <4> [147.446386] R13: ffff888104985890 R14: ffff888131cb8818 R15: 00000000ffffffc2 <4> [147.446388] FS: 0000000000000000(0000) GS:ffff8888db09b000(0000) knlGS:0000000000000000 <4> [147.446389] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [147.446390] CR2: 000000c00041b000 CR3: 000000000344c001 CR4: 0000000000f72ef0 <4> [147.446392] PKRU: 55555554 <4> [147.446393] Call Trace: <4> [147.446394] <TASK> <4> [147.446398] ? lock_acquire+0xa0/0x2f0 <4> [147.446405] ? lock_release+0xd0/0x2b0 <4> [147.446410] drm_sched_job_timedout+0x94/0x1a0 [gpu_sched] <4> [147.446415] process_one_work+0x239/0x760 <4> [147.446446] worker_thread+0x200/0x3f0 <4> [147.446449] ? __pfx_worker_thread+0x10/0x10 <4> [147.446451] kthread+0x10d/0x150 <4> [147.446455] ? __pfx_kthread+0x10/0x10 <4> [147.446458] ret_from_fork+0x3d4/0x480 <4> [147.446460] ? __pfx_kthread+0x10/0x10 <4> [147.446463] ret_from_fork_asm+0x1a/0x30 <4> [147.446470] </TASK> <4> [147.446472] irq event stamp: 237671 <4> [147.446473] hardirqs last enabled at (237677): [<ffffffff814a9c09>] __up_console_sem+0x79/0xa0 <4> [147.446476] hardirqs last disabled at (237682): [<ffffffff814a9bee>] __up_console_sem+0x5e/0xa0 <4> [147.446478] softirqs last enabled at (236862): [<ffffffff813d0e7f>] __irq_exit_rcu+0x13f/0x160 <4> [147.446481] softirqs last disabled at (236855): [<ffffffff813d0e7f>] __irq_exit_rcu+0x13f/0x160 <4> [147.446483] ---[ end trace 0000000000000000 ]--- <6> [147.613214] Console: switching to colour frame buffer device 240x67 <7> [147.616915] xe 0000:03:00.0: [drm:drm_pagemap_dev_unhold_work [drm_gpusvm_helper]] Releasing reference on provider device and module. <7> [147.621783] xe 0000:03:00.0: [drm:drm_client_dev_restore] fbdev: ret=0

Detail

Value

Duration

unknown

Hostname

shard-bmg-2

Igt-Version

IGT-Version: 2.3-gced6a76d1 (x86_64) (Linux: 7.0.0-rc4-lgci-xe-xe-4749-4ae9f18564e78a544-debug+ x86_64)

Out

Using IGT_SRANDOM=1774064821 for randomisation
Opened device: /dev/dri/card0
Starting subtest: madvise-split-vma
runner: This test was killed due to a kernel taint (0x244).

This test caused an abort condition: Kernel badly tainted (0x244, 0x200) (check dmesg for details):
	TAINT_WARN: WARN_ON has happened.

Err

Starting subtest: madvise-split-vma
(xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: Test assertion failure function xe_bo_create, file ../lib/xe/xe_ioctl.c:411:
(xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: Failed assertion: __xe_bo_create(fd, vm, size, placement, flags, ((void *)0), &handle) == 0
(xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: Last errno: 125, Operation canceled
(xe_exec_system_allocator:3623) xe/xe_ioctl-CRITICAL: error: -1 != 0
Received signal SIGQUIT.
Stack trace: 
 #0 [fatal_sig_handler+0x17b]
 #1 [__sigaction+0x50]
 #2 [read+0x11]
 #3 [_IO_file_underflow+0x165]
 #4 [__getdelim+0x15f]
 #5 [dwfl_report_module+0x91b]
 #6 [dwfl_linux_proc_report+0xd3]
 #7 [print_backtrace+0x5d]
 #8 [__igt_fail_assert+0x104]
 #9 [xe_bo_create+0x77]
 #10 [test_exec+0x456]
 #11 [__igt_unique____real_main2349+0x32d5]
 #12 [main+0x2d]
 #13 [__libc_init_first+0x8a]
 #14 [__libc_start_main+0x8b]
 #15 [_start+0x25]

Dmesg

Scroll to first warning

<6> [94.202895] Console: switching to colour dummy device 80x25

<6> [94.203211] [IGT] xe_exec_system_allocator: executing

<3> [96.495609] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11367 recv=11366

<3> [98.799270] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11367 recv=11366

<6> [98.912593] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [98.912617] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [98.912627] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [98.912636] nvme 0000:05:00.0: [ 0] RxErr (First)

<3> [101.103188] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11368 recv=11366

<6> [101.216068] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [101.216093] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [101.216102] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [101.216111] nvme 0000:05:00.0: [ 0] RxErr (First)

<3> [103.407420] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11368 recv=11366

<3> [105.711726] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11369 recv=11366

<3> [108.016124] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11369 recv=11366

<6> [108.019567] [IGT] xe_exec_system_allocator: starting subtest madvise-split-vma

<7> [108.020135] xe 0000:03:00.0: [drm:drm_pagemap_dev_unhold_work [drm_gpusvm_helper]] Releasing reference on provider device and module.

<6> [108.133454] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [108.133480] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [108.133488] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [108.133497] nvme 0000:05:00.0: [ 0] RxErr (First)

<7> [108.754163] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 0 val 0x2c2d292a

<7> [108.754307] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 1 val 0x2c2c2c2d

<6> [108.861165] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [108.861189] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [108.861198] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [108.861207] nvme 0000:05:00.0: [ 0] RxErr (First)

<6> [108.970280] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [108.970304] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [108.970313] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [108.970322] nvme 0000:05:00.0: [ 0] RxErr (First)

<6> [109.080364] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [109.080389] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [109.080398] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [109.080407] nvme 0000:05:00.0: [ 0] RxErr (First)

<3> [110.320405] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11370 recv=11366

<3> [110.320478] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11371 recv=11366

<6> [110.429605] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [110.429629] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [110.429639] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [110.429648] nvme 0000:05:00.0: [ 0] RxErr (First)

<3> [112.624626] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11370 recv=11366

<3> [112.624700] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11371 recv=11366

<3> [114.928900] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11372 recv=11366

<3> [114.928973] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11373 recv=11366

<3> [117.233113] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11372 recv=11366

<3> [117.233189] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11373 recv=11366

<3> [119.537245] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11374 recv=11366

<3> [121.842493] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11374 recv=11366

<7> [123.767164] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 0 val 0x2d2d2a2a

<7> [123.767309] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 1 val 0x2d2d2c2d

<3> [124.145673] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11375 recv=11366

<6> [125.276463] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [125.276565] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [125.276574] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [125.276583] nvme 0000:05:00.0: [ 0] RxErr (First)

<6> [125.386671] pcieport 0000:00:06.0: AER: Multiple Correctable error message received from 0000:05:00.0

<4> [125.386696] nvme 0000:05:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

<4> [125.386705] nvme 0000:05:00.0: device [15b7:5017] error status/mask=00000001/0000e000

<4> [125.386714] nvme 0000:05:00.0: [ 0] RxErr (First)

<3> [126.449881] xe 0000:03:00.0: [drm] *ERROR* TLB invalidation fence timeout, seqno=11375 recv=11366

<7> [131.827342] xe 0000:03:00.0: [drm:xe_hw_engine_snapshot_capture [xe]] Tile0: GT0: Proceeding with manual engine snapshot

<4> [131.827837] xe 0000:03:00.0: [drm] Tile0: GT0: Check job timeout: seqno=16716, lrc_seqno=16716, guc_id=0, not started

<4> [136.946678] xe 0000:03:00.0: [drm] Tile0: GT0: Check job timeout: seqno=16716, lrc_seqno=16716, guc_id=0, not started

<7> [138.737656] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 0 val 0x2d2d2a2b

<7> [138.737798] xe 0000:03:00.0: [drm:xe_hwmon_read [xe]] thermal data for group 1 val 0x2d2d2c2d

<4> [142.066775] xe 0000:03:00.0: [drm] Tile0: GT0: Check job timeout: seqno=16716, lrc_seqno=16716, guc_id=0, not started

<4> [147.186254] xe 0000:03:00.0: [drm] Tile0: GT0: Schedule disable failed to respond, guc_id=0

<6> [147.371507] xe 0000:03:00.0: [drm] Xe device coredump has been created

<6> [147.371523] xe 0000:03:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data

<6> [147.371525] xe 0000:03:00.0: [drm] Tile0: GT0: trying reset from guc_exec_queue_timedout_job [xe]

<6> [147.371627] xe 0000:03:00.0: [drm] Tile0: GT0: reset queued

<6> [147.371719] xe 0000:03:00.0: [drm] Tile0: GT0: reset started

<7> [147.371915] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: GuC CT communication channel stopped

<7> [147.372358] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: Applying GT save-restore MMIOs

<7> [147.372452] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0x4148] = 0x00000000

<7> [147.372537] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0x8828] = 0x00800000

<7> [147.372617] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb0c8] = 0x11111440

<7> [147.372695] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb104] = 0x08104440

<7> [147.372772] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb108] = 0x30200000

<7> [147.372846] xe 0000:03:00.0: [drm:xe_reg_sr_apply_mmio [xe]] Tile0: GT0: REG[0xb158] = 0x0000007f

<7> [147.372917] xe 0000:03:00.0: [drm:xe_wopcm_init [xe]] WOPCM: 4096K

<7> [147.373022] xe 0000:03:00.0: [drm:xe_wopcm_init [xe]] GuC WOPCM is already locked [6144K, 832K)

<7> [147.373133] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: GuC CT communication channel disabled

<7> [147.374336] xe 0000:03:00.0: [drm:xe_guc_ads_populate [xe]] Tile0: GT0: Updated ADS capture size 20480 (was 49152)

<3> [147.385043] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT0: load failed: status = 0x400000A0, time = 9ms, freq = 2150MHz (req 2133MHz)

<3> [147.385139] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01

<3> [147.385151] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT0: firmware signature verification failed

<3> [147.385317] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT0: reset failed (-EPROTO)

<3> [147.385372] xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged.

IOCTLs and executions are blocked.

For recovery procedure, refer to https://docs.kernel.org/gpu/drm-uapi.html#device-wedging

Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new

<7> [147.385397] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT0: GuC CT communication channel stopped

<7> [147.385541] xe 0000:03:00.0: [drm:guc_ct_change_state [xe]] Tile0: GT1: GuC CT communication channel stopped

<3> [147.445680] xe 0000:03:00.0: [drm] *ERROR* Tile0: GT1: GuC mmio request 0x5507: no reply 0x5507

<6> [147.445701] xe 0000:03:00.0: [drm] device wedged, needs recovery

<4> [147.446062] ------------[ cut here ]------------

<4> [147.446064] xe 0000:03:00.0: [drm] Tile0: GT0: Kernel-submitted job timed out

<4> [147.446066] WARNING: drivers/gpu/drm/xe/xe_guc_submit.c:1641 at guc_exec_queue_timedout_job+0x1424/0x2400 [xe], CPU#8: kworker/u64:19/2339

<4> [147.446201] Modules linked in: snd_hda_codec_intelhdmi snd_hda_codec_hdmi pmt_crashlog mei_gsc_proxy mei_lb mei_gsc mtd_intel_dg xe drm_gpuvm drm_gpusvm_helper drm_buddy drm_ttm_helper ttm gpu_sched drm_suballoc_helper drm_exec drm_display_helper cec rc_core drm_kunit_helpers i2c_algo_bit kunit intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp hid_generic cmdlinepart eeepc_wmi spi_nor asus_wmi mei_pxp mei_hdcp sparse_keymap platform_profile mtd wmi_bmof kvm_intel binfmt_misc kvm usbhid irqbypass snd_hda_intel ghash_clmulni_intel hid snd_intel_dspcfg aesni_intel snd_hda_codec rapl r8169 video intel_cstate snd_hda_core snd_hwdep realtek i2c_i801 snd_pcm snd_timer i2c_mux idma64 nls_iso8859_1 spi_intel_pci snd mei_me soundcore spi_intel i2c_smbus mei intel_pmc_core pmt_telemetry wmi pmt_discovery pmt_class intel_pmc_ssram_telemetry pinctrl_alderlake acpi_tad intel_vsec acpi_pad dm_multipath msr nvme_fabrics fuse

<4> [147.446277] efi_pstore nfnetlink autofs4

<4> [147.446282] CPU: 8 UID: 0 PID: 2339 Comm: kworker/u64:19 Tainted: G S U 7.0.0-rc4-lgci-xe-xe-4749-4ae9f18564e78a544-debug+ #1 PREEMPT(lazy)

<4> [147.446285] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER

<4> [147.446287] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1645 03/15/2024

<4> [147.446288] Workqueue: gt-ordered-wq drm_sched_job_timedout [gpu_sched]

<4> [147.446296] RIP: 0010:guc_exec_queue_timedout_job+0x142d/0x2400 [xe]

<4> [147.446377] Code: 74 04 48 8b 7f 08 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 76 5b 5e e1 48 89 c6 48 8d 3d cc a1 39 00 41 89 d8 44 89 e1 4c 89 ea <67> 48 0f b9 3a 48 8b 45 90 48 8b 40 60 e9 c6 ee ff ff 8b 70 08 49

<4> [147.446379] RSP: 0018:ffffc90003fcbca0 EFLAGS: 00010246

<4> [147.446381] RAX: ffffffffa11ff2bf RBX: 0000000000000000 RCX: 0000000000000000

<4> [147.446382] RDX: ffff888104985890 RSI: ffffffffa11ff2bf RDI: ffffffffa1003dc0

<4> [147.446384] RBP: ffffc90003fcbdb0 R08: 0000000000000000 R09: 0000000000000000

<4> [147.446385] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000

<4> [147.446386] R13: ffff888104985890 R14: ffff888131cb8818 R15: 00000000ffffffc2

<4> [147.446388] FS: 0000000000000000(0000) GS:ffff8888db09b000(0000) knlGS:0000000000000000

<4> [147.446389] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

<4> [147.446390] CR2: 000000c00041b000 CR3: 000000000344c001 CR4: 0000000000f72ef0

<4> [147.446392] PKRU: 55555554

<4> [147.446393] Call Trace:

<4> [147.446394] <TASK>

<4> [147.446398] ? lock_acquire+0xa0/0x2f0

<4> [147.446405] ? lock_release+0xd0/0x2b0

<4> [147.446410] drm_sched_job_timedout+0x94/0x1a0 [gpu_sched]

<4> [147.446415] process_one_work+0x239/0x760

<4> [147.446446] worker_thread+0x200/0x3f0

<4> [147.446449] ? __pfx_worker_thread+0x10/0x10

<4> [147.446451] kthread+0x10d/0x150

<4> [147.446455] ? __pfx_kthread+0x10/0x10

<4> [147.446458] ret_from_fork+0x3d4/0x480

<4> [147.446460] ? __pfx_kthread+0x10/0x10

<4> [147.446463] ret_from_fork_asm+0x1a/0x30

<4> [147.446470] </TASK>

<4> [147.446472] irq event stamp: 237671

<4> [147.446473] hardirqs last enabled at (237677): [<ffffffff814a9c09>] __up_console_sem+0x79/0xa0

<4> [147.446476] hardirqs last disabled at (237682): [<ffffffff814a9bee>] __up_console_sem+0x5e/0xa0

<4> [147.446478] softirqs last enabled at (236862): [<ffffffff813d0e7f>] __irq_exit_rcu+0x13f/0x160

<4> [147.446481] softirqs last disabled at (236855): [<ffffffff813d0e7f>] __irq_exit_rcu+0x13f/0x160

<4> [147.446483] ---[ end trace 0000000000000000 ]---

<6> [147.613214] Console: switching to colour frame buffer device 240x67

<7> [147.616915] xe 0000:03:00.0: [drm:drm_pagemap_dev_unhold_work [drm_gpusvm_helper]] Releasing reference on provider device and module.

<7> [147.621783] xe 0000:03:00.0: [drm:drm_client_dev_restore] fbdev: ret=0

Created at 2026-03-21 04:22:07