Nintendo

Why Nintendo's Satoru Iwata refuses to lay off staff - https://www.polygon.com/2013/7/5/4496512/why-nintendos-satoru-iwata-refuses-to-lay-off-staff

source : https://gist.githubusercontent.com/plutooo/2aadbd4a718e269df474079dd2e584fb/raw/7b3af77b5202366c8934c88ef251f1e905967040/gistfile1.txt

...

Nintendo Will Pay Its Workers 10% More - https://www.gamespot.com/articles/nintendo-will-pay-its-workers-10-more/1100-6511268/ - The move is meant to invest in the workforce and address inflation.

...

== A one in a million bug in Switch kernel ==

Nintendo Switch firmware 14.0.0 was released yesterday. It contained many minor

changes to their kernel. One of them, was that during user-mode cache

operations (flush / clean / zero), it now sets a secret byte in the thread local

storage (TLS) to 1.

If an interrupt is received, kernel-mode reads the user-mode byte from TLS, and

if it's equal to 1, the kernel performs a memory barrier.

Why is this complicated TLS communication scheme necessary between user-mode

and kernel? Nintendo would not introduce this out-of-the-blue, there is some

weird hardware phenomenon going on.

This took some time to figure out, but imagine the following sequence of

instructions executing:

dc civac, x8

add x8, x8, #32

dc civac, x8

add x8, x8, #32

dc civac, x8 ←


what happens if you take an interrupt here?

add x8, x8, #32

dc civac, x8

add x8, x8, #32

dsb sy ←


memory barrier

ret

An interrupt may be received by the CPU at any point during game execution.

Interrupts may lead to "core migration", which is when the kernel scheduler

moves a thread to a different CPU core.

If we imagine a core migration in this code sequence, we can clearly see the

problem:

dc civac, x8 ←- Core 0

add x8, x8, #32 ←- Core 0

dc civac, x8 ←- Core 0

add x8, x8, #32 ←- Core 0

dc civac, x8 ←- Core 1 interrupt! core migration

add x8, x8, #32 ←- Core 1

dc civac, x8 ←- Core 1

add x8, x8, #32 ←- Core 1

dsb sy ←- Core 1 memory barrier

ret

Do you see the problem? There was never a memory barrier on core 0!

This means that *not necessarily* all cache ops are completed by the time

the function returns! For a brief time, the physical DRAM, for some of the

cache lines, will be incorrect.

So to summarize, if the CPU:

(1) takes an interrupt inside a function like this (super rare)

AND

(2) the scheduler decides to perform core migration (super rare)

Then, you'd get some graphical glitches (games mainly use cache operations when

talking to the GPU).

In this situation, devs would probably blame faulty DRAM chips or CPU errata,

but this is totally a pure software bug!

This bug has existed since day zero, which means that it took 5 years (!) for

Nintendo to track it down.

Credits to whoever nameless employee at Nintendo found this bug! The attention

to detail is incredible. And how do you even find / debug a bug like this?

Makes you think, do Linux, Windows and Mac handle this properly? Honestly, I

doubt it!

Thanks to SciresM for discussion / diff.

—plutoo

Tags: Company