commit 307e9f91ed0dcb6bf40ed7a9f7bb34b25ee06978 Author: Greg Kroah-Hartman Date: Wed Aug 15 18:11:09 2018 +0200 Linux 4.17.15 commit 62203c007b7c68833aa0bd8fb51ff1f41b521f27 Author: Borislav Petkov Date: Fri Apr 27 16:34:34 2018 -0500 x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present commit f8b64d08dde2714c62751d18ba77f4aeceb161d3 upstream. Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're always present as symbols and not only in the CONFIG_SMP case. Then, other code using them doesn't need ugly ifdeffery anymore. Get rid of some ifdeffery. Signed-off-by: Borislav Petkov Signed-off-by: Suravee Suthikulpanit Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Link: http://lkml.kernel.org/r/1524864877-111962-2-git-send-email-suravee.suthikulpanit@amd.com Signed-off-by: Guenter Roeck Signed-off-by: Greg Kroah-Hartman commit 27f172cc352873fb6aff24565e388708f1624e5c Author: Vlastimil Babka Date: Tue Aug 14 20:50:47 2018 +0200 x86/init: fix build with CONFIG_SWAP=n commit 792adb90fa724ce07c0171cbc96b9215af4b1045 upstream. The introduction of generic_max_swapfile_size and arch-specific versions has broken linking on x86 with CONFIG_SWAP=n due to undefined reference to 'generic_max_swapfile_size'. Fix it by compiling the x86-specific max_swapfile_size() only with CONFIG_SWAP=y. Reported-by: Tomas Pruzina Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit ffdb0b32dbbd9adb7c084ca6a9c9dfc1c156cea6 Author: Abel Vesa Date: Wed Aug 15 00:26:00 2018 +0300 cpu/hotplug: Non-SMP machines do not make use of booted_once commit 269777aa530f3438ec1781586cdac0b5fe47b061 upstream. Commit 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once") breaks non-SMP builds. [ I suspect the 'bool' fields should just be made to be bitfields and be exposed regardless of configuration, but that's a separate cleanup that I'll leave to the owners of this file for later. - Linus ] Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once") Cc: Dave Hansen Cc: Thomas Gleixner Cc: Tony Luck Signed-off-by: Abel Vesa Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 8eabab771cc7eefafc9dd6b29a038dc691e6a33e Author: Vlastimil Babka Date: Tue Aug 14 23:38:57 2018 +0200 x86/smp: fix non-SMP broken build due to redefinition of apic_id_is_primary_thread commit d0055f351e647f33f3b0329bff022213bf8aa085 upstream. The function has an inline "return false;" definition with CONFIG_SMP=n but the "real" definition is also visible leading to "redefinition of ‘apic_id_is_primary_thread’" compiler error. Guard it with #ifdef CONFIG_SMP Signed-off-by: Vlastimil Babka Fixes: 6a4d2657e048 ("x86/smp: Provide topology_is_primary_thread()") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit f6b2c7253830473fedb8b5680894660e78e03aea Author: Josh Poimboeuf Date: Fri Aug 10 08:31:10 2018 +0100 x86/microcode: Allow late microcode loading with SMT disabled commit 07d981ad4cf1e78361c6db1c28ee5ba105f96cc1 upstream The kernel unnecessarily prevents late microcode loading when SMT is disabled. It should be safe to allow it if all the primary threads are online. Signed-off-by: Josh Poimboeuf Acked-by: Borislav Petkov Signed-off-by: David Woodhouse Signed-off-by: Greg Kroah-Hartman commit bc651db4b335fcbbe7d20e7416f97faab5580500 Author: David Woodhouse Date: Wed Aug 8 11:00:16 2018 +0100 tools headers: Synchronise x86 cpufeatures.h for L1TF additions commit e24f14b0ff985f3e09e573ba1134bfdf42987e05 upstream [ ... and some older changes in the 4.17.y backport too ...] Signed-off-by: David Woodhouse Signed-off-by: Greg Kroah-Hartman commit fcf9776ade8be41bd3cd7d691063a95174b3a8dc Author: Arnaldo Carvalho de Melo Date: Fri Jun 1 10:42:31 2018 -0300 tools headers: Synchronize prctl.h ABI header commit 63b89a19cc9ef911dcc64d41b60930c346eee0c0 upstream To pick up changes from: $ git log --oneline -2 -i include/uapi/linux/prctl.h 356e4bfff2c5 prctl: Add force disable speculation b617cfc85816 prctl: Add speculation control prctls $ tools/perf/trace/beauty/prctl_option.sh > before.c $ cp include/uapi/linux/prctl.h tools/include/uapi/linux/prctl.h $ tools/perf/trace/beauty/prctl_option.sh > after.c $ diff -u before.c after.c # --- before.c 2018-06-01 10:39:53.834073962 -0300 # +++ after.c 2018-06-01 10:42:11.307985394 -0300 @@ -35,6 +35,8 @@ [42] = "GET_THP_DISABLE", [45] = "SET_FP_MODE", [46] = "GET_FP_MODE", + [52] = "GET_SPECULATION_CTRL", + [53] = "SET_SPECULATION_CTRL", }; static const char *prctl_set_mm_options[] = { [1] = "START_CODE", $ This will be used by 'perf trace' to show these strings when beautifying the prctl syscall args. At some point we'll be able to say something like: 'perf trace --all-cpus -e prctl(option=*SPEC*)' To filter by arg by name. This silences this warning when building tools/perf: Warning: Kernel ABI header at 'tools/include/uapi/linux/prctl.h' differs from latest version at 'include/uapi/linux/prctl.h' Cc: Adrian Hunter Cc: David Ahern Cc: Jiri Olsa Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-zztsptwhc264r8wg44tqh5gp@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David Woodhouse Signed-off-by: Greg Kroah-Hartman commit bbef6c71614ddbffede79f2f51a3796ac76a3b12 Author: Andi Kleen Date: Tue Aug 7 15:09:38 2018 -0700 x86/mm/kmmio: Make the tracer robust against L1TF commit 1063711b57393c1999248cccb57bebfaf16739e7 upstream The mmio tracer sets io mapping PTEs and PMDs to non present when enabled without inverting the address bits, which makes the PTE entry vulnerable for L1TF. Make it use the right low level macros to actually invert the address bits to protect against L1TF. In principle this could be avoided because MMIO tracing is not likely to be enabled on production machines, but the fix is straigt forward and for consistency sake it's better to get rid of the open coded PTE manipulation. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit c4b60c527031d46e843a4ec9619a04dd3cb8045b Author: Andi Kleen Date: Tue Aug 7 15:09:39 2018 -0700 x86/mm/pat: Make set_memory_np() L1TF safe commit 958f79b9ee55dfaf00c8106ed1c22a2919e0028b upstream set_memory_np() is used to mark kernel mappings not present, but it has it's own open coded mechanism which does not have the L1TF protection of inverting the address bits. Replace the open coded PTE manipulation with the L1TF protecting low level PTE routines. Passes the CPA self test. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit d224e11c59a933074c4c71e8739f39186d2de109 Author: Andi Kleen Date: Tue Aug 7 15:09:37 2018 -0700 x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert commit 0768f91530ff46683e0b372df14fd79fe8d156e5 upstream Some cases in THP like: - MADV_FREE - mprotect - split mark the PMD non present for temporarily to prevent races. The window for an L1TF attack in these contexts is very small, but it wants to be fixed for correctness sake. Use the proper low level functions for pmd/pud_mknotpresent() to address this. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit aa04c664ccc3753d3ac2a19783d4f4a3f2163154 Author: Andi Kleen Date: Tue Aug 7 15:09:36 2018 -0700 x86/speculation/l1tf: Invert all not present mappings commit f22cc87f6c1f771b57c407555cfefd811cdd9507 upstream For kernel mappings PAGE_PROTNONE is not necessarily set for a non present mapping, but the inversion logic explicitely checks for !PRESENT and PROT_NONE. Remove the PROT_NONE check and make the inversion unconditional for all not present mappings. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 6dcc5103dd5b1d63c1e7c3a1df5aa221a8eb92e7 Author: Thomas Gleixner Date: Tue Aug 7 08:19:57 2018 +0200 cpu/hotplug: Fix SMT supported evaluation commit bc2d8d262cba5736332cbc866acb11b1c5748aa9 upstream Josh reported that the late SMT evaluation in cpu_smt_state_init() sets cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied on the kernel command line as it cannot differentiate between SMT disabled by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and makes the sysfs interface unusable. Rework this so that during bringup of the non boot CPUs the availability of SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a 'primary' thread then set the local cpu_smt_available marker and evaluate this explicitely right after the initial SMP bringup has finished. SMT evaulation on x86 is a trainwreck as the firmware has all the information _before_ booting the kernel, but there is no interface to query it. Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: Josh Poimboeuf Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 4fef280872237c114d732db2f0460b00747ac895 Author: Paolo Bonzini Date: Sun Aug 5 16:07:47 2018 +0200 KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry commit 5b76a3cff011df2dcb6186c965a2e4d809a05ad4 upstream When nested virtualization is in use, VMENTER operations from the nested hypervisor into the nested guest will always be processed by the bare metal hypervisor, and KVM's "conditional cache flushes" mode in particular does a flush on nested vmentry. Therefore, include the "skip L1D flush on vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting. Add the relevant Documentation. Signed-off-by: Paolo Bonzini Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit daceaeec3e486247f455fe524055bdca26ca791a Author: Paolo Bonzini Date: Sun Aug 5 16:07:46 2018 +0200 x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry commit 8e0b2b916662e09dd4d09e5271cdf214c6b80e62 upstream Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is not needed. Add a new value to enum vmx_l1d_flush_state, which is used either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES. Signed-off-by: Paolo Bonzini Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit ea6c7c5dab4517bf21579363d66c48c991cc9027 Author: Paolo Bonzini Date: Sun Aug 5 16:07:45 2018 +0200 x86/speculation: Simplify sysfs report of VMX L1TF vulnerability commit ea156d192f5257a5bf393d33910d3b481bf8a401 upstream Three changes to the content of the sysfs file: - If EPT is disabled, L1TF cannot be exploited even across threads on the same core, and SMT is irrelevant. - If mitigation is completely disabled, and SMT is enabled, print "vulnerable" instead of "vulnerable, SMT vulnerable" - Reorder the two parts so that the main vulnerability state comes first and the detail on SMT is second. Signed-off-by: Paolo Bonzini Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 1273eb5fca6ae07004ecb1ebac27f2bdc367afe1 Author: Thomas Gleixner Date: Sun Aug 5 17:06:12 2018 +0200 Documentation/l1tf: Remove Yonah processors from not vulnerable list commit 58331136136935c631c2b5f06daf4c3006416e91 upstream Dave reported, that it's not confirmed that Yonah processors are unaffected. Remove them from the list. Reported-by: ave Hansen Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit a09777bcd347a2c9ee701b054c10eca3171baa43 Author: Nicolai Stange Date: Sun Jul 22 13:38:18 2018 +0200 x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() commit 18b57ce2eb8c8b9a24174a89250cf5f57c76ecdc upstream For VMEXITs caused by external interrupts, vmx_handle_external_intr() indirectly calls into the interrupt handlers through the host's IDT. It follows that these interrupts get accounted for in the kvm_cpu_l1tf_flush_l1d per-cpu flag. The subsequently executed vmx_l1d_flush() will thus be aware that some interrupts have happened and conduct a L1d flush anyway. Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed anymore. Drop it. Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit de446dc3cf1ba99a0a7f1fe3286cc1b5d4579807 Author: Nicolai Stange Date: Sun Jul 29 13:06:04 2018 +0200 x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d commit ffcba43ff66c7dab34ec700debd491d2a4d319b4 upstream The last missing piece to having vmx_l1d_flush() take interrupts after VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on irq entry. Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(), ipi_entering_ack_irq(), smp_reschedule_interrupt() and uv_bau_message_interrupt(). Suggested-by: Paolo Bonzini Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 7654bab9d7d0aac9a9b7903483afcb1b180140b0 Author: Nicolai Stange Date: Sun Jul 29 12:15:33 2018 +0200 x86: Don't include linux/irq.h from asm/hardirq.h commit 447ae316670230d7d29430e2cbf1f5db4f49d14c upstream The next patch in this series will have to make the definition of irq_cpustat_t available to entering_irq(). Inclusion of asm/hardirq.h into asm/apic.h would cause circular header dependencies like asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/topology.h linux/smp.h asm/smp.h or linux/gfp.h linux/mmzone.h asm/mmzone.h asm/mmzone_64.h asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/irqdesc.h linux/kobject.h linux/sysfs.h linux/kernfs.h linux/idr.h linux/gfp.h and others. This causes compilation errors because of the header guards becoming effective in the second inclusion: symbols/macros that had been defined before wouldn't be available to intermediate headers in the #include chain anymore. A possible workaround would be to move the definition of irq_cpustat_t into its own header and include that from both, asm/hardirq.h and asm/apic.h. However, this wouldn't solve the real problem, namely asm/harirq.h unnecessarily pulling in all the linux/irq.h cruft: nothing in asm/hardirq.h itself requires it. Also, note that there are some other archs, like e.g. arm64, which don't have that #include in their asm/hardirq.h. Remove the linux/irq.h #include from x86' asm/hardirq.h. Fix resulting compilation errors by adding appropriate #includes to *.c files as needed. Note that some of these *.c files could be cleaned up a bit wrt. to their set of #includes, but that should better be done from separate patches, if at all. Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 2566de674202af2205a927b5ad1d704ce4145627 Author: Nicolai Stange Date: Fri Jul 27 13:22:16 2018 +0200 x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d commit 45b575c00d8e72d69d75dd8c112f044b7b01b069 upstream Part of the L1TF mitigation for vmx includes flushing the L1D cache upon VMENTRY. L1D flushes are costly and two modes of operations are provided to users: "always" and the more selective "conditional" mode. If operating in the latter, the cache would get flushed only if a host side code path considered unconfined had been traversed. "Unconfined" in this context means that it might have pulled in sensitive data like user data or kernel crypto keys. The need for L1D flushes is tracked by means of the per-vcpu flag l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a L1d flush based on its value and reset that flag again. Currently, interrupts delivered "normally" while in root operation between VMEXIT and VMENTER are not taken into account. Part of the reason is that these don't leave any traces and thus, the vmx code is unable to tell if any such has happened. As proposed by Paolo Bonzini, prepare for tracking all interrupts by introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in strong analogy to the per-vcpu ->l1tf_flush_l1d. A later patch will make interrupt handlers set it. For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86' per-cpu irq_cpustat_t as suggested by Peter Zijlstra. Provide the helpers kvm_set_cpu_l1tf_flush_l1d(), kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate. Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as l1tf_flush_l1d. Suggested-by: Paolo Bonzini Suggested-by: Peter Zijlstra Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Reviewed-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman commit 6896e226e6d732576e27d463c42f3139f847f3d6 Author: Nicolai Stange Date: Fri Jul 27 12:46:29 2018 +0200 x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 commit 9aee5f8a7e30330d0a8f4c626dc924ca5590aba5 upstream An upcoming patch will extend KVM's L1TF mitigation in conditional mode to also cover interrupts after VMEXITs. For tracking those, stores to a new per-cpu flag from interrupt handlers will become necessary. In order to improve cache locality, this new flag will be added to x86's irq_cpustat_t. Make some space available there by shrinking the ->softirq_pending bitfield from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS, i.e. 10. Suggested-by: Paolo Bonzini Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Reviewed-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman commit f5129286ade4ef85ad8ad26a4a43ba54136f3755 Author: Nicolai Stange Date: Sat Jul 21 22:35:28 2018 +0200 x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() commit 5b6ccc6c3b1a477fbac9ec97a0b4c1c48e765209 upstream Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes vmx_l1d_flush() if so. This test is unncessary for the "always flush L1D" mode. Move the check to vmx_l1d_flush()'s conditional mode code path. Notes: - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no extra function call. - This inverts the (static) branch prediction, but there hadn't been any explicit likely()/unlikely() annotations before and so it stays as is. Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 569f08c9b0d0e5761e9a542e8da0fa0d2ca9d25d Author: Nicolai Stange Date: Sat Jul 21 22:25:00 2018 +0200 x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' commit 427362a142441f08051369db6fbe7f61c73b3dca upstream The vmx_l1d_flush_always static key is only ever evaluated if vmx_l1d_should_flush is enabled. In that case however, there are only two L1d flushing modes possible: "always" and "conditional". The "conditional" mode's implementation tends to require more sophisticated logic than the "always" mode. Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key with a 'vmx_l1d_flush_cond' one. There is no change in functionality. Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit afb059e688f22d860575cddea281534504397756 Author: Nicolai Stange Date: Sat Jul 21 22:16:56 2018 +0200 x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() commit 379fd0c7e6a391e5565336a646f19f218fb98c6c upstream vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no point in setting l1tf_flush_l1d to true from there again. Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 63a759508f3ec9af3b5014c371651db91a505f8b Author: Josh Poimboeuf Date: Wed Jul 25 10:36:45 2018 +0200 cpu/hotplug: detect SMT disabled by BIOS commit 73d5e2b472640b1fcdb61ae8be389912ef211bda upstream If SMT is disabled in BIOS, the CPU code doesn't properly detect it. The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf' vulnerabilities file shows SMT as vulnerable. Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a case. Unfortunately the detection can only be done after bringing all the CPUs online, so we have to overwrite any previous writes to the variable. Reported-by: Joe Mario Tested-by: Jiri Kosina Fixes: f048c399e0f7 ("x86/topology: Provide topology_smt_supported()") Signed-off-by: Josh Poimboeuf Signed-off-by: Peter Zijlstra Signed-off-by: Greg Kroah-Hartman commit 0cd091501932ad434023bc21495714fee1e1b9d3 Author: Tony Luck Date: Thu Jul 19 13:49:58 2018 -0700 Documentation/l1tf: Fix typos commit 1949f9f49792d65dba2090edddbe36a5f02e3ba3 upstream Fix spelling and other typos Signed-off-by: Tony Luck Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 48f9b6554c21c137556d8a1a70c94b575ce3d334 Author: Nicolai Stange Date: Wed Jul 18 19:07:38 2018 +0200 x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content commit 288d152c23dcf3c09da46c5c481903ca10ebfef7 upstream The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order to evict the L1d cache. However, these pages are never cleared and, in theory, their data could be leaked. More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses. Fix this by initializing the individual vmx_l1d_flush_pages with a different pattern each. Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to "flush_pages" to reflect this change. Fixes: a47dd5f06714 ("x86/KVM/VMX: Add L1D flush algorithm") Signed-off-by: Nicolai Stange Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 0eb84cc57b0fc6dd76ab17f862c13a4321cae222 Author: Jiri Kosina Date: Sat Jul 14 21:56:13 2018 +0200 x86/speculation/l1tf: Unbreak !__HAVE_ARCH_PFN_MODIFY_ALLOWED architectures commit 6c26fcd2abfe0a56bbd95271fce02df2896cfd24 upstream pfn_modify_allowed() and arch_has_pfn_modify_check() are outside of the !__ASSEMBLY__ section in include/asm-generic/pgtable.h, which confuses assembler on archs that don't have __HAVE_ARCH_PFN_MODIFY_ALLOWED (e.g. ia64) and breaks build: include/asm-generic/pgtable.h: Assembler messages: include/asm-generic/pgtable.h:538: Error: Unknown opcode `static inline bool pfn_modify_allowed(unsigned long pfn,pgprot_t prot)' include/asm-generic/pgtable.h:540: Error: Unknown opcode `return true' include/asm-generic/pgtable.h:543: Error: Unknown opcode `static inline bool arch_has_pfn_modify_check(void)' include/asm-generic/pgtable.h:545: Error: Unknown opcode `return false' arch/ia64/kernel/entry.S:69: Error: `mov' does not fit into bundle Move those two static inlines into the !__ASSEMBLY__ section so that they don't confuse the asm build pass. Fixes: 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings") Signed-off-by: Jiri Kosina Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit f4e44a41979effe99d054d5c24c9058826fc805f Author: Thomas Gleixner Date: Fri Jul 13 16:23:26 2018 +0200 Documentation: Add section about CPU vulnerabilities commit 3ec8ce5d866ec6a08a9cfab82b62acf4a830b35f upstream Add documentation for the L1TF vulnerability and the mitigation mechanisms: - Explain the problem and risks - Document the mitigation mechanisms - Document the command line controls - Document the sysfs files Signed-off-by: Thomas Gleixner Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Acked-by: Linus Torvalds Link: https://lkml.kernel.org/r/20180713142323.287429944@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 912d10ace8248917bb691dbae20ad1df3916d50f Author: Jiri Kosina Date: Fri Jul 13 16:23:25 2018 +0200 x86/bugs, kvm: Introduce boot-time control of L1TF mitigations commit d90a7a0ec83fb86622cd7dae23255d3c50a99ec8 upstream Introduce the 'l1tf=' kernel command line option to allow for boot-time switching of mitigation that is used on processors affected by L1TF. The possible values are: full Provides all available mitigations for the L1TF vulnerability. Disables SMT and enables all mitigations in the hypervisors. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. full,force Same as 'full', but disables SMT control. Implies the 'nosmt=force' command line option. sysfs control of SMT and the hypervisor flush control is disabled. flush Leaves SMT enabled and enables the conditional hypervisor mitigation. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. flush,nosmt Disables SMT and enables the conditional hypervisor mitigation. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. If SMT is reenabled or flushing disabled at runtime hypervisors will issue a warning. flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is started in a potentially insecure configuration. off Disables hypervisor mitigations and doesn't emit any warnings. Default is 'flush'. Let KVM adhere to these semantics, which means: - 'lt1f=full,force' : Performe L1D flushes. No runtime control possible. - 'l1tf=full' - 'l1tf-flush' - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if SMT has been runtime enabled or L1D flushing has been run-time enabled - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted. - 'l1tf=off' : L1D flushes are not performed and no warnings are emitted. KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush' module parameter except when lt1f=full,force is set. This makes KVM's private 'nosmt' option redundant, and as it is a bit non-systematic anyway (this is something to control globally, not on hypervisor level), remove that option. Add the missing Documentation entry for the l1tf vulnerability sysfs file while at it. Signed-off-by: Jiri Kosina Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 7fc6f0e2f7cb515ec42e12c632de8c36d756f4db Author: Thomas Gleixner Date: Fri Jul 13 16:23:24 2018 +0200 cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early commit fee0aede6f4739c87179eca76136f83210953b86 upstream The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support SMT) when the sysfs SMT control file is initialized. That was fine so far as this was only required to make the output of the control file correct and to prevent writes in that case. With the upcoming l1tf command line parameter, this needs to be set up before the L1TF mitigation selection and command line parsing happens. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 6c57ce2dd9fc11d93ad025e1910a8a0726bdb4c9 Author: Jiri Kosina Date: Fri Jul 13 16:23:23 2018 +0200 cpu/hotplug: Expose SMT control init function commit 8e1b706b6e819bed215c0db16345568864660393 upstream The L1TF mitigation will gain a commend line parameter which allows to set a combination of hypervisor mitigation and SMT control. Expose cpu_smt_disable() so the command line parser can tweak SMT settings. [ tglx: Split out of larger patch and made it preserve an already existing force off state ] Signed-off-by: Jiri Kosina Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142323.039715135@linutronix.de Signed-off-by: Greg Kroah-Hartman commit e86575c798b0826a44877ab839a3aa0381b677b1 Author: Thomas Gleixner Date: Fri Jul 13 16:23:22 2018 +0200 x86/kvm: Allow runtime control of L1D flush commit 895ae47f9918833c3a880fbccd41e0692b37e7d9 upstream All mitigation modes can be switched at run time with a static key now: - Use sysfs_streq() instead of strcmp() to handle the trailing new line from sysfs writes correctly. - Make the static key management handle multiple invocations properly. - Set the module parameter file to RW Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.954525119@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 3983c579df7826b4b49fd984b3ba3dca94cbe1f4 Author: Thomas Gleixner Date: Fri Jul 13 16:23:21 2018 +0200 x86/kvm: Serialize L1D flush parameter setter commit dd4bfa739a72508b75760b393d129ed7b431daab upstream Writes to the parameter files are not serialized at the sysfs core level, so local serialization is required. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.873642605@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 6da20c6faec12bd0303b19d65beda0e0cfd70a3d Author: Thomas Gleixner Date: Fri Jul 13 16:23:20 2018 +0200 x86/kvm: Add static key for flush always commit 4c6523ec59fe895ea352a650218a6be0653910b1 upstream Avoid the conditional in the L1D flush control path. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.790914912@linutronix.de Signed-off-by: Greg Kroah-Hartman commit f22496974121fe4f1530350c9a53631e3f02f878 Author: Thomas Gleixner Date: Fri Jul 13 16:23:19 2018 +0200 x86/kvm: Move l1tf setup function commit 7db92e165ac814487264632ab2624e832f20ae38 upstream In preparation of allowing run time control for L1D flushing, move the setup code to the module parameter handler. In case of pre module init parsing, just store the value and let vmx_init() do the actual setup after running kvm_init() so that enable_ept is having the correct state. During run-time invoke it directly from the parameter setter to prepare for run-time control. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.694063239@linutronix.de Signed-off-by: Greg Kroah-Hartman commit fe7463bbef767b3d91e9113c69fea975a035388d Author: Thomas Gleixner Date: Fri Jul 13 16:23:18 2018 +0200 x86/l1tf: Handle EPT disabled state proper commit a7b9020b06ec6d7c3f3b0d4ef1a9eba12654f4f7 upstream If Extended Page Tables (EPT) are disabled or not supported, no L1D flushing is required. The setup function can just avoid setting up the L1D flush for the EPT=n case. Invoke it after the hardware setup has be done and enable_ept has the correct state and expose the EPT disabled state in the mitigation status as well. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 5a591c95e2a285dd1f3a24e2dbc1d62f43ad0057 Author: Thomas Gleixner Date: Fri Jul 13 16:23:17 2018 +0200 x86/kvm: Drop L1TF MSR list approach commit 2f055947ae5e2741fb2dc5bba1033c417ccf4faa upstream The VMX module parameter to control the L1D flush should become writeable. The MSR list is set up at VM init per guest VCPU, but the run time switching is based on a static key which is global. Toggling the MSR list at run time might be feasible, but for now drop this optimization and use the regular MSR write to make run-time switching possible. The default mitigation is the conditional flush anyway, so for extra paranoid setups this will add some small overhead, but the extra code executed is in the noise compared to the flush itself. Aside of that the EPT disabled case is not handled correctly at the moment and the MSR list magic is in the way for fixing that as well. If it's really providing a significant advantage, then this needs to be revisited after the code is correct and the control is writable. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.516940445@linutronix.de Signed-off-by: Greg Kroah-Hartman commit fc7107040fa1d4b9d1c9f1b8da53af0579748e62 Author: Thomas Gleixner Date: Fri Jul 13 16:23:16 2018 +0200 x86/litf: Introduce vmx status variable commit 72c6d2db64fa18c996ece8f06e499509e6c9a37e upstream Store the effective mitigation of VMX in a status variable and use it to report the VMX state in the l1tf sysfs file. Signed-off-by: Thomas Gleixner Tested-by: Jiri Kosina Reviewed-by: Greg Kroah-Hartman Reviewed-by: Josh Poimboeuf Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de Signed-off-by: Greg Kroah-Hartman commit 49fc27f3c0f3135f1f9f9623f9962fff97de95ef Author: Thomas Gleixner Date: Sat Jul 7 11:40:18 2018 +0200 cpu/hotplug: Online siblings when SMT control is turned on commit 215af5499d9e2b55f111d2431ea20218115f29b3 upstream Writing 'off' to /sys/devices/system/cpu/smt/control offlines all SMT siblings. Writing 'on' merily enables the abilify to online them, but does not online them automatically. Make 'on' more useful by onlining all offline siblings. Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 0689d66648a5c48b04afec54125ab5812eea2ec0 Author: Konrad Rzeszutek Wilk Date: Thu Jun 28 17:10:36 2018 -0400 x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required commit 390d975e0c4e60ce70d4157e0dd91ede37824603 upstream If the L1D flush module parameter is set to 'always' and the IA32_FLUSH_CMD MSR is available, optimize the VMENTER code with the MSR save list. Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 0a06117dd3ee22d731560c9c27a18012f2454bde Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 22:01:22 2018 -0400 x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs commit 989e3992d2eca32c3f1404f2bc91acda3aa122d8 upstream The IA32_FLUSH_CMD MSR needs only to be written on VMENTER. Extend add_atomic_switch_msr() with an entry_only parameter to allow storing the MSR only in the guest (ENTRY) MSR array. Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 7bbd5a4751ac4b96ca7982de15ba4daab76a0cd7 Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 22:00:47 2018 -0400 x86/KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting commit 3190709335dd31fe1aeeebfe4ffb6c7624ef971f upstream This allows to load a different number of MSRs depending on the context: VMEXIT or VMENTER. Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit f38baa5bc13b62c8beeaa44ee495060f45e68f90 Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 20:11:39 2018 -0400 x86/KVM/VMX: Add find_msr() helper function commit ca83b4a7f2d068da79a029d323024aa45decb250 upstream .. to help find the MSR on either the guest or host MSR list. Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 813e6ede42f26c74d936c8413db52d4243b2aac1 Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 13:58:37 2018 -0400 x86/KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers commit 33966dd6b2d2c352fae55412db2ea8cfff5df13a upstream There is no semantic change but this change allows an unbalanced amount of MSRs to be loaded on VMEXIT and VMENTER, i.e. the number of MSRs to save or restore on VMEXIT or VMENTER may be different. Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit aab5ed8057b4ae701f32814a86d3d9a01b828942 Author: Paolo Bonzini Date: Mon Jul 2 13:07:14 2018 +0200 x86/KVM/VMX: Add L1D flush logic commit c595ceee45707f00f64f61c54fb64ef0cc0b4e85 upstream Add the logic for flushing L1D on VMENTER. The flush depends on the static key being enabled and the new l1tf_flush_l1d flag being set. The flags is set: - Always, if the flush module parameter is 'always' - Conditionally at: - Entry to vcpu_run(), i.e. after executing user space - From the sched_in notifier, i.e. when switching to a vCPU thread. - From vmexit handlers which are considered unsafe, i.e. where sensitive data can be brought into L1D: - The emulator, which could be a good target for other speculative execution-based threats, - The MMU, which can bring host page tables in the L1 cache. - External interrupts - Nested operations that require the MMU (see above). That is vmptrld, vmptrst, vmclear,vmwrite,vmread. - When handling invept,invvpid [ tglx: Split out from combo patch and reduced to a single flag ] Signed-off-by: Paolo Bonzini Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 348a22db760a954b7de6d77645dfe231dd172340 Author: Paolo Bonzini Date: Mon Jul 2 13:03:48 2018 +0200 x86/KVM/VMX: Add L1D MSR based flush commit 3fa045be4c720146b18a19cea7a767dc6ad5df94 upstream 336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR (IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other MSRs defined in the document. The semantics of this MSR is to allow "finer granularity invalidation of caching structures than existing mechanisms like WBINVD. It will writeback and invalidate the L1 data cache, including all cachelines brought in by preceding instructions, without invalidating all caches (eg. L2 or LLC). Some processors may also invalidate the first level level instruction cache on a L1D_FLUSH command. The L1 data and instruction caches may be shared across the logical processors of a core." Use it instead of the loop based L1 flush algorithm. A copy of this document is available at https://bugzilla.kernel.org/show_bug.cgi?id=199511 [ tglx: Avoid allocating pages when the MSR is available ] Signed-off-by: Paolo Bonzini Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 1db21b84cb84bffb9fb67650366d32b555920326 Author: Paolo Bonzini Date: Mon Jul 2 12:47:38 2018 +0200 x86/KVM/VMX: Add L1D flush algorithm commit a47dd5f06714c844b33f3b5f517b6f3e81ce57b5 upstream To mitigate the L1 Terminal Fault vulnerability it's required to flush L1D on VMENTER to prevent rogue guests from snooping host memory. CPUs will have a new control MSR via a microcode update to flush L1D with a single MSR write, but in the absence of microcode a fallback to a software based flush algorithm is required. Add a software flush loop which is based on code from Intel. [ tglx: Split out from combo patch ] [ bpetkov: Polish the asm code ] Signed-off-by: Paolo Bonzini Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 9431927cfe0353f8263f230fe8bc30d4a84a6aef Author: Konrad Rzeszutek Wilk Date: Mon Jul 2 12:29:30 2018 +0200 x86/KVM/VMX: Add module argument for L1TF mitigation commit a399477e52c17e148746d3ce9a483f681c2aa9a0 upstream Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka L1 terminal fault. The valid arguments are: - "always" L1D cache flush on every VMENTER. - "cond" Conditional L1D cache flush, explained below - "never" Disable the L1D cache flush mitigation "cond" is trying to avoid L1D cache flushes on VMENTER if the code executed between VMEXIT and VMENTER is considered safe, i.e. is not bringing any interesting information into L1D which might exploited. [ tglx: Split out from a larger patch ] Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 8efd53e73ae385947d22c6d0c1fc6dc718048a63 Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 11:29:53 2018 -0400 x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present commit 26acfb666a473d960f0fd971fe68f3e3ad16c70b upstream If the L1TF CPU bug is present we allow the KVM module to be loaded as the major of users that use Linux and KVM have trusted guests and do not want a broken setup. Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as such they are the ones that should set nosmt to one. Setting 'nosmt' means that the system administrator also needs to disable SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line parameter, or via the /sys/devices/system/cpu/smt/control. See commit 05736e4ac13c ("cpu/hotplug: Provide knobs to control SMT"). Other mitigations are to use task affinity, cpu sets, interrupt binding, etc - anything to make sure that _only_ the same guests vCPUs are running on sibling threads. Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit edaa3d5446cbd847f299dc12e66b2020d273091c Author: Thomas Gleixner Date: Fri Jun 29 16:05:48 2018 +0200 cpu/hotplug: Boot HT siblings at least once commit 0cc3cd21657be04cb0559fe8063f2130493f92cf upstream Due to the way Machine Check Exceptions work on X86 hyperthreads it's required to boot up _all_ logical cores at least once in order to set the CR4.MCE bit. So instead of ignoring the sibling threads right away, let them boot up once so they can configure themselves. After they came out of the initial boot stage check whether its a "secondary" sibling and cancel the operation which puts the CPU back into offline state. Reported-by: Dave Hansen Signed-off-by: Thomas Gleixner Tested-by: Tony Luck Signed-off-by: Greg Kroah-Hartman commit 50eb78a07451ee490d3bed0a1ab2065a3364d0fb Author: Thomas Gleixner Date: Fri Jun 29 16:05:47 2018 +0200 Revert "x86/apic: Ignore secondary threads if nosmt=force" commit 506a66f374891ff08e064a058c446b336c5ac760 upstream Dave Hansen reported, that it's outright dangerous to keep SMT siblings disabled completely so they are stuck in the BIOS and wait for SIPI. The reason is that Machine Check Exceptions are broadcasted to siblings and the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or reboots the machine. The MCE chapter in the SDM contains the following blurb: Because the logical processors within a physical package are tightly coupled with respect to shared hardware resources, both logical processors are notified of machine check errors that occur within a given physical processor. If machine-check exceptions are enabled when a fatal error is reported, all the logical processors within a physical package are dispatched to the machine-check exception handler. If machine-check exceptions are disabled, the logical processors enter the shutdown state and assert the IERR# signal. When enabling machine-check exceptions, the MCE flag in control register CR4 should be set for each logical processor. Reverting the commit which ignores siblings at enumeration time solves only half of the problem. The core cpuhotplug logic needs to be adjusted as well. This thoughtful engineered mechanism also turns the boot process on all Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU before the secondary CPUs are brought up. Depending on the number of physical cores the window in which this situation can happen is smaller or larger. On a HSW-EX it's about 750ms: MCE is enabled on the boot CPU: [ 0.244017] mce: CPU supports 22 MCE banks The corresponding sibling #72 boots: [ 1.008005] .... node #0, CPUs: #72 That means if an MCE hits on physical core 0 (logical CPUs 0 and 72) between these two points the machine is going to shutdown. At least it's a known safe state. It's obvious that the early boot can be hit by an MCE as well and then runs into the same situation because MCEs are not yet enabled on the boot CPU. But after enabling them on the boot CPU, it does not make any sense to prevent the kernel from recovering. Adjust the nosmt kernel parameter documentation as well. Reverts: 2207def700f9 ("x86/apic: Ignore secondary threads if nosmt=force") Reported-by: Dave Hansen Signed-off-by: Thomas Gleixner Tested-by: Tony Luck Signed-off-by: Greg Kroah-Hartman commit 1c777e6c2b02c7656e00784374122318465b940c Author: Michal Hocko Date: Wed Jun 27 17:46:50 2018 +0200 x86/speculation/l1tf: Fix up pte->pfn conversion for PAE commit e14d7dfb41f5807a0c1c26a13f2b8ef16af24935 upstream Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for CONFIG_PAE because phys_addr_t is wider than unsigned long and so the pte_val reps. shift left would get truncated. Fix this up by using proper types. Fixes: 6b28baca9b1f ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation") Reported-by: Jan Beulich Signed-off-by: Michal Hocko Signed-off-by: Thomas Gleixner Acked-by: Vlastimil Babka Signed-off-by: Greg Kroah-Hartman commit 7c7de7559da38ab2f074f0f45c6432ff72933c5b Author: Vlastimil Babka Date: Fri Jun 22 17:39:33 2018 +0200 x86/speculation/l1tf: Protect PAE swap entries against L1TF commit 0d0f6249058834ffe1ceaad0bb31464af66f6e7a upstream The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for offset. The lower word is zeroed, thus systems with less than 4GB memory are safe. With 4GB to 128GB the swap type selects the memory locations vulnerable to L1TF; with even more memory, also the swap offfset influences the address. This might be a problem with 32bit PAE guests running on large 64bit hosts. By continuing to keep the whole swap entry in either high or low 32bit word of PTE we would limit the swap size too much. Thus this patch uses the whole PAE PTE with the same layout as the 64bit version does. The macros just become a bit tricky since they assume the arch-dependent swp_entry_t to be 32bit. Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Acked-by: Michal Hocko Signed-off-by: Greg Kroah-Hartman commit f85769f1400f9ac8ea9b1ef1ab0fd94f63fb9033 Author: Borislav Petkov Date: Fri Jun 22 11:34:11 2018 +0200 x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings commit 7ce2f0393ea2396142b7faf6ee9b1f3676d08a5f upstream The TOPOEXT reenablement is a workaround for broken BIOSen which didn't enable the CPUID bit. amd_get_topology_early(), however, relies on that bit being set so that it can read out the CPUID leaf and set smp_num_siblings properly. Move the reenablement up to early_init_amd(). While at it, simplify amd_get_topology_early(). Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit e486e3cbe16d76ef1aa6fde8a5f9c6d723c26009 Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 16:42:58 2018 -0400 x86/cpufeatures: Add detection of L1D cache flush support. commit 11e34e64e4103955fc4568750914c75d65ea87ee upstream 336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR (IA32_FLUSH_CMD) which is detected by CPUID.7.EDX[28]=1 bit being set. This new MSR "gives software a way to invalidate structures with finer granularity than other architectual methods like WBINVD." A copy of this document is available at https://bugzilla.kernel.org/show_bug.cgi?id=199511 Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 51d8977e5a526d91874631f3473674c510b1be1f Author: Vlastimil Babka Date: Thu Jun 21 12:36:29 2018 +0200 x86/speculation/l1tf: Extend 64bit swap file size limit commit 1a7ed1ba4bba6c075d5ad61bb75e3fbc870840d6 upstream The previous patch has limited swap file size so that large offsets cannot clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation. It assumed that offsets are encoded starting with bit 12, same as pfn. But on x86_64, offsets are encoded starting with bit 9. Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA and 256TB with 46bit MAX_PA. Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2") Signed-off-by: Vlastimil Babka Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 0daddfa173ab932070494f3c3b530b0ce6fd2db9 Author: Thomas Gleixner Date: Tue Jun 5 14:00:11 2018 +0200 x86/apic: Ignore secondary threads if nosmt=force commit 2207def700f902f169fc237b717252c326f9e464 upstream nosmt on the kernel command line merely prevents the onlining of the secondary SMT siblings. nosmt=force makes the APIC detection code ignore the secondary SMT siblings completely, so they even do not show up as possible CPUs. That reduces the amount of memory allocations for per cpu variables and saves other resources from being allocated too large. This is not fully equivalent to disabling SMT in the BIOS because the low level SMT enabling in the BIOS can result in partitioning of resources between the siblings, which is not undone by just ignoring them. Some CPUs can use the full resources when their sibling is not onlined, but this is depending on the CPU family and model and it's not well documented whether this applies to all partitioned resources. That means depending on the workload disabling SMT in the BIOS might result in better performance. Linus analysis of the Intel manual: The intel optimization manual is not very clear on what the partitioning rules are. I find: "In general, the buffers for staging instructions between major pipe stages are partitioned. These buffers include µop queues after the execution trace cache, the queues after the register rename stage, the reorder buffer which stages instructions for retirement, and the load and store buffers. In the case of load and store buffers, partitioning also provided an easier implementation to maintain memory ordering for each logical processor and detect memory ordering violations" but some of that partitioning may be relaxed if the HT thread is "not active": "In Intel microarchitecture code name Sandy Bridge, the micro-op queue is statically partitioned to provide 28 entries for each logical processor, irrespective of software executing in single thread or multiple threads. If one logical processor is not active in Intel microarchitecture code name Ivy Bridge, then a single thread executing on that processor core can use the 56 entries in the micro-op queue" but I do not know what "not active" means, and how dynamic it is. Some of that partitioning may be entirely static and depend on the early BIOS disabling of HT, and even if we park the cores, the resources will just be wasted. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 3708ec490bf9dc8243571c0daaf4503590e6caee Author: Thomas Gleixner Date: Wed Jun 6 00:57:38 2018 +0200 x86/cpu/AMD: Evaluate smp_num_siblings early commit 1e1d7e25fd759eddf96d8ab39d0a90a1979b2d8c upstream To support force disabling of SMT it's required to know the number of thread siblings early. amd_get_topology() cannot be called before the APIC driver is selected, so split out the part which initializes smp_num_siblings and invoke it from amd_early_init(). Signed-off-by: Thomas Gleixner Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 096700fb08dc07989030f30a01b1b7c70e445d4b Author: Borislav Petkov Date: Fri Jun 15 20:48:39 2018 +0200 x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info commit 119bff8a9c9bb00116a844ec68be7bc4b1c768f5 upstream Old code used to check whether CPUID ext max level is >= 0x80000008 because that last leaf contains the number of cores of the physical CPU. The three functions called there now do not depend on that leaf anymore so the check can go. Signed-off-by: Borislav Petkov Signed-off-by: Thomas Gleixner Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 8346482d765318a50a01942bc23b4959656d15a9 Author: Thomas Gleixner Date: Wed Jun 6 01:00:55 2018 +0200 x86/cpu/intel: Evaluate smp_num_siblings early commit 1910ad5624968f93be48e8e265513c54d66b897c upstream Make use of the new early detection function to initialize smp_num_siblings on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's required for force disabling SMT. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 2910ebe9281197ab65c953543bb3db6f69d043f8 Author: Thomas Gleixner Date: Wed Jun 6 00:55:39 2018 +0200 x86/cpu/topology: Provide detect_extended_topology_early() commit 95f3d39ccf7aaea79d1ffdac1c887c2e100ec1b6 upstream To support force disabling of SMT it's required to know the number of thread siblings early. detect_extended_topology() cannot be called before the APIC driver is selected, so split out the part which initializes smp_num_siblings. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 06ebc9fd87005eb56c08aea3f2844aca5fc37c78 Author: Thomas Gleixner Date: Wed Jun 6 00:53:57 2018 +0200 x86/cpu/common: Provide detect_ht_early() commit 545401f4448a807b963ff17b575e0a393e68b523 upstream To support force disabling of SMT it's required to know the number of thread siblings early. detect_ht() cannot be called before the APIC driver is selected, so split out the part which initializes smp_num_siblings. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 9087f7047f6e6d25ada0cb6643ec27aba810d77b Author: Thomas Gleixner Date: Wed Jun 6 00:47:10 2018 +0200 x86/cpu/AMD: Remove the pointless detect_ht() call commit 44ca36de56d1bf196dca2eb67cd753a46961ffe6 upstream Real 32bit AMD CPUs do not have SMT and the only value of the call was to reach the magic printout which got removed. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 186aeab679e342e7223d8ff5f39027695168110f Author: Thomas Gleixner Date: Wed Jun 6 00:36:15 2018 +0200 x86/cpu: Remove the pointless CPU printout commit 55e6d279abd92cfd7576bba031e7589be8475edb upstream The value of this printout is dubious at best and there is no point in having it in two different places along with convoluted ways to reach it. Remove it completely. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 73650e0724c0ccd0f85e215d0c8866853391718d Author: Thomas Gleixner Date: Tue May 29 17:48:27 2018 +0200 cpu/hotplug: Provide knobs to control SMT commit 05736e4ac13c08a4a9b1ef2de26dd31a32cbee57 upstream Provide a command line and a sysfs knob to control SMT. The command line options are: 'nosmt': Enumerate secondary threads, but do not online them 'nosmt=force': Ignore secondary threads completely during enumeration via MP table and ACPI/MADT. The sysfs control file has the following states (read/write): 'on': SMT is enabled. Secondary threads can be freely onlined 'off': SMT is disabled. Secondary threads, even if enumerated cannot be onlined 'forceoff': SMT is permanentely disabled. Writes to the control file are rejected. 'notsupported': SMT is not supported by the CPU The command line option 'nosmt' sets the sysfs control to 'off'. This can be changed to 'on' to reenable SMT during runtime. The command line option 'nosmt=force' sets the sysfs control to 'forceoff'. This cannot be changed during runtime. When SMT is 'on' and the control file is changed to 'off' then all online secondary threads are offlined and attempts to online a secondary thread later on are rejected. When SMT is 'off' and the control file is changed to 'on' then secondary threads can be onlined again. The 'off' -> 'on' transition does not automatically online the secondary threads. When the control file is set to 'forceoff', the behaviour is the same as setting it to 'off', but the operation is irreversible and later writes to the control file are rejected. When the control status is 'notsupported' then writes to the control file are rejected. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 77db3823f0a5b6a9e3addb2e796dd9f08c14b02d Author: Thomas Gleixner Date: Tue May 29 17:49:05 2018 +0200 cpu/hotplug: Split do_cpu_down() commit cc1fe215e1efa406b03aa4389e6269b61342dec5 upstream Split out the inner workings of do_cpu_down() to allow reuse of that function for the upcoming SMT disabling mechanism. No functional change. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 165ae0b7dacbd442b7b20bd32588f28e58acc884 Author: Thomas Gleixner Date: Tue May 29 19:05:25 2018 +0200 cpu/hotplug: Make bringup/teardown of smp threads symmetric commit c4de65696d865c225fda3b9913b31284ea65ea96 upstream The asymmetry caused a warning to trigger if the bootup was stopped in state CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can now be invoked on already or still parked threads. But there is still no reason to have this be asymmetric. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit aef565d8587c335aef7fc9ba500ebcef531f6d39 Author: Thomas Gleixner Date: Thu Jun 21 10:37:20 2018 +0200 x86/topology: Provide topology_smt_supported() commit f048c399e0f7490ab7296bc2c255d37eb14a9675 upstream Provide information whether SMT is supoorted by the CPUs. Preparatory patch for SMT control mechanism. Suggested-by: Dave Hansen Signed-off-by: Thomas Gleixner Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit babba06adfa67263406ded5a3bafefb5037660e9 Author: Thomas Gleixner Date: Tue May 29 17:50:22 2018 +0200 x86/smp: Provide topology_is_primary_thread() commit 6a4d2657e048f096c7ffcad254010bd94891c8c0 upstream If the CPU is supporting SMT then the primary thread can be found by checking the lower APIC ID bits for zero. smp_num_siblings is used to build the mask for the APIC ID bits which need to be taken into account. This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different than the initial APIC ID in CPUID. But according to AMD the lower bits have to be consistent. Intel gave a tentative confirmation as well. Preparatory patch to support disabling SMT at boot/runtime. Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit f3839aa4a2b9b04b5b4144412e436d6811972d2b Author: Peter Zijlstra Date: Tue May 29 16:43:46 2018 +0200 sched/smt: Update sched_smt_present at runtime commit ba2591a5993eabcc8e874e30f361d8ffbb10d6d4 upstream The static key sched_smt_present is only updated at boot time when SMT siblings have been detected. Booting with maxcpus=1 and bringing the siblings online after boot rebuilds the scheduling domains correctly but does not update the static key, so the SMT code is not enabled. Let the key be updated in the scheduler CPU hotplug code to fix this. Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 3f1ad8f2b59eb19fcb130270d003be4c61ef4af7 Author: Konrad Rzeszutek Wilk Date: Wed Jun 20 16:42:57 2018 -0400 x86/bugs: Move the l1tf function and define pr_fmt properly commit 56563f53d3066afa9e63d6c997bf67e76a8b05c0 upstream The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt which was defined as "Spectre V2 : ". Move the function to be past SSBD and also define the pr_fmt. Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf") Signed-off-by: Konrad Rzeszutek Wilk Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit ad71fbdf4f0e0fcd76454e2dcbeac8af4c3b7763 Author: Andi Kleen Date: Wed Jun 13 15:48:28 2018 -0700 x86/speculation/l1tf: Limit swap file size to MAX_PA/2 commit 377eeaa8e11fe815b1d07c81c4a0e2843a8c15eb upstream For the L1TF workaround its necessary to limit the swap file size to below MAX_PA/2, so that the higher bits of the swap offset inverted never point to valid memory. Add a mechanism for the architecture to override the swap file size check in swapfile.c and add a x86 specific max swapfile check function that enforces that limit. The check is only enabled if the CPU is vulnerable to L1TF. In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system with 46bit PA it is 32TB. The limit is only per individual swap file, so it's always possible to exceed these limits with multiple swap files or partitions. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Michal Hocko Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 4880e94df0dc52f8d360293f3f4411348d9970f6 Author: Andi Kleen Date: Wed Jun 13 15:48:27 2018 -0700 x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings commit 42e4089c7890725fcd329999252dc489b72f2921 upstream For L1TF PROT_NONE mappings are protected by inverting the PFN in the page table entry. This sets the high bits in the CPU's address space, thus making sure to point to not point an unmapped entry to valid cached memory. Some server system BIOSes put the MMIO mappings high up in the physical address space. If such an high mapping was mapped to unprivileged users they could attack low memory by setting such a mapping to PROT_NONE. This could happen through a special device driver which is not access protected. Normal /dev/mem is of course access protected. To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings. Valid page mappings are allowed because the system is then unsafe anyways. It's not expected that users commonly use PROT_NONE on MMIO. But to minimize any impact this is only enforced if the mapping actually refers to a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip the check for root. For mmaps this is straight forward and can be handled in vm_insert_pfn and in remap_pfn_range(). For mprotect it's a bit trickier. At the point where the actual PTEs are accessed a lot of state has been changed and it would be difficult to undo on an error. Since this is a uncommon case use a separate early page talk walk pass for MMIO PROT_NONE mappings that checks for this condition early. For non MMIO and non PROT_NONE there are no changes. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 79f7d72277f65414034738ef3cb7e45dd151bc19 Author: Andi Kleen Date: Wed Jun 13 15:48:26 2018 -0700 x86/speculation/l1tf: Add sysfs reporting for l1tf commit 17dbca119312b4e8173d4e25ff64262119fcef38 upstream L1TF core kernel workarounds are cheap and normally always enabled, However they still should be reported in sysfs if the system is vulnerable or mitigated. Add the necessary CPU feature/bug bits. - Extend the existing checks for Meltdowns to determine if the system is vulnerable. All CPUs which are not vulnerable to Meltdown are also not vulnerable to L1TF - Check for 32bit non PAE and emit a warning as there is no practical way for mitigation due to the limited physical address bits - If the system has more than MAX_PA/2 physical memory the invert page workarounds don't protect the system against the L1TF attack anymore, because an inverted physical address will also point to valid memory. Print a warning in this case and report that the system is vulnerable. Add a function which returns the PFN limit for the L1TF mitigation, which will be used in follow up patches for sanity and range checks. [ tglx: Renamed the CPU feature bit to L1TF_PTEINV ] Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 9dcfa2d7575207205fb45bcfe9371abc73dc2eea Author: Andi Kleen Date: Wed Jun 13 15:48:25 2018 -0700 x86/speculation/l1tf: Make sure the first page is always reserved commit 10a70416e1f067f6c4efda6ffd8ea96002ac4223 upstream The L1TF workaround doesn't make any attempt to mitigate speculate accesses to the first physical page for zeroed PTEs. Normally it only contains some data from the early real mode BIOS. It's not entirely clear that the first page is reserved in all configurations, so add an extra reservation call to make sure it is really reserved. In most configurations (e.g. with the standard reservations) it's likely a nop. Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 23cb0697cec31c89d2211c7479f10f4a972e95df Author: Andi Kleen Date: Wed Jun 13 15:48:24 2018 -0700 x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation commit 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c upstream When PTEs are set to PROT_NONE the kernel just clears the Present bit and preserves the PFN, which creates attack surface for L1TF speculation speculation attacks. This is important inside guests, because L1TF speculation bypasses physical page remapping. While the host has its own migitations preventing leaking data from other VMs into the guest, this would still risk leaking the wrong page inside the current guest. This uses the same technique as Linus' swap entry patch: while an entry is is in PROTNONE state invert the complete PFN part part of it. This ensures that the the highest bit will point to non existing memory. The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and pte/pmd/pud_pfn undo it. This assume that no code path touches the PFN part of a PTE directly without using these primitives. This doesn't handle the case that MMIO is on the top of the CPU physical memory. If such an MMIO region was exposed by an unpriviledged driver for mmap it would be possible to attack some real memory. However this situation is all rather unlikely. For 32bit non PAE the inversion is not done because there are really not enough bits to protect anything. Q: Why does the guest need to be protected when the HyperVisor already has L1TF mitigations? A: Here's an example: Physical pages 1 2 get mapped into a guest as GPA 1 -> PA 2 GPA 2 -> PA 1 through EPT. The L1TF speculation ignores the EPT remapping. Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and they belong to different users and should be isolated. A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and gets read access to the underlying physical page. Which in this case points to PA 2, so it can read process B's data, if it happened to be in L1, so isolation inside the guest is broken. There's nothing the hypervisor can do about this. This mitigation has to be done in the guest itself. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 04ed26b3ddfa2f1fe4b83f9cd134b04b8e97f2ca Author: Linus Torvalds Date: Wed Jun 13 15:48:23 2018 -0700 x86/speculation/l1tf: Protect swap entries against L1TF commit 2f22b4cd45b67b3496f4aa4c7180a1271c6452f6 upstream With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting side effects allow to read the memory the PTE is pointing too, if its values are still in the L1 cache. For swapped out pages Linux uses unmapped PTEs and stores a swap entry into them. To protect against L1TF it must be ensured that the swap entry is not pointing to valid memory, which requires setting higher bits (between bit 36 and bit 45) that are inside the CPUs physical address space, but outside any real memory. To do this invert the offset to make sure the higher bits are always set, as long as the swap file is not too big. Note there is no workaround for 32bit !PAE, or on systems which have more than MAX_PA/2 worth of memory. The later case is very unlikely to happen on real systems. [AK: updated description and minor tweaks by. Split out from the original patch ] Signed-off-by: Linus Torvalds Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Tested-by: Andi Kleen Reviewed-by: Josh Poimboeuf Acked-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 89bcd057d28ffb3ab75f95348c3f518df9f15102 Author: Linus Torvalds Date: Wed Jun 13 15:48:22 2018 -0700 x86/speculation/l1tf: Change order of offset/type in swap entry commit bcd11afa7adad8d720e7ba5ef58bdcd9775cf45f upstream If pages are swapped out, the swap entry is stored in the corresponding PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate on PTE entries which have the present bit set and would treat the swap entry as phsyical address (PFN). To mitigate that the upper bits of the PTE must be set so the PTE points to non existent memory. The swap entry stores the type and the offset of a swapped out page in the PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware ignores the bits beyond the phsyical address space limit, so to make the mitigation effective its required to start 'offset' at the lowest possible bit so that even large swap offsets do not reach into the physical address space limit bits. Move offset to bit 9-58 and type to bit 59-63 which are the bits that hardware generally doesn't care about. That, in turn, means that if you on desktop chip with only 40 bits of physical addressing, now that the offset starts at bit 9, there needs to be 30 bits of offset actually *in use* until bit 39 ends up being set, which means when inverted it will again point into existing memory. So that's 4 terabyte of swap space (because the offset is counted in pages, so 30 bits of offset is 42 bits of actual coverage). With bigger physical addressing, that obviously grows further, until the limit of the offset is hit (at 50 bits of offset - 62 bits of actual swap file coverage). This is a preparatory change for the actual swap entry inversion to protect against L1TF. [ AK: Updated description and minor tweaks. Split into two parts ] [ tglx: Massaged changelog ] Signed-off-by: Linus Torvalds Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Tested-by: Andi Kleen Reviewed-by: Josh Poimboeuf Acked-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit 7a5ff7472a7266ca9c74bb9e12f0e3d9c01aae6c Author: Andi Kleen Date: Wed Jun 13 15:48:21 2018 -0700 x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT commit 50896e180c6aa3a9c61a26ced99e15d602666a4c upstream L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU speculates on PTE entries which do not have the PRESENT bit set, if the content of the resulting physical address is available in the L1D cache. The OS side mitigation makes sure that a !PRESENT PTE entry points to a physical address outside the actually existing and cachable memory space. This is achieved by inverting the upper bits of the PTE. Due to the address space limitations this only works for 64bit and 32bit PAE kernels, but not for 32bit non PAE. This mitigation applies to both host and guest kernels, but in case of a 64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of the PAE address space (44bit) is not enough if the host has more than 43 bits of populated memory address space, because the speculation treats the PTE content as a physical host address bypassing EPT. The host (hypervisor) protects itself against the guest by flushing L1D as needed, but pages inside the guest are not protected against attacks from other processes inside the same guest. For the guest the inverted PTE mask has to match the host to provide the full protection for all pages the host could possibly map into the guest. The hosts populated address space is not known to the guest, so the mask must cover the possible maximal host address space, i.e. 52 bit. On 32bit PAE the maximum PTE mask is currently set to 44 bit because that is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits the mask to be below what the host could possible use for physical pages. The L1TF PROT_NONE protection code uses the PTE masks to determine which bits to invert to make sure the higher bits are set for unmapped entries to prevent L1TF speculation attacks against EPT inside guests. In order to invert all bits that could be used by the host, increase __PHYSICAL_PAGE_SHIFT to 52 to match 64bit. The real limit for a 32bit PAE kernel is still 44 bits because all Linux PTEs are created from unsigned long PFNs, so they cannot be higher than 44 bits on a 32bit kernel. So these extra PFN bits should be never set. The only users of this macro are using it to look at PTEs, so it's safe. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Michal Hocko Acked-by: Dave Hansen Signed-off-by: Greg Kroah-Hartman commit ffc42003b11a54c06a9846b494f94e089498e18b Author: Nick Desaulniers Date: Fri Aug 3 10:05:50 2018 -0700 x86/irqflags: Provide a declaration for native_save_fl commit 208cbb32558907f68b3b2a081ca2337ac3744794 upstream. It was reported that the commit d0a8d9378d16 is causing users of gcc < 4.9 to observe -Werror=missing-prototypes errors. Indeed, it seems that: extern inline unsigned long native_save_fl(void) { return 0; } compiled with -Werror=missing-prototypes produces this warning in gcc < 4.9, but not gcc >= 4.9. Fixes: d0a8d9378d16 ("x86/paravirt: Make native_save_fl() extern inline"). Reported-by: David Laight Reported-by: Jean Delvare Signed-off-by: Nick Desaulniers Signed-off-by: Thomas Gleixner Cc: hpa@zytor.com Cc: jgross@suse.com Cc: kstewart@linuxfoundation.org Cc: gregkh@linuxfoundation.org Cc: boris.ostrovsky@oracle.com Cc: astrachan@google.com Cc: mka@chromium.org Cc: arnd@arndb.de Cc: tstellar@redhat.com Cc: sedat.dilek@gmail.com Cc: David.Laight@aculab.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.com Signed-off-by: Greg Kroah-Hartman commit ce4ec59ee604bf66b034b832bd8c1373aace488f Author: Masami Hiramatsu Date: Sat Apr 28 21:37:03 2018 +0900 kprobes/x86: Fix %p uses in error messages commit 0ea063306eecf300fcf06d2f5917474b580f666f upstream. Remove all %p uses in error messages in kprobes/x86. Signed-off-by: Masami Hiramatsu Cc: Ananth N Mavinakayanahalli Cc: Anil S Keshavamurthy Cc: Arnd Bergmann Cc: David Howells Cc: David S . Miller Cc: Heiko Carstens Cc: Jon Medhurst Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Thomas Richter Cc: Tobin C . Harding Cc: Will Deacon Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491902310.9916.13355297638917767319.stgit@devbox Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 0a9da8dd128e2e3038b0b4355cc639769047976d Author: Jiri Kosina Date: Thu Jul 26 13:14:55 2018 +0200 x86/speculation: Protect against userspace-userspace spectreRSB commit fdf82a7856b32d905c39afc85e34364491e46346 upstream. The article "Spectre Returns! Speculation Attacks using the Return Stack Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks, making use solely of the RSB contents even on CPUs that don't fallback to BTB on RSB underflow (Skylake+). Mitigate userspace-userspace attacks by always unconditionally filling RSB on context switch when the generic spectrev2 mitigation has been enabled. [1] https://arxiv.org/pdf/1807.07940.pdf Signed-off-by: Jiri Kosina Signed-off-by: Thomas Gleixner Reviewed-by: Josh Poimboeuf Acked-by: Tim Chen Cc: Konrad Rzeszutek Wilk Cc: Borislav Petkov Cc: David Woodhouse Cc: Peter Zijlstra Cc: Linus Torvalds Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1807261308190.997@cbobk.fhfr.pm Signed-off-by: Greg Kroah-Hartman commit 7676d2dee2b687449bb501a98062863741b30bc8 Author: Peter Zijlstra Date: Fri Aug 3 16:41:39 2018 +0200 x86/paravirt: Fix spectre-v2 mitigations for paravirt guests commit 5800dc5c19f34e6e03b5adab1282535cb102fafd upstream. Nadav reported that on guests we're failing to rewrite the indirect calls to CALLEE_SAVE paravirt functions. In particular the pv_queued_spin_unlock() call is left unpatched and that is all over the place. This obviously wrecks Spectre-v2 mitigation (for paravirt guests) which relies on not actually having indirect calls around. The reason is an incorrect clobber test in paravirt_patch_call(); this function rewrites an indirect call with a direct call to the _SAME_ function, there is no possible way the clobbers can be different because of this. Therefore remove this clobber check. Also put WARNs on the other patch failure case (not enough room for the instruction) which I've not seen trigger in my (limited) testing. Three live kernel image disassemblies for lock_sock_nested (as a small function that illustrates the problem nicely). PRE is the current situation for guests, POST is with this patch applied and NATIVE is with or without the patch for !guests. PRE: (gdb) disassemble lock_sock_nested Dump of assembler code for function lock_sock_nested: 0xffffffff817be970 <+0>: push %rbp 0xffffffff817be971 <+1>: mov %rdi,%rbp 0xffffffff817be974 <+4>: push %rbx 0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx 0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched> 0xffffffff817be981 <+17>: mov %rbx,%rdi 0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh> 0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax 0xffffffff817be98f <+31>: test %eax,%eax 0xffffffff817be991 <+33>: jne 0xffffffff817be9ba 0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp) 0xffffffff817be99d <+45>: mov %rbx,%rdi 0xffffffff817be9a0 <+48>: callq *0xffffffff822299e8 0xffffffff817be9a7 <+55>: pop %rbx 0xffffffff817be9a8 <+56>: pop %rbp 0xffffffff817be9a9 <+57>: mov $0x200,%esi 0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi 0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063ae0 <__local_bh_enable_ip> 0xffffffff817be9ba <+74>: mov %rbp,%rdi 0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock> 0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 End of assembler dump. POST: (gdb) disassemble lock_sock_nested Dump of assembler code for function lock_sock_nested: 0xffffffff817be970 <+0>: push %rbp 0xffffffff817be971 <+1>: mov %rdi,%rbp 0xffffffff817be974 <+4>: push %rbx 0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx 0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched> 0xffffffff817be981 <+17>: mov %rbx,%rdi 0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh> 0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax 0xffffffff817be98f <+31>: test %eax,%eax 0xffffffff817be991 <+33>: jne 0xffffffff817be9ba 0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp) 0xffffffff817be99d <+45>: mov %rbx,%rdi 0xffffffff817be9a0 <+48>: callq 0xffffffff810a0c20 <__raw_callee_save___pv_queued_spin_unlock> 0xffffffff817be9a5 <+53>: xchg %ax,%ax 0xffffffff817be9a7 <+55>: pop %rbx 0xffffffff817be9a8 <+56>: pop %rbp 0xffffffff817be9a9 <+57>: mov $0x200,%esi 0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi 0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063aa0 <__local_bh_enable_ip> 0xffffffff817be9ba <+74>: mov %rbp,%rdi 0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock> 0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 End of assembler dump. NATIVE: (gdb) disassemble lock_sock_nested Dump of assembler code for function lock_sock_nested: 0xffffffff817be970 <+0>: push %rbp 0xffffffff817be971 <+1>: mov %rdi,%rbp 0xffffffff817be974 <+4>: push %rbx 0xffffffff817be975 <+5>: lea 0x88(%rbp),%rbx 0xffffffff817be97c <+12>: callq 0xffffffff819f7160 <_cond_resched> 0xffffffff817be981 <+17>: mov %rbx,%rdi 0xffffffff817be984 <+20>: callq 0xffffffff819fbb00 <_raw_spin_lock_bh> 0xffffffff817be989 <+25>: mov 0x8c(%rbp),%eax 0xffffffff817be98f <+31>: test %eax,%eax 0xffffffff817be991 <+33>: jne 0xffffffff817be9ba 0xffffffff817be993 <+35>: movl $0x1,0x8c(%rbp) 0xffffffff817be99d <+45>: mov %rbx,%rdi 0xffffffff817be9a0 <+48>: movb $0x0,(%rdi) 0xffffffff817be9a3 <+51>: nopl 0x0(%rax) 0xffffffff817be9a7 <+55>: pop %rbx 0xffffffff817be9a8 <+56>: pop %rbp 0xffffffff817be9a9 <+57>: mov $0x200,%esi 0xffffffff817be9ae <+62>: mov $0xffffffff817be993,%rdi 0xffffffff817be9b5 <+69>: jmpq 0xffffffff81063ae0 <__local_bh_enable_ip> 0xffffffff817be9ba <+74>: mov %rbp,%rdi 0xffffffff817be9bd <+77>: callq 0xffffffff817be8c0 <__lock_sock> 0xffffffff817be9c2 <+82>: jmp 0xffffffff817be993 End of assembler dump. Fixes: 63f70270ccd9 ("[PATCH] i386: PARAVIRT: add common patching machinery") Fixes: 3010a0663fd9 ("x86/paravirt, objtool: Annotate indirect calls") Reported-by: Nadav Amit Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Juergen Gross Cc: Konrad Rzeszutek Wilk Cc: Boris Ostrovsky Cc: David Woodhouse Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman commit 02e1b5a02de3d4b22ebbddac3f80aa4ff86067a8 Author: Oleksij Rempel Date: Fri Jun 15 09:41:29 2018 +0200 ARM: dts: imx6sx: fix irq for pcie bridge commit 1bcfe0564044be578841744faea1c2f46adc8178 upstream. Use the correct IRQ line for the MSI controller in the PCIe host controller. Apparently a different IRQ line is used compared to other i.MX6 variants. Without this change MSI IRQs aren't properly propagated to the upstream interrupt controller. Signed-off-by: Oleksij Rempel Reviewed-by: Lucas Stach Fixes: b1d17f68e5c5 ("ARM: dts: imx: add initial imx6sx device tree source") Signed-off-by: Shawn Guo Signed-off-by: Amit Pundir Signed-off-by: Greg Kroah-Hartman commit 8ee542a6ef5d5b08e58f6657c004caa031ef144d Author: Al Viro Date: Thu Aug 9 17:51:32 2018 -0400 fix __legitimize_mnt()/mntput() race commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream. __legitimize_mnt() has two problems - one is that in case of success the check of mount_lock is not ordered wrt preceding increment of refcount, making it possible to have successful __legitimize_mnt() on one CPU just before the otherwise final mntpu() on another, with __legitimize_mnt() not seeing mntput() taking the lock and mntput() not seeing the increment done by __legitimize_mnt(). Solved by a pair of barriers. Another is that failure of __legitimize_mnt() on the second read_seqretry() leaves us with reference that'll need to be dropped by caller; however, if that races with final mntput() we can end up with caller dropping rcu_read_lock() and doing mntput() to release that reference - with the first mntput() having freed the damn thing just as rcu_read_lock() had been dropped. Solution: in "do mntput() yourself" failure case grab mount_lock, check if MNT_DOOMED has been set by racing final mntput() that has missed our increment and if it has - undo the increment and treat that as "failure, caller doesn't need to drop anything" case. It's not easy to hit - the final mntput() has to come right after the first read_seqretry() in __legitimize_mnt() *and* manage to miss the increment done by __legitimize_mnt() before the second read_seqretry() in there. The things that are almost impossible to hit on bare hardware are not impossible on SMP KVM, though... Reported-by: Oleg Nesterov Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro Signed-off-by: Greg Kroah-Hartman commit 8b563c7c424da3978b694845bcf8c19e8b08ccc3 Author: Al Viro Date: Thu Aug 9 17:21:17 2018 -0400 fix mntput/mntput race commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream. mntput_no_expire() does the calculation of total refcount under mount_lock; unfortunately, the decrement (as well as all increments) are done outside of it, leading to false positives in the "are we dropping the last reference" test. Consider the following situation: * mnt is a lazy-umounted mount, kept alive by two opened files. One of those files gets closed. Total refcount of mnt is 2. On CPU 42 mntput(mnt) (called from __fput()) drops one reference, decrementing component * After it has looked at component #0, the process on CPU 0 does mntget(), incrementing component #0, gets preempted and gets to run again - on CPU 69. There it does mntput(), which drops the reference (component #69) and proceeds to spin on mount_lock. * On CPU 42 our first mntput() finishes counting. It observes the decrement of component #69, but not the increment of component #0. As the result, the total it gets is not 1 as it should've been - it's 0. At which point we decide that vfsmount needs to be killed and proceed to free it and shut the filesystem down. However, there's still another opened file on that filesystem, with reference to (now freed) vfsmount, etc. and we are screwed. It's not a wide race, but it can be reproduced with artificial slowdown of the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups. Fix consists of moving the refcount decrement under mount_lock; the tricky part is that we want (and can) keep the fast case (i.e. mount that still has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu() before that mntput(). IOW, if mntput() observes (under rcu_read_lock()) a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to be dropped. Reported-by: Jann Horn Tested-by: Jann Horn Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro Signed-off-by: Greg Kroah-Hartman commit 0aeaf7c64bde62bae04cf50bd25d424094b47f53 Author: Al Viro Date: Thu Aug 9 10:15:54 2018 -0400 make sure that __dentry_kill() always invalidates d_seq, unhashed or not commit 4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab upstream. RCU pathwalk relies upon the assumption that anything that changes ->d_inode of a dentry will invalidate its ->d_seq. That's almost true - the one exception is that the final dput() of already unhashed dentry does *not* touch ->d_seq at all. Unhashing does, though, so for anything we'd found by RCU dcache lookup we are fine. Unfortunately, we can *start* with an unhashed dentry or jump into it. We could try and be careful in the (few) places where that could happen. Or we could just make the final dput() invalidate the damn thing, unhashed or not. The latter is much simpler and easier to backport, so let's do it that way. Reported-by: "Dae R. Jeong" Cc: stable@vger.kernel.org Signed-off-by: Al Viro Signed-off-by: Greg Kroah-Hartman commit e3cf27089079ace669320a59059b47cab859dceb Author: Al Viro Date: Mon Aug 6 09:03:58 2018 -0400 root dentries need RCU-delayed freeing commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream. Since mountpoint crossing can happen without leaving lazy mode, root dentries do need the same protection against having their memory freed without RCU delay as everything else in the tree. It's partially hidden by RCU delay between detaching from the mount tree and dropping the vfsmount reference, but the starting point of pathwalk can be on an already detached mount, in which case umount-caused RCU delay has already passed by the time the lazy pathwalk grabs rcu_read_lock(). If the starting point happens to be at the root of that vfsmount *and* that vfsmount covers the entire filesystem, we get trouble. Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro Signed-off-by: Greg Kroah-Hartman commit ce8998f030222a92d2012ebec3ab6771ecbdba9d Author: Linus Torvalds Date: Sun Aug 12 12:19:42 2018 -0700 init: rename and re-order boot_cpu_state_init() commit b5b1404d0815894de0690de8a1ab58269e56eae6 upstream. This is purely a preparatory patch for upcoming changes during the 4.19 merge window. We have a function called "boot_cpu_state_init()" that isn't really about the bootup cpu state: that is done much earlier by the similarly named "boot_cpu_init()" (note lack of "state" in name). This function initializes some hotplug CPU state, and needs to run after the percpu data has been properly initialized. It even has a comment to that effect. Except it _doesn't_ actually run after the percpu data has been properly initialized. On x86 it happens to do that, but on at least arm and arm64, the percpu base pointers are initialized by the arch-specific 'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init(). This had some unexpected results, and in particular we have a patch pending for the merge window that did the obvious cleanup of using 'this_cpu_write()' in the cpu hotplug init code: - per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE; + this_cpu_write(cpuhp_state.state, CPUHP_ONLINE); which is obviously the right thing to do. Except because of the ordering issue, it actually failed miserably and unexpectedly on arm64. So this just fixes the ordering, and changes the name of the function to be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu hotplug state, because the core CPU state was supposed to have already been done earlier. Marked for stable, since the (not yet merged) patch that will show this problem is marked for stable. Reported-by: Vlastimil Babka Reported-by: Mian Yousaf Kaukab Suggested-by: Catalin Marinas Acked-by: Thomas Gleixner Cc: Will Deacon Cc: stable@kernel.org Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 20dfc496dce0b5146ea96d06079706cba603df0d Author: Quinn Tran Date: Thu Jul 26 16:34:44 2018 -0700 scsi: qla2xxx: Fix memory leak for allocating abort IOCB commit 5e53be8e476a3397ed5383c23376f299555a2b43 upstream. In the case of IOCB QFull, Initiator code can leave behind a stale pointer to an SRB structure on the outstanding command array. Fixes: 82de802ad46e ("scsi: qla2xxx: Preparation for Target MQ.") Cc: stable@vger.kernel.org #v4.16+ Signed-off-by: Quinn Tran Signed-off-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit a4a7bb655e9369feb2d941d7931cd390639a3311 Author: Bart Van Assche Date: Thu Aug 2 10:44:42 2018 -0700 scsi: sr: Avoid that opening a CD-ROM hangs with runtime power management enabled commit 1214fd7b497400d200e3f4e64e2338b303a20949 upstream. Surround scsi_execute() calls with scsi_autopm_get_device() and scsi_autopm_put_device(). Note: removing sr_mutex protection from the scsi_cd_get() and scsi_cd_put() calls is safe because the purpose of sr_mutex is to serialize cdrom_*() calls. This patch avoids that complaints similar to the following appear in the kernel log if runtime power management is enabled: INFO: task systemd-udevd:650 blocked for more than 120 seconds. Not tainted 4.18.0-rc7-dbg+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. systemd-udevd D28176 650 513 0x00000104 Call Trace: __schedule+0x444/0xfe0 schedule+0x4e/0xe0 schedule_preempt_disabled+0x18/0x30 __mutex_lock+0x41c/0xc70 mutex_lock_nested+0x1b/0x20 __blkdev_get+0x106/0x970 blkdev_get+0x22c/0x5a0 blkdev_open+0xe9/0x100 do_dentry_open.isra.19+0x33e/0x570 vfs_open+0x7c/0xd0 path_openat+0x6e3/0x1120 do_filp_open+0x11c/0x1c0 do_sys_open+0x208/0x2d0 __x64_sys_openat+0x59/0x70 do_syscall_64+0x77/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: Bart Van Assche Cc: Maurizio Lombardi Cc: Johannes Thumshirn Cc: Alan Stern Cc: Tested-by: Johannes Thumshirn Reviewed-by: Johannes Thumshirn Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit f2be94d24f6a7f43087158209391a159a90fe492 Author: Daniel Borkmann Date: Wed Aug 8 19:23:13 2018 +0200 bpf, sockmap: fix bpf_tcp_sendmsg sock error handling commit 5121700b346b6160ccc9411194e3f1f417c340d1 upstream. While working on bpf_tcp_sendmsg() code, I noticed that when a sk->sk_err is set we error out with err = sk->sk_err. However this is problematic since sk->sk_err is a positive error value and therefore we will neither go into sk_stream_error() nor will we report an error back to user space. I had this case with EPIPE and user space was thinking sendmsg() succeeded since EPIPE is a positive value, thinking we submitted 32 bytes. Fix it by negating the sk->sk_err value. Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: Daniel Borkmann Acked-by: John Fastabend Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman commit 5c6850e8f7364a98729dd57435b8d9d55696b90c Author: Daniel Borkmann Date: Wed Aug 8 19:23:14 2018 +0200 bpf, sockmap: fix leak in bpf_tcp_sendmsg wait for mem path commit 7c81c71730456845e6212dccbf00098faa66740f upstream. In bpf_tcp_sendmsg() the sk_alloc_sg() may fail. In the case of ENOMEM, it may also mean that we've partially filled the scatterlist entries with pages. Later jumping to sk_stream_wait_memory() we could further fail with an error for several reasons, however we miss to call free_start_sg() if the local sk_msg_buff was used. Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: Daniel Borkmann Acked-by: John Fastabend Signed-off-by: Alexei Starovoitov Signed-off-by: Greg Kroah-Hartman commit 303a32ac205b5dc186e1f02b273f4daadaa152e2 Author: Juergen Gross Date: Thu Aug 9 16:42:16 2018 +0200 xen/netfront: don't cache skb_shinfo() commit d472b3a6cf63cd31cae1ed61930f07e6cd6671b5 upstream. skb_shinfo() can change when calling __pskb_pull_tail(): Don't cache its return value. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross Reviewed-by: Wei Liu Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit ef74528ff7a839c84a7196b012db67225454b993 Author: Minchan Kim Date: Fri Aug 10 17:23:10 2018 -0700 zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature commit 4f7a7beaee77275671654f7b9f3f9e73ca16ec65 upstream. If zram supports writeback feature, it's no longer a BD_CAP_SYNCHRONOUS_IO device beause zram does asynchronous IO operations for incompressible pages. Do not pretend to be synchronous IO device. It makes the system very sluggish due to waiting for IO completion from upper layers. Furthermore, it causes a user-after-free problem because swap thinks the opearion is done when the IO functions returns so it can free the page (e.g., lock_page_or_retry and goto out_release in do_swap_page) but in fact, IO is asynchronous so the driver could access a just freed page afterward. This patch fixes the problem. BUG: Bad page state in process qemu-system-x86 pfn:3dfab21 page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1 flags: 0x17fffc000000008(uptodate) raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set bad because of flags: 0x8(uptodate) CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G B 4.18.0-rc5+ #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017 Call Trace: dump_stack+0x5c/0x7b bad_page+0xba/0x120 get_page_from_freelist+0x1016/0x1250 __alloc_pages_nodemask+0xfa/0x250 alloc_pages_vma+0x7c/0x1c0 do_swap_page+0x347/0x920 __handle_mm_fault+0x7b4/0x1110 handle_mm_fault+0xfc/0x1f0 __get_user_pages+0x12f/0x690 get_user_pages_unlocked+0x148/0x1f0 __gfn_to_pfn_memslot+0xff/0x3c0 [kvm] try_async_pf+0x87/0x230 [kvm] tdp_page_fault+0x132/0x290 [kvm] kvm_mmu_page_fault+0x74/0x570 [kvm] kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm] kvm_vcpu_ioctl+0x388/0x5d0 [kvm] do_vfs_ioctl+0xa2/0x630 ksys_ioctl+0x70/0x80 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x55/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: https://lore.kernel.org/lkml/0516ae2d-b0fd-92c5-aa92-112ba7bd32fc@contabo.de/ Link: http://lkml.kernel.org/r/20180802051112.86174-1-minchan@kernel.org [minchan@kernel.org: fix changelog, add comment] Link: https://lore.kernel.org/lkml/0516ae2d-b0fd-92c5-aa92-112ba7bd32fc@contabo.de/ Link: http://lkml.kernel.org/r/20180802051112.86174-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20180805233722.217347-1-minchan@kernel.org [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Minchan Kim Reported-by: Tino Lehnig Tested-by: Tino Lehnig Cc: Sergey Senozhatsky Cc: Jens Axboe Cc: [4.15+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit f0f759e70ffa9319c8010b5f3f3c1f211b3efb41 Author: Daniel Bristot de Oliveira Date: Fri Jul 20 11:16:30 2018 +0200 sched/deadline: Update rq_clock of later_rq when pushing a task commit 840d719604b0925ca23dde95f1767e4528668369 upstream. Daniel Casini got this warn while running a DL task here at RetisLab: [ 461.137582] ------------[ cut here ]------------ [ 461.137583] rq->clock_update_flags < RQCF_ACT_SKIP [ 461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 assert_clock_updated.isra.32.part.33+0x17/0x20 [a ton of modules] [ 461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3 [ 461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013 [ 461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20 [ 461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b [ 461.137673] RSP: 0018:ffffa77e08cafc68 EFLAGS: 00010082 [ 461.137674] RAX: 0000000000000000 RBX: ffff8b3fc1702d80 RCX: 0000000000000006 [ 461.137674] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8b3fded164b0 [ 461.137675] RBP: ffffa77e08cafc68 R08: 0000000000000026 R09: 0000000000000339 [ 461.137676] R10: ffff8b3fd060d410 R11: 0000000000000026 R12: ffffffffa4e14e20 [ 461.137677] R13: ffff8b3fdec22940 R14: ffff8b3fc1702da0 R15: ffff8b3fdec22940 [ 461.137678] FS: 00007efe43ee5700(0000) GS:ffff8b3fded00000(0000) knlGS:0000000000000000 [ 461.137679] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 461.137680] CR2: 00007efe30000010 CR3: 0000000301744003 CR4: 00000000001606e0 [ 461.137680] Call Trace: [ 461.137684] push_dl_task.part.46+0x3bc/0x460 [ 461.137686] task_woken_dl+0x60/0x80 [ 461.137689] ttwu_do_wakeup+0x4f/0x150 [ 461.137690] ttwu_do_activate+0x77/0x80 [ 461.137692] try_to_wake_up+0x1d6/0x4c0 [ 461.137693] wake_up_q+0x32/0x70 [ 461.137696] do_futex+0x7e7/0xb50 [ 461.137698] __x64_sys_futex+0x8b/0x180 [ 461.137701] do_syscall_64+0x5a/0x110 [ 461.137703] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 461.137705] RIP: 0033:0x7efe4918ca26 [ 461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2 [ 461.137738] RSP: 002b:00007efe43ee4928 EFLAGS: 00000283 ORIG_RAX: 00000000000000ca [ 461.137739] RAX: ffffffffffffffda RBX: 0000000005094df0 RCX: 00007efe4918ca26 [ 461.137740] RDX: 0000000000000001 RSI: 0000000000000085 RDI: 0000000005094e24 [ 461.137741] RBP: 00007efe43ee49c0 R08: 0000000005094e20 R09: 0000000004000001 [ 461.137741] R10: 0000000000000001 R11: 0000000000000283 R12: 0000000000000000 [ 461.137742] R13: 0000000005094df8 R14: 0000000000000001 R15: 0000000000448a10 [ 461.137743] ---[ end trace 187df4cad2bf7649 ]--- This warning happened in the push_dl_task(), because __add_running_bw()->cpufreq_update_util() is getting the rq_clock of the later_rq before its update, which takes place at activate_task(). The fix then is to update the rq_clock before calling add_running_bw(). To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to activate_task(). Reported-by: Daniel Casini Signed-off-by: Daniel Bristot de Oliveira Signed-off-by: Peter Zijlstra (Intel) Acked-by: Juri Lelli Cc: Clark Williams Cc: Linus Torvalds Cc: Luca Abeni Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Tommaso Cucinotta Fixes: e0367b12674b sched/deadline: Move CPU frequency selection triggering points Link: http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bristot@redhat.com Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit c315c4e36d4dc98f2249a497c1a8b04f5d8a17f3 Author: Isaac J. Manjarres Date: Tue Jul 17 12:35:29 2018 -0700 stop_machine: Disable preemption after queueing stopper threads commit 2610e88946632afb78aa58e61f11368ac4c0af7b upstream. This commit: 9fb8d5dc4b64 ("stop_machine, Disable preemption when waking two stopper threads") does not fully address the race condition that can occur as follows: On one CPU, call it CPU 3, thread 1 invokes cpu_stop_queue_two_works(2, 3,...), and the execution is such that thread 1 queues the works for migration/2 and migration/3, and is preempted after releasing the locks for migration/2 and migration/3, but before waking the threads. Then, On CPU 2, a kworker, call it thread 2, is running, and it invokes cpu_stop_queue_two_works(1, 2,...), such that thread 2 queues the works for migration/1 and migration/2. Meanwhile, on CPU 3, thread 1 resumes execution, and wakes migration/2 and migration/3. This means that when CPU 2 releases the locks for migration/1 and migration/2, but before it wakes those threads, it can be preempted by migration/2. If thread 2 is preempted by migration/2, then migration/2 will execute the first work item successfully, since migration/3 was woken up by CPU 3, but when it goes to execute the second work item, it disables preemption, calls multi_cpu_stop(), and thus, CPU 2 will wait forever for migration/1, which should have been woken up by thread 2. However migration/1 cannot be woken up by thread 2, since it is a kworker, so it is affine to CPU 2, but CPU 2 is running migration/2 with preemption disabled, so thread 2 will never run. Disable preemption after queueing works for stopper threads to ensure that the operation of queueing the works and waking the stopper threads is atomic. Co-Developed-by: Prasad Sodagudi Co-Developed-by: Pavankumar Kondeti Signed-off-by: Isaac J. Manjarres Signed-off-by: Prasad Sodagudi Signed-off-by: Pavankumar Kondeti Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: bigeasy@linutronix.de Cc: gregkh@linuxfoundation.org Cc: matt@codeblueprint.co.uk Fixes: 9fb8d5dc4b64 ("stop_machine, Disable preemption when waking two stopper threads") Link: http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isaacm@codeaurora.org Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman commit 01f2e8354ca072a7d469adf60a4852f4eaaaf707 Author: Linus Torvalds Date: Mon Jan 8 11:51:04 2018 -0800 Mark HI and TASKLET softirq synchronous commit 3c53776e29f81719efcf8f7a6e30cdf753bee94d upstream. Way back in 4.9, we committed 4cd13c21b207 ("softirq: Let ksoftirqd do its job"), and ever since we've had small nagging issues with it. For example, we've had: 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run") 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") all of which worked around some of the effects of that commit. The DVB people have also complained that the commit causes excessive USB URB latencies, which seems to be due to the USB code using tasklets to schedule USB traffic. This seems to be an issue mainly when already living on the edge, but waiting for ksoftirqd to handle it really does seem to cause excessive latencies. Now Hanna Hawa reports that this issue isn't just limited to USB URB and DVB, but also causes timeout problems for the Marvell SoC team: "I'm facing kernel panic issue while running raid 5 on sata disks connected to Macchiatobin (Marvell community board with Armada-8040 SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2) uses a tasklet to clean the done descriptors from the queue" The latency problem causes a panic: mv_xor_v2 f0400000.xor: dma_sync_wait: timeout! Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction We've discussed simply just reverting the original commit entirely, and also much more involved solutions (with per-softirq threads etc). This patch is intentionally stupid and fairly limited, because the issue still remains, and the other solutions either got sidetracked or had other issues. We should probably also consider the timer softirqs to be synchronous and not be delayed to ksoftirqd (since they were the issue with the earlier watchdog problems), but that should be done as a separate patch. This does only the tasklet cases. Reported-and-tested-by: Hanna Hawa Reported-and-tested-by: Josef Griebichler Reported-by: Mauro Carvalho Chehab Cc: Alan Stern Cc: Greg Kroah-Hartman Cc: Eric Dumazet Cc: Ingo Molnar Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit f31d0592090f79fa2bb1ad1aced53fbf7aaad6ed Author: John David Anglin Date: Sun Aug 5 13:30:31 2018 -0400 parisc: Define mb() and add memory barriers to assembler unlock sequences commit fedb8da96355f5f64353625bf96dc69423ad1826 upstream. For years I thought all parisc machines executed loads and stores in order. However, Jeff Law recently indicated on gcc-patches that this is not correct. There are various degrees of out-of-order execution all the way back to the PA7xxx processor series (hit-under-miss). The PA8xxx series has full out-of-order execution for both integer operations, and loads and stores. This is described in the following article: http://web.archive.org/web/20040214092531/http://www.cpus.hp.com/technical_references/advperf.shtml For this reason, we need to define mb() and to insert a memory barrier before the store unlocking spinlocks. This ensures that all memory accesses are complete prior to unlocking. The ldcw instruction performs the same function on entry. Signed-off-by: John David Anglin Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Helge Deller Signed-off-by: Greg Kroah-Hartman commit 8ab85f0bec9b334b78fe95224c66a88a0669f212 Author: Helge Deller Date: Sat Jul 28 11:47:17 2018 +0200 parisc: Enable CONFIG_MLONGCALLS by default commit 66509a276c8c1d19ee3f661a41b418d101c57d29 upstream. Enable the -mlong-calls compiler option by default, because otherwise in most cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F relocations. This fixes building the 64-bit defconfig. Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Helge Deller Signed-off-by: Greg Kroah-Hartman