(Non-)kerberized NFS4 Riddle

Jurjen Bokma

Last modified: "2015-05-17 19:24:20 jurjen"

Abstract

We would see the Linux kernel being stuck, typically for 22s, in a gssd upcall. The network trace would show the NFS4 client requesting Kerberos tickets for a service that uses no Kerberos. It turns out that the non-kerberized NFS server should not have a kerberos keytab!


Table of Contents

Symptoms

We have an NFS4 client 'client'. It contacts two kinds of NFS4 servers: those that do use Kerberos ('srv_krb'), and those that don't ('srv_nokrb').

With some kernels (on Ubuntu in the range 4.4 up to 4.8, haven't checked beyond), the client will sometimes lock up. Syslog will report a CPU being stuck, typically for 22s, in rcp.gssd, and a kernel trace like this one will show in the kernel logs:


[ 3699.686975] Call Trace:
[ 3699.686978]  [<ffffffff81183457>] queued_spin_lock_slowpath+0xb/0xf
[ 3699.686980]  [<ffffffff81807a40>] _raw_spin_lock+0x20/0x30
[ 3699.686983]  [<ffffffffc0146d50>] gss_setup_upcall+0x160/0x390
[auth_rpcgss]
[ 3699.686985]  [<ffffffffc0147b8e>] gss_cred_init+0xce/0x350 [auth_rpcgss]
[ 3699.686987]  [<ffffffff810bf500>] ? prepare_to_wait_event+0xf0/0xf0
[ 3699.686997]  [<ffffffffc00df473>]
rpcauth_lookup_credcache+0x1e3/0x280 [sunrpc]
[ 3699.686999]  [<ffffffffc014538e>] gss_lookup_cred+0xe/0x10 [auth_rpcgss]
[ 3699.687005]  [<ffffffffc00dec7c>] rpcauth_lookupcred+0x7c/0xb0 [sunrpc]
[ 3699.687010]  [<ffffffffc00dfc6a>] rpcauth_refreshcred+0x12a/0x1a0
[sunrpc]
[ 3699.687014]  [<ffffffffc00cf650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
[ 3699.687019]  [<ffffffffc00cf650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
[ 3699.687023]  [<ffffffffc00cfb30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
[ 3699.687027]  [<ffffffffc00cfb30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
[ 3699.687031]  [<ffffffffc00cfb6c>] call_refresh+0x3c/0x70 [sunrpc]
[ 3699.687036]  [<ffffffffc00db496>] __rpc_execute+0x86/0x440 [sunrpc]
[ 3699.687042]  [<ffffffffc00de57e>] rpc_execute+0x5e/0xb0 [sunrpc]
[ 3699.687046]  [<ffffffffc00d2210>] rpc_run_task+0x70/0x90 [sunrpc]
[ 3699.687054]  [<ffffffffc0a2a176>] nfs4_call_sync_sequence+0x56/0x80
[nfsv4]
[ 3699.687059]  [<ffffffffc0a2c60d>] _nfs4_lookup_root.isra.60+0xcd/0xe0
[nfsv4]
[ 3699.687067]  [<ffffffffc00ee0f8>] ?
rpc_find_or_alloc_pipe_dir_object+0x88/0xa0 [sunrpc]
[ 3699.687071]  [<ffffffffc0a34632>] nfs4_lookup_root+0x52/0xf0 [nfsv4]
[ 3699.687076]  [<ffffffffc0a34721>] nfs4_lookup_root_sec+0x51/0x70 [nfsv4]
[ 3699.687080]  [<ffffffffc0a3477c>] nfs4_find_root_sec+0x3c/0xb0 [nfsv4]
[ 3699.687084]  [<ffffffffc0a375c9>] nfs4_proc_get_rootfh+0x39/0x90 [nfsv4]
[ 3699.687091]  [<ffffffffc0a4eb1c>] nfs4_get_rootfh+0x4c/0x110 [nfsv4]
[ 3699.687096]  [<ffffffffc00d1bc4>] ?
rpc_clone_client_set_auth+0x44/0x50 [sunrpc]
[ 3699.687097]  [<ffffffff811de86d>] ? kmem_cache_alloc_trace+0x1ad/0x220
[ 3699.687104]  [<ffffffffc01671bf>] ? nfs_alloc_fattr+0x1f/0x70 [nfs]
[ 3699.687111]  [<ffffffffc0a4eebf>] nfs4_server_common_setup+0x9f/0x1d0
[nfsv4]
[ 3699.687116]  [<ffffffffc0a5029a>] nfs4_create_server+0x23a/0x360 [nfsv4]
[ 3699.687119]  [<ffffffff813f0f19>] ? find_next_bit+0x19/0x20
[ 3699.687124]  [<ffffffffc0a4845e>] nfs4_remote_mount+0x2e/0x60 [nfsv4]
[ 3699.687126]  [<ffffffff81204b49>] mount_fs+0x39/0x160
[ 3699.687128]  [<ffffffff811a6905>] ? __alloc_percpu+0x15/0x20
[ 3699.687130]  [<ffffffff8121fe17>] vfs_kern_mount+0x67/0x110
[ 3699.687135]  [<ffffffffc0a48386>] nfs_do_root_mount+0x86/0xc0 [nfsv4]
[ 3699.687139]  [<ffffffffc0a48784>] nfs4_try_mount+0x44/0xc0 [nfsv4]
[ 3699.687143]  [<ffffffffc01611c7>] ? get_nfs_version+0x27/0x90 [nfs]
[ 3699.687148]  [<ffffffffc016cba1>] nfs_fs_mount+0x481/0xd10 [nfs]
[ 3699.687153]  [<ffffffffc016da20>] ? nfs_clone_super+0x130/0x130 [nfs]
[ 3699.687157]  [<ffffffffc016bbe0>] ? param_set_portnr+0x50/0x50 [nfs]
[ 3699.687159]  [<ffffffff81204b49>] mount_fs+0x39/0x160
[ 3699.687160]  [<ffffffff811a6905>] ? __alloc_percpu+0x15/0x20
[ 3699.687161]  [<ffffffff8121fe17>] vfs_kern_mount+0x67/0x110
[ 3699.687163]  [<ffffffff81222426>] do_mount+0x236/0xd60
[ 3699.687164]  [<ffffffff8122324b>] SyS_mount+0x8b/0xe0
[ 3699.687165]  [<ffffffff81807df6>] entry_SYSCALL_64_fastpath+0x16/0x75
      

This doesn't occur in any kernels of the 3.* series we used. And it also depends on the network: in some networks it happens more often than in others.

On affected machines, mounting shares from srv_nonkrb hangs, and, probably because of our particular configuration, sudo cannot be used after that.

Sniffing the network with tcpdump shows the client requesting Kerberos keys for srv_nonkrb, even though no such keys can be obtained (since the server doesn't use Kerberos in the first place). The request (and subsequent denial) is repeated many times over.

Another symptom that may or may not have the same cause is that when many (typically 30 or more) clients hit the same non-kerberized NFS4 server somewhat concurrently (typically within 4s of one another), some of the clients lock up indefinitely.