Category: System Administration

December 11, 2023

Tableau: Upgrading from 2022.3.x to 2023.3.0

A.K.A. I upgraded and now my site has no content?!? Attempting to test the upgrade to 2023.3.0 in our development environment, the site was absolutely empty after the upgrade completed. No errors, nothing indicating something went wrong. Just nothing in the web page where I would expect to see data sources, workbooks, etc. The database still had a lot of ‘stuff’, the disk still had hundreds of gigs of ‘stuff’. But nothing showed up. I have experienced this problem starting with 2022.3.5 or 2022.3.11 and upgrading to 2023.3.0. I could upgrade to 2023.1.x and still have site content.

I wasn’t doing anything peculiar during the upgrade:

Run TableauServerTabcmd-64bit-2023-3-0.exe to upgrade the CLI
Run TableauServer-64bit-2023-3-0.exe to upgrade the Tableau binaries
Once installation completes, run open a new command prompt with Run as Administrator and launch “.\Tableau\Tableau Server\packages\scripts.20233.23.1017.0948\upgrade-tsm.cmd” –username username

The upgrade-tsm batch upgrades all of the components and database content. At this point, the server will be stopped. Start it. Verify everything looks OK – site is online, SSL is right, I can log in. Check out the site data … it’s not there!

Reportedly this is a known bug that only impacts systems that have been restored from backup. Since all of my servers were moved from Windows 2012 to Windows 2019 by backing up and restoring the Tableau sites … that’d be all of ’em! Fortunately, it is easy enough to make the data visible again. Run tsm maintenance reindex-search to recreate the search index. Refresh the user site, and there will be workbooks, data, jobs, and all sorts of things.

If reindexing does not sort the problem, tsm maintenance reset-searchserver should do it. The search reindex sorted me, though.

October 2, 2023

Linux – High Load with CIFS Mounts using Kernel 6.5.5

We recently updated our Fedora servers from 36 and 37 to 38. Since the upgrade, we have observed servers with very high load averages – 8+ on a 4-cpu server – but the server didn’t seem unreasonably slow. On the Unix servers I first used, Irix and Solaris, load average counts threads in a Runnable state. Linux, however, includes both Runnable and Uninterruptible states in the load average. This means processes waiting – on network calls using mkdir to a mounted remote server, local disk I/O – are included in the load average. As such, a high load average on Linux may indicate CPU resource contention but it may also indicate I/O contention elsewhere.

But there’s a third possibility – code that opts for the simplicity of the uninterrupted sleep without needing to be uninterruptible for a process. In our upgrade, CIFS mounts have a laundromat that I assume cleans up cache – I see four cifsd-cfid-laundromat threads in an uninterruptible sleep state – which means my load average, when the system is doing absolutely nothing, would be 4.

2023-10-03 11:11:12 [lisa@server01 ~/]# ps aux | grep " [RD]"
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1150 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat]
root 1151 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat]
root 1152 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat]
root 1153 0.0 0.0 0 0 ? D Sep28 0:01 [cifsd-cfid-laundromat]
root 556598 0.0 0.0 224668 3072 pts/11 R+ 11:11 0:00 ps aux

Looking around the Internet, I see quite a few bug reports regarding this situation … so it seems like a “ignore it and wait” problem – although the load average value is increased by these sleeping threads, it’s cosmetic. Which explains why the server didn’t seem to be running slowly even through the load average was so high.

https://lkml.org/lkml/2023/9/26/1144

Date: Tue, 26 Sep 2023 17:54:10 -0700
From: Paul Aurich 
Subject: Re: Possible bug report: kernel 6.5.0/6.5.1 high load when CIFS share is mounted (cifsd-cfid-laundromat in"D" state)

On 2023-09-19 13:23:44 -0500, Steve French wrote:
>On Tue, Sep 19, 2023 at 1:07 PM Tom Talpey <tom@talpey.com> wrote:
>> These changes are good, but I'm skeptical they will reduce the load
>> when the laundromat thread is actually running. All these do is avoid
>> creating it when not necessary, right?
>
>It does create half as many laundromat threads (we don't need
>laundromat on connection to IPC$) even for the Windows server target
>example, but helps more for cases where server doesn't support
>directory leases.

Perhaps the laundromat thread should be using msleep_interruptible()?

Using an interruptible sleep appears to prevent the thread from contributing
to the load average, and has the happy side-effect of removing the up-to-1s delay
when tearing down the tcon (since a7c01fa93ae, kthread_stop() will return
early triggered by kthread_stop).

~Paul

September 6, 2023

Redis Continually Receiving SIGTERM

I brought up a redis cluster — three servers which all logged basically nothing apart from the fact they were about to shut down. The service status showed as “Activating” — never started — and the server wasn’t doing anything useful.

The redis log reads:

2920940:signal-handler (1694019281) Received SIGTERM scheduling shutdown...
2921151:signal-handler (1694019374) Received SIGTERM scheduling shutdown...
2921518:signal-handler (1694019468) Received SIGTERM scheduling shutdown...
2921726:signal-handler (1694019561) Received SIGTERM scheduling shutdown...
2922133:signal-handler (1694019655) Received SIGTERM scheduling shutdown...
2922410:signal-handler (1694019748) Received SIGTERM scheduling shutdown...
2923173:signal-handler (1694019842) Received SIGTERM scheduling shutdown...
2923537:signal-handler (1694019935) Received SIGTERM scheduling shutdown...
2923747:signal-handler (1694020029) Received SIGTERM scheduling shutdown...
2924110:signal-handler (1694020122) Received SIGTERM scheduling shutdown...
2924319:signal-handler (1694020216) Received SIGTERM scheduling shutdown...
2924687:signal-handler (1694020309) Received SIGTERM scheduling shutdown...
2924900:signal-handler (1694020403) Received SIGTERM scheduling shutdown...
2925266:signal-handler (1694020496) Received SIGTERM scheduling shutdown...
2925467:signal-handler (1694020590) Received SIGTERM scheduling shutdown...

Turns out this is a hazard of copy/pasting a unit file from an older server — evidently redis cannot use a service type of “Forking” with systemd. To resolve the issue, edit /etc/systemd/system/redis.service and updating the type to “simple”. Use systemctl daemon-reload and then systemctl restart redis to launch redis with the new config … voila, I’ve got a cluster of three servers that are started and communicating.

September 1, 2023

Tableau: Workbooks and Views Created or Modified By a Specific Individual

I had a manager looking to locate a ‘something in Tableau’ that was created by a specific individual — in this case, it was a terminated employee so “just ask the person” was not a viable solution. I put together a query to list all workbooks owned by or modified by an individual:

SELECT w.id, w.name, w.description, w.owner_id, w.modified_by_user_id, owner_system_users.email AS owner_email, modified_system_users.email AS modifier_email
     FROM  public.workbooks AS w
      LEFT OUTER JOIN public.users AS owner_users on w.owner_id = owner_users.id
      LEFT OUTER JOIN public.users AS modified_users ON w.owner_id = modified_users.id
      LEFT OUTER JOIN public.system_users AS owner_system_users ON owner_system_users.id = owner_users.system_user_id
		LEFT OUTER JOIN public.system_users AS modified_system_users ON modified_system_users.id = modified_users.system_user_id
      WHERE owner_system_users.name = 'UserLogonID';
--      WHERE owner_system_users.email LIKE '%Smith%' OR modified_system_users.email = '%Smith%'
		;

As well as a query to identify all views owned by an individual:

SELECT views.*, owner_system_users.email AS owner_email
     FROM  public.views 
      LEFT OUTER JOIN public.users AS owner_users on views.owner_id = owner_users.id
      LEFT OUTER JOIN public.system_users AS owner_system_users ON owner_system_users.id = owner_users.system_user_id

      WHERE owner_system_users.name = 'UserLogonID';
--      WHERE owner_system_users.email LIKE '%Smith%' OR modified_system_users.email = '%Smith%'
		;

The email address based search is most reasonable — our email addresses are algorithmically based on our names, so we always know what the address would have been. Many contractors, however, don’t have Office 365 licenses or mailboxes … so I have to fall back to finding their logon ID in those cases.

August 23, 2023

Tableau – Data Source Connection Info and Workbooks

I think I finally have a query that links workbooks where data sources are used and the connection information from the data_connections table!

-- Query to find all data sources and where they are used
select system_users.email
, datasources.id, datasources.name, datasources.created_at, datasources.updated_at, datasources.db_class, datasources.db_name, datasources.site_id
, data_connections.server, data_connections.dbclass
, sites.name as SiteName, projects.name as ProjectName, workbooks.name as WorkbookName
from datasources
left outer join data_connections on data_connections.datasource_id = datasources.id
left outer join users on users.id = datasources.owner_id
left outer join system_users on users.system_user_id = system_users.id
left outer join sites on datasources.site_id = sites.id
left outer join projects on datasources.project_id = projects.id
left outer join workbooks on datasources.parent_workbook_id = workbooks.id
order by datasources.name
;

August 8, 2023

Verifying Connectivity From Locked Down Windows Desktop or Server

We frequently encounter individuals who cannot use something from their server or desktop — but their IT group has Windows locked down so they cannot just telnet to the destination on the port and check if it’s connecting. Windows doesn’t have a whole lot of useful tools of its own. Fortunately, I’ve found that nmap.org publishes a local install zip file for Windows.

Download latest Win32 zip file from https://nmap.org/dist/ — on 8/8/2023, that is https://nmap.org/dist/nmap-7.92-win32.zip

Extract the zip file contents somewhere (I use tmp, right in downloads works, whatever)
Open command prompt and change directory (cd) into the folder where nmap was extracted — e.g. cd /d c:\tmp\nmap-7.92

— A quick trick for opening a command prompt to a directory location: If you have a file explorer window open to the location, click into the header where the file path is shown and remove the text that appears there

Type cmd and hit enter

And voila — a command prompt opened to the same location you were viewing

In the command prompt, run an map command to test a specific port (-p) and host. Since some hosts do not return ICMP requests, I also include -P0 instructing nmap not to attempt pinging the host first. This example verifies we have connectivity to google.com on port 443 (https)

August 6, 2023

Linux: Disabling Wild Local DNS Server Thing (i.e. systemd-resolved)

I am certain there is some way to configure systemd-resolved to actually use internal DNS servers so you can resolve your local hostnames. But nothing I’ve tried have worked, and I don’t actually need this wild local DNS thing.

Here’s the problem — systemd-resolved creates an /etc/resolv.conf file that uses a localhost address as the nameserver — and that may very well forward requests out to Internet DNS servers. Which don’t have any clue about your internal DNS zones — thus you can no longer resolve local hostnames. Whenever I see 127.0.0.53 in /etc/resolv.conf, I know systemd-resolved is at work.

[lisa@linux ~]# cat /etc/resolv.conf
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search example.com

To disable this local name resolution, stop and disable systemd-resolved, unlink the /etc/resolv.conf file it created, and restart NetworkManager

[lisa@linux ~]# systemctl stop systemd-resolved.service
[lisa@linux ~]# systemctl disable systemd-resolved.service
[lisa@linux ~]# unlink /etc/resolv.conf
[lisa@linux ~]# systemctl restart NetworkManager
Removed /etc/systemd/system/multi-user.target.wants/systemd-resolved.service.
Removed /etc/systemd/system/dbus-org.freedesktop.resolve1.service.

Voila, /etc/resolv.conf is now populated with reasonable internal DNS servers, and you can resolve local hostnames.

[lisa@linux ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search example.com
nameserver 10.1.2.33
nameserver 10.1.2.66

August 3, 2023

Cisco Aironet — Unable to Access Wireless Device That Moved to Different Access Point

I think this is a fairly esoteric issue — something that happens frequently enough but doesn’t actually impact functionality so anyone notices. We got a new WiFi doorbell that we set up inside (and could access it), took outside (and could access it) … but, when we went back inside? We could not access the doorbell. No HTTPS, no RTSP, no ICMP. Nothing.

Cisco access points maintain a list of associated wireless clients. These may also be kept in an arp table, although arp caching appears to be disabled by default. So device was on AP1, moved to AP2. Clients on AP2 (or AP3, or AP4) were able to access it since the switch has it registered on the port for AP2. Anything on AP1, however, cannot access the device. The MAC address still appeared in the associations table for AP1. You can set a lower activity timeout — the default was one day — to clear devices out more promptly. But … if the device communicates outside of its new WAP, how frequently is it going to be talking to a device on its old WAP? Generally, we’re talking to our servers (wired) or the Internet (also wired). So technically … Scott’s cell phone couldn’t reach my cell phone when I go from the bedroom to the office. But we never notice because we have minimal peer to peer communication. It’s not like doorbells are going to go walking about normally … but it was good to know a quick AP reboot would allow our cell phones to pull up the doorbell’s video feed.

July 31, 2023

SSH’ing to Older Cisco Access Points

Trying to ssh into our Cisco access points, we get an error saying “no matching key exchange method found. Their offer: diffie-hellman-group1-sha1” … to one-off enable older, deprecated algorithms, we added a cisco.conf to /etc/ssh/ssh_config.d (/etc/ssh/ssh_config includes /etc/ssh/ssh_config.d/*.conf)

Host <IP>
     Ciphers 3des-cbc,blowfish-cbc,aes128-cbc,aes128-ctr,aes256-ctr
     KeyAlgorithms diffie-hellman-group1-sha1

And restart sshd — voila*, you can SSH into the router / access point / etc.

* — you may get an invalid key length error. In this case, you need to regenerate the key on the Cisco device using a 2048-bit key:

config term
crypto key zeroize rsa
crypto key generate rsa modulus 2048
end

July 20, 2023

Cisco Catalyst 2960-S: Capturing All Traffic Sent Through a Port

We had an issue where an IOT device was not able to establish the connection it wanted — it would report it couldn’t connect to the Internet. I knew it could connect to the Internet in general; but, without knowing what tiny part of the Internet it used to determine ‘connected’ or ‘not connected’, we were stuck. Except! We recently upgraded the switch in our house to a Cisco Catalyst 2960S — which allows me to do one of the cool things I’d seen the network guys at work do but had never been able to reproduce at home: using SPAN (Switched Port ANalyzer). When we’d encounter strange behavior with a network device where we couldn’t just install Wireshark and get a network capture, the network group would basically clone all of the traffic sent to the device’s port to another switch port where we could capture traffic. They would send me a capture file, and it was just like having a Wireshark capture.

You can set up SPAN from the command line configuration, but I don’t have a username/password pair to log into SSH (and can only establish this from the command line configuration). Before breaking out the Cisco console cable, I tried running Cisco Network Assistant (unfortunately, a discontinued product line). One of the options under “Configure” => “Switching” is SPAN:

Since there was no existing SPAN session, I had to select a session number.

Then find the two ports — in the Ingress/Egress/Destination column, the port that is getting the traffic you want needs to either have Ingress (only incoming traffic), Egress (only outgoing traffic), or Both (all traffic). The port to which you want to clone the traffic is set to Destination. And the destination encapsulation is Replicate. Click apply.

In the example above, the laptop plugged in to GE1/0/24 gets all of the traffic traversing GE1/0/5 — running tshark -w /tmp/TheProblem.cap writes the packet capture to a file for later analysis. Caveat — the destination port is no longer “online” — it receives traffic but isn’t sending or receiving its own traffic … so make sure you aren’t using remote access to control the device!

To remove the SPAN, change the Ingress/Egress/Destination values back to “none”, change the destination encapsulation back to select one, and apply.

Since the source port is connected to one of our wireless access points, the network capture encompasses all wireless traffic through that access point.

And we were easily able to identify that this particular device uses the rule “I can ping 8.8.8.8” to determine if it is connected to the Internet. We were able to identify a firewall rule that prevented ICMP replies; allowing this traffic immediately allowed the devices to connect as expected.