Patching Notes

Timeline

Asha likes to submit her FedRAMP packet before about the 20th, so try to get it done before that.
Send email ~ 1 week before.
Give 15 minute warning in the Slack #xdr-patching, #xdr-content-aas, #xdr-soc Channel, #xdr-engineering Channel channels, etc before patching

Patching Process

Each month the AWS GovCloud(GC) TEST/PROD environments must be patched to comply with FedRAMP requirements. This wiki page outlines the process for patching the environment.

Email Template that needs to be sent out prior or create a Calendar event for patching and email addresses of individuals who should get the invite.

Leonard, Wesley A. <wesley.a.leonard@accenturefederal.com>; Waddle, Duane E. <duane.e.waddle@accenturefederal.com>; Nair, Asha A. <asha.a.nair@accenturefederal.com>; Crawley, Angelita <angelita.crawley@accenturefederal.com>; Rivas, Gregory A. <gregory.a.rivas@accenturefederal.com>; Damstra, Frederick T. <frederick.t.damstra@accenturefederal.com>; Poulton, Brad <brad.poulton@accenturefederal.com>; Kuykendall, Charles S. <charles.s.kuykendall@accenturefederal.com>; Williams, Colby <colby.williams@accenturefederal.com>; Naughton, Brandon <brandon.naughton@accenturefederal.com>; Cooper, Jeremy <jeremy.cooper@accenturefederal.com>; Jennings, Kendall <kendall.jennings@accenturefederal.com>; Lohmeyer, Dean <dean.lohmeyer@accenturefederal.com>; XDR-Patching <xdr.patching@accenturefederal.com>

SUBJECT: <INSERT MONTH> Patching

It is time for monthly patching again. Patching is going to occur during business hours within the next week or two.  Everything - including Customer LCPs - needs patching.  We will be doing the servers in 2 waves.
 
For real-time patching announcements, join the Slack #xdr-patching Channel. Announcements will be posted in that channel on what is going down and when.
 
Here is the proposed patching schedule:

Wednesday <INSERT MONTH> 11:
* Moose and Internal infrastructure
  * Patching
* CaaSP 
  * Patching
 
Thursday <INSERT MONTH> 12:
* Moose and Internal
  * Reboots
* All Customer LCP
  * Patching (AM)
  * Reboots (PM)
* CaaSP
  * Reboots

Monday <INSERT MONTH> 16:
* All Customer XDR Cloud
  * Patching
* All Search heads
  * Reboots (PM)

Tuesday <INSERT MONTH> 17:
* All Remaining XDR Cloud
  * Reboots (AM)
 
The customer and user impact will be during the reboots so they will be done in batches to reduce our total downtime.

Detailed Steps (Brad's patching)

HEY BRAD: READ ME!

Run the cmd below to deal with message: "This system is not registered with an entitlement server. You can use subscription-manager to register."

date; salt '*' state.sls os_modifications.rhel_deregistration --output-diff

It's safe to run on * and will remove any RHEL registration (or warnings about lack thereof) on systems that have a billing code.

Reminder - The legacy Reposerver was shutdown in late February 2021, so consider it a suspect if you have issues.

Day 1 (Wednesday)

Patch GC TEST first! This helps find problems in TEST and potential problems in PROD. Test is shutdown to save on costs:

# To start up all of test run this command
xdrtest start --profile mdr-test-c2-gov
xdrtest start --profile mdr-test-modelclient-gov

# For just a single instance
xdrtest --profile mdr-test-c2-gov start salt-master

Post to Slack #xdr-patching Channel:

FYI, patching today. 
* This morning, patches to all internal systems, moose, and CaaSP. 
* No reboots, so impact should be minimal.

:warning: See if GitHub has any updates! Coordinate with Duane or Colby on GitHub Patching.

Step 1 of 1 (Day 1): Moose and C2 Infrastructure -

Starting with Moose and Internal infra patching within GC TEST. Check disk space for potential issues. Return here to start on PROD after TEST is patched.

# Test connectivity between Salt Master and Minions
salt -C '* not ( afs* or nga* or doed* or dc-c19* or la-c19* or bas-* or ca-c19* or frtib* or dgi* or vmray* )' test.ping --out=txt

# Fred's update for df -h - checks for disk utilization at the 80-90% area
salt -C '* not ( afs* or nga* or doed* or dc-c19* or la-c19* or bas-* or ca-c19* or frtib* or dgi* or vmray* )' cmd.run 'df -h | egrep "[890][0-9]\%"'

# Review packages that will be updated. Some packages are versionlocked (Collectd, Splunk, Teleport, etc.).
salt -C '* not ( afs* or nga* or doed* or dc-c19* or la-c19* or bas-* or ca-c19* or frtib* or dgi* or vmray* )' cmd.run 'yum check-update'

Also, the `phantom_repo` pkg wants to upgrade, but we are not ready. Let's exclude that.

date; salt -C '* not ( afs* or nga* or doed* or dc-c19* or la-c19* or bas-* or ca-c19* or frtib* or dgi* or vmray* or phantom-0* )' pkg.upgrade

# update phantom, but exclude the phantom repo. 
date; salt -C 'phantom-0*' pkg.upgrade disablerepo='["phantom-base",]'

Now patch vmray that dirty Ubuntu server

From the docs, their recommended patching is:

# Test connectivity between Salt Master and Minions
salt vmray* test.ping

# Stop Service
salt vmray* cmd.run 'systemctl stop vmray-server vmray-worker'

# Review packages that will be updated. Some packages are versionlocked (Collectd, Splunk, Teleport, etc.).
salt vmray* cmd.run 'apt list --upgradable'

# Update and Upgrade
date; vmray\* pkg.upgrade

#Or using the built-in package manager
date; salt vmray* cmd.run 'apt update && apt full-upgrade -y && apt autoremove -y'

3. Optional: /opt/vmray/bin/control_modules reload to reload the kernel module (only for the Worker).

# Start Service
4. systemctl start vmray-server vmray-worker
salt vmray* cmd.run 'systemctl start vmray-server vmray-worker'

5. Reboot the Server (later? or now?) wait until all servers get rebooted.

What about threatq? Ask Duane! It needs special handling.

Run it again to make sure nothing got missed.

salt -C '* not ( afs* or nga* or doed* or dc-c19* or la-c19* or bas-* or ca-c19* or frtib* or dgi* or vmray* or phantom-0* )' pkg.upgrade

:warning: After upgrades check on Portal to make sure it is still up.

Prod: Portal
Test: Portal

If Portal is down, start by restarting the Docker service. My guess is patching is messing with the network stack and Docker service don't like that.

date; salt 'customer-portal*' cmd.run 'systemctl restart docker'

Portal Notes are here for further Troubleshooting if necessary: Portal Notes

Patch CaaSP

See Patch CaaSP instructions

Troubleshooting

Phantom error

phantom.msoc.defpoint.local:
    ERROR: Problem encountered upgrading packages. Additional info follows:

    changes:
        ----------
    result:
        ----------
        pid:
            40718
        retcode:
            1
        stderr:
            Running scope as unit run-40718.scope.
            Error in PREIN scriptlet in rpm package phantom_repo-4.9.39220-1.x86_64
            phantom_repo-4.9.37880-1.x86_64 was supposed to be removed but is not!
        stdout:
            Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
            Logging to /var/log/phantom/phantom_install_log
            error: %pre(phantom_repo-4.9.39220-1.x86_64) scriptlet failed, exit status 7

Error:

error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed

salt dc-c19* cmd.run 'pip uninstall urllib3 -y'

This error is caused by the versionlock on the package. Use this to view the list

yum versionlock list
Error: Package: salt-minion-2018.3.4-1.el7.noarch (@salt-2018.3)
                       Requires: salt = 2018.3.4-1.el7
                       Removing: salt-2018.3.4-1.el7.noarch (@salt-2018.3)
                           salt = 2018.3.4-1.el7
                       Updated By: salt-2018.3.5-1.el7.noarch (salt-2018.3)
                           salt = 2018.3.5-1.el7

Error: installing package `kernel-3.10.0-1062.12.1.el7.x86_64` needs 7MB on the /boot filesystem

# Install yum utils 
yum install yum-utils

# Package-cleanup set count as how many old kernels you want left 
package-cleanup --oldkernels --count=1 -y

ISSUE: Salt-minion doesn't come back and has this error

/usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh: line 16: /lib/modules/3.10.0-957.21.3.el7.x86_64///lib/modules/3.10.0-957.21.3.el7.x86_64/kernel/sound/drivers/mpu401/snd-mpu401.ko.xz: No such file or directory

RESOLUTION: Manually reboot the OS, this is most likely due to a kernal upgrade.

Day 2 (Thursday)

Step 1 of 4 (Day 2): Reboot Internals

Long Day of Rebooting!

Post to Slack #xdr-patching Channel:

FYI, patching today. Rebooting TEST 
* In about 15 minutes: Reboots of moose, internal systems and CaaSP.
* Following that, patching (but not rebooting) of all customer PoPs/LCPs.
* Then this afternoon, reboots of those those PoPs/LCPs.

Be sure to select ALL entities in Sensu for silencing not just the first 25. Sensu -> Entities -> Sort (name) -> Select Entity and Silence. This will silence both keepalive and other checks. Some silenced events will not unsilence and will need to be manually unsilenced. IDEA! restart the sensu server and the vault-3 server first. This helps with the clearing of the silenced entities.

GovCloud (TEST)

SSH via TSH into GC Salt-Master to reboot servers in GC that are on gc-dev.

# Login to Teleport
tsh --proxy=teleport.xdrtest.accenturefederalcyber.com login

# SSH to GC Salt-Master (TEST)
tshd salt-master

Start with Sensu and Vault

# Vault-3 and Sensu
salt -C 'vault-3* or sensu*' test.ping --out=txt
date; salt -C 'vault-3* or sensu*' system.reboot --async
watch "salt -C 'vault-3* or sensu*' test.ping --out=txt"

Reboot majority of servers in GC Test.

salt -C '*com not ( modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or vmray-* or vault-3* or rhsso-0* )' test.ping --out=txt
date; salt -C '*com not ( modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or vmray-* or vault-3* or rhsso-0* )' system.reboot --async

:warning:

You will lose connectivity to Teleport and Salt Master

Log back in and verify they are back up

watch "salt -C '*com not ( modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or vmray-* or vault-3* or rhsso-0* )' cmd.run 'uptime' --out=txt"

Take care of the govcloud Resolvers one at a time. The vmray can be combined with one of the govcloud ones.

salt -C 'resolver-govcloud.pvt.*com or resolver-vmray-*.pvt.*com' test.ping --out=txt
date; salt -C 'resolver-govcloud.pvt.*com or resolver-vmray-*.pvt.*com' system.reboot --async
watch "salt -C 'resolver-govcloud.pvt.*com or resolver-vmray-*.pvt.*com' test.ping --out=txt"

salt -C 'resolver-govcloud-2.pvt.*com' test.ping --out=txt
date; salt -C 'resolver-govcloud-2.pvt.*com' system.reboot --async
watch "salt -C 'resolver-govcloud-2.pvt.*com' test.ping --out=txt"

Check uptime on the minions in GC to make sure you didn't miss any.

salt -C  '*com not ( modelclient-splunk-idx* or moose-splunk-idx* or threatq-* or vmray-server* )' cmd.run 'uptime | grep days'

Duane Section (feel free to bypass)

-- I (Duane) did this a little different. Salt-master first, then everything but resolvers. Resolvers reboot one at a time.

salt -C '* not ( afs* or nga* or dc-c19* or la-c19* or qcomp* or salt-master* or moose-splunk-indexer-* or resolver* )' cmd.run 'shutdown -r now'

Reboot CaaSP

See Day 2 notes in Patch CaaSP instructions

GovCloud (PROD)

Post to Slack #xdr-patching Channel and #xdr-soc Channel:

FYI, patching today. Rebooting PROD 
* In about 15 minutes: Reboots of moose, internal systems and CaaSP, including the VPN.
* Following that, patching (but not rebooting) of all customer PoPs/LCPs.
* Then this afternoon, reboots of those those PoPs/LCPs.

SSH via TSH into GC Salt-Master to reboot servers in GC that are on GC Prod.

# Login to Teleport
tsh --proxy=teleport.xdr.accenturefederalcyber.com login

# SSH to GC Salt-Master (PROD)
tsh ssh node=salt-master

:warning: Don't forget to silence Sensu! Be sure to post the Jurrasic Park, "Hold on to your butts" Meme into the xdr-soc channel before restarting Prod. /giphy hold on to your butts

Start with Vault and Sensu

# Vault-1 and Sensu
salt -C 'vault-1*com or sensu*com' test.ping --out=txt
date; salt -C 'vault-1*com or sensu*com' system.reboot --async
watch "salt -C 'vault-1*com or sensu*com' test.ping --out=txt"

Reboot majority of servers.

salt -C  '*com not ( afs* or nga* or doed* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or bas-* or frtib* or ca-c19* or resolver* or vault-1*com or sensu*com or vmray-* )' test.ping --out=txt

date; salt -C  '*com not ( afs* or nga* or doed* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or bas-* or frtib* or ca-c19* or resolver* or vault-1*com or sensu*com or vmray-* )' system.reboot --async

:warning:

You will lose connectivity to Salt master

Log back in and verify they are back up

watch "salt -C  '*accenturefederalcyber.com not ( afs* or nga* or doed* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or bas-* or frtib* or ca-c19* or resolver* or vault-1*com or sensu*com or vmray-* )' cmd.run 'uptime' --out=txt"

Vault Service likes to crap out after reboot; verify the service is back up

Borrowed this from Vault Upgrade instructions

# Check the status
salt vault* cmd.run cmd='VAULT_SKIP_VERIFY=1 VAULT_ADDR=https://127.0.0.1 vault status'

# If you see "connection refused", the Vault service is not running
salt vault* cmd.run 'systemctl start vault'

# Check the status
salt vault* cmd.run cmd='VAULT_SKIP_VERIFY=1 VAULT_ADDR=https://127.0.0.1 vault status'

vault-1.pvt.xdr.accenturefederalcyber.com:
    Key                      Value
    ---                      -----
    Recovery Seal Type       shamir
    Initialized              true
    Sealed                   false
    Total Recovery Shares    5
    Threshold                2
    Version                  1.9.3
    Storage Type             dynamodb
    Cluster Name             vault-cluster-b6aa0cd0
    Cluster ID               d0d778a9-b123-4a6a-7712-0b99d54f8a00
    HA Enabled               true
    HA Cluster               https://10.40.0.204:443
    HA Mode                  standby
    Active Node Address      https://vault.pvt.xdr.accenturefederalcyber.com

Verify the UI is up Vault Prod

Take care of the resolvers one at a time and with the GC Prod Salt Master. Reboot one of each at the same time.

salt -C 'resolver-govcloud.pvt.*com or resolver-vmray-*.pvt.*com' test.ping --out=txt
date; salt -C 'resolver-govcloud.pvt.*com or resolver-vmray-*.pvt.*com' system.reboot --async
watch "salt -C 'resolver-govcloud.pvt.*com or resolver-vmray-*.pvt.*com' test.ping --out=txt"

salt -C 'resolver-govcloud-2.pvt.*com' test.ping --out=txt
date; salt -C 'resolver-govcloud-2.pvt.*com' system.reboot --async
watch "salt -C 'resolver-govcloud-2.pvt.*com' test.ping --out=txt"

Take care of the vmray-worker server separately due to taking a very long time (20-30 minutes) to reboot.

salt -C 'vmray-worker*com' test.ping --out=txt
date; salt -C 'vmray-worker*com' system.reboot --async
watch "salt -C 'vmray-worker*com' test.ping --out=txt"

Check uptime on the minions in GC Prod to make sure you didn't miss any.

salt -C  '*accenturefederalcyber.com not ( afs* or nga* or doed* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or bas-* or frtib* or ca-c19* )' cmd.run 'uptime | grep days'

Verify Portal is up: Portal

Look in Sensu for any silent alerts.

Step 2 of 4 (Day 2): Reboot Moose and Modelclient

GovCloud (TEST)
NOTE: indexer hostnames have changed!

salt 'moose-splunk-idx*' test.ping --out=txt

# First Indexer
salt moose-splunk-idx-7a4.pvt.xdrtest.accenturefederalcyber.com test.ping --out=txt
date; salt moose-splunk-idx-7a4.pvt.xdrtest.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt moose-splunk-idx-7a4.pvt.xdrtest.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

salt 'modelclient-splunk-idx*' test.ping --out=txt

# First Indexer
salt 'modelclient-splunk-idx-326.pvt.xdrtest.accenturefederalcyber.com' test.ping --out=txt
date; salt modelclient-splunk-idx-326.pvt.xdrtest.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt modelclient-splunk-idx-326.pvt.xdrtest.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

Repeat the above patching steps for the additional indexers, waiting for 3 green checks in between each one.

# Second Moose indexer
salt moose-splunk-idx-3b9.pvt.xdrtest.accenturefederalcyber.com test.ping --out=txt
date; salt moose-splunk-idx-3b9.pvt.xdrtest.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt moose-splunk-idx-3b9.pvt.xdrtest.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

# Second Modelclient indexer
salt 'modelclient-splunk-idx-129.pvt.xdrtest.accenturefederalcyber.com' test.ping --out=txt
date; salt modelclient-splunk-idx-129.pvt.xdrtest.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt modelclient-splunk-idx-129.pvt.xdrtest.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

# Third Moose indexer
#salt moose-splunk-idx-568.pvt.xdrtest.accenturefederalcyber.com test.ping --out=txt
#date; salt moose-splunk-idx-568.pvt.xdrtest.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
#watch "salt moose-splunk-idx-568.pvt.xdrtest.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

# Third Modelclient indexer
salt 'modelclient-splunk-idx-8b8.pvt.xdrtest.accenturefederalcyber.com' test.ping --out=txt
date; salt modelclient-splunk-idx-8b8.pvt.xdrtest.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt modelclient-splunk-idx-8b8.pvt.xdrtest.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

# Verify all indexers on Moose and Modelclient have been patched:
salt -C 'moose-splunk-idx* or modelclient-splunk-idx*' cmd.run 'uptime | grep days'

GovCloud (PROD)

salt -C 'moose-splunk-idx*' test.ping --out=txt

# First indexer
salt moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com test.ping --out=txt
date; salt moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

Repeat the above patching steps for the additional indexers, waiting for 3 green checks in between each one.

# Second indexer
salt moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com test.ping --out=txt
date; salt moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

# Third indexer
salt moose-splunk-idx-1e7.pvt.xdr.accenturefederalcyber.com test.ping --out=txt
date; salt moose-splunk-idx-1e7.pvt.xdr.accenturefederalcyber.com system.reboot --async

# Indexers take a while to restart
watch "salt moose-splunk-idx-1e7.pvt.xdr.accenturefederalcyber.com cmd.run 'uptime' --out=txt"

# Verify all indexers rebooted:
salt 'moose-splunk-idx*' cmd.run 'uptime | grep days'

# Verify Splunk is active on all indexers
salt 'moose-splunk-idx*' cmd.run 'systemctl status splunk | grep Active'

Troubleshooting

If there are no green checkmarks after restarting the index cluster

:warning: ONLY USE THESE STEPS IF SPLUNK IS CONFIGURED WITH A SEARCH FACTER THAT IS NOT EQUAL TO THE REPLICATION FACTOR :warning:

Log into the cluster's manager, click on Settings -> Indexer clustering, click on the Indexes tab, then the Bucket Status button. Click on the Fixup Tasks - Pending tab, then the Generation button. Any bucket listed here with the Current Status of "does not meet: primacy & sf & rf" has to be deleted from the indexer with that copy of the bucket. To get the indexer, click on the Action link, then View Bucket Details. These buckets should have only one peer associated with them.

Unfortunately, this view does not provide you with the bucket name as it exists on the file system. You will need to replace the ~ (tilde) characters with _ (underscore) and find the bucket name in the file system, usually in /opt/splunkdata/hot/normal_primary/$indexname/ though some indices are in high_primary. In addition, the bucket may be in the index's db/ or colddb/ directory. Ensure the Splunk bucket in question is in the AWS S3 frozen bucket then remove it and restart Splunk on that instance. It is better to do this in batches so you only have to restart the indexer once.

If the bucket is not in S3, check the contents of the bucket on disk. If it has only rawdata/journal.gz or rawdata/{deleted,journal.gz} then it is safe to delete.

Helpful bash function for checking S3:

### Change the customer identifier
function awsls {
  /usr/local/bin/aws s3 ls --region us-gov-east-1 s3://xdr-<CUSTOMER>-prod-splunk-frozen/$1/frozendb/$2
}
### Usage: awsls <index> <bucket>
### Example: awsls _internaldb db_1639812753_1639799383_1639_FAE2A88B-E6D9-47D0-8F8C-4D7DA9B72531

:warning: END SF != RF WARNING :warning:

Is Splunk configured with the correct S3 URL?

Check the path value in the index volume config (either in Salt or on the cluster manager) and ensure it matches the bucket associated with the customer.

Example indexes.conf entry under [volume:smartstore]: path = s3://xdr-doed-prod-splunk-smartstore/ The pattern is xdr-<CUSTOMER>-<PROD|TEST>-splunk-smartstore.

Test one or more of the indexers to ensure they can communicate with the S3 bucket specified in the volume path

/opt/splunk/bin/splunk cmd splunkd rfs ls volume:smartstore

If the indexer/checkmarks don't come back ( legacy information )

If an indexer is not coming back up...look at screenshot in AWS... see this: Probing EDD (edd=off to disable)... ok then look at system log in AWS see this: Please enter passphrase for disk splunkhot!:

IF/WHEN an Indexer doesn't come back up follow these steps:

- In the AWS console, grab the instance ID. 
- Run the MDR/get-console.sh (Duane's script for pulling the system log)
- Look for "Please enter passphrase for disk splunkhot"

In AWS console stop instance (which will remove ephemeral splunk data) then start it. Then ensure the /opt/splunkdata/hot exists.

salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h'

IF the MOUNT for /opt/splunkdata/hot DOESN'T EXIST, STOP SPLUNK! Splunk will write to the wrong volume. before mounting the new volume clear out the wrong /opt/splunkdata/

rm -rf /opt/splunkdata/hot/*

salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl stop splunk'

Ensure the /opt/splunkdata doesn't already exist, before the boothook.

ssh prod-moose-splunk-indexer-1

If it doesn't then manually run the cloudinit boothook.

sh /var/lib/cloud/instance/boothooks/part-002
salt -C 'nga-splunk-indexer-2.msoc.defpoint.local' cmd.run 'sh /var/lib/cloud/instance/boothooks/part-002'

Ensure the hot directory is owned by splunk:splunk

ll /opt/splunkdata/
salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'ls -larth /opt/splunkdata'
chown -R splunk: /opt/splunkdata/
salt -C '' cmd.run 'chown -R splunk: /opt/splunkdata/'

It will be waiting for the luks.key

systemctl daemon-reload
salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl daemon-reload'
salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl restart systemd-cryptsetup@splunkhot'
salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl | egrep cryptset'

It is waiting for command prompt, when you restart the service it picks up the key from a file. Systemd sees the crypt setup service as a dependency for the Splunk service.

Look for this. This is good, it is ready for restart of splunk Cryptography Setup for splunkhot

systemctl restart splunk
salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl restart splunk'

Once the /opt/splunkdata/hot is visible in df -h and the splunk service is started, then wait for the cluster to have 3 green checkmarks.

Check the servers again to ensure all of them have rebooted.

salt -C ''moose-splunk-idx*'' cmd.run 'uptime' --out=txt | sort

Ensure all Moose and Internal have been rebooted

salt -C '* not ( afs* or bas-* or ca-c19* or dc-c19* or dgi-* or doed* or frtib-* or la-c19* or nga* )' cmd.run uptime

Day 2 (Thursday), Step 3 of 4, Patching LCPs

salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' test.ping --out=txt
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'yum check-update'
salt -C 'bas* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'yum check-update --disablerepo=splunk-8.2'
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'uptime'

# Fred's update for df -h:
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h | egrep "[890][0-9]\%"'

# Updates
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade

# If a repo gives an error, you may need to disable it.
# salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade disablerepo=msoc-repo # Optional for fix
# salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade disablerepo=splunk-8.2 # Optional for fix
# on 2020-07-23: salt -C 'nga-splunk-ds-1 or afs-splunk-ds-1 or afs-splunk-ds-2' pkg.upgrade disablerepo=splunk-7.0 # Optional for fix

Troubleshooting

Error on afs-splunk-ds-3: error: cannot open Packages database in /var/lib/rpm

Solution:

mkdir /root/backups.rpm/
cp -avr /var/lib/rpm/ /root/backups.rpm/
rm -f /var/lib/rpm/__db*
db_verify /var/lib/rpm/Packages
rpm --rebuilddb
yum clean all

Error on `*-ds`: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'

Reason: POP Nodes shouldn't be using the .local DNS address.

Solution: Needs a permanent fix. For now, patch with the repo disabled:

salt -C '*-ds* not afs-splunk-ds-4' pkg.upgrade disablerepo=splunk-7.0

Day 2 (Thursday), Step 4 of 4 (afternoon), Reboots LCPs

Post to Slack #xdr-patching Channel

Resuming today's patching with the reboots of customer LCPs.

:warning: Remember to silence Sensu alerts before restarting servers.

NOTE: Restart LCPs one server at a time at a location in order to minimize risk of concurrent outages.

First syslog servers

Restart the first syslog server by itself to check for reboot issues. This will also grab a few FRTIB LCPs (10, 11, 12, 15 & 16)

salt -C '*syslog-1* not *.local' cmd.run 'uptime && hostname'
date; salt -C '*syslog-1* not *.local' system.reboot --async

#Look for /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid
watch "salt -C '*syslog-1* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

#Ensure Splunk is running
salt -C '*syslog-1* not *.local' cmd.run '/opt/splunk/bin/splunk status'

Second syslog servers

WAIT! see commands below for a faster approach!

salt -C '*syslog-2* not *.local' cmd.run 'uptime && hostname'
date; salt -C '*syslog-2* not *.local' system.reboot --async
watch "salt -C '*syslog-2* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

Remaining Syslog Servers

(We might be able to reboot some of these at the same time. If they are in different locations. Check the location grain on them.) grains.item location

afs-splunk-syslog-8: {u'location': u'az-east-us-2'}
afs-splunk-syslog-7: {u'location': u'az-east-us-2'}
afs-splunk-syslog-4: {u'location': u'San Antonio'}

# Location Grain
salt -C '*-splunk-syslog*' grains.item location

# Even numbered LCPs 
salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' cmd.run 'uptime'

date; salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' system.reboot --async

watch "salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' test.ping --out=text"

salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'

# Ensure Splunk is running
salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' cmd.run '/opt/splunk/bin/splunk status'

# Odd numbered LCPs
salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7 or *splunk-syslog-9' cmd.run 'uptime'

date; salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7 or *splunk-syslog-9' system.reboot --async

watch "salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7 or *splunk-syslog-9' test.ping --out=text"

salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7 or *splunk-syslog-9' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'

salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7 or *splunk-syslog-9' cmd.run '/opt/splunk/bin/splunk status'

# Ensure Splunk is running in addition to syslog-ng
salt '*splunk-syslog*' cmd.run 'systemctl status splunk | grep Active'

# Did you miss some?
salt '*splunk-syslog*' cmd.run 'uptime | grep day'

Troubleshooting

Possible issue: If syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons )

:warning: 2020-06-11 - had to do this for afs-syslog-5 through 8

salt afs-splunk-syslog-1 cmd.run 'setenforce 0'
salt afs-splunk-syslog-1 cmd.run 'systemctl stop rsyslog'
salt afs-splunk-syslog-1 cmd.run 'systemctl start syslog-ng'

watch "salt -C '*syslog-1* not *.local' test.ping"

If the syslog-ng service doesn't start, check the syslog-ng file for oms agent added configurations.

Possible issue: NGA LCP nodes hostnames change after reboot and Sensu agent name changes.

salt 'nga-splunk-ds-1' cmd.run 'hostnamectl set-hostname aws-splnks1-tts.nga.gov'
salt 'nga-splunk-ds-1' cmd.run 'hostnamectl status'
salt 'nga-splunk-ds-1' cmd.run 'systemctl stop sensu-agent'
salt 'nga-splunk-ds-1' cmd.run 'systemctl start sensu-agent'

Repeat for other LCP nodes

Verify logs are flowing

AFS Splunk Search Head - Access here to check logs on afssplhf103.us.accenturefederal.com

# index=* source=/opt/syslog-ng/* host=afs* earliest=-15m | stats count by host
# New search string
| tstats count WHERE index=* source=/opt/syslog-ng/* host=afs* earliest=-15m latest=now BY host

# another check to ensure logs are flowing 
index=network sourcetype="pan:traffic" earliest=-6h latest=now

Should see at least 5 hosts

NGA Splunk Search Head - Access here to check log on aws-syslog1-tts.nga.gov

index=network sourcetype="citrix:netscaler:syslog" earliest=-15m latest=now
index=zscaler sourcetype="zscaler:web" earliest=-15m latest=now

NOTE: NGA sourcetype="zscaler:web" logs are handled by fluentd and can lag behind by 10 minutes.

#index=* source=/opt/syslog-ng/* host=aws* earliest=-60m | stats count by host
#New search string
| tstats count WHERE index=* source=/opt/syslog-ng/* host=aws* earliest=-60m latest=now BY host

FRTIB Splunk Search Head - Access here to check logs

# Check to see if logs are coming from the syslog nodes. 
| tstats count WHERE index=* source=/opt/syslog-ng/* host=cvg* OR host=*siema*alight* OR host=rfrbxlvspk* earliest=-15m latest=now BY host

POP DS (could these be restarted at the same time? Or in 2 batches?)

Don't forget DS-4

# Try reboot at the same time
salt '*splunk*ds*' cmd.run 'uptime'
date; salt '*splunk*ds*' system.reboot --async
watch "salt '*splunk*ds*' test.ping --out=text"
salt '*splunk-ds*' cmd.run 'systemctl status splunk | grep Active'

Did you get all of them?

salt -C ' * not *local not *.pvt.xdr.accenturefederalcyber.com' cmd.run uptime
salt -C ' * not *local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'uptime | grep day'

Don't forget to un-silence Sensu.

Day 3 (Monday)

Step 1 of 2, Customer Slices Patching

Shorter day of Patching! :-) Don't forget to patch and reboot in Test environment.

Post to Slack #xdr-patching Channel:

Today's patching is all XDR customer environments. Indexers and Searchheads will be patched this morning. Search heads will be rebooted this afternoon, and the indexers will be rebooted tomorrow. Thank you for your cooperation.

Run these commands on GC Prod Salt Master. These notes should patch all Splunk instances.

salt -C 'afs*local or afs*com or bas-*com or ca-c19*com or dc*com or dgi*com or doed*com or frtib*com or la-*com or nga*com or nga*local' test.ping --out=txt

salt -C 'afs*local or afs*com or bas-*com or ca-c19*com or dc*com or dgi*com or doed*com or frtib*com or la-*com or nga*com or nga*local' cmd.run 'uptime'

# Fred's kung fu for df -h:
salt -C 'afs*local or afs*com or bas-*com or ca-c19*com or dc*com or dgi*com or doed*com or frtib*com or la-*com or nga*com or nga*local' cmd.run 'df -h | egrep "[890][0-9]\%"'

# SKIP this one as long as Fred's kung fu works
salt -C 'afs*local or afs*com or bas-*com or ca-c19*com or dc*com or dgi*com or doed*com or frtib*com or la-*com or nga*com or nga*local' cmd.run 'df -h'

# Check for upgrades
salt -C 'afs*local or afs*com or bas-*com or ca-c19*com or dc*com or dgi*com or doed*com or frtib*com or la-*com or nga*com or nga*local' cmd.run 'yum check-update'

# Upgrade the Packages
salt -C 'afs*local or afs*com or bas-*com or ca-c19*com or dc*com or dgi*com or doed*com or frtib*com or la-*com or nga*com or nga*local' pkg.upgrade

:warning: Some Splunk Indexers always have high disk space usage (83%). This is normal.

Troubleshooting

EPEL repo is enabled on afs-splunk-hf ( I don't know why); had to run this to avoid issue with collectd package on msoc

yum update --disablerepo epel

Day 3 (Monday afternoon)

Step 2 of 2, Customer Slices Search Heads Only Reboots

Post to Slack #xdr-patching Channel , #xdr-soc Channel , and #xdr-engineering Channel:

FYI: Rebooting the Splunk Search Heads as part of today's patching. Reboots will occur in 15 minutes.

:warning: Silence Sensu first! Run on the GC PROD Salt Master.

Commands to run on the GC PROD Salt Master:

salt -C '*-sh* and not *moose* and not fm-shared-search*' test.ping --out=txt | sort | grep -v jid

salt -C '*-sh* and not *moose* and not fm-shared-search*' cmd.run 'df -h | egrep "[890][0-9]\%"'

date; salt -C '*-sh* and not *moose* and not fm-shared-search*' system.reboot --async

watch "salt -C '*-sh* and not *moose* and not fm-shared-search*' cmd.run 'uptime'"

salt -C '*-sh* and not *moose* and not fm-shared-search*' cmd.run 'systemctl status splunk | grep active'

:warning: Don't forget to un-silence Sensu.

Day 4 (Tuesday)

Step 1 of 1, Customer Slices CMs Reboots

Post to Slack #xdr-patching Channel:

Today's patching is the indexing clusters for all XDR customer environments. Cluster masters and indexers will be rebooted. Thank you for your cooperation.

:warning: Silence Sensu first! Run on the GC PROD Salt Master.

Commands to run on the GC PROD Salt Master:

salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt

#Did you silence sensu?

salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'

date; salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' system.reboot --async
watch "salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt"

salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'systemctl status splunk | grep Active'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'uptime'

ulimit errors

May 27 17:08:57 la-c19-splunk-cm.msoc.defpoint.local splunk[3840]: /etc/rc.d/init.d/splunk: line 13: ulimit: open files: cannot modify limit: Invalid argument afs-splunk-hf has a hard time restarting. Might need to stop then start the instance.

Log into the CM's

Generate the URLs on the GC Prod Salt Master. OR SEE COMMAND BELOW!

for i in `salt -C '( *splunk-cm* ) not moose*' test.ping --out=txt`; do echo https://${i}8000; done | grep -v True

Reboot the indexers one at a time (AFS cluster gets backed up when an indexer is rebooted)

Command to view "three green check marks" from salt.

salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status --static --out=json | jq --raw-output 'keys[] as $k | "\($k): \(.[$k] | .[].changes?.stdout)"' | grep -1 factor

NOTICE: Using compound targeting doesn't seem to work with multi-level grains after they reboot. After reboot if the grain targeting doesn't work try this to sync up the grains after the reboot: salt '*splunk-i*' saltutil.refresh_grains

# us-east-1a
salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' test.ping --out=txt

salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'

date; salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' system.reboot --async

watch "salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status | grep 'not met'"

salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status --static --out=json | jq --raw-output 'keys[] as $k | "\($k): \(.[$k] | .[].changes?.stdout)"' | grep -1 factor

salt '*splunk-i*' saltutil.refresh_grains

watch "salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' test.ping --out=txt"

Wait for 3 green check marks

Repeat for other AZs

# us-gov-east-1b
salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' test.ping --out=txt

salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'

date; salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' system.reboot --async

watch "salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status | grep 'not met'"

salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status --static --out=json | jq --raw-output 'keys[] as $k | "\($k): \(.[$k] | .[].changes?.stdout)"' | grep -1 factor

salt '*splunk-i*' saltutil.refresh_grains

watch "salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' test.ping --out=txt"

# 3 green checkmarks 

# us-gov-east-1c
salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' test.ping --out=txt

salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'

date; salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' system.reboot --async

watch "salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status| grep 'not met'"

salt -C ' *splunk-cm* not moose* ' state.sls splunk.master.cluster_status --static --out=json | jq --raw-output 'keys[] as $k | "\($k): \(.[$k] | .[].changes?.stdout)"' | grep -1 factor

salt '*splunk-i*' saltutil.refresh_grains

watch "salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' test.ping --out=txt"

# 3 green checkmarks

NOTE NGA had a hard time getting 3 checkmarks The CM was waiting on stuck buckets. Force rolled the buckets to get green checkmarks.

Verify you got everything

Run this on GC Test and GC Prod

salt '*' cmd.run 'uptime | grep days'

:warning: Make sure the Sensu checks are not silenced.

Post to Slack #xdr-patching Channel:

Patching is done for this month.

Patching Notes.md 44 KB Verlauf Originalformat

Patching Notes

Timeline

Patching Process

Detailed Steps (Brad's patching)

HEY BRAD: READ ME!

Day 1 (Wednesday)

Step 1 of 1 (Day 1): Moose and C2 Infrastructure -

Also, the phantom_repo pkg wants to upgrade, but we are not ready. Let's exclude that.

Now patch vmray that dirty Ubuntu server

What about threatq? Ask Duane! It needs special handling.

Run it again to make sure nothing got missed.

Patch CaaSP

Troubleshooting

Error:

This error is caused by the versionlock on the package. Use this to view the list

Error: installing package kernel-3.10.0-1062.12.1.el7.x86_64 needs 7MB on the /boot filesystem

ISSUE: Salt-minion doesn't come back and has this error

Day 2 (Thursday)

Step 1 of 4 (Day 2): Reboot Internals

GovCloud (TEST)

You will lose connectivity to Teleport and Salt Master

Log back in and verify they are back up

Duane Section (feel free to bypass)

Reboot CaaSP

GovCloud (PROD)

You will lose connectivity to Salt master

Log back in and verify they are back up

Vault Service likes to crap out after reboot; verify the service is back up

Step 2 of 4 (Day 2): Reboot Moose and Modelclient

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

Troubleshooting

If there are no green checkmarks after restarting the index cluster

Is Splunk configured with the correct S3 URL?

If the indexer/checkmarks don't come back ( legacy information )

Day 2 (Thursday), Step 3 of 4, Patching LCPs

Troubleshooting

Error on *-ds: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'

Day 2 (Thursday), Step 4 of 4 (afternoon), Reboots LCPs

First syslog servers

Second syslog servers

Remaining Syslog Servers

Troubleshooting

Possible issue: If syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons )

Possible issue: NGA LCP nodes hostnames change after reboot and Sensu agent name changes.

Verify logs are flowing

POP DS (could these be restarted at the same time? Or in 2 batches?)

Day 3 (Monday)

Step 1 of 2, Customer Slices Patching

Troubleshooting

Day 3 (Monday afternoon)

Step 2 of 2, Customer Slices Search Heads Only Reboots

Day 4 (Tuesday)

Step 1 of 1, Customer Slices CMs Reboots

ulimit errors

Log into the CM's

Reboot the indexers one at a time (AFS cluster gets backed up when an indexer is rebooted)

Repeat for other AZs

Verify you got everything

Patching Notes.md 44 KB

Verlauf Originalformat

Also, the `phantom_repo` pkg wants to upgrade, but we are not ready. Let's exclude that.

Error: installing package `kernel-3.10.0-1062.12.1.el7.x86_64` needs 7MB on the /boot filesystem

Error on `*-ds`: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'