Patching Notes

Timeline

Asha likes to submit her FedRAMP packet before about the 20th, so try to get it done before that.
Send email ~ 1 week before.
Give 15 minute warning in these slack channels #mdr-patching, #mdr-content, etc before patching

Patching Process

Each month the environment must be patched to comply with FedRAMP requirements. This wiki page outlines the process for patching the environment.

Email Template that needs to be sent out prior to patching and email addresses of individuals who should get the email.

Leonard, Wesley A. <wesley.a.leonard@accenturefederal.com>; Waddle, Duane E. <duane.e.waddle@accenturefederal.com>; Nair, Asha A. <asha.a.nair@accenturefederal.com>; Middleton, S. <s.middleton@accenturefederal.com>; Crawley, Angelita <angelita.crawley@accenturefederal.com>; Rivas, Gregory A. <gregory.a.rivas@accenturefederal.com>; Damstra, Frederick T. <frederick.t.damstra@accenturefederal.com>; Poulton, Brad <brad.poulton@accenturefederal.com>; Williams, Colby <colby.williams@accenturefederal.com>; Mahmood, Shahid <shahid.mahmood@accenturefederal.com>; Naughton, Brandon <brandon.naughton@accenturefederal.com>

SUBJECT: December Patching

It is time for monthly patching again. Patching is going to occur during business hours within the next week or two.  Everything - including Customer POP/LCPs - needs patching.  We will be doing the servers in 2 waves.
 
For real-time patching announcements, join the slack channel #xdr-patching. Announcements will be posted in that channel on what is going down and when.
 
Here is the proposed patching schedule:

Wednesday Dec 11:
* Moose and Internal infrastructure
  * Patching
 
Thursday Dec 12:
* Moose and Internal
  * Reboots
* All Customer PoP/LCP
  * Patching (AM)
  * Reboots (PM)

Monday Dec 16:
* All Customer XDR Cloud
  * Patching
* All Search heads
  * Reboots (PM)

Tuesday Dec 17:
* All Remaining XDR Cloud
  * Reboots (AM)
 
The customer and user impact will be during the reboots so they will be done in batches to reduce our total downtime.

Detailed Steps (Brad's patching)

Day 1 (Wednesday), step 1 of 1: Moose and Internal infrastructure - Patching

Patch TEST first! This helps find problems in TEST and potential problems in PROD.

Post to Slack:

FYI, patching today. 
* This morning, patches to all internal systems and moose. 
* No reboots, so impact should be minimal.

:warning: See if GitHub has any updates! Coordinate with Duane or Colby on GitHub Patching.

Starting with moose and internal infra patching. Check disk space for potential issues.

salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' test.ping --out=txt
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run 'df -h /boot'  
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run 'df -h /var/log'   # some at 63%
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run 'df -h /var'        # one at 74%
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run 'df -h'

# Fred's update for df -h:
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run 'df -h | egrep "[890][0-9]\%"'

# Review packages that will be updated. some packages are versionlocked (Collectd, Splunk,etc.).
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run 'yum check-update' 

### OpenVPN sometimes goes down with patching and needs a restart of the service. 
### Let's patch the VPN after everthing else. I am not sure which package is causing the issue. Kernal? bind-utils? 

### Also, the phantom_repo pkg wants to upgrade, but we are not ready. Let's exclude that package to prevent errors. 
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* or openvpn* )' pkg.upgrade exclude='phantom_repo'
salt -C 'openvpn*' pkg.upgrade

# Just to be sure, run it again to make sure nothing got missed. 
salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' pkg.upgrade exclude='phantom_repo'

#patch GC ( from the GC salt master )
salt -C  '*accenturefederalcyber.com not nihor*' test.ping
salt -C  '*accenturefederalcyber.com not nihor*' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C  '*accenturefederalcyber.com not nihor*' cmd.run 'yum check-update'
salt -C  '*accenturefederalcyber.com not nihor*' pkg.upgrade

:warning: After upgrades check on Portal to make sure it is still up.

Phantom error

phantom.msoc.defpoint.local:
    ERROR: Problem encountered upgrading packages. Additional info follows:

    changes:
        ----------
    result:
        ----------
        pid:
            40718
        retcode:
            1
        stderr:
            Running scope as unit run-40718.scope.
            Error in PREIN scriptlet in rpm package phantom_repo-4.9.39220-1.x86_64
            phantom_repo-4.9.37880-1.x86_64 was supposed to be removed but is not!
        stdout:
            Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
            Logging to /var/log/phantom/phantom_install_log
            error: %pre(phantom_repo-4.9.39220-1.x86_64) scriptlet failed, exit status 7

Error: `error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed`

salt ma-* cmd.run 'pip uninstall urllib3 -y'

This error is caused by the versionlock on the package. Use this to view the list

yum versionlock list
Error: Package: salt-minion-2018.3.4-1.el7.noarch (@salt-2018.3)
                       Requires: salt = 2018.3.4-1.el7
                       Removing: salt-2018.3.4-1.el7.noarch (@salt-2018.3)
                           salt = 2018.3.4-1.el7
                       Updated By: salt-2018.3.5-1.el7.noarch (salt-2018.3)
                           salt = 2018.3.5-1.el7

Error: installing package `kernel-3.10.0-1062.12.1.el7.x86_64` needs 7MB on the /boot filesystem

# Install yum utils 
yum install yum-utils

# Package-cleanup set count as how many old kernels you want left 
package-cleanup --oldkernels --count=1 -y

If VPN server stops working,

Try a stop and start of the VPN server (instructions). The private IP will probably change.

ISSUE: salt-minion doesn't come back and has this error

/usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh: line 16: /lib/modules/3.10.0-957.21.3.el7.x86_64///lib/modules/3.10.0-957.21.3.el7.x86_64/kernel/sound/drivers/mpu401/snd-mpu401.ko.xz: No such file or directory

RESOLUTION: Manually reboot the OS, this is most likely due to a kernal upgrade.

Day 2 (Thursday), step 1 of 4: Reboot Internals

Don't forget to reboot test.

Post to Slack:

FYI, patching today. 
* In about 15 minutes: Reboots of moose and internal systems, including the VPN.
* Following that, patching (but not rebooting) of all customer PoPs/LCPs.
* Then this afternoon, reboots of those those PoPs/LCPs.

Be sure to select ALL events in sensu for silencing not just the first 25. Sensu -> Entities -> Sort (name) -> Select Entity and Silence. This will silence both keepalive and other checks. Some silenced events will not unsilence and will need to be manually unsilenced. IDEA! restart the sensu server and the vault-3 server first. This helps with the clearing of the silenced entities.

salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' test.ping
date; salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' system.reboot
watch "salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' test.ping"
salt -C '* not ( moose-splunk-indexer* or afs* or nga* or ma-* or mo-* or la-* or dc-* or vault-3* or sensu* or interconnect* or resolver* or nihor* )' test.ping --out=txt
date; salt -C '* not ( moose-splunk-indexer* or afs* or nga* or ma-* or mo-* or la-* or dc-* or vault-3* or sensu* or interconnect* or resolver* or nihor* )' system.reboot
### You will lose connectivity to openvpn and salt master
### Log back in and verify they are back up
watch "salt -C '* not ( moose-splunk-indexer* or afs* or nga* or ma-* or mo-* or la-* or dc-* or vault-3* or sensu* or interconnect* or resolver* or nihor* )' cmd.run 'uptime' --out=txt"

# Take care of the interconencts/resolvers one at a time. 

# Production
salt 'interconnect-0.pvt.xdr.accenturefederalcyber.com' test.ping 
salt 'interconnect-0.pvt.xdr.accenturefederalcyber.com' system.reboot
salt 'interconnect-1.pvt.xdr.accenturefederalcyber.com' test.ping 
salt 'interconnect-1.pvt.xdr.accenturefederalcyber.com' system.reboot
salt 'resolver-commercial.pvt.xdr.accenturefederalcyber.com' test.ping
salt 'resolver-commercial.pvt.xdr.accenturefederalcyber.com' system.reboot
salt 'resolver-govcloud.pvt.xdr.accenturefederalcyber.com' test.ping
salt 'resolver-govcloud.pvt.xdr.accenturefederalcyber.com' system.reboot

# Test
salt 'interconnect-0.pvt.xdrtest.accenturefederalcyber.com' test.ping 
salt 'interconnect-0.pvt.xdrtest.accenturefederalcyber.com' system.reboot
salt 'interconnect-1.pvt.xdrtest.accenturefederalcyber.com' test.ping 
salt 'interconnect-1.pvt.xdrtest.accenturefederalcyber.com' system.reboot
salt 'resolver-commercial.pvt.xdrtest.accenturefederalcyber.com' test.ping
salt 'resolver-commercial.pvt.xdrtest.accenturefederalcyber.com' system.reboot
salt 'resolver-govcloud.pvt.xdrtest.accenturefederalcyber.com' test.ping
salt 'resolver-govcloud.pvt.xdrtest.accenturefederalcyber.com' system.reboot

I (Duane) did this a little different. Salt-master first, then openvpn, then everything but interconnects and resolvers. Interconnects and resolvers reboot one at a time.

salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* or openvpn* or qcomp* or salt-master* or moose-splunk-indexer-* or interconnect* or resolver* )' cmd.run 'shutdown -r now'

Day 2 (Thursday), Step 2 of 4: Reboot Moose

salt -C 'moose-splunk-indexer*' test.ping --out=txt

# Do the first indexers
salt -C 'moose-splunk-indexer-i-03ff4fb9915d5f7df.msoc.defpoint.local' test.ping --out=txt
date; salt -C 'moose-splunk-indexer-i-03ff4fb9915d5f7df.msoc.defpoint.local' system.reboot

# Indexers take a while to restart
watch "salt -C 'moose-splunk-indexer-i-03ff4fb9915d5f7df.msoc.defpoint.local' cmd.run 'uptime' --out=txt"
ping moose-splunk-indexer-i-03ff4fb9915d5f7df.msoc.defpoint.local

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

Repeat the above patching steps for the additional indexers, waiting for 3 green checks in between each one.

# Do the second indexer
salt -C 'moose-splunk-indexer-i-0b11e585de680b383.msoc.defpoint.local' test.ping --out=txt
date; salt -C 'moose-splunk-indexer-i-0b11e585de680b383.msoc.defpoint.local' system.reboot

# Indexers take a while to restart
watch "salt -C 'moose-splunk-indexer-i-0b11e585de680b383.msoc.defpoint.local' cmd.run 'uptime' --out=txt"

# Do the third indexer
salt -C 'moose-splunk-indexer-i-00ca1da87a2abcd56.msoc.defpoint.local' test.ping --out=txt
date; salt -C 'moose-splunk-indexer-i-00ca1da87a2abcd56.msoc.defpoint.local' system.reboot

# Indexers take a while to restart
watch "salt -C 'moose-splunk-indexer-i-00ca1da87a2abcd56.msoc.defpoint.local' cmd.run 'uptime' --out=txt"

# Verify all indexers patched:
salt 'moose-splunk-indexer*' cmd.run 'uptime' --out=txt

If the indexer/checkmarks don't come back ( legacy information )

If an indexer is not coming back up...look at screenshot in AWS... see this: Probing EDD (edd=off to disable)... ok then look at system log in AWS see this: Please enter passphrase for disk splunkhot!:

IF/WHEN and indexer doesn't come back up follow these steps:

In the AWS console, grab the instance ID.
Run the MDR/get-console.sh (Duane's script for pulling the system log)
Look for "Please enter passphrase for disk splunkhot"

In AWS console stop instance (which will remove ephemeral splunk data) then start it. Then ensure the /opt/splunkdata/hot exists.

salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'df -h'

IF the MOUNT for /opt/splunkdata/hot DOESN'T EXIST, STOP SPLUNK! Splunk will write to the wrong volume. before mounting the new volume clear out the wrong /opt/splunkdata/

rm -rf /opt/splunkdata/hot/*

salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl stop splunk'

Ensure the /opt/splunkdata doesn't already exist, before the boothook.

ssh prod-moose-splunk-indexer-1

If it doesn't then manually run the cloudinit boot hook.

sh /var/lib/cloud/instance/boothooks/part-002
salt -C 'nga-splunk-indexer-2.msoc.defpoint.local' cmd.run 'sh /var/lib/cloud/instance/boothooks/part-002'

Ensure the hot directory is owned by splunk:splunk

ll /opt/splunkdata/
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'ls -larth /opt/splunkdata'
chown -R splunk: /opt/splunkdata/
salt -C '' cmd.run 'chown -R splunk: /opt/splunkdata/'

It will be waiting for the luks.key

systemctl daemon-reload
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl daemon-reload'
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl restart systemd-cryptsetup@splunkhot'
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl | egrep cryptset'

It is waiting for command prompt, when you restart the service it picks up the key from a file. Systemd sees the crypt setup service as a dependency for the splunk service.

Look for this. this is good, it is ready for restart of splunk

Cryptography Setup for splunkhot

systemctl restart splunk
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl restart splunk'

Once the /opt/splunkdata/hot is visible in df -h and the splunk service is started, then wait for the cluster to have 3 green checkmarks.

Check the servers again to ensure all of them have rebooted.

salt -C ''moose-splunk-indexer*'' cmd.run 'uptime' --out=txt | sort

Ensure all Moose and Internal have been rebooted

salt -C '* not ( afs* or saf* or nga* or ma-* or mo-* or dc-c19* or la-c19* )' cmd.run uptime

Day 2 (Thursday), Step 3 of 4, Patching POPs

(Presently this is only nga-* and afs-*, as the C-19 customers dont have pops)

salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' test.ping --out=txt
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'yum check-update'
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'uptime'

# Check for sufficient space (or use fred's method, next comment)
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h /boot'
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h /var/log'
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h /var'
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h'

# Fred's update for df -h:
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h | egrep "[890][0-9]\%"'

# Updates
salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade

# If a repo gives an error, you may need to disable it.
# salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade disablerepo=msoc-repo # Optional for fix
# salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade disablerepo=splunk-7.0 # Optional for fix
# on 2020-07-23: salt -C 'nga-splunk-ds-1 or afs-splunk-ds-1 or afs-splunk-ds-2' pkg.upgrade disablerepo=splunk-7.0 # Optional for fix

Error on afs-splunk-ds-3: error: cannot open Packages database in /var/lib/rpm

Solution:

mkdir /root/backups.rpm/
cp -avr /var/lib/rpm/ /root/backups.rpm/
rm -f /var/lib/rpm/__db*
db_verify /var/lib/rpm/Packages
rpm --rebuilddb
yum clean all

Error on `*-ds`: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'

Reason: POP Nodes shouldn't be using the .local dns address.

Solution: Needs a permanent fix. For now, patch with the repo disabled:

salt -C '*-ds* not afs-splunk-ds-4' pkg.upgrade disablerepo=splunk-7.0

Day 2 (Thursday), Step 4 of 4 (afternoon), Reboots POPs

Post to Slack:

Resuming today's patching with the reboots of customer POPs.

NOTE: Restart POPs one server at a time in order to minimize risk of concurrent outages.

First syslog servers

salt -C '*syslog-1* not *.local' cmd.run 'uptime'
date; salt -C '*syslog-1* not *.local' system.reboot

watch "salt -C '*syslog-1* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"
# Look for /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid

Possible issue: if syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons )

2020-06-11 - had to do this for afs-syslog-5 through 8

salt saf-splunk-syslog-1 cmd.run 'setenforce 0'
salt saf-splunk-syslog-1 cmd.run 'systemctl stop rsyslog'
salt saf-splunk-syslog-1 cmd.run 'systemctl start syslog-ng'

watch "salt -C '*syslog-1* not *.local' test.ping"

Second syslog servers

salt -C '*syslog-2* not *.local' cmd.run 'uptime'
date; salt -C '*syslog-2* not *.local' system.reboot
watch "salt -C '*syslog-2* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

Remaining Syslog Servers

salt -C '*syslog-3* not *.local' cmd.run 'uptime'
date; salt -C '*syslog-3* not *.local' system.reboot
watch "salt -C '*syslog-3* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

salt -C '*syslog-4* not *.local' cmd.run 'uptime'
date; salt -C '*syslog-4* not *.local' system.reboot
watch "salt -C '*syslog-4* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

salt -C '*syslog-7* not *.local' cmd.run 'uptime'
date; salt -C '*syslog-7* not *.local' system.reboot
watch "salt -C '*syslog-7* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

salt -C '*syslog-8* not *.local' cmd.run 'uptime'
date; salt -C '*syslog-8* not *.local' system.reboot
watch "salt -C '*syslog-8* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'"

(We might be able to reboot some of these at the same time. If they are in different locations. Check the location grain on them.) grains.item location

afs-splunk-syslog-8: {u'location': u'az-east-us-2'}

afs-splunk-syslog-7: {u'location': u'az-east-us-2'}
afs-splunk-syslog-4: {u'location': u'San Antonio'}

salt -C 'afs-splunk-syslog*'  grains.item location

salt -L 'afs-splunk-syslog-3, afs-splunk-syslog-7' cmd.run 'uptime'
date; salt -L 'afs-splunk-syslog-3, afs-splunk-syslog-7' system.reboot
watch "salt -L 'afs-splunk-syslog-3, afs-splunk-syslog-7' test.ping"

salt -L 'afs-splunk-syslog-4, afs-splunk-syslog-8' cmd.run 'uptime'
date; salt -L 'afs-splunk-syslog-4, afs-splunk-syslog-8' system.reboot
watch "salt -L 'afs-splunk-syslog-4, afs-splunk-syslog-8' test.ping"

Verify logs are flowing

https://afs-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search afssplhf103.us.accenturefederal.com

# index=* source=/opt/syslog-ng/* host=afs* earliest=-15m | stats count by host
| tstats count WHERE index=* source=/opt/syslog-ng/* host=afs* earliest=-15m latest=now BY host

https://nga-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search aws-syslog1-tts.nga.gov

index=network sourcetype="citrix:netscaler:syslog" earliest=-15m latest=now
index=zscaler sourcetype="zscaler:web" earliest=-15m latest=now

NOTICE: NGA sourcetype="zscaler:web" logs are handled by fluentd and can lag behind by 10 minutes.

#index=* source=/opt/syslog-ng/* host=aws* earliest=-60m | stats count by host
| tstats count WHERE index=* source=/opt/syslog-ng/* host=aws* earliest=-60m latest=now BY host

POP DS (could these be restarted at the same time? Or in 2 batches?)

salt -C '*splunk-ds-1* not *.local' cmd.run 'uptime'
date; salt -C '*splunk-ds-1* not *.local' system.reboot
watch "salt -C '*splunk-ds-1* not *.local' cmd.run 'uptime'"

salt -C '*splunk-ds-2* not *.local' cmd.run 'uptime'
date; salt -C '*splunk-ds-2* not *.local' system.reboot
watch "salt -C '*splunk-ds-2* not *.local' cmd.run 'uptime'"

salt afs-splunk-ds-4 cmd.run 'uptime'
date; salt afs-splunk-ds-4 system.reboot
watch "salt -C '*splunk-ds-4* not *.local' cmd.run 'uptime'"

Don't forget DS-4

# Try reboot at the same time
salt '*splunk*ds*' cmd.run 'uptime'
date; salt '*splunk*ds*' system.reboot
watch "salt '*splunk*ds*' test.ping"
salt '*splunk-ds*' cmd.run 'systemctl status splunk'

Did you get all of them?

salt -C ' * not *local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'uptime' --out=txt | sort

Day 3 (Monday), Step 1 of 2, Customer Slices Patching

Post to Slack:

Today's patching is all XDR customer environments. Indexers and searchheads will be patched this morning. Search heads will be rebooted this afternoon, and the indexers will be rebooted tomorrow. Thank you for your cooperation.

salt -C 'afs*local or ma-* or mo-*local or la-*local or nga*local or dc*local' test.ping --out=txt
salt -C 'afs*local or ma-* or mo-*local or la-*local or nga*local or dc*local' cmd.run 'uptime'
salt -C 'afs*local or ma-* or mo-*local or la-*local or nga*local or dc*local' cmd.run 'df -h'

# Fred's update for df -h:
salt -C 'afs*local or ma-* or mo-*local or la-*local or nga*local or dc*local' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C 'afs*local or ma-* or mo-*local or la-*local or nga*local or dc*local' pkg.upgrade

EPEL repo is enabled on afs-splunk-hf ( I don't know why); had to run this to avoid issue with collectd package on msoc-repo

yum update --disablerepo epel

Day 3 (Monday afternoon), Step 2 of 2, Customer Slices Search Heads Only Reboots

Silence Sensu first!

Post to Slack (xdr-patching and xdr-soc):

FYI: Rebooting the Splunk search heads as part of today's patching. Reboots will occur in 15 minutes.

Commands to run:


salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' test.ping --out=txt | sort
salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' system.reboot
watch "salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' cmd.run 'uptime'"

Day 4 (Tuesday), Step 1 of 1, Customer Slices CMs Reboots

Today's patching is the indexing clusters for all XDR customer environments. Cluster masters and indexers will be rebooted this morning. Thank you for your cooperation.

Silence Sensu first!

salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt
#Did you silence sensu?
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' system.reboot
watch "salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt"
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'systemctl status splunk'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'uptime'

Log into the CM's

Generate the URLs:

for i in `salt -C '( *splunk-cm* ) not moose*' test.ping --out=txt`; do echo https://${i}8000; done | grep -v True

ulimit errors

May 27 17:08:57 la-c19-splunk-cm.msoc.defpoint.local splunk[3840]: /etc/rc.d/init.d/splunk: line 13: ulimit: open files: cannot modify limit: Invalid argument afs-splunk-hf has a hard time restarting. Might need to stop then start the instance.

Reboot the indexers one at a time (AFS cluster gets backed up when an indexer is rebooted) How to replicate this with ASGs?

TODO: get command to view "three green check marks" from salt.

salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1a not moose*' test.ping --out=txt
salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1a not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1a not moose*' system.reboot
watch "salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1a not moose*' test.ping --out=txt"

Wait for 3 green check marks

Repeat for AZs 2 & 3

salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1b not moose*' test.ping --out=txt
salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1b not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1b not moose*' system.reboot
watch "salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1b not moose*' test.ping --out=txt"
#3 green checkmarks 

salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1c not moose*' test.ping --out=txt
salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1c not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"'
salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1c not moose*' system.reboot
watch "salt -C '*splunk-indexer-* and G@ec2:placement:availability_zone:us-east-1c not moose*' test.ping --out=txt"
#3 green checkmarks

NGA had a hard time getting 3 checkmarks The CM was waiting on stuck buckets. Force rolled the buckets to get green checkmarks.

Verify you got everything

salt '*' cmd.run 'uptime | grep days'
salt \* cmd.run 'uptime'

Patching Notes.md 25 KB

文件歷史 原始文件

Patching Notes

Timeline

Patching Process

Detailed Steps (Brad's patching)

Day 1 (Wednesday), step 1 of 1: Moose and Internal infrastructure - Patching

Error: `error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed`

Error: installing package `kernel-3.10.0-1062.12.1.el7.x86_64` needs 7MB on the /boot filesystem

If VPN server stops working,

ISSUE: salt-minion doesn't come back and has this error

Day 2 (Thursday), step 1 of 4: Reboot Internals

Day 2 (Thursday), Step 2 of 4: Reboot Moose

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

If the indexer/checkmarks don't come back ( legacy information )

Look for this. this is good, it is ready for restart of splunk

Day 2 (Thursday), Step 3 of 4, Patching POPs

Error on afs-splunk-ds-3: error: cannot open Packages database in /var/lib/rpm

Error on `*-ds`: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'

Day 2 (Thursday), Step 4 of 4 (afternoon), Reboots POPs

First syslog servers

Possible issue: if syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons )

Second syslog servers

Remaining Syslog Servers

Verify logs are flowing

POP DS (could these be restarted at the same time? Or in 2 batches?)

Day 3 (Monday), Step 1 of 2, Customer Slices Patching

Day 3 (Monday afternoon), Step 2 of 2, Customer Slices Search Heads Only Reboots

Day 4 (Tuesday), Step 1 of 1, Customer Slices CMs Reboots

Log into the CM's

ulimit errors

Reboot the indexers one at a time (AFS cluster gets backed up when an indexer is rebooted) How to replicate this with ASGs?

Repeat for AZs 2 & 3

Verify you got everything

:warning: MAKE SURE the Sensu checks are not silenced.

Patching Notes.md 25 KB 文件歷史 原始文件

Patching Notes

Timeline

Patching Process

Detailed Steps (Brad's patching)

Day 1 (Wednesday), step 1 of 1: Moose and Internal infrastructure - Patching

Error: error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed

Error: installing package kernel-3.10.0-1062.12.1.el7.x86_64 needs 7MB on the /boot filesystem

If VPN server stops working,

ISSUE: salt-minion doesn't come back and has this error

Day 2 (Thursday), step 1 of 4: Reboot Internals

Day 2 (Thursday), Step 2 of 4: Reboot Moose

WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS

If the indexer/checkmarks don't come back ( legacy information )

Look for this. this is good, it is ready for restart of splunk

Day 2 (Thursday), Step 3 of 4, Patching POPs

Error on afs-splunk-ds-3: error: cannot open Packages database in /var/lib/rpm

Error on *-ds: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'

Day 2 (Thursday), Step 4 of 4 (afternoon), Reboots POPs

First syslog servers

Possible issue: if syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons )

Second syslog servers

Remaining Syslog Servers

Verify logs are flowing

POP DS (could these be restarted at the same time? Or in 2 batches?)

Day 3 (Monday), Step 1 of 2, Customer Slices Patching

Day 3 (Monday afternoon), Step 2 of 2, Customer Slices Search Heads Only Reboots

Day 4 (Tuesday), Step 1 of 1, Customer Slices CMs Reboots

Log into the CM's

ulimit errors

Reboot the indexers one at a time (AFS cluster gets backed up when an indexer is rebooted) How to replicate this with ASGs?

Repeat for AZs 2 & 3

Verify you got everything

:warning: *MAKE SURE the Sensu checks are not silenced. *

Patching Notes.md 25 KB

文件歷史原始文件

Error: `error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed`

Error: installing package `kernel-3.10.0-1062.12.1.el7.x86_64` needs 7MB on the /boot filesystem

Error on `*-ds`: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml'

:warning: MAKE SURE the Sensu checks are not silenced.