Patching Notes.md 15 KB

Patching Notes.md

Timeline

  • Asha likes to submit her FedRAMP packet before about the 20th, so try to get it done before that.
  • Send email ~ 1 week before.
  • Give 15 minute warning in these slack channels #mdr-patching, #mdr-content, etc before patching

Patching Process

Each month the environment must be patched to comply with FedRAMP requirements. This wiki page outlines the process for patching the environment.

Email Template that needs to be sent out prior to patching and email addresses of individuals who should get the email.

Leonard, Wesley A. <wesley.a.leonard@accenturefederal.com>; Waddle, Duane E. <duane.e.waddle@accenturefederal.com>; Nair, Asha A. <asha.a.nair@accenturefederal.com>; Middleton, S. <s.middleton@accenturefederal.com>; Crawley, Angelita <angelita.crawley@accenturefederal.com>; Rivas, Gregory A. <gregory.a.rivas@accenturefederal.com>; Damstra, Frederick T. <frederick.t.damstra@accenturefederal.com>
SUBJECT: December Patching
It is time for monthly patching again. Patching is going to occur during business hours within the next week or two.  Everything - including Customer POPs - needs patching.  We will be doing the servers in 2 waves.
 
For real-time patching announcements, join the slack channel #mdr-patching. Announcements will be posted in that channel on what is going down and when.
 
Here is the proposed patching schedule:

Wednesday Dec 11:
* Moose and Internal infrastructure
  * Patching
 
Thursday Dec 12:
* Moose and Internal
  * Reboots
* All Customer PoP
  * Patching (AM)
  * Reboots (PM)

Monday Dec 16:
* All Customer XDR Cloud
  * Patching
* All Search heads
  * Reboots (PM)

Tuesday Dec 17:
* All Remaining XDR Cloud
  * Reboots (AM)
 
The customer and user impact will be during the reboots so they will be done in batches to reduce our total downtime is less.

Brad's Patching

:warning: See if Github Has any updates! Coordinate with Duane on Github Patching.

Starting with moose and internal infra patching. Check disk space for potential issues.

salt -C '* not ( afs* or saf* or nga* )' test.ping --out=txt
salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h /boot'  
salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h /var/log'   # some at 63%
salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h /var'        # one at 74%
salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h'
#review packages that will be updated. some packages are versionlocked (Collectd, Splunk,etc.).
salt -C '* not ( afs* or saf* or nga* )' cmd.run 'yum check-update' 
salt -C '* not ( afs* or saf* or nga* )' pkg.upgrade

Error: error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed

salt ma-* cmd.run 'pip uninstall urllib3 -y'

This error is caused by the versionlock on the package. Use this to view the list

yum versionlock list
Error: Package: salt-minion-2018.3.4-1.el7.noarch (@salt-2018.3)
                       Requires: salt = 2018.3.4-1.el7
                       Removing: salt-2018.3.4-1.el7.noarch (@salt-2018.3)
                           salt = 2018.3.4-1.el7
                       Updated By: salt-2018.3.5-1.el7.noarch (salt-2018.3)
                           salt = 2018.3.5-1.el7

Error: installing package kernel-3.10.0-1062.12.1.el7.x86_64 needs 7MB on the /boot filesystem`

#Install yum utils 
yum install yum-utils

#Package-cleanup set count as how many old kernels you want left 
package-cleanup --oldkernels --count=1

If VPN server stops working,

try a stop and start of the vpn server. The private IP will probably change.

ISSUE: salt-minion doesn't come back and has this error

/usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh: line 16: /lib/modules/3.10.0-957.21.3.el7.x86_64///lib/modules/3.10.0-957.21.3.el7.x86_64/kernel/sound/drivers/mpu401/snd-mpu401.ko.xz: No such file or directory

RESOLUTION: Manually reboot the OS, this is most likely due to a kernal upgrade.

Reboots Internals

Be sure to select ALL events in sensu for silencing not just the first 25. Sensu -> Entities -> Sort (name) -> Select Entity and Silence. This will silence both keepalive and other checks. Some silenced events will not unsilence and will need to be manually unsilenced. Some silenced events will still trigger. Not sure why. The keepalive still triggers victorops. IDEA! restart the sensu server and the vault-3 server first. this helps with the clearing of the silenced entities.

salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' test.ping
salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' cmd.run 'shutdown -r now'
salt -C '* not ( moose-splunk-indexer* or afs* or nga* or ma-* or la-* or vault-3* or sensu* )' test.ping --out=txt
salt -C '* not ( moose-splunk-indexer* or afs* or nga* or ma-* or la-* or vault-3* or sensu* )' cmd.run 'shutdown -r now'
#you will lose connectivity to openvpn and salt master
#log back in and verify they are back up
salt -C '* not ( moose-splunk-indexer* or afs* or saf* or nga* )' cmd.run 'uptime' --out=txt

Reboots Moose

salt -C 'moose-splunk-indexer*' test.ping --out=txt
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'shutdown -r now'
#indexers take a while to restart
watch "salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' test.ping"
ping moose-splunk-indexer-1.msoc.defpoint.local

#WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS indexer2 is not coming back up...look at screenshot in aws... see this: Probing EDD (edd=off to disable)... ok look at system log in AWS see this: Please enter passphrase for disk splunkhot!:

IF/WHEN and indexer doesn't come back up follow these steps: in AWS grab the instance id.

run the MDR/get-console.sh ( Duane's script for pulling the system log) look for "Please enter passphrase for disk splunkhot"

In AWS console stop instance (which will remove ephemeral splunk data) then start it. Then ensure the /opt/splunkdata/hot exists.

salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'df -h'

IF the MOUNT for /opt/splunkdata/hot DOESN"T EXISTS STOP SPLUNK! Splunk will write to the wrong volume. before mounting the new volume clear out the wrong /opt/splunkdata/

rm -rf /opt/splunkdata/hot/*
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl stop splunk'

Ensure the /opt/splunkdata doesn't already exist, before the boothook.

ssh prod-moose-splunk-indexer-1

if it doesn't then manually run the cloudinit boot hook.

sh /var/lib/cloud/instance/boothooks/part-002
salt -C 'nga-splunk-indexer-2.msoc.defpoint.local' cmd.run 'sh /var/lib/cloud/instance/boothooks/part-002'

ensure the hot folder is owned by splunk:splunk

ll /opt/splunkdata/
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'ls -larth /opt/splunkdata'
chown -R splunk: /opt/splunkdata/
salt -C '' cmd.run 'chown -R splunk: /opt/splunkdata/'

it will be waiting for the luks.key

systemctl daemon-reload
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl daemon-reload'
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl restart systemd-cryptsetup@splunkhot'
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl | egrep cryptset'

It is waiting for command prompt, when you restart the service it picks up the key from a file. Systemd sees the crypt setup service as a dependency for the splunk service.

#look for this. this is good, it is ready for restart of splunk Cryptography Setup for splunkhot

systemctl restart splunk
salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl restart splunk'

once the /opt/splunkdata/hot is visible in df -h and the splunk service is started, then wait for the cluster to have 3 green checkmarks.

check the servers again to ensure all of them have rebooted.

salt -C ''moose-splunk-indexer*'' cmd.run 'uptime' --out=txt | sort

Ensure all Moose and Internal have been rebooted

salt -C '* not ( afs* or nga* or la-* or ma-* )' cmd.run 'uptime' --out=txt | sort

Patching POPs

salt -C '* not *.local' test.ping --out=txt
salt -C '* not *.local' cmd.run 'yum check-update'
salt -C '* not *.local' cmd.run 'uptime'
#check for sufficent HD space
salt -C '* not *.local' cmd.run 'df -h /boot'
salt -C '* not *.local' cmd.run 'df -h /var/log'
salt -C '* not *.local' cmd.run 'df -h /var'
salt -C '* not *.local' cmd.run 'df -h'
salt -C '* not *.local' pkg.upgrade disablerepo=msoc-repo
salt -C '* not *.local' pkg.upgrade

Error on afs-splunk-ds-3: error: cannot open Packages database in /var/lib/rpm Solution:

mkdir /root/backups.rpm/
cp -avr /var/lib/rpm/ /root/backups.rpm/
rm -f /var/lib/rpm/__db*
db_verify /var/lib/rpm/Packages
rpm --rebuilddb
yum clean all

Reboots POPs

DO NOT restart all POP at the same time

salt -C '*syslog-1* not *.local' cmd.run 'uptime'
salt -C '*syslog-1* not *.local' cmd.run 'shutdown -r now'

salt -C '*syslog-1* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep' 
#look for /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid

if syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons )

salt saf-splunk-syslog-1 cmd.run 'setenforce 0'
salt saf-splunk-syslog-1 cmd.run 'systemctl stop rsyslog'
salt saf-splunk-syslog-1 cmd.run 'systemctl start syslog-ng'

watch "salt -C '*syslog-1* not *.local' test.ping"

salt -C '*syslog-2* not *.local' cmd.run 'uptime'
salt -C '*syslog-2* not *.local' cmd.run 'shutdown -r now'
salt -C '*syslog-2* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'
salt -L 'nga-splunk-syslog-2,saf-splunk-syslog-2' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'

salt -C '*syslog-3* not *.local' cmd.run 'uptime'
salt -C '*syslog-3* not *.local' cmd.run 'shutdown -r now'

salt -C '*syslog-4* not *.local' cmd.run 'uptime'
salt -C '*syslog-4* not *.local' cmd.run 'shutdown -r now'

repeat for syslog-5, syslog-6, syslog-7, and syslog-8
(might be able to reboot some of these at the same time. if they are if different locations. check the location grain on them.) grains.item location

afs-splunk-syslog-8: {u'location': u'az-east-us-2'}
afs-splunk-syslog-6: {u'location': u'az-central-us'}

afs-splunk-syslog-7: {u'location': u'az-east-us-2'}
afs-splunk-syslog-5: {u'location': u'az-central-us'}
afs-splunk-syslog-4: {u'location': u'San Antonio'}

salt -C 'afs-splunk-syslog*  grains.item location

salt -L 'afs-splunk-syslog-6, afs-splunk-syslog-8' cmd.run 'uptime'
salt -L 'afs-splunk-syslog-6, afs-splunk-syslog-8' cmd.run 'shutdown -r now'
watch "salt -L 'afs-splunk-syslog-6, afs-splunk-syslog-8' test.ping"

salt -L 'afs-splunk-syslog-5, afs-splunk-syslog-7' cmd.run 'uptime'
salt -L 'afs-splunk-syslog-5, afs-splunk-syslog-7' cmd.run 'shutdown -r now'
watch "salt -L 'afs-splunk-syslog-5, afs-splunk-syslog-7' test.ping"

#verify logs are flowing https://afs-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search afssplhf103.us.accenturefederal.com

index=* source=/opt/syslog-ng/* host=afs* earliest=-15m | stats  count by host

https://nga-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search aws-syslog1-tts.nga.gov

index=network sourcetype="citrix:netscaler:syslog" earliest=-15m
index=* source=/opt/syslog-ng/* host=aws* earliest=-60m | stats count by host

POP ds (could these be restarted at the same time? Or in 2 batches?)

salt -C '*splunk-ds-1* not *.local' cmd.run 'uptime'
salt -C '*splunk-ds-1* not *.local' cmd.run 'shutdown -r now'

salt -C '*splunk-ds-2* not *.local' cmd.run 'uptime'
salt -C '*splunk-ds-2* not *.local' cmd.run 'shutdown -r now'

salt afs-splunk-ds-[2,3,4] cmd.run 'uptime'
salt afs-splunk-ds-[2,3,4] cmd.run 'shutdown -r now'

Don't forget ds-3 and ds-4

#try reboot at the same time
salt '*splunk*ds*' cmd.run 'uptime'
salt '*splunk*ds*' cmd.run 'shutdown -r now'
watch "salt '*splunk*ds*' test.ping"
salt '*splunk-ds*' cmd.run 'systemctl status splunk'

Did you get all of them?

salt -C ' * not *local ' cmd.run 'uptime' --out=txt | sort

Customer Slices Patching

salt -C 'afs*local or ma-*local or la-*local or nga*local' test.ping --out=txt
salt -C 'afs*local or ma-*local or la-*local or nga*local' cmd.run 'uptime'
salt -C 'afs*local or ma-*local or la-*local or nga*local' cmd.run 'df -h'
salt -C 'afs*local or ma-*local or la-*local or nga*local' pkg.upgrade

epel repo is enabled on afs-splunk-hf ( I don't know why) had to run this to avoid issue with collectd package on msoc-repo

yum update --disablerepo epel

Customer Slices Search Heads Only Reboots

Silence Sensu first!

salt -C 'afs-splunk-sh*local or ma-*-splunk-sh*local or la-*-splunk-sh*local or nga-splunk-sh*local' test.ping --out=txt
salt -C 'afs-splunk-sh*local or ma-*-splunk-sh*local or la-*-splunk-sh*local or nga-splunk-sh*local' cmd.run 'df -h'
salt -C 'afs-splunk-sh*local or ma-*-splunk-sh*local or la-*-splunk-sh*local or nga-splunk-sh*local' cmd.run 'shutdown -r now'
salt -C 'afs-splunk-sh*local or ma-*-splunk-sh*local or la-*-splunk-sh*local or nga-splunk-sh*local' cmd.run 'uptime'

Customer Slices CMs Reboots

Silence Sensu first!

salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'shutdown -r now'
watch "salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt"
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'systemctl status splunk'
salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'uptime'

May 27 17:08:57 la-c19-splunk-cm.msoc.defpoint.local splunk[3840]: /etc/rc.d/init.d/splunk: line 13: ulimit: open files: cannot modify limit: Invalid argument afs-splunk-hf has a hard time restarting. Might need to stop then start the instance.

reboot indexers 1 at a time (AFS cluster gets backed up when an indexer is rebooted) How to replicate this with ASGs? TODO: get command to view "three green check marks" from salt.

salt -C '*splunk-indexer-1* and G@ec2:placement:availability_zone:us-east-1a not moose*' test.ping
salt -C '*splunk-indexer-1* not moose*' test.ping --out=txt
salt -C '*splunk-indexer-1* not moose*' test.ping --out=txt
salt -C '*splunk-indexer-1* not moose*' cmd.run 'df -h'
salt -C '*splunk-indexer-1* not moose*' cmd.run 'shutdown -r now'

Wait for 3 green check marks

#repeat for indexers 2 & 3

salt -C '*splunk-indexer-2* not moose*' test.ping

3 green checkmarks

salt -C '*splunk-indexer-3* not moose*' test.ping
salt -L 'afs-splunk-indexer-3.msoc.defpoint.local,saf-splunk-indexer-3.msoc.defpoint.local' cmd.run 'df -h'

NGA had a hard time getting 3 checkmarks The CM was waiting on stuck buckets. Force rolled the buckets to get green checkmarks.

salt -C '* not *.local' cmd.run 'uptime | grep days'

:warning: *MAKE SURE the Sensu checks are not silenced. *