|
@@ -0,0 +1,362 @@
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+-----------
|
|
|
+Date: Wednesday, November 13, 2019 at 3:53 PM
|
|
|
+To: "Leonard, Wesley A." <wesley.a.leonard@accenturefederal.com>, "Waddle, Duane E." <duane.e.waddle@accenturefederal.com>
|
|
|
+Subject: November Patching for MDR
|
|
|
+
|
|
|
+Again, we are going to attempt to do much of the patching during business hours during the week. Everything - including Customer POPs - needs patches this time. We will be doing the servers in 2 waves.
|
|
|
+
|
|
|
+Wave 1 is hot patching of all systems; this will fix the sudo and stage the kernel patches.
|
|
|
+
|
|
|
+Wave 2 will be the needed reboots; as this is where we see the customer impact.
|
|
|
+
|
|
|
+In the slack channel, #mdr-patching. You can join that to get real-time announcements on what is going down and when.
|
|
|
+
|
|
|
+Rough preliminary plans are:
|
|
|
+
|
|
|
+
|
|
|
+Wed Nov 13:
|
|
|
+Moose and Internal infrastructure
|
|
|
+Wave 1
|
|
|
+
|
|
|
+Thursday Nov 14:
|
|
|
+Moose and Internal
|
|
|
+Wave 2
|
|
|
+All Customer PoP
|
|
|
+Wave 1 (AM)
|
|
|
+Wave 2 (PM)
|
|
|
+Monday Nov 18:
|
|
|
+All Customer MDR Cloud
|
|
|
+Wave 1
|
|
|
+All Search heads
|
|
|
+Wave 2 (PM)
|
|
|
+Tuesday Nov 19:
|
|
|
+All Remaining MDR Cloud
|
|
|
+Wave 2 (AM)
|
|
|
+
|
|
|
+The Customer // User impact will be darning the reboots that is why I am doing them in batches so our total downtime is less.
|
|
|
+
|
|
|
+----------------------
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+#restarting the indexers one at a time (one from each group). Use the CM to see if the indexer comes back up properly.
|
|
|
+salt -C ' ( *moose* or *saf* ) and *indexer-1*' cmd.run 'shutdown -r now'
|
|
|
+#check to ensure the hot volume is mounted /opt/splunkdata/hot
|
|
|
+salt -C '( *moose* or *saf* ) and *indexer-1*' cmd.run 'df -h'
|
|
|
+
|
|
|
+#WAIT FOR 3 checks in the CM before restarting the next indexer.
|
|
|
+
|
|
|
+#repeat for indexer 2
|
|
|
+salt -C ' ( *moose* or *saf* ) and *indexer-2*' cmd.run 'shutdown -r now'
|
|
|
+#check to ensure the hot volume is mounted /opt/splunkdata/hot
|
|
|
+salt -C ' ( *moose* or *saf* ) and *indexer-2*' cmd.run 'df -h'
|
|
|
+
|
|
|
+#WAIT FOR 3 checks in the CM before restarting the next indexer.
|
|
|
+
|
|
|
+#repeat for indexer 3
|
|
|
+salt -C ' ( *moose* or *saf* ) and *indexer-3*' cmd.run 'shutdown -r now'
|
|
|
+#check to ensure the hot volume is mounted /opt/splunkdata/hot
|
|
|
+salt -C ' ( *moose* or *saf* ) and *indexer-3*' cmd.run 'df -h'
|
|
|
+
|
|
|
+
|
|
|
+IF/WHEN and indexer doesn't come back up follow these steps:
|
|
|
+in AWS grab the instance id.
|
|
|
+
|
|
|
+run the MDR/get-console.sh
|
|
|
+look for "Please enter passphrase for disk splunkhot"
|
|
|
+
|
|
|
+in AWS console stop instance (which will remove ephemeral splunk data) then start it.
|
|
|
+Then ensure the /opt/splunkdata/hot exists.
|
|
|
+if it doesn't then manually run the cloudinit boot hook.
|
|
|
+sh /var/lib/cloud/instance/boothooks/part-002
|
|
|
+
|
|
|
+ensure the hot folder is owned by splunk:splunk
|
|
|
+it will be waiting for the luks.key
|
|
|
+systemctl deamon-reload
|
|
|
+systemctl restart systemd-cryptsetup@splunkhot
|
|
|
+It is waiting for command prompt, when you restart the service it picks up the key from a file. Systemd sees the crypt setup service as a dependency for the splunk service.
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+#restart indexers (one at a time; wait for 3 green checkmarks in Cluster Master)
|
|
|
+salt -C 'nga*indexer-1*' test.ping
|
|
|
+salt -C 'nga*indexer-1*' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+#Repeat for indexer-2 and indexer-3
|
|
|
+
|
|
|
+#Ensure all have been restarted. Then done with NGA
|
|
|
+salt -C '*nga*' cmd.run 'uptime'
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+-------------------------------------------------
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+#############
|
|
|
+#
|
|
|
+###############################################################################################################################################
|
|
|
+#
|
|
|
+#############
|
|
|
+
|
|
|
+
|
|
|
+Brad's Actual Patching
|
|
|
+Starting with moose and internal infra Wave 1. Check disk space for potential issues.
|
|
|
+
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' test.ping --out=txt
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h /boot'
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h /var/log' # some at 63%
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h /var' # one at 74%
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' cmd.run 'df -h'
|
|
|
+#review packages that will be updated. some packages are versionlocked (Collectd, Splunk,etc.).
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' cmd.run 'yum check-update'
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' pkg.upgrade
|
|
|
+
|
|
|
+This error: error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed
|
|
|
+pip uninstall urllib3
|
|
|
+
|
|
|
+This error is caused by the versionlock on the package. Use this to view the list
|
|
|
+yum versionlock list
|
|
|
+Error: Package: salt-minion-2018.3.4-1.el7.noarch (@salt-2018.3)
|
|
|
+ Requires: salt = 2018.3.4-1.el7
|
|
|
+ Removing: salt-2018.3.4-1.el7.noarch (@salt-2018.3)
|
|
|
+ salt = 2018.3.4-1.el7
|
|
|
+ Updated By: salt-2018.3.5-1.el7.noarch (salt-2018.3)
|
|
|
+ salt = 2018.3.5-1.el7
|
|
|
+
|
|
|
+
|
|
|
+Error: installing package kernel-3.10.0-1062.12.1.el7.x86_64 needs 7MB on the /boot filesystem
|
|
|
+## Install yum utils ##
|
|
|
+yum install yum-utils
|
|
|
+
|
|
|
+## Package-cleanup set count as how many old kernels you want left ##
|
|
|
+package-cleanup --oldkernels --count=1
|
|
|
+
|
|
|
+If VPN server stops working, try a stop and start of the vpn server. The private IP will probably change.
|
|
|
+
|
|
|
+###
|
|
|
+Wave 2 Internals
|
|
|
+###
|
|
|
+
|
|
|
+Be sure to select ALL events in sensu for silencing not just the first 25.
|
|
|
+Sensu -> Entities -> Sort (name) -> Select Entity and Silence. This will silence both keepalive and other checks.
|
|
|
+Some silenced events will not unsilence and will need to be manually unsilenced.
|
|
|
+Some silenced events will still trigger. Not sure why. The keepalive still triggers victorops.
|
|
|
+***IDEA! restart the sensu server and the vault-3 server first. this helps with the clearing of the silenced entities.
|
|
|
+salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' test.ping
|
|
|
+salt -L 'vault-3.msoc.defpoint.local,sensu.msoc.defpoint.local' cmd.run 'shutdown -r now'
|
|
|
+salt -C '* not ( moose-splunk-indexer* or afs* or saf* or nga* or vault-3* or sensu* )' test.ping --out=txt
|
|
|
+salt -C '* not ( moose-splunk-indexer* or afs* or saf* or nga* or vault-3* or sensu* )' cmd.run 'shutdown -r now'
|
|
|
+#you will lose connectivity to openvpn and salt master
|
|
|
+#log back in and verify they are back up
|
|
|
+salt -C '* not ( moose-splunk-indexer* or afs* or saf* or nga* )' cmd.run 'uptime' --out=txt
|
|
|
+
|
|
|
+###
|
|
|
+Wave 2 Moose
|
|
|
+###
|
|
|
+
|
|
|
+salt -C 'moose-splunk-indexer*' test.ping --out=txt
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'shutdown -r now'
|
|
|
+#indexers take a while to restart
|
|
|
+ping moose-splunk-indexer-1.msoc.defpoint.local
|
|
|
+#WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS
|
|
|
+indexer2 is not coming back up...look at screenshot in aws... see this: Probing EDD (edd=off to disable)... ok
|
|
|
+look at system log in AWS see this: Please enter passphrase for disk splunkhot!:
|
|
|
+
|
|
|
+In AWS console stop instance (which will remove ephemeral splunk data) then start it.
|
|
|
+Then ensure the /opt/splunkdata/hot exists.
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'df -h'
|
|
|
+IF the MOUNT for /opt/splunkdata/hot DOESN"T EXISTS STOP SPLUNK! Splunk will write to the wrong volume.
|
|
|
+before mounting the new volume clear out the wrong /opt/splunkdata/
|
|
|
+
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl stop splunk'
|
|
|
+Ensure the /opt/splunkdata doesn't already exist, before the boothook. (theory that this causes the issue)
|
|
|
+ssh prod-moose-splunk-indexer-1
|
|
|
+if it doesn't then manually run the cloudinit boot hook.
|
|
|
+sh /var/lib/cloud/instance/boothooks/part-002
|
|
|
+salt -C 'nga-splunk-indexer-2.msoc.defpoint.local' cmd.run 'sh /var/lib/cloud/instance/boothooks/part-002'
|
|
|
+
|
|
|
+ensure the hot folder is owned by splunk:splunk
|
|
|
+ll /opt/splunkdata/
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'ls -larth /opt/splunkdata'
|
|
|
+chown -R splunk: /opt/splunkdata/
|
|
|
+salt -C 'nga-splunk-indexer-2.msoc.defpoint.local' cmd.run 'chown -R splunk: /opt/splunkdata/'
|
|
|
+it will be waiting for the luks.key
|
|
|
+systemctl daemon-reload
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl daemon-reload'
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl restart systemd-cryptsetup@splunkhot'
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl | egrep cryptset'
|
|
|
+It is waiting for command prompt, when you restart the service it picks up the key from a file. Systemd sees the crypt setup service as a dependency for the splunk service.
|
|
|
+
|
|
|
+
|
|
|
+#look for this. this is good, it is ready for restart of splunk
|
|
|
+Cryptography Setup for splunkhot
|
|
|
+
|
|
|
+systemctl restart splunk
|
|
|
+salt -C 'moose-splunk-indexer-1.msoc.defpoint.local' cmd.run 'systemctl restart splunk'
|
|
|
+
|
|
|
+once the /opt/splunkdata/hot is visible in df -h and the splunk service is started, then wait for the cluster to have 3 green checkmarks.
|
|
|
+
|
|
|
+check the servers again to ensure all of them have rebooted.
|
|
|
+salt -C ''moose-splunk-indexer*'' cmd.run 'uptime' --out=txt | sort
|
|
|
+
|
|
|
+Ensure all Moose and Internal have been rebooted
|
|
|
+salt -C '* not ( afs* or saf* or nga* )' cmd.run 'uptime' --out=txt | sort
|
|
|
+
|
|
|
+###
|
|
|
+Wave 1 POPs
|
|
|
+###
|
|
|
+
|
|
|
+salt -C '* not *.local' test.ping --out=txt
|
|
|
+salt -C '* not *.local' cmd.run 'yum check-update'
|
|
|
+salt -C '* not *.local' cmd.run 'uptime'
|
|
|
+#check for sufficent HD space
|
|
|
+salt -C '* not *.local' cmd.run 'df -h /boot'
|
|
|
+salt -C '* not *.local' cmd.run 'df -h /var/log'
|
|
|
+salt -C '* not *.local' cmd.run 'df -h /var'
|
|
|
+salt -C '* not *.local' cmd.run 'df -h'
|
|
|
+salt -C '* not *.local' pkg.upgrade disablerepo=msoc-repo
|
|
|
+salt -C '* not *.local' pkg.upgrade
|
|
|
+
|
|
|
+###
|
|
|
+Wave 2 POPs
|
|
|
+###
|
|
|
+
|
|
|
+DO NOT restart all POP at the same time
|
|
|
+salt -C '*syslog-1* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*syslog-1* not *.local' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+salt -C '*syslog-1* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'
|
|
|
+#look for /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid
|
|
|
+
|
|
|
+SAF will need the setenforce run
|
|
|
+salt saf-splunk-syslog-1 cmd.run 'setenforce 0'
|
|
|
+salt saf-splunk-syslog-1 cmd.run 'systemctl stop rsyslog'
|
|
|
+salt saf-splunk-syslog-1 cmd.run 'systemctl start syslog-ng'
|
|
|
+
|
|
|
+salt -C '*syslog-2* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*syslog-2* not *.local' cmd.run 'shutdown -r now'
|
|
|
+salt -C '*syslog-2* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'
|
|
|
+salt -L 'nga-splunk-syslog-2,saf-splunk-syslog-2' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'
|
|
|
+
|
|
|
+salt -C '*syslog-3* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*syslog-3* not *.local' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+salt -C '*syslog-4* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*syslog-4* not *.local' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+repeat for syslog-5, syslog-6, syslog-7, and syslog-8
|
|
|
+(might be able to reboot some of these at the same time. if they are if different locations. check the location grain on them.)
|
|
|
+grains.item location
|
|
|
+
|
|
|
+afs-splunk-syslog-8: {u'location': u'az-east-us-2'}
|
|
|
+afs-splunk-syslog-6: {u'location': u'az-central-us'}
|
|
|
+
|
|
|
+afs-splunk-syslog-7: {u'location': u'az-east-us-2'}
|
|
|
+afs-splunk-syslog-5: {u'location': u'az-central-us'}
|
|
|
+afs-splunk-syslog-4: {u'location': u'San Antonio'}
|
|
|
+
|
|
|
+salt -C 'afs-splunk-syslog* grains.item location
|
|
|
+
|
|
|
+salt -L 'afs-splunk-syslog-6, afs-splunk-syslog-8' cmd.run 'uptime'
|
|
|
+salt -L 'afs-splunk-syslog-6, afs-splunk-syslog-8' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+salt -L 'afs-splunk-syslog-5, afs-splunk-syslog-7' cmd.run 'uptime'
|
|
|
+salt -L 'afs-splunk-syslog-5, afs-splunk-syslog-7' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+# verify logs are flowing
|
|
|
+https://saf-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search
|
|
|
+ddps03.corp.smartandfinal.com
|
|
|
+index=* source=/opt/syslog-ng/* host=ddps* earliest=-15m | stats count by host
|
|
|
+
|
|
|
+https://afs-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search
|
|
|
+afssplhf103.us.accenturefederal.com
|
|
|
+index=* source=/opt/syslog-ng/* host=afs* earliest=-15m | stats count by host
|
|
|
+
|
|
|
+https://nga-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search
|
|
|
+aws-syslog1-tts.nga.gov
|
|
|
+index=network sourcetype="citrix:netscaler:syslog" earliest=-15m
|
|
|
+index=* source=/opt/syslog-ng/* host=aws* earliest=-60m | stats count by host
|
|
|
+
|
|
|
+POP ds (could these be restarted at the same time? Or in 2 batches?)
|
|
|
+salt -C '*splunk-ds-1* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*splunk-ds-1* not *.local' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+salt -C '*splunk-ds-2* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*splunk-ds-2* not *.local' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+salt afs-splunk-ds-[2,3,4] cmd.run 'uptime'
|
|
|
+salt afs-splunk-ds-[2,3,4] cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+Don't forget ds-3 and ds-4
|
|
|
+
|
|
|
+salt '*splunk-ds*' cmd.run 'systemctl status splunk'
|
|
|
+
|
|
|
+
|
|
|
+POP dcn
|
|
|
+salt -C '*splunk-dcn-1* not *.local' cmd.run 'uptime'
|
|
|
+salt -C '*splunk-dcn-1* not *.local' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+Did you get all of them?
|
|
|
+salt -C ' * not *local ' cmd.run 'uptime'
|
|
|
+
|
|
|
+###
|
|
|
+Customer Slices Wave 1
|
|
|
+###
|
|
|
+
|
|
|
+salt -C 'afs*local or saf*local or nga*local' test.ping --out=txt
|
|
|
+salt -C 'afs*local or saf*local or nga*local' cmd.run 'uptime'
|
|
|
+salt -C 'afs*local or saf*local or nga*local' cmd.run 'df -h'
|
|
|
+salt -C 'afs*local or saf*local or nga*local' pkg.upgrade
|
|
|
+
|
|
|
+epel repo is enabled on afs-splunk-hf ( I don't know why)
|
|
|
+had to run this to avoid issue with collectd package on msoc-repo
|
|
|
+
|
|
|
+yum update --disablerepo epel
|
|
|
+
|
|
|
+Silence Sensu first!
|
|
|
+Customer Slices Search Heads Only Wave 2
|
|
|
+salt -C 'afs-splunk-sh*local or saf-splunk-sh*local or nga-splunk-sh*local' test.ping --out=txt
|
|
|
+salt -C 'afs-splunk-sh*local or saf-splunk-sh*local or nga-splunk-sh*local' cmd.run 'df -h'
|
|
|
+salt -C 'afs-splunk-sh*local or saf-splunk-sh*local or nga-splunk-sh*local' cmd.run 'shutdown -r now'
|
|
|
+salt -C 'afs-splunk-sh*local or saf-splunk-sh*local or nga-splunk-sh*local' cmd.run 'uptime'
|
|
|
+
|
|
|
+###
|
|
|
+Customer Slices CMs Wave 2
|
|
|
+###
|
|
|
+
|
|
|
+Silence Sensu first!
|
|
|
+salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt
|
|
|
+salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h'
|
|
|
+salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'shutdown -r now'
|
|
|
+salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'systemctl status splunk'
|
|
|
+salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'uptime'
|
|
|
+
|
|
|
+afs-splunk-hf has a hard time restarting. Might need to stop then start the instance.
|
|
|
+
|
|
|
+reboot indexers 1 at a time (AFS cluster gets backed up when an indexer is rebooted)
|
|
|
+salt -C '*splunk-indexer-1* not moose*' test.ping --out=txt
|
|
|
+salt -C '*splunk-indexer-1* not moose*' cmd.run 'df -h'
|
|
|
+salt -C '*splunk-indexer-1* not moose*' cmd.run 'shutdown -r now'
|
|
|
+
|
|
|
+Wait for 3 green check marks
|
|
|
+#repeat for indexers 2 & 3
|
|
|
+salt -C '*splunk-indexer-2* not moose*' test.ping
|
|
|
+
|
|
|
+3 green checkmarks
|
|
|
+salt -C '*splunk-indexer-3* not moose*' test.ping
|
|
|
+salt -L 'afs-splunk-indexer-3.msoc.defpoint.local,saf-splunk-indexer-3.msoc.defpoint.local' cmd.run 'df -h'
|
|
|
+
|
|
|
+NGA had a hard time getting 3 checkmarks The CM was waiting on stuck buckets. Force rolled the buckets to get green checkmarks.
|
|
|
+
|
|
|
+salt -C '* not *.local' cmd.run 'uptime | grep days'
|
|
|
+***MAKE SURE the Sensu checks are not silenced. ***
|