# Patching Notes ## Timeline * Asha likes to submit her FedRAMP packet before about the 20th, so try to get it done before that. * Send email ~ 1 week before. * Give 15 minute warning in the Slack [#xdr-patching](https://afscyber.slack.com/archives/CJ462RRBM), [#xdr-content-aas](https://afscyber.slack.com/archives/C010NEX6X1N) channels, etc before patching ## HEY BRAD: READ ME! Run the cmd below to deal with message: `"This system is not registered with an entitlement server. You can use subscription-manager to register."` ``` date; salt '*' state.sls os_modifications.rhel_deregistration --output-diff ``` It's safe to run on `*` and will remove any RHEL registration (or warnings about lack thereof) on systems that have a billing code. Also, reminder that the legacy `Reposerver` was shutdown in late February 2021, so consider it a suspect if you have issues. ## Patching Process Each month the AWS `Commercial (Legacy) PROD` & `GovCloud (GC) TEST/PROD` environments must be patched to comply with FedRAMP requirements. This wiki page outlines the process for patching the environment. Email Template that needs to be sent out prior to patching and email addresses of individuals who should get the email. ``` Leonard, Wesley A. ; Waddle, Duane E. ; Nair, Asha A. ; Middleton, Sheryl ; Crawley, Angelita ; Rivas, Gregory A. ; Damstra, Frederick T. ; Poulton, Brad ; Williams, Colby ; Naughton, Brandon ; Cooper, Jeremy ; ``` ``` SUBJECT: Patching ``` ``` It is time for monthly patching again. Patching is going to occur during business hours within the next week or two. Everything - including Customer POP/LCPs - needs patching. We will be doing the servers in 2 waves. For real-time patching announcements, join the Slack #xdr-patching Channel. Announcements will be posted in that channel on what is going down and when. Here is the proposed patching schedule: Wednesday 11: * Moose and Internal infrastructure * Patching * CaaSP * Patching Thursday 12: * Moose and Internal * Reboots * All Customer PoP/LCP * Patching (AM) * Reboots (PM) * CaaSP * Reboots Monday 16: * All Customer XDR Cloud * Patching * All Search heads * Reboots (PM) Tuesday 17: * All Remaining XDR Cloud * Reboots (AM) The customer and user impact will be during the reboots so they will be done in batches to reduce our total downtime. ``` ## Detailed Steps (Brad's patching) ### Day 1 (Wednesday), step 1 of 1: Moose and Internal Infrastructure - Patching Patch `GC TEST` first! This helps find problems in `TEST` and potential problems in `PROD`. Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM): ``` FYI, patching today. * This morning, patches to all internal systems, moose, and CaaSP. * No reboots, so impact should be minimal. ``` > :warning: **See if GitHub has any updates!** Coordinate with Duane or Colby on GitHub Patching. Starting with Moose and Internal infra patching within GovCloud TEST. Check disk space for potential issues. Return here to start on PROD after TEST is patched. ``` # Test connectivity between Salt Master and Minions salt -C '* not ( afs* or nga* or ma-* or or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' test.ping --out=txt # Fred's update for df -h - checks for disk utilization at the 80-90% area salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' cmd.run 'df -h | egrep "[890][0-9]\%"' # Review packages that will be updated. some packages are versionlocked (Collectd, Splunk,etc.). salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' cmd.run 'yum check-update' #Older commands that are still viable if Fred's one-liner has issues; feel free to skip and move to pkg.upgrade line salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' cmd.run 'df -h /boot' salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' cmd.run 'df -h /var/log' # some at 63% salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' cmd.run 'df -h /var' # one at 74% salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* )' cmd.run 'df -h' ``` > :warning: **OpenVPN sometimes goes down with patching and needs a restart of the service.** ### Patch the VPN after everything else. I am not sure which package is causing the issue. Kernel? bind-utils? ### Also, the phantom_repo pkg wants to upgrade, but we are not ready. Let's exclude that package and OpenVPN server to prevent errors. ``` salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* or threatq* or openvpn* )' pkg.upgrade exclude='phantom_repo' # Now Patch OpenVPN server and monitor during process in case any issues occur; ie, you get kicked off of VPN, etc. salt -C 'openvpn*' pkg.upgrade # Just to be sure, run it again to make sure nothing got missed. salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or nihor* or bp-ot-demo* or bas-* or doed* or ca-c19* or frtib* or dgi* )' pkg.upgrade exclude='phantom_repo' ``` > :warning: After upgrades check on Portal to make sure it is still up. - Prod: [Portal](https://portal.xdr.accenturefederalcyber.com/choose/login/) - Test: [Portal](https://portal.xdrtest.accenturefederalcyber.com/choose/login/) If Portal is down, start by restarting the Docker service. My guess is patching is messing with the network stack and Docker service don't like that. ``` salt 'customer-portal*' cmd.run 'systemctl restart docker' ``` Portal Notes are here for further Troubleshooting if necessary: [Portal Notes](Portal%20Notes.md) #### Patch CaaSP See [Patch CaaSP instructions](Patching%20Notes--CaaSP.md) #### Troubleshooting Phantom error ``` phantom.msoc.defpoint.local: ERROR: Problem encountered upgrading packages. Additional info follows: changes: ---------- result: ---------- pid: 40718 retcode: 1 stderr: Running scope as unit run-40718.scope. Error in PREIN scriptlet in rpm package phantom_repo-4.9.39220-1.x86_64 phantom_repo-4.9.37880-1.x86_64 was supposed to be removed but is not! stdout: Delta RPMs disabled because /usr/bin/applydeltarpm not installed. Logging to /var/log/phantom/phantom_install_log error: %pre(phantom_repo-4.9.39220-1.x86_64) scriptlet failed, exit status 7 ``` ##### Error: `error: unpacking of archive failed on file /usr/lib/python2.7/site-packages/urllib3/packages/ssl_match_hostname: cpio: rename failed` `salt ma-* cmd.run 'pip uninstall urllib3 -y'` This error is caused by the versionlock on the package. Use this to view the list ``` yum versionlock list Error: Package: salt-minion-2018.3.4-1.el7.noarch (@salt-2018.3) Requires: salt = 2018.3.4-1.el7 Removing: salt-2018.3.4-1.el7.noarch (@salt-2018.3) salt = 2018.3.4-1.el7 Updated By: salt-2018.3.5-1.el7.noarch (salt-2018.3) salt = 2018.3.5-1.el7 ``` ##### Error: installing package `kernel-3.10.0-1062.12.1.el7.x86_64` needs 7MB on the /boot filesystem ``` # Install yum utils yum install yum-utils # Package-cleanup set count as how many old kernels you want left package-cleanup --oldkernels --count=1 -y ``` ##### If VPN server stops working, Try a stop and start of the VPN service ([instructions](OpenVPN%20Notes.md)). The private IP will probably change. ##### ISSUE: salt-minion doesn't come back and has this error `/usr/lib/dracut/modules.d/90kernel-modules/module-setup.sh: line 16: /lib/modules/3.10.0-957.21.3.el7.x86_64///lib/modules/3.10.0-957.21.3.el7.x86_64/kernel/sound/drivers/mpu401/snd-mpu401.ko.xz: No such file or directory` RESOLUTION: Manually reboot the OS, this is most likely due to a kernal upgrade. ### Day 2 (Thursday), step 1 of 4: Reboot Internals Long Day of Rebooting! Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM): ``` FYI, patching today. Rebooting TEST * In about 15 minutes: Reboots of moose, internal systems and CaaSP, including the VPN. * Following that, patching (but not rebooting) of all customer PoPs/LCPs. * Then this afternoon, reboots of those those PoPs/LCPs. ``` Be sure to select ALL entities in Sensu for silencing not just the first 25. Sensu -> Entities -> Sort (name) -> Select Entity and Silence. This will silence both keepalive and other checks. Some silenced events will not unsilence and will need to be manually unsilenced. *IDEA! restart the sensu server and the vault-3 server first. This helps with the clearing of the silenced entities.* #### GovCloud (TEST) SSH via TSH into `GC Salt-Master` to reboot servers in GC that are on `gc-dev`. ``` # Login to Teleport tsh --proxy=teleport.xdrtest.accenturefederalcyber.com login # SSH to GC Salt-Master (TEST) tsh ssh node=salt-master ``` Start with `Sensu` and `Vault` ``` # Vault-3 and Sensu salt -C 'vault-3* or sensu*' test.ping --out=txt date; salt -C 'vault-3* or sensu*' system.reboot watch "salt -C 'vault-3* or sensu*' test.ping --out=txt" ``` Reboot majority of servers in GC. ``` salt -C '*com not ( interconnect* or modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or threatq-* or vault-3* )' test.ping --out=txt date; salt -C '*com not ( interconnect* or modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or threatq-* or vault-3* )' system.reboot ``` > :warning: ### You will lose connectivity to Openvpn and Salt Master ### Log back in and verify they are back up ``` watch "salt -C '*com not ( interconnect* or modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or threatq-* or vault-3* )' cmd.run 'uptime' --out=txt" ``` Take care of the Interconnects/Resolvers one at a time. Reboot one of each at the same time. ``` salt -C 'interconnect-0.pvt.*com or resolver-govcloud.pvt.*com' test.ping --out=txt date; salt -C 'interconnect-0.pvt.*com or resolver-govcloud.pvt.*com' system.reboot watch "salt -C 'interconnect-0.pvt.*com or resolver-govcloud.pvt.*com' test.ping --out=txt" salt -C 'interconnect-1.pvt.*com or resolver-govcloud-2.pvt.*com' test.ping --out=txt date; salt -C 'interconnect-1.pvt.*com or resolver-govcloud-2.pvt.*com' system.reboot watch "salt -C 'interconnect-1.pvt.*com or resolver-govcloud-2.pvt.*com' test.ping --out=txt" ``` Check uptime on the minions in GC to make sure you didn't miss any. ``` salt -C '*com not ( interconnect* or modelclient-splunk-idx* or moose-splunk-idx* or resolver* or sensu* or threatq-* or vault-3* )' cmd.run 'uptime | grep days' ``` ### Duane Section (feel free to bypass) -- I (Duane) did this a little different. Salt-master first, then openvpn, then everything but interconnects and resolvers. Interconnects and resolvers reboot one at a time. ``` salt -C '* not ( afs* or nga* or ma-* or dc-c19* or la-c19* or openvpn* or qcomp* or salt-master* or moose-splunk-indexer-* or interconnect* or resolver* )' cmd.run 'shutdown -r now' ``` -- #### GovCloud (PROD) Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM) and [#xdr-soc Channel](https://afscyber.slack.com/archives/CFUP7STE2): ``` FYI, patching today. Rebooting PROD * In about 15 minutes: Reboots of moose, internal systems and CaaSP, including the VPN. * Following that, patching (but not rebooting) of all customer PoPs/LCPs. * Then this afternoon, reboots of those those PoPs/LCPs. ``` SSH via TSH into `GC Salt-Master` to reboot servers in GC that are on `gc-prod`. ``` #Login to Teleport tsh --proxy=teleport.xdr.accenturefederalcyber.com login #SSH to GC Salt-Master (PROD) tsh ssh node=salt-master ``` Start with `Vault` and `Sensu` ``` # Vault-1 and Sensu salt -C 'vault-1*com or sensu*com' test.ping --out=txt date; salt -C 'vault-1*com or sensu*com' system.reboot watch "salt -C 'vault-1*com or sensu*com' test.ping --out=txt" ``` Reboot majority of servers in GC. ``` salt -C '*com not ( afs* or nga* or ma-* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or nihor* or bp-ot-demo* or bas-* or doed* or frtib* or ca-c19* or interconnect* or resolver* or vault-1*com or sensu*com )' test.ping --out=txt date; salt -C '*com not ( afs* or nga* or ma-* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or nihor* or bp-ot-demo* or bas-* or doed* or frtib* or ca-c19* or interconnect* or resolver* or vault-1*com or sensu*com )' system.reboot ``` > :warning: ### You will lose connectivity to openvpn and salt master ### Log back in and verify they are back up ``` watch "salt -C '*accenturefederalcyber.com not ( afs* or nga* or ma-* or dc-c19* or la-c19* or dgi-* moose-splunk-idx* or modelclient-splunk-idx* or nihor* or bp-ot-demo* or bas-* or doed* or frtib* or ca-c19* or interconnect* or resolver* or vault-1*com or sensu*com )' cmd.run 'uptime' --out=txt" ``` Take care of the interconnects/resolvers one at a time and with the `GC Salt Master`. Reboot one of each at the same time. ``` salt -C 'interconnect-0.pvt.*com or resolver-govcloud.pvt.*com' test.ping --out=txt date; salt -C 'interconnect-0.pvt.*com or resolver-govcloud.pvt.*com' system.reboot watch "salt -C 'interconnect-0.pvt.*com or resolver-govcloud.pvt.*com' test.ping --out=txt" salt -C 'interconnect-1.pvt.*com or resolver-govcloud-2.pvt.*com' test.ping --out=txt date; salt -C 'interconnect-1.pvt.*com or resolver-govcloud-2.pvt.*com' system.reboot watch "salt -C 'interconnect-1.pvt.*com or resolver-govcloud-2.pvt.*com' test.ping --out=txt" ``` Check uptime on the minions in GC to make sure you didn't miss any. ``` salt -C '*accenturefederalcyber.com not ( afs* or nga* or ma-* or dc-c19* or la-c19* or dgi-* or moose-splunk-idx* or modelclient-splunk-idx* or nihor* or bp-ot-demo* or bas-* or doed* or frtib* or ca-c19* or interconnect* or resolver* or vault-1*com or sensu*com )' cmd.run 'uptime | grep days' ``` Verify Portal is up: [Portal](https://portal.xdr.accenturefederalcyber.com/) Look in Sensu for any silent alerts. #### Reboot CaaSP See Patching Notes--CaaSP.md ### Day 2 (Thursday), Step 2 of 4: Reboot Moose Don't forget `GC TEST`! Start there first. Log in to Moose [Moose Splunk CM](https://moose-splunk-cm.pvt.xdr.accenturefederalcyber.com:8000/) and go to `settings->indexer clustering`. ``` salt -C 'moose-splunk-idx*' test.ping --out=txt # Do the first indexers salt moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com test.ping --out=txt date; salt moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com system.reboot # Indexers take a while to restart watch "salt moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com cmd.run 'uptime' --out=txt" ping moose-splunk-idx-4e1.pvt.xdr.accenturefederalcyber.com ``` #### WAIT FOR SPLUNK CLUSTER TO HAVE 3 CHECKMARKS Repeat the above patching steps for the additional indexers, waiting for `3 green checks` in between each one. ``` # Do the second indexer salt moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com test.ping --out=txt date; salt moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com system.reboot # Indexers take a while to restart watch "salt moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com cmd.run 'uptime' --out=txt" # Do the third indexer salt moose-splunk-idx-7fa.pvt.xdr.accenturefederalcyber.com test.ping --out=txt date; salt moose-splunk-idx-7fa.pvt.xdr.accenturefederalcyber.com system.reboot # Indexers take a while to restart watch "salt moose-splunk-idx-7fa.pvt.xdr.accenturefederalcyber.com cmd.run 'uptime' --out=txt" # Verify all indexers patched: salt 'moose-splunk-idx*' cmd.run 'uptime' --out=txt ``` #### Troubleshooting ##### If the indexer/checkmarks don't come back ( legacy information ) If an indexer is not coming back up...look at screenshot in AWS... see this: `Probing EDD (edd=off to disable)... ok` then look at system log in AWS see this: `Please enter passphrase for disk splunkhot!`: IF/WHEN an `Indexer` doesn't come back up follow these steps: ``` - In the AWS console, grab the instance ID. - Run the MDR/get-console.sh (Duane's script for pulling the system log) - Look for "Please enter passphrase for disk splunkhot" ``` In AWS console stop instance (which will remove ephemeral splunk data) then start it. Then ensure the `/opt/splunkdata/hot` exists. ``` salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h' ``` IF the MOUNT for `/opt/splunkdata/hot` DOESN'T EXIST, STOP SPLUNK! Splunk will write to the wrong volume. before mounting the new volume clear out the wrong `/opt/splunkdata/` ``` rm -rf /opt/splunkdata/hot/* ``` ``` salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl stop splunk' ``` Ensure the `/opt/splunkdata` doesn't already exist, before the `boothook`. ``` ssh prod-moose-splunk-indexer-1 ``` If it doesn't then manually run the `cloudinit boothook`. ``` sh /var/lib/cloud/instance/boothooks/part-002 salt -C 'nga-splunk-indexer-2.msoc.defpoint.local' cmd.run 'sh /var/lib/cloud/instance/boothooks/part-002' ``` Ensure the hot directory is owned by `splunk:splunk` ``` ll /opt/splunkdata/ salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'ls -larth /opt/splunkdata' chown -R splunk: /opt/splunkdata/ salt -C '' cmd.run 'chown -R splunk: /opt/splunkdata/' ``` It will be waiting for the `luks.key` ``` systemctl daemon-reload salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl daemon-reload' salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl restart systemd-cryptsetup@splunkhot' salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl | egrep cryptset' ``` It is waiting for command prompt, when you restart the service it picks up the key from a file. `Systemd` sees the crypt setup service as a dependency for the Splunk service. Look for this. This is good, it is ready for restart of splunk Cryptography Setup for splunkhot ``` systemctl restart splunk salt -C 'moose-splunk-idx-422.pvt.xdr.accenturefederalcyber.com' cmd.run 'systemctl restart splunk' ``` Once the `/opt/splunkdata/hot` is visible in `df -h` and the splunk service is started, then wait for the cluster to have 3 green checkmarks. Check the servers again to ensure all of them have rebooted. ``` salt -C ''moose-splunk-idx*'' cmd.run 'uptime' --out=txt | sort ``` Ensure all Moose and Internal have been rebooted ``` salt -C '* not ( afs* or bas-* or ca-c19* or dc-c19* or dgi-* or frtib-* or la-c19* or ma-* or nga* or nihors-* )' cmd.run uptime ``` ### Day 2 (Thursday), Step 3 of 4, Patching LCPs ``` salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' test.ping --out=txt salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'yum check-update' salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'uptime' # Check for sufficient space (or use Fred's method, next comment) salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h /boot' salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h /var/log' salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h /var' salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h' # Fred's update for df -h: salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' cmd.run 'df -h | egrep "[890][0-9]\%"' # Updates salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade # If a repo gives an error, you may need to disable it. # salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade disablerepo=msoc-repo # Optional for fix # salt -C '* not *.local not *.pvt.xdr.accenturefederalcyber.com' pkg.upgrade disablerepo=splunk-7.0 # Optional for fix # on 2020-07-23: salt -C 'nga-splunk-ds-1 or afs-splunk-ds-1 or afs-splunk-ds-2' pkg.upgrade disablerepo=splunk-7.0 # Optional for fix ``` #### Troubleshooting Error on `afs-splunk-ds-3: error: cannot open Packages database in /var/lib/rpm` Solution: ``` mkdir /root/backups.rpm/ cp -avr /var/lib/rpm/ /root/backups.rpm/ rm -f /var/lib/rpm/__db* db_verify /var/lib/rpm/Packages rpm --rebuilddb yum clean all ``` ##### Error on `*-ds`: Could not resolve 'reposerver.msoc.defpoint.local/splunk/7.0/repodata/repomd.xml' Reason: POP Nodes shouldn't be using the `.local` DNS address. Solution: Needs a permanent fix. For now, patch with the repo disabled: ``` salt -C '*-ds* not afs-splunk-ds-4' pkg.upgrade disablerepo=splunk-7.0 ``` ### Day 2 (Thursday), Step 4 of 4 (afternoon), Reboots LCPs Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM) ``` Resuming today's patching with the reboots of customer LCPs. ``` Remember to silence Sensu alerts before restarting servers. NOTE: Restart LCPs one server at a time at a location in order to minimize risk of concurrent outages. #### First syslog servers Restart the first syslog server by itself to check for reboot issues. ``` salt -C '*syslog-1* not *.local' cmd.run 'uptime && hostname' date; salt -C '*syslog-1* not *.local' system.reboot watch "salt -C '*syslog-1* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'" # Look for /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid ``` #### Second syslog servers ``` salt -C '*syslog-2* not *.local' cmd.run 'uptime && hostname' date; salt -C '*syslog-2* not *.local' system.reboot watch "salt -C '*syslog-2* not *.local' cmd.run 'ps -ef | grep syslog-ng | grep -v grep'" ``` #### Remaining Syslog Servers (We might be able to reboot some of these at the same time. If they are in different locations. Check the location grain on them.) `grains.item location` ``` afs-splunk-syslog-8: {u'location': u'az-east-us-2'} afs-splunk-syslog-7: {u'location': u'az-east-us-2'} afs-splunk-syslog-4: {u'location': u'San Antonio'} salt -C '*-splunk-syslog*' grains.item location salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7' cmd.run 'uptime' date; salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7' system.reboot watch "salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7' test.ping" salt -C '*splunk-syslog-3 or *splunk-syslog-5 or *splunk-syslog-7' cmd.run 'ps -ef | grep syslog-ng | grep -v grep' salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' cmd.run 'uptime' date; salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' system.reboot watch "salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' test.ping" salt -C '*splunk-syslog-2 or *splunk-syslog-4 or *splunk-syslog-6 or *splunk-syslog-8' cmd.run 'ps -ef | grep syslog-ng | grep -v grep' ``` #### Troubleshooting ##### Possible issue: if syslog-ng doesn't start, it might need the setenforce 0 command run ( left here for legacy reasons ) > :warning: 2020-06-11 - had to do this for `afs-syslog-5` through `8` ``` salt saf-splunk-syslog-1 cmd.run 'setenforce 0' salt saf-splunk-syslog-1 cmd.run 'systemctl stop rsyslog' salt saf-splunk-syslog-1 cmd.run 'systemctl start syslog-ng' watch "salt -C '*syslog-1* not *.local' test.ping" ``` If the syslog-ng service doesn't start, check the `syslog-ng` file for `oms agent` added configurations. ##### Possible issue: NGA LCP nodes hostnames change after reboot and Sensu agent name changes. ``` salt 'nga-splunk-ds-1' cmd.run 'hostnamectl set-hostname aws-splnks1-tts.nga.gov' salt 'nga-splunk-ds-1' cmd.run 'hostnamectl status' salt 'nga-splunk-ds-1' cmd.run 'systemctl stop sensu-agent' salt 'nga-splunk-ds-1' cmd.run 'systemctl start sensu-agent' ``` Repeat for other LCP nodes #### Verify logs are flowing AFS Splunk Search Head - [Access here](https://afs-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search) to check logs on `afssplhf103.us.accenturefederal.com` ``` #index=* source=/opt/syslog-ng/* host=afs* earliest=-15m | stats count by host #New search string | tstats count WHERE index=* source=/opt/syslog-ng/* host=afs* earliest=-15m latest=now BY host ``` Should see at least 5 hosts NGA Splunk Search Head - [Access here](https://nga-splunk-sh.msoc.defpoint.local:8000/en-US/app/search/search) to check log on `aws-syslog1-tts.nga.gov` ``` index=network sourcetype="citrix:netscaler:syslog" earliest=-15m latest=now index=zscaler sourcetype="zscaler:web" earliest=-15m latest=now ``` NOTICE: NGA `sourcetype="zscaler:web"` logs are handled by `fluentd` and can lag behind by 10 minutes. ``` #index=* source=/opt/syslog-ng/* host=aws* earliest=-60m | stats count by host #New search string | tstats count WHERE index=* source=/opt/syslog-ng/* host=aws* earliest=-60m latest=now BY host ``` ### POP DS (could these be restarted at the same time? Or in 2 batches?) Don't forget `DS-4` ``` # Try reboot at the same time salt '*splunk*ds*' cmd.run 'uptime' date; salt '*splunk*ds*' system.reboot watch "salt '*splunk*ds*' test.ping" salt '*splunk-ds*' cmd.run 'systemctl status splunk' ``` Did you get all of them? ``` salt -C ' * not *local not *.pvt.xdr.accenturefederalcyber.com' cmd.run uptime ``` Don't forget to un-silence Sensu. ### Day 3 (Monday), Step 1 of 2, Customer Slices Patching Shorter day of Patching! :-) Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM): ``` Today's patching is all XDR customer environments. Indexers and Searchheads will be patched this morning. Search heads will be rebooted this afternoon, and the indexers will be rebooted tomorrow. Thank you for your cooperation. ``` Run these commands on `GC Prod Salt Master`. These notes should patch all splunks. ``` salt -C 'afs*local or afs*com or ma-*com or la-*com or nga*com or nga*local or dc*com or nihor*com or bp-ot-demo*com or bas-*com or doed*com or frtib*com or ca-c19*com or dgi*com' test.ping --out=txt salt -C 'afs*local or afs*com or ma-*com or la-*com or nga*com or nga*local or dc*com or nihor*com or bp-ot-demo*com or bas-*com or doed*com or frtib*com or ca-c19*com or dgi*com' cmd.run 'uptime' salt -C 'afs*local or afs*com or ma-*com or la-*com or nga*com or nga*local or dc*com or nihor*com or bp-ot-demo*com or bas-*com or doed*com or frtib*com or ca-c19*com or dgi*com' cmd.run 'df -h' # Fred's update for df -h: salt -C 'afs*local or afs*com or ma-*com or la-*com or nga*com or nga*local or dc*com or nihor*com or bp-ot-demo*com or bas-*com or doed*com or frtib*com or ca-c19*com or dgi*com' cmd.run 'df -h | egrep "[890][0-9]\%"' salt -C 'afs*local or afs*com or ma-*com or la-*com or nga*com or nga*local or dc*com or nihor*com or bp-ot-demo*com or bas-*com or doed*com or frtib*com or ca-c19*com or dgi*com' pkg.upgrade ``` NOTE: Some Splunk Indexers always have high disk space usage (83%). This is normal. #### Troubleshooting EPEL repo is enabled on `afs-splunk-hf` ( I don't know why); had to run this to avoid issue with collectd package on `msoc-repo` `yum update --disablerepo epel` ### Day 3 (Monday afternoon), Step 2 of 2, Customer Slices Search Heads Only Reboots Silence Sensu first! Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM) , [#xdr-soc Channel](https://afscyber.slack.com/archives/CFUP7STE2) , and [#xdr-engineering Channel](https://afscyber.slack.com/archives/CFTJSTGDB): ``` FYI: Rebooting the Splunk search heads as part of today's patching. Reboots will occur in 15 minutes. ``` Commands to run on the `GC PROD Salt Master`: ``` salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' test.ping --out=txt | sort salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' cmd.run 'df -h | egrep "[890][0-9]\%"' date; salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' system.reboot watch "salt -C '*-sh* and not *moose* and not qcompliance* and not fm-shared-search*' cmd.run 'uptime'" ``` Don't forget to un-silence Sensu. ### Day 4 (Tuesday), Step 1 of 1, Customer Slices CMs Reboots Long Day of Reboots! Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM): ``` Today's patching is the indexing clusters for all XDR customer environments. Cluster masters and indexers will be rebooted. Thank you for your cooperation. ``` Silence `Sensu` first! Run on the `GC PROD Salt Master`. ``` salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt #Did you silence sensu? salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h' salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"' date; salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' system.reboot watch "salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' test.ping --out=txt" salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'systemctl status splunk | grep Active' salt -C '( *splunk-cm* or *splunk-hf* ) not moose*' cmd.run 'uptime' ``` #### ulimit errors May 27 17:08:57 la-c19-splunk-cm.msoc.defpoint.local splunk[3840]: /etc/rc.d/init.d/splunk: line 13: ulimit: open files: cannot modify limit: Invalid argument afs-splunk-hf has a hard time restarting. Might need to stop then start the instance. #### Log into the CM's Generate the URLs on the `GC Salt Master`: ``` for i in `salt -C '( *splunk-cm* ) not moose*' test.ping --out=txt`; do echo https://${i}8000; done | grep -v True ``` #### Reboot the indexers one at a time (AFS cluster gets backed up when an indexer is rebooted) TODO: get command to view "three green check marks" from salt. ``` salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' test.ping --out=txt salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"' date; salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' system.reboot watch "salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1a or G@ec2:placement:availability_zone:us-gov-east-1a ) not moose*' test.ping --out=txt" ``` Wait for 3 green check marks #### Repeat for other AZs ``` salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' test.ping --out=txt salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"' date; salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' system.reboot watch "salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1b or G@ec2:placement:availability_zone:us-gov-east-1b ) not moose*' test.ping --out=txt" # 3 green checkmarks salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' test.ping --out=txt salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' cmd.run 'df -h | egrep "[890][0-9]\%"' date; salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' system.reboot watch "salt -C '*splunk-i* and ( G@ec2:placement:availability_zone:us-east-1c or G@ec2:placement:availability_zone:us-gov-east-1c ) not moose*' test.ping --out=txt" # 3 green checkmarks ``` NGA had a hard time getting 3 checkmarks The CM was waiting on stuck buckets. Force rolled the buckets to get green checkmarks. #### Verify you got everything Run this on `Commercial (legacy)` and `GC Salt Master` ``` salt '*' cmd.run 'uptime | grep days' ``` > :warning: *MAKE SURE the Sensu checks are not silenced. * Post to Slack [#xdr-patching Channel](https://afscyber.slack.com/archives/CJ462RRBM): ``` Patching is done for this month. ```