Splunk Upgrade Notes.md 14 KB

Splunk Upgrade Notes

User Calendar Apt to notify when you are upgrading Splunk.

Naughton, Brandon <brandon.naughton@accenturefederal.com>; Williams, Colby <colby.williams@accenturefederal.com>; Waddle, Duane E. <duane.e.waddle@accenturefederal.com>; Damstra, Frederick T. <frederick.t.damstra@accenturefederal.com>; Reuther, John M. <john.m.reuther@accenturefederal.com>; Leonard, Wesley A. <wesley.a.leonard@accenturefederal.com>; Starcher, George <george.a.starcher@accenturefederal.com>; Rivas, Gregory A. <gregory.a.rivas@accenturefederal.com>; Jarrett, James M. <james.m.jarrett@accenturefederal.com>; Kerr, James <j.kerr@accenturefederal.com>

This is an FYI only. I plan on upgrading PROD Splunk during this time.

No need to notify the customer since this is a "behind the scences" change. No customer facing downtime.

Post to slack channels before you begin. xdr-patching, xdr-engineering, xdr-soc

Starting dc-c19 Splunk upgrade. please plan on outages. 

NOTE: The CM should be at the same or higher version than any Search Head connecting to it. Thus, upgrade the FM-shared-search, monitoring console, and qcompliance after upgrading all the Cluster Masters.

Splunk Upgrade 2021 "The Big One Part 2"

09/27/2021

Overview

Upgrade Steps

  • Ensure recent and persistent snapshot of SH, HF, CM, etc. EBS Volumes

    • Silence Sensu
    • Stop Splunk service on CM,SH,HF,Cust-SH and take EBS snapshot of ALL the drives so that the snapshots will not be deleted in two days!
    • salt -C '( *sh* or *hf* ) and moose*' cmd.run 'systemctl stop splunk'
    • Choose to backup /opt/splunk AND/OR take a snapshot.
    • Backup /opt/splunk tar -cvzf /opt/splunk/opt-splunk-backup.tar.gz /opt/splunk
    • snapshot name: -pre-upgrade-backup-
    • snapshot name: modelclient-splunk-hf-pre-upgrade-backup-8.0.5
    • Update the profile, InstanceId, and tag to create snapshots of all volumes

      aws --profile mdr-test-c2-gov ec2 create-snapshots --instance-specification 'InstanceId=i-02a546c0de3d20030,ExcludeBootVolume=false' --tag-specifications 'ResourceType=snapshot,Tags=[{Key=Name,Value=modelclient-splunk-hf-pre-upgrade-backup-8.0.5}]'
      
    • Before Splunk Upgrades

      • Upgrade ES 6.1.1/6.2.0 -> 6.6.2

        • version 6.6.2 is minimum version supported by 8.2.2.1
        • The app failed to upload to the SH. ( takes a long time ). Modify the etc/system/local/web.conf to allow large uploads.

          [settings]
          max_upload_size = 1024
          
        • Run the setup after the upgrade.

        • In CAASP, the app failed to upgrade with the error "invalid message type: 28" due to insufficent space in the /tmp dir. Be sure to have a minimum of 1.4 GB available in /tmp for the tar ball to be extracted.

    • Update salt pillar data to new Splunk repo to reflect new splunk repo.

      • Dump all passwords from the password store PRIOR to upgrade.
        • Run on the HF: | rest /services/storage/passwords
    • Sensu and update the repo at the same time

      • Setup silence on Sensu for ALL servers
      • verify there is enough disk space cmd.run 'df -h'
      • apply the updated pillar data salt -C 'moose* not moose-alsi*' saltutil.refresh_pillar
      • verify the pillar is updated salt -C 'moose* not moose-alsi*' pillar.item yumrepos:splunk
      • Run: state.sls splunk.update_repo to update repo
    • Stop all servers at the same time

      • Stop Splunk on all servers cmd.run 'systemctl stop splunk'
    • Upgrade CM, SH, HF, customer SH ( if applicable )

      • Did you take an AWS EBS snapshot ???
      • Upgrade splunk salt -C '( *cm* or *sh* or *hf* ) and moose*' pkg.upgrade name=splunk
      • Splunk is now waiting for accept license. Do Not Start Splunk Until after indexers are upgraded.
      • Swap George's app SA-AFS-XDR-Threat62 for SA-AFS-XDR-Threat64 on Search Heads with ES installed
        • Install via the UI after extracting the app builder
        • rm –rf /opt/splunk/etc/apps/SA-AFS-XDR-Threat62
        • scp via teleport the new app
        • besure to untar it twice!
    • Upgrade and Start Indexers

      • Upgrade splunk salt -C '*idx* and moose*' pkg.upgrade name=splunk
      • Start indexers and accept license cmd.run 'systemctl start splunk'
        • cmd.run '/opt/splunk/bin/splunk version'
        • cmd.run '/opt/splunk/bin/splunk status'
    • Start CM and SH and Cust-SH

      • Start CM/SH/HF and accept license
        • salt -C '( *cm* or *sh* or *hf* ) and moose*' cmd.run 'systemctl start splunk'
    • Verify Splunk Web is up and searches of _internal index are working and three green checkmarks

    • Upgrade fm-shared-search-0/splunk-mc-0/qcompliance-splunk-sh

      • update pillar in...
        • salt/pillar/mc_variables.sls
        • salt/pillar/fm_shared_search.sls
      • see above steps for upgrading a SH

    After Splunk App Upgrades

    - Migrate KV store storage engine to WiredTiger on the SHs ( where the KV store is used. )
        - backup kvstore first! 
        - https://docs.splunk.com/Documentation/Splunk/8.2.2/Admin/BackupKVstore#Back_up_and_restore_the_KV_store_with_point_in_time_consistency 
            - Verify backup is there in /opt/splunk/var/lib/splunk/kvstorebackup
        - https://docs.splunk.com/Documentation/Splunk/8.2.2/Admin/MigrateKVstore#Migrate_the_KV_store_after_an_upgrade_to_Splunk_Enterprise_8.1_or_higher_in_a_single-instance_deployment
    - upgrade apps slowly so Brandon can troubleshoot errors!!!!)
    - Ensure 3 green checkmarks (Prevents 3 green checkmarks on CM) Update the CM bundle to include `_cluster` see here: [Fixes for not replicating indexes?](https://github.xdr.accenturefederalcyber.com/mdr-engineering/msoc-afs-cm/pull/9)  (index _metrics and _introspection not in _cluster)
    
    • Remove Sensu Silences
    • On AFS/FRTIB Cluster ensure that OKTA logs are coming in still
    • Check lastchance index for unusual data. If the upgrade of ES introducing new indexes, and the new indexes are not on the Splunk indexers, then the data will be put into the lastchance index.
    • Wait a week or two and delete the snapshots?
    • Upgrade Universal Forwarders connected to Moose ( monthly patching may do this for you! )
      • NOTE: The repo for the Splunkforwarder for non-splunk servers is the msoc repo
      • Move the rpm for splunkforwarder to the msoc repo and then upgrade on the servers
      • cmd.run 'yum clean all ; yum makecache fast'
      • pkg.upgrade name=splunkforwarder
      • Use a restart to accept the license
      • cmd.run 'systemctl restart splunkuf'
      • state.sls internal_splunk_forwarder --output-diff test=false
      • See which UFs have been upgraded. Splunk server UFs may get upgraded with a yum update. salt 'minion*' cmd.run '/opt/splunkforwarder/bin/splunk version' salt 'minion*' cmd.run '/opt/splunkforwarder/bin/splunk status'

    Upgrade Splunk on LCPs

    • Upgrade pillar in {customer}variables.sls
      • salt -C 'dgi* not *com' test.ping
      • apply the updated pillar data saltutil.refresh_pillar
      • verify the pillar is updated pillar.item yumrepos:splunk
      • state.sls splunk.update_repo
      • Unneeded: yum clean all ; yum makecache fast
    • Silence Sensu
    • Create a backup
      • Ensure you have room to take a backup
        • cmd.run 'df -h /opt'
    • Stop Splunk and take a backup
      • cmd.run 'systemctl stop splunk'
      • cmd.run 'tar -czf /opt/opt-splunk-backup-8.0.5.tar.gz /opt/splunk'
      • Worried about space?
        • cmd.run 'tar -czf /opt/syslog-ng/opt-splunk-backup-8.0.5.tar.gz /opt/splunk'
      • cmd.run 'ls -larth /opt'
    • Upgrade Splunk
      • pkg.upgrade name=splunk
      • cmd.run 'systemctl start splunk'
      • cmd.run '/opt/splunk/bin/splunk version'
      • cmd.run 'tail /opt/splunk/var/log/splunk/splunkd.log'
    • Remove Sensu Silence

    New Error: 11-09-2021 22:17:39.611 +0000 ERROR ExecProcessor [16242 ExecProcessor] - message from "/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/splunk_secure_gateway/bin/ssg_enable_modular_input.py" Socket error communicating with splunkd (error=[Errno 111] Connection refused), path = https://127.0.0.1:9666//services/server/roles

    The splunk_secure_gateway app got installed with 8.2. It is not used AFAIK and can be ignored.

    <<<------------------ LEGACY -------------------->>>

    Splunk Upgrade 2020 "The Big One"

    08/11/2020

    Software is located in Duane's One drive.

    Overall Plan

    1. Upgrade AFS/NGA 7.0.3 -> 8.0.5

      1. Why? bc of SOC blockers.
      2. Prep Work
        1. Ensure Apps are 8x compatible. Make a list of apps that will be upgraded prior upgrade and after upgrade.
        2. Check modular apps to see for python3 compatability
        3. Create one drive doc to track compatability
        4. Pull Brandon into app upgrade checks
        5. Pillar if/then for test/prod
        6. The Python Upgrade will be completed at a later date
          1. Upgrade using the Python 2 runtime and make minimal changes to Python code
      3. AFS/NGA upgrade
        1. Update salt pillar data to 8.0.5 repo to reflect new splunk repo.
          • 0.2 Dump all passwords from the password store PRIOR to upgrade.
            • 0.2.1 Run on the HF: | rest /services/storage/passwords
        2. Ensure recent backup of SH EBS
        3. upgrade indexers: stop all at the same time
          • 3.1. apply the updated pillar datasalt afs* saltutil.refresh_pillar
          • 3.2. verify the pillar is updatedsalt afs* pillar.item yumrepos:splunk
          • 3.3. verify there is enough disk space
        4. Upgrade CM
          • 0.1 Setup silence on Sensu for ALL servers
            1. Run: state.sls splunk.new_install to update repo ; yes it will restart splunk. (ROOM FOR IMPROVEMENT: Make new saltstate for splunk repo)
            2. Stop splunk cmd.run 'systemctl stop splunk'
            3. Upgrade splunk pkg.upgrade name=splunk
              • 3.1 Splunk is now waiting for accept license. Do Not Start Splunk Until after indexers are upgraded.
        5. Upgrade SH
          • 0.1 Setup silence on Sensu
            1. Run: state.sls splunk.new_install to update repo
            2. Stop splunk cmd.run 'systemctl stop splunk'
              • 2.1 Backup /opt/splunk tar -cvzf /opt/splunk/opt-splunk-backup.tar.gz /opt/splunk
            3. Upgrade splunk pkg.upgrade name=splunk
              • 3.1 Splunk is now waiting for accept license.
        6. Upgrade Indexers
          • 0.1 Setup silence on Sensu
            1. Run: state.sls splunk.new_install to update repo
            2. Stop splunk cmd.run 'systemctl stop splunk'
            3. Upgrade splunk pkg.upgrade name=splunk
            4. Start indexers and accept license cmd.run 'systemctl start splunk'
              • 4.1 cmd.run '/opt/splunk/bin/splunk version'
              • 3.2 cmd.run '/opt/splunk/bin/splunk status'
        7. Start CM and SH
          1. Start CM/SH and accept license cmd.run 'systemctl start splunk'
        8. Upgrade HF (slice only, not POPs)
          1. Run: state.sls splunk.new_install to update repo
          2. Stop splunk cmd.run 'systemctl stop splunk'
            • 2.1 Backup /opt/splunk tar -cvzf /opt/splunk/opt-splunk-backup.tar.gz /opt/splunk
          3. Upgrade splunk pkg.upgrade name=splunk
          4. Start indexers and accept license cmd.run 'systemctl start splunk'
        9. After Splunk App Upgrades
          1. Upgrade ES 5.0.1 -> 6.2.0

            1. The app failed to upload to the SH. ( takes a long time ). Modify the etc/system/local/web.conf to allow large uploads.

              max_upload_size = 1024
              
          2. See Matrix for other apps ( upgrade apps slowly so Brandon can troubleshoot errors!!!!)

          3. run geo ip DB update

            1. /usr/local/bin/maxmind-downloader.sh
          4. (Prevents 3 green checkmarks on CM) Update the CM bundle to include _cluster see here: Fixes for not replicating indexes? (index _metrics and _introspection not in _cluster)

          5. NGA has an additional check on the splunk HF IAM role for externalID. Besure to add the "patch" back in. See here: Jira Ticket - MSOCI-623 - Splunk AWS TA doesn't support --external-id when assuming an IAM role. This is for the splunk_TA_aws app.

        10. Delete Sensu Silences
        11. Check lastchance index for unusual data. If the upgrade of ES introducing new indexes, and the new indexes are not on the Splunk indexers, then the data will be put into the lastchance index.
    2. Upgrade Moose 7.2.1 ->8.0.5 DONE!

      1. test
        1. CM, SH, indexer, HF, Forwarders
      2. prod
      3. Upgrade *.local Universal Forwarders
    3. Upgrade Covids 8.0.4 -> 8.0.5

      1. test
      2. prod
    4. Upgrade POP nodes