# Splunk Migration from Commercial to GovCloud Rough process to migrate from Commercial to GovCloud. Work in progress. # Verify splunk is operating in an acceptable fashion, make note of obvious errors/warnings. Verify Splunk is operating correctly: * Log into the SH gui, check for errors * Log into the CM gui, check for errors * On the cm, run: ``` sudo -u splunk /opt/splunk/bin/splunk show cluster-status ``` # Create "snapshots" > :warning: Remember to check the 'No Reboot' box! Create "final" snapshots of the SH, CM, and HF on aws. Name: moose-splunk-hf-FinalSnapshot-20200115 Description: Final snapshot before migration to GC # Create a branch ``` cd ~/msoc-infrastructure git fetch --all git checkout develop git pull origin git checkout -b feature/ftd_MSOCI-XXXX_MigrateXXXX ``` # Create the okta apps ``` cd ~/msoc-infrastructure/tools/okta_app_maker export OKTA_API_TOKEN=FILLMEIN SPLUNK_PREFIX=moose SPLUNK_PREFIX_CAP=Moose ENVIRONMENT=prod ENVIRONMENT_CAP=Prod ENVIRONMENT_DNS=xdr # Alternative xdrtest ./okta_app_maker.py "${SPLUNK_PREFIX_CAP} Splunk SH [${ENVIRONMENT_CAP}] [GC]" "https://${SPLUNK_PREFIX}-splunk.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com" ./okta_app_maker.py "${SPLUNK_PREFIX_CAP} Splunk CM [${ENVIRONMENT_CAP}] [GC]" "https://${SPLUNK_PREFIX}-splunk-cm.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com:8000" ./okta_app_maker.py "${SPLUNK_PREFIX_CAP} Splunk HF [${ENVIRONMENT_CAP}] [GC]" "https://${SPLUNK_PREFIX}-splunk-hf.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com:8000" # For moose only: ./okta_app_maker.py "Qcompliance [${ENVIRONMENT_CAP}] [GC]" "https://qcompliance-splunk.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com:8000" ``` Update `~/msoc-infrastructure/salt/pillar/xxxx_variables.sls with the values from the scripts.` # Update Security Groups ``` cd ~/msoc-infrastructure/terraform/CUSTOMERDIRECTORY tfswitch # modify module "CUST_cluster" in "moose.tf" or "main.tf", and add: # migration_cidr = [ "10.40.16.0/22" ] # Determine actual CIDR block for vpc-splunk in the new account ``` Commit to GIT and do a PR ``` terraform init terraform workspace select prod terraform apply ``` DEPRECATED: I don't think this is right. The existing security group should be updated. ~It's possible the This will update the security group, but due to the asg, it will not recreate the indexers. Rather than rebuilding each, use the AWS console to modify each indexer.~ 1. ~Log onto the AWS Commercial Legacy Prod account.~ 2. ~Go to ec2->instances and filter on "CUST-indexer"~ # Copy the splunk modules to the new account If this is a brand new account, initialize following the usual processes, but do not apply the splunk modules yet. If this is an existing account: ``` cd ~/xdr-terraform-live/ git checkout master git fetch --all git pull git checkout -b feature/ftd_MSOCI-XXXX_MigrateCUSTToGC cp -r 000-skeleton/{140-splunk-frozen-bucket,150-splunk-cluster-master,160-splunk-indexer-cluster,170-splunk-searchhead,180-splunk-heavy-forwarder} prod/aws-us-gov/CUSTOMERDIRECTORY/ git add prod/aws-us-gov/CUSTOMERDIRECTORY/{140-splunk-frozen-bucket,150-splunk-cluster-master,160-splunk-indexer-cluster,170-splunk-searchhead,180-splunk-heavy-forwarder} ``` Edit `prod/aws-us-gov/CUSTOMERDIRECTORY/account.hcl` and verify the following variables: * `vpc_info['vpc-splunk']` should be the correct subnet (Something out of 10.42.0.0/16) * `splunk_data_sources` should be an array of IPs that have access. (Maybe copied from `~/msoc-infrastructure/terraform/common/variables.tf` * `splunk_legacy_cidr` should be the legacy subnet * `splunk_asg_sizes` should be [1, 1, 1] unless additional instances are needed * `splunk_volume_sizes`, probably copied from elsewhere * `instance_types` Commit and do a PR for approval # Create the frozen bucket ``` cd ~/xdr-terraform-live/test/aws-us-gov/mdr-test-c2/140-splunk-frozen-bucket terragrunt apply ``` # Create the SAML logins ``` cd ~/msoc-infrastructure/salt/pillar vim moose_variables.sls # copy the saml settings, applying different settings in an {% if %} clause for 'pvt.xdr.accenturefederalcyber.com' (see moose_variables.sls) # and update with the information obtained in the following steps. ``` 1. Log onto the [Okta Admin Active Applications](https://mdr-multipass-admin.okta.com/admin/apps/active) 1. For each of the new apps ("CUST Splunk CM [Prod] [GC]", HF, and SF), go to groups, and click 'Assign->Assign to Groups', and assign the following groups: * CM: mdr-admins, mdr-engineers * HF: mdr-admins, mdr-engineers * SH: mdr-admins, mdr-engineers, Analysts # Refresh the salt filesystem By now, your PR for salt should have been merged. Make sure it's on all the salt masters. ``` tshp salt-master salt-run fileserver.update salt '*' saltutil.sync_all # optional but why not? salt '*' saltutil.refresh_pillar # optional but why not? ``` # Migrate the cluster master: ## Terraform and highstate a clustermaster in govcloud. ### Terraform it: ``` cd ~/xdr-terraform-live/prod/aws-us-gov/mdr-prod-c2/150-splunk-cluster-master terragrunt apply ``` ### Highstate it: ``` ssh gc-dev-salt-master salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' state.highstate --output-diff # run it twice salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' state.highstate --output-diff # reboot to ensure all changes are active salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' system.reboot watch "salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' test.ping" ``` # Initial rsync Run an initial rsync of the remote host while its running to to do an initial staging. This will reduce total time that we are running with reduced redundancy. Prep for scp: ``` # generate key on new tshp CUST-splunk-cm sudo systemctl stop splunk sudo systemctl disable splunk sudo su - splunk ssh-keygen # enter x3 cat ~/.ssh/id_rsa.pub exit # authorize key on old tshp CUST-splunk-cm.msoc.defpoint.local mkdir .ssh cat >> .ssh/authorized_keys # paste from above exit # Validate that it's working tshp CUST-splunk-cm sudo su - splunk ssh frederick.t.damstra@CUST-splunk-cm.msoc.defpoint.local ``` rsync legacy to local: ``` tshp CUST-splunk-cm sudo su - splunk # this can be run multiple times without issue. You may wish to do # it first before you've stopped splunk to minimize the interruption. time rsync --rsync-path="sudo rsync" -avz --delete --progress \ frederick.t.damstra@CUST-splunk-cm.msoc.defpoint.local:/opt/splunk/ /opt/splunk/ \ --exclude="*.log" --exclude '*.log.*' --exclude '*.bundle' --exclude ".ssh" ``` ## Stop splunk on legacy CM, and rsync all configuration to the new CM. ``` tshp salt-master # stop splunk on old and new salt 'CUST-*-cm*' service.stop splunk salt 'CUST-splunk-cm.msoc.defpoint.local' service.disable splunk exit ``` rsync legacy to local: ``` tshp CUST-splunk-cm sudo su - splunk # this can be run multiple times without issue. You may wish to do # it first before you've stopped splunk to minimize the interruption. time rsync --rsync-path="sudo rsync" -avz --delete --progress \ frederick.t.damstra@CUST-splunk-cm.msoc.defpoint.local:/opt/splunk/ /opt/splunk/ \ --exclude="*.log" --exclude '*.log.*' --exclude '*.bundle' --exclude ".ssh" ``` Fix references: ``` find /opt/splunk -type f -name "*.conf" -exec grep -Hi msoc.defpoint.local {} \; # fix anything found ``` ## Update cluster master configuration to point to new DNS name ``` cd cd tmp git clone git@github.xdr.accenturefederalcyber.com:mdr-engineering/msoc-moose-cm.git cd msoc-moose-cm/ grep msoc `find . -type f` # fix anything found # Fix the coldToFrozen script: NOT ALL CUSTOMERS HAVE THIS vim master-apps/TA-Frozen-S3/bin/coldToFrozenS3.py # make to match ~/msoc-infrastructure/salt/fileroots/splunk/files/coldToFrozenS3.py ``` These changes will be deployed later. ## Fix permissions and start splunk on the new CM ``` ssh gc-prod-moose-splunk-cm sudo chown -R splunk:splunk /opt/splunk sudo systemctl start splunk # check for issues: sudo tail -F /opt/splunk/var/log/splunk/splunkd.log # Make sure the bundle is active sudo -u splunk /opt/splunk/bin/splunk show cluster-status sudo -u splunk /opt/splunk/bin/splunk validate cluster-bundle sudo -u splunk /opt/splunk/bin/splunk show cluster-bundle-status sudo -u splunk /opt/splunk/bin/splunk apply cluster-bundle sudo -u splunk /opt/splunk/bin/splunk show cluster-bundle-status # optional but recommended sudo -u splunk /opt/splunk/bin/splunk enable maintenance-mode ``` ## For each indexer: Get list of indexers: ``` tshp ls | grep moose | grep indexer ``` For each indexer, as quickly as possible: ``` ssh dev-moose-splunk-indexer-i-00d5ea4121238bb1b sudo sed -i 's/^master_uri.*$/master_uri = https:\/\/moose-splunk-cm.pvt.xdr.accenturefederalcyber.com:8089/g' /opt/splunk/etc/system/local/server.conf sudo sed -i 's/msoc.defpoint.local/pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/license_slave/local/server.conf sudo systemctl stop splunk sudo systemctl start splunk # verify via 'show cluster-status' on cm that it joined sudo tail -F /opt/splunk/var/log/splunk/splunkd.log ``` ## Validate cluster status: ``` ssh gc-prod-moose-splunk-cm sudo -u splunk /opt/splunk/bin/splunk show cluster-status sudo -u splunk /opt/splunk/bin/splunk disable maintenance-mode sudo -u splunk /opt/splunk/bin/splunk apply cluster-bundle # this last one should error, but it's a failsafe ``` ## Fix the search head ``` ssh prod-moose-splunk-sh sudo sed -i 's/msoc.defpoint.local/pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/license_slave/local/server.conf sudo sed -i 's/msoc.defpoint.local/pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/moose_sh_outputs/default/outputs.conf sudo sed -i 's/moose-splunk-cm/moose-splunk-cm.pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/connected_clusters/local/server.conf sudo systemctl restart splunk sudo tail -F /opt/splunk/var/log/splunk/splunkd.log ``` ## Fix indexer discovery Edit salt and update any references to the CM: ``` cd ~/msoc-infrastructure/salt/pillar grep "moose-splunk-cm" `find . -type f` # vim vim vim git commit git push ``` Then update the salt master, and: ``` ssh gc-prod-salt-master sudo salt-run fileserver.update salt '*' saltutil.refresh_pillar # Don't want to start splunk on the old cluster master, though it likely wouldn't do anything. salt -C 'moose* not moose-splunk-cm.msoc*' state.highstate --output-diff --force-color test=true 2>&1 | less -iSR # validate you expect the changes, and then do it for real: salt -C 'moose* not moose-splunk-cm.msoc*' state.highstate --output-diff test=false ``` This should *only* be necessary for moose: ``` salt '*.msoc.defpoint.local state.sls internal_splunk_forwarder --output-diff --force-color test=true salt '*.msoc.defpoint.local state.sls internal_splunk_forwarder --output-diff --force-color test=false salt '*.pvt*' state.sls internal_splunk_forwarder --output-diff --force-color test=true salt '*.pvt*' state.sls internal_splunk_forwarder --output-diff --force-color test=false ``` ######################################### # Migrate the indexers: Good idea to check currrent usage: ``` salt 'moose-splunk-i*' cmd.run 'df -h | grep opt' ``` ## Stand up 3 new indexers in GC and add them to the cluster. Terraform the cluster: ``` cd ~/xdr-terraform-live/test/aws-us-gov/mdr-test-c2/160-splunk-indexer-cluster terragrunt apply ``` Verify the indexers come online: ``` ssh gc-prod-salt-master sudo salt-key -L | grep idx exit date ssh gc-dev-moose-splunk-cm # I believe it can take up to about a half hour to come online sudo -u splunk /opt/splunk/bin/splunk show cluster-status # you may need ot highstate a second time? Not clear what happens to cause this to work sometimes and not others ``` ## Create the leagcy HEC All converted clients need the legacy hec module. This is not included in the skeleton directory. ### Update existing things to use the new HEC Update salt: * Validate connectivity from client LCPs to the GovCloud NLBs * Update the NLB endpoints (should be 3 IP addresses) * May want to keep one of the old ones temporarily, so that errors can be recorded. * Update -hec endpoints to use the xdr.acccenturefederalcyber.com domain. #### The Moose HECs This may only be necessary for moose, but check for other hecs, as well: ``` vim ~/xdr-terraform-live/prod/env.hcl # Update hec, hec_pub, and hec_pub_ack cd ~/xdr-terraform-live/prod/aws/legacy-mdr-prod/045-kinesis-firehose-waf-logs terragrunt apply cd ~/xdr-terraform-live/prod/aws/legacy-mdr-prod/045-kinesis-portal-data-sync terragrunt apply ``` ### Remove the old HEC Log into the console and check the old HEC to get a feel for what sort of traffic to expect. ``` cd ~/msoc-infrastructure/terraform/100-moose vim moose.tf # Change create_hec_lb to false tfswitch terraform init terraform workspace select prod terraform apply ``` Note: HEC communications will be down until the next step is completed ### Create a new HEC ``` cd ~/xdr-terraform-live/prod/aws-us-gov/mdr-prod-c2/165-splunk-legacy-hec tfswitch terragrunt init terragrunt apply ``` numerous salt things... this only applies to moose so notes not kept Validate that traffic is hitting the new HEC in a similar fashion to the old (remember that the negative TTL of 1 hour may apply) ### One at a time, stop the legacy indexers: On each indexer: ``` sudo systemctl disable splunk sudo -u splunk /opt/splunk/bin/splunk offline --enforce-counts sudo tail -F /opt/splunk/var/log/splunk/splunkd.log ``` Watch the status via the cluster master before moving on to the next: ``` sudo -u splunk /opt/splunk/bin/splunk show cluster-status ``` ### Validate: * Search head can search * No warnings on SH or Cluster Master If you get integrity warnings, it is most likely because the 'sed' above added a newline. To fix, for each file that has the warning: ``` vim {file} :set binary :set noeol :w! ZZ ```