Splunk Migration from Commercial to GovCloud - 1. Prep and Indexer Cluster.md 14 KB

Splunk Migration from Commercial to GovCloud

Rough process to migrate from Commercial to GovCloud. Work in progress.

Verify splunk is operating in an acceptable fashion, make note of obvious errors/warnings.

Verify Splunk is operating correctly:

  • Log into the SH gui, check for errors
  • Log into the CM gui, check for errors
  • On the cm, run:

    sudo -u splunk /opt/splunk/bin/splunk show cluster-status
    

Create "snapshots"

:warning: Remember to check the 'No Reboot' box!

Create "final" snapshots of the SH, CM, and HF on aws.

Name: moose-splunk-hf-FinalSnapshot-20200115
Description: Final snapshot before migration to GC

Create a branch

cd ~/msoc-infrastructure
git fetch --all
git checkout develop
git pull origin
git checkout -b feature/ftd_MSOCI-XXXX_MigrateXXXX

Create the okta apps

cd ~/msoc-infrastructure/tools/okta_app_maker
export OKTA_API_TOKEN=FILLMEIN
SPLUNK_PREFIX=moose
SPLUNK_PREFIX_CAP=Moose
ENVIRONMENT=prod
ENVIRONMENT_CAP=Prod
ENVIRONMENT_DNS=xdr # Alternative xdrtest
./okta_app_maker.py "${SPLUNK_PREFIX_CAP} Splunk SH [${ENVIRONMENT_CAP}] [GC]" "https://${SPLUNK_PREFIX}-splunk.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com"
./okta_app_maker.py "${SPLUNK_PREFIX_CAP} Splunk CM [${ENVIRONMENT_CAP}] [GC]" "https://${SPLUNK_PREFIX}-splunk-cm.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com:8000"
./okta_app_maker.py "${SPLUNK_PREFIX_CAP} Splunk HF [${ENVIRONMENT_CAP}] [GC]" "https://${SPLUNK_PREFIX}-splunk-hf.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com:8000"
# For moose only:
./okta_app_maker.py "Qcompliance [${ENVIRONMENT_CAP}] [GC]" "https://qcompliance-splunk.pvt.${ENVIRONMENT_DNS}.accenturefederalcyber.com:8000"

Update ~/msoc-infrastructure/salt/pillar/xxxx_variables.sls with the values from the scripts.

Update Security Groups

cd ~/msoc-infrastructure/terraform/CUSTOMERDIRECTORY
tfswitch
# modify module "CUST_cluster" in "moose.tf" or "main.tf", and add:
#   migration_cidr = [ "10.40.16.0/22" ] # Determine actual CIDR block for vpc-splunk in the new account

Commit to GIT and do a PR

terraform init
terraform workspace select prod
terraform apply

DEPRECATED: I don't think this is right. The existing security group should be updated.

~It's possible the This will update the security group, but due to the asg, it will not recreate the indexers. Rather than rebuilding each, use the AWS console to modify each indexer.~

  1. ~Log onto the AWS Commercial Legacy Prod account.~
  2. ~Go to ec2->instances and filter on "CUST-indexer"~

Copy the splunk modules to the new account

If this is a brand new account, initialize following the usual processes, but do not apply the splunk modules yet.

If this is an existing account:

cd ~/xdr-terraform-live/
git checkout master
git fetch --all
git pull
git checkout -b feature/ftd_MSOCI-XXXX_MigrateCUSTToGC
cp -r 000-skeleton/{140-splunk-frozen-bucket,150-splunk-cluster-master,160-splunk-indexer-cluster,170-splunk-searchhead,180-splunk-heavy-forwarder} prod/aws-us-gov/CUSTOMERDIRECTORY/
git add prod/aws-us-gov/CUSTOMERDIRECTORY/{140-splunk-frozen-bucket,150-splunk-cluster-master,160-splunk-indexer-cluster,170-splunk-searchhead,180-splunk-heavy-forwarder}

Edit prod/aws-us-gov/CUSTOMERDIRECTORY/account.hcl and verify the following variables:

  • vpc_info['vpc-splunk'] should be the correct subnet (Something out of 10.42.0.0/16)
  • splunk_data_sources should be an array of IPs that have access. (Maybe copied from ~/msoc-infrastructure/terraform/common/variables.tf
  • splunk_legacy_cidr should be the legacy subnet
  • splunk_asg_sizes should be [1, 1, 1] unless additional instances are needed
  • splunk_volume_sizes, probably copied from elsewhere
  • instance_types

Commit and do a PR for approval

Create the frozen bucket

cd ~/xdr-terraform-live/test/aws-us-gov/mdr-test-c2/140-splunk-frozen-bucket
terragrunt apply

Create the SAML logins

cd ~/msoc-infrastructure/salt/pillar
vim moose_variables.sls
# copy the saml settings, applying different settings in an {% if %} clause for 'pvt.xdr.accenturefederalcyber.com' (see moose_variables.sls)
# and update with the information obtained in the following steps.
  1. Log onto the Okta Admin Active Applications
  2. For each of the new apps ("CUST Splunk CM [Prod] [GC]", HF, and SF), go to groups, and click 'Assign->Assign to Groups', and assign the following groups:
    • CM: mdr-admins, mdr-engineers
    • HF: mdr-admins, mdr-engineers
    • SH: mdr-admins, mdr-engineers, Analysts

Refresh the salt filesystem

By now, your PR for salt should have been merged. Make sure it's on all the salt masters.

tshp salt-master
salt-run fileserver.update
salt '*' saltutil.sync_all       # optional but why not?
salt '*' saltutil.refresh_pillar # optional but why not?

Migrate the cluster master:

Terraform and highstate a clustermaster in govcloud.

Terraform it:

cd ~/xdr-terraform-live/prod/aws-us-gov/mdr-prod-c2/150-splunk-cluster-master
terragrunt apply

Highstate it:

ssh gc-dev-salt-master
salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' state.highstate --output-diff
# run it twice
salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' state.highstate --output-diff
# reboot to ensure all changes are active
salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' system.reboot
watch "salt 'moose-splunk-cm.pvt.xdr.accenturefederalcyber.com' test.ping"

Initial rsync

Run an initial rsync of the remote host while its running to to do an initial staging. This will reduce total time that we are running with reduced redundancy.

Prep for scp:

# generate key on new
tshp CUST-splunk-cm
sudo systemctl stop splunk
sudo systemctl disable splunk
sudo su - splunk
ssh-keygen
# enter x3
cat ~/.ssh/id_rsa.pub
exit

# authorize key on old
tshp CUST-splunk-cm.msoc.defpoint.local
mkdir .ssh
cat >> .ssh/authorized_keys
# paste from above
exit

# Validate that it's working
tshp CUST-splunk-cm
sudo su - splunk
ssh frederick.t.damstra@CUST-splunk-cm.msoc.defpoint.local

rsync legacy to local:

tshp CUST-splunk-cm
sudo su - splunk
# this can be run multiple times without issue. You may wish to do 
# it first before you've stopped splunk to minimize the interruption.
time rsync --rsync-path="sudo rsync" -avz --delete --progress \
  frederick.t.damstra@CUST-splunk-cm.msoc.defpoint.local:/opt/splunk/ /opt/splunk/ \
  --exclude="*.log"   --exclude '*.log.*'   --exclude '*.bundle' --exclude ".ssh"

Stop splunk on legacy CM, and rsync all configuration to the new CM.

tshp salt-master
# stop splunk on old and new
salt 'CUST-*-cm*' service.stop splunk
salt 'CUST-splunk-cm.msoc.defpoint.local' service.disable splunk
exit

rsync legacy to local:

tshp CUST-splunk-cm
sudo su - splunk
# this can be run multiple times without issue. You may wish to do 
# it first before you've stopped splunk to minimize the interruption.
time rsync --rsync-path="sudo rsync" -avz --delete --progress \
  frederick.t.damstra@CUST-splunk-cm.msoc.defpoint.local:/opt/splunk/ /opt/splunk/ \
  --exclude="*.log"   --exclude '*.log.*'   --exclude '*.bundle' --exclude ".ssh"

Fix references:

find /opt/splunk -type f -name "*.conf" -exec grep -Hi msoc.defpoint.local {} \;
# fix anything found

Update cluster master configuration to point to new DNS name

cd
cd tmp
git clone git@github.xdr.accenturefederalcyber.com:mdr-engineering/msoc-moose-cm.git
cd msoc-moose-cm/
grep msoc `find . -type f`
# fix anything found

# Fix the coldToFrozen script: NOT ALL CUSTOMERS HAVE THIS
vim master-apps/TA-Frozen-S3/bin/coldToFrozenS3.py
# make to match ~/msoc-infrastructure/salt/fileroots/splunk/files/coldToFrozenS3.py

These changes will be deployed later.

Fix permissions and start splunk on the new CM

ssh gc-prod-moose-splunk-cm
sudo chown -R splunk:splunk /opt/splunk
sudo systemctl start splunk
# check for issues:
sudo tail -F /opt/splunk/var/log/splunk/splunkd.log

# Make sure the bundle is active
sudo -u splunk /opt/splunk/bin/splunk show cluster-status
sudo -u splunk /opt/splunk/bin/splunk validate cluster-bundle
sudo -u splunk /opt/splunk/bin/splunk show cluster-bundle-status
sudo -u splunk /opt/splunk/bin/splunk apply cluster-bundle
sudo -u splunk /opt/splunk/bin/splunk show cluster-bundle-status

# optional but recommended
sudo -u splunk /opt/splunk/bin/splunk enable maintenance-mode

For each indexer:

Get list of indexers:

tshp ls | grep moose | grep indexer

For each indexer, as quickly as possible:

ssh dev-moose-splunk-indexer-i-00d5ea4121238bb1b
sudo sed -i 's/^master_uri.*$/master_uri = https:\/\/moose-splunk-cm.pvt.xdr.accenturefederalcyber.com:8089/g' /opt/splunk/etc/system/local/server.conf
sudo sed -i 's/msoc.defpoint.local/pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/license_slave/local/server.conf
sudo systemctl stop splunk
sudo systemctl start splunk
# verify via 'show cluster-status' on cm that it joined
sudo tail -F /opt/splunk/var/log/splunk/splunkd.log

Validate cluster status:

ssh gc-prod-moose-splunk-cm
sudo -u splunk /opt/splunk/bin/splunk show cluster-status
sudo -u splunk /opt/splunk/bin/splunk disable maintenance-mode
sudo -u splunk /opt/splunk/bin/splunk apply cluster-bundle
# this last one should error, but it's a failsafe

Fix the search head

ssh prod-moose-splunk-sh
sudo sed -i 's/msoc.defpoint.local/pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/license_slave/local/server.conf
sudo sed -i 's/msoc.defpoint.local/pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/moose_sh_outputs/default/outputs.conf
sudo sed -i 's/moose-splunk-cm/moose-splunk-cm.pvt.xdr.accenturefederalcyber.com/g' /opt/splunk/etc/apps/connected_clusters/local/server.conf
sudo systemctl restart splunk
sudo tail -F /opt/splunk/var/log/splunk/splunkd.log

Fix indexer discovery

Edit salt and update any references to the CM:

cd ~/msoc-infrastructure/salt/pillar
grep "moose-splunk-cm" `find . -type f`
# vim vim vim 
git commit
git push 

Then update the salt master, and:

ssh gc-prod-salt-master
sudo salt-run fileserver.update
salt '*' saltutil.refresh_pillar
# Don't want to start splunk on the old cluster master, though it likely wouldn't do anything.
salt -C 'moose* not moose-splunk-cm.msoc*' state.highstate --output-diff --force-color test=true 2>&1 | less -iSR
# validate you expect the changes, and then do it for real:
salt -C 'moose* not moose-splunk-cm.msoc*' state.highstate --output-diff test=false 

This should only be necessary for moose:

salt '*.msoc.defpoint.local state.sls internal_splunk_forwarder --output-diff --force-color test=true
salt '*.msoc.defpoint.local state.sls internal_splunk_forwarder --output-diff --force-color test=false
salt '*.pvt*' state.sls internal_splunk_forwarder --output-diff --force-color test=true
salt '*.pvt*' state.sls internal_splunk_forwarder --output-diff --force-color test=false

#########################################

Migrate the indexers:

Good idea to check currrent usage:

salt 'moose-splunk-i*' cmd.run 'df -h | grep opt'

Stand up 3 new indexers in GC and add them to the cluster.

Terraform the cluster:

cd ~/xdr-terraform-live/test/aws-us-gov/mdr-test-c2/160-splunk-indexer-cluster
terragrunt apply

Verify the indexers come online:

ssh gc-prod-salt-master
sudo salt-key -L | grep idx
exit

date
ssh gc-dev-moose-splunk-cm
# I believe it can take up to about a half hour to come online
sudo -u splunk /opt/splunk/bin/splunk show cluster-status
# you may need ot highstate a second time? Not clear what happens to cause this to work sometimes and not others

Create the leagcy HEC

All converted clients need the legacy hec module. This is not included in the skeleton directory.

Update existing things to use the new HEC

Update salt:

  • Validate connectivity from client LCPs to the GovCloud NLBs
  • Update the NLB endpoints (should be 3 IP addresses)
    • May want to keep one of the old ones temporarily, so that errors can be recorded.
  • Update -hec endpoints to use the xdr.acccenturefederalcyber.com domain.
  • The Moose HECs

    This may only be necessary for moose, but check for other hecs, as well:

    vim ~/xdr-terraform-live/prod/env.hcl
    # Update hec, hec_pub, and hec_pub_ack
    
    cd ~/xdr-terraform-live/prod/aws/legacy-mdr-prod/045-kinesis-firehose-waf-logs
    terragrunt apply
    cd ~/xdr-terraform-live/prod/aws/legacy-mdr-prod/045-kinesis-portal-data-sync
    terragrunt apply
    

    Remove the old HEC

    Log into the console and check the old HEC to get a feel for what sort of traffic to expect.

    cd ~/msoc-infrastructure/terraform/100-moose
    vim moose.tf
    # Change create_hec_lb to false
    tfswitch
    terraform init
    terraform workspace select prod
    terraform apply
    

    Note: HEC communications will be down until the next step is completed

    Create a new HEC

    cd ~/xdr-terraform-live/prod/aws-us-gov/mdr-prod-c2/165-splunk-legacy-hec
    tfswitch
    terragrunt init
    terragrunt apply
    

    numerous salt things... this only applies to moose so notes not kept

    Validate that traffic is hitting the new HEC in a similar fashion to the old (remember that the negative TTL of 1 hour may apply)

    One at a time, stop the legacy indexers:

    On each indexer:

    sudo systemctl disable splunk
    sudo -u splunk /opt/splunk/bin/splunk offline --enforce-counts
    sudo tail -F /opt/splunk/var/log/splunk/splunkd.log
    

    Watch the status via the cluster master before moving on to the next:

    sudo -u splunk /opt/splunk/bin/splunk show cluster-status
    

    Validate:

    • Search head can search
    • No warnings on SH or Cluster Master

    If you get integrity warnings, it is most likely because the 'sed' above added a newline. To fix, for each file that has the warning:

    vim {file}
    :set binary
    :set noeol
    :w!
    ZZ