Splunk SmartStore Migration.md 9.9 KB

Migration to SmartStore

Splunk documentation: https://docs.splunk.com/Documentation/Splunk/latest/Indexer/AboutSmartStore

Items of note:

  • SmartStore data retention is managed cluster-wide
  • Only maxGlobalDataSizeMB, maxGlobalRawDataSizeMB, and frozenTimePeriodInSecs control when to freeze data.
    • The most restrictive rule applies.
    • When buckets freeze, they are removed from both remote and local storage.

Prerequisites

Create the Target S3 Bucket

Commit 35d2254 enabled the creation of a SmartStore (S2) specific S3 bucket for every customer.

To add it to existing customers/slices, copy the 145-splunk-smartstore-s3/ directory from xdr-terraform-live/test/aws-us-gov/mdr-test-modelclient/ to the target customer's directory and update the referenced tag if necessary to v3.2.14 or higher, then run terragrunt-local apply to create the customer's SmartStore S3 bucket.

Ensure the Index Cluster's Search Factor and Replication Factor Are Equal

SF must equal RF. This setting is usually found in the Cluster Manager's /etc/system/local/server.conf

Ensure the Value of maxDataSize is set to "auto"

This has been corrected in the msoc-skeleton-cm configuration for all future customers. See commit 052d212.

For existing customers, it may be necessary to remove a 2nd maxDataSize entry in master-apps/all_indexes/local/indexes.conf (maxDataSize = 5000).

Add minFreeSpace = 20% to the [diskUsage] Stanza in server.conf

This can be placed in the CM's master-apps/_cluster/local/server.conf or in an app of its own.

[diskUsage]
minFreeSpace = 20%

Create the SmartStore (Remote) Volume

Change the value in path to match the S3 bucket name created by Terraform.

[volume:smartstore]
storageType = remote
# Do we want a path or drop everything into '/'?
path = s3://xdr-CUSTOMER-ENVIRONMENT-splunk-smartstore/
remote.s3.endpoint = https://s3.us-gov-east-1.amazonaws.com
remote.s3.supports_versioning = true
remote.s3.encryption = sse-kms
remote.s3.kms.key_id = alias/SmartStore
remote.s3.kms.auth_region = us-gov-east-1
# SSL settings for S3 communications
remote.s3.sslVerifyServerCert = true
remote.s3.sslVersions = tls1.2
remote.s3.sslAltNameToCheck = s3.us-gov-east-1.amazonaws.com
# https://www.amazontrust.com/repository/SFSRootCAG2.pem
remote.s3.sslRootCAPath = $SPLUNK_HOME/etc/auth/SFSRootCAG2.pem
remote.s3.cipherSuite = ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:AES128-SHA256:AES256-SHA256:AES256-SHA:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256
remote.s3.ecdhCurves = prime256v1, secp384r1, secp521r1
# SSL settings for KMS communication
remote.s3.kms.sslVerifyServerCert = true
remote.s3.kms.sslVersions = tls1.2
remote.s3.kms.sslAltNameToCheck = kms.us-gov-east-1.amazonaws.com
remote.s3.kms.sslRootCAPath = $SPLUNK_HOME/etc/auth/SFSRootCAG2.pem
remote.s3.kms.cipherSuite = ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:AES128-SHA256:AES256-SHA256:AES256-SHA:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256
remote.s3.kms.ecdhCurves = prime256v1, secp384r1, secp521r1

NOTE: It may be possible to reduce the cipherSuite value to TLSv1.2+HIGH:@STRENGTH rather than specifying each type. The default value is TLSv1+HIGH:TLSv1.2+HIGH:@STRENGTH.

The same commit referenced above for the msoc-skeleton-cm repository added remotePath = volume:smartstore/$_index_name to the master-apps/all_indexes/local/indexes.conf file.

NOTE: Before restarting the index cluster the first time, log in to the Cluster Manager, ensure there are no pending fix-up tasks and remove all excess buckets.

Add the SFSRootCAG2.pem to $SPLUNK_HOME/etc/auth/

This CA is used by Splunk to validate the certificates for the S3 bucket as well as the KMS queries agains us-gov-east-1. See https://www.amazontrust.com/repository/

This file was added to the indexers' Salt state with commit 6d5dc54 and can be added to the indexers with salt customer-splunk-idx-* state.sls splunk.indexer (it may already be present).

Update indexes.conf and server.conf Deployed by the Cluster Manager

From the Salt server, run salt customer-splunk-cm* state.sls splunk.master.apply_bundle_master test=true to ensure the indexers will receive the SmartStore volume definition and updated server.conf. If it looks correct, run without test=true. SSH to the Cluster Manager, become the splunk user and run splunk show cluster-bundle-status (you may need to authenticate as the minion user) to observe the rolling cluster restart and correct any errors found in the bundle validation stage.

Test the Indexers' Ability to Communicate with S3

SSH to one of the customer indexers and switch to the Splunk user. You should be able to run /usr/local/bin/aws s3 ls --region us-gov-east-1 s3://xdr-<customer>-<env>-splunk-smartstore/. There will be no output if this is successful.

Additional tests:

# List all indices and buckets
splunk cmd splunkd rfs -- ls --starts-with volume:smartstore

# List all buckets in a specific index
splunk cmd splunkd rfs -- ls --starts-with volume:smartstore/<some-index>/

Migrate Indices to SmartStore

Confirm That the Cluster is Healthy and in the Complete State

Set constrain_singlesite_buckets = false in the Cluster Manager's server.conf under [clustering] and restart Splunk on the Cluster Manager. This can be applied via Salt with salt <customer>-splunk-cm* state.sls splunk.master.init test=true then without test=true after validating only the INI change and service restart (perhaps file permissions change for etc/passwd) are the only changes.

Check the Bucket Status panel and resolve any pending fixup tasks.

Test With One Index to Start

Add remotePath = volume:smartstore/$_index_name to an index such as _introspection in the Cluster Manager's copy of master-apps/all_indexes/local/indexes.conf, set frozenTimePeriodInSecs = 0 and maxGlobalDataSizeMB = 0, then apply the change via Salt.

Check Splunk's Log

index=_internal sourcetype=splunkd TERM(action=upload) 
| rex field=cache_id "\w+\|(?<indice>[^~]+)" 
| stats count(eval(status=="attempting")) AS Attempting count(eval(status=="succeeded")) AS Succeeded count(eval(status=="failed")) AS Failed BY indice
| addcoltotals labelfield=indice

The _introspection index should appear in the search results with values under "Attempting" and "Succeeded". If the value under "Failed" is greater than zero, check splunkd.log on one of the indexers to troubleshoot.

Additional Splunk Searches:

| rest /services/admin/cacheman/_metrics splunk_server=*-splunk-idx-* 
| fields splunk_server migration.*
| rename migration.* AS * 
| sort start_epoch
| eval Duration = end_epoch - start_epoch, Duration = tostring(Duration, "duration")
| convert timeformat="%F %T %Z" ctime(start_epoch) AS Start ctime(end_epoch) AS End 
| eval Completed = round(current_job/total_jobs,4)*100, End = if(isnull(End), "N/A", End), status = case( status=="running", "Running", status=="finished", "Finished", true(), status )
| eventstats sum(eval(Completed/3)) AS overall 
| eval overall = round(overall,2)
| fields splunk_server Start End Duration status total_jobs current_job Completed overall 
| rename splunk_server AS "Splunk Indexer" status AS Status current_job AS "Current Job" total_jobs AS "Total Jobs" 
| appendpipe [ 
    | stats count BY overall 
    | eval "Current Job" = "Overall Completion" 
    |  rename overall AS Completed 
    | fields Completed "Current Job"] 
| fields - overall
| eval Completed = Completed . "%"

If Splunk restarts before the migration completes, the endpoint data may not be accurate. If that happens, run:

| rest /services/admin/cacheman splunk_server=*-splunk-idx-*
| search cm:bucket.stable=0
| stats count BY splunk_server

IMPORTANT: Do not forget to reconfigure the retention settings after the migration.

Move Remaining Indices to SmartStore

Add remotePath = volume:smartstore/$_index_name under the [default] stanza as well as under all other index definitions. This is a good time to update indexes.conf entries where a stanza is relying on values from [default] rather than having them defined per index.

For example:

[os]
homePath = volume:normal_primary/$_index_name/db
coldPath = volume:normal_primary/$_index_name/colddb
remotePath = volume:smartstore/$_index_name
thawedPath = $SPLUNK_DB/os/thaweddb
tstatsHomePath = volume:high_primary/$_index_name/datamodel_summary
coldToFrozenScript = "/usr/bin/python3" "/usr/local/bin/coldToFrozenS3.py"
frozenTimePeriodInSecs = 31557600
lastChanceIndex = lastchance
maxConcurrentOptimizes = 24
maxDataSize = auto
maxHotBuckets = 10
maxTotalDataSizeMB = 4294967295
quarantineFutureSecs = 172800
quarantinePastSecs = 604800
repFactor = auto

Once all the index definitions have remotePath defined, use Salt to apply the bundle change to the indexers. Observe the progress of the bundle application from the Cluster Manager as mentioned above and use the Splunk search to observe data moving to S3.

Thawing Frozen Data

DO NOT thaw an archived (frozen) bucket into a SmartStore index!

Create a separate, "classic" index that does not utilize SmartStore (no remotePath) and thaw the buckets into the thawedPath of that index.