Bulk Loading Options for Cassandra
- Your target cluster is not running, or if it is running, is not sensitive to latency from bulk loading at "top speed" and associated operations.
- You are willing to manually, or have a tool to, de-duplicate sstable names and are willing to figure out where to copy them to in any non copy-all-to-all case. You are willing to run cleanup and/or major compaction understand that some disk space is wasted until you do. [2]
- You don't want to deal with the potential failure modes of streaming, which are especially bad in non-LAN deploys including EC2.
- You are restoring in a case where RF=N, because you can just copy one node's data to all nodes in the new RF=N cluster and start the cluster without bootstrapping (auto_bootstrap: false in cassandra.yaml).
- The sstables you want to import are a different version than the target cluster currently creates. Example : trying to sstableload -hc- (1.0) sstables into a -hd- (1.1) cluster is reported to not work. [3]
- You have your source sstables in something like s3 which can easily parallelize copies to all target nodes. s3<>ec2 is fast and free, close to best case for the inefficiency during copy stage.
- You want to increase RF on a running cluster, and are ok with running cleanup and/or major compaction after you do.
- You want to restore from a cluster with RF=[x] to a cluster whose RF is the same or smaller and whose size is a multiple of [x]. Example: restoring a 9 node RF=3 cluster to a 3 node RF=3 cluster, you copy 3 source nodes worth of sstables to each target node.
- You have a running target cluster, and want the bulk loading to respect for example streaming throttle limits.
- You don't have access to the data directory on your target cluster, and/or JMX to call "refresh" on it.
- Your replica placement strategy on the target cluster is so different from the source that the overhead of understanding where to copy sstables to is unacceptable, and/or you don't want to call cleanup on a superset of sstables.
- You have limited network bandwidth between the source of sstables and the target(s). In this case, copying a superset of sstables around is especially ineffecient.
- Your infrastructure makes it easy to temporarily copy sstables to a set of sstableloader nodes or nodes on which you call "bulkLoad" via JMX. These nodes are either non-cluster-member hosts which are otherwise able to participate in the cluster as a pseudo-member from an access perspective or cluster members with sufficient headroom to bulkload.
- You can tolerate the potential data duplication and/or operational complexity which results from the fragility of streaming. LAN is best case here. A notable difference between "bulkLoad" and sstableloader is that "bulkLoad" does not have sstableloader's "--ignores" option, which means you can't tell it to ignore replica targets on failure. [4]
- You understand that, because it uses streaming, streams on a per-sstable basis, and streaming respects a throughput cap, your performance is bounded in terms of ability to parallelize or burst, despite "bulk" loading.
On this page
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
An effective approach to migrate dynamic thrift data to CQL, part 1
An effective approach to migrate dynamic thrift data to CQL, part 1
May 17, 2016 12:00:00 AM
6
min read
Backup strategies in Cassandra
Backup strategies in Cassandra
May 25, 2018 12:00:00 AM
4
min read
Useful CQLSH Commands for Everyday Use

Useful CQLSH Commands for Everyday Use
Jul 5, 2022 12:00:00 AM
8
min read
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.