Create Spark Dataset from Cassandra Table


Sample Program :

SparkSession sparkSession = SparkSession.builder().appName("cassandratable").master("local[2]").getOrCreate();

Map<String, String> cassandraMap = new HashMap<String, String>();

cassandraMap.put(spark.cassandra.connection.host, <cassandraHostName>); //localhost
cassandraMap.put(spark.cassandra.connection.port, <cassandraPort>); //9042
cassandraMap.put(keyspace, <cassandraKeySpace>);
cassandraMap.put(table, <cassandraTable>);

Dataset<Row> cassandraTableData = sparkSession.read().format(org.apache.spark.sql.cassandra).options(cassandraMap).load();

System.out.println("Column Names : " + Arrays.asList(cassandraTableData .columns()));
System.out.println("Table Schema : " + cassandraTableData.schema());

Bullet Points:

  1. In above code initially, I have created SparkSession to run spark job in standalone mode.
  2. In the second step, we provided Cassandra table details like its hostname, portno, keyspace, tableName
  3. In the third step, we read Cassandra table using 'org.apache.spark.sql.cassandra' format and load Cassandra table into cassandraTableData Dataset using load command.
  4. We can find the table column names using columns() method
  5. We can find the table schema using schema() method