!pip install pyspark

from pyspark.sql import SparkSession  
from pyspark.sql.functions import concat, col
import findspark
#findspark.init()

# create a SparkSession
spark = SparkSession.builder.appName("ReadCSV").getOrCreate()

# read CSV file
df = spark.read.csv("/content/drive/MyDrive/Phaul/small1.csv", 
                    header=True, inferSchema=True)

# show first 5 rows
df.show(5)

# stop SparkSession
#spark.stop()

cols = df.columns 
cols

from pyspark.sql.functions import col, count

df.select([count(col(c).isNull().alias(c)) for c in df.columns]).show()

df = df.na.drop(how = "any").show()

from pyspark import SparkConf, SparkContext

def SON_algorithm(data, support_threshold, num_partitions):
    # Step 1: Partition the data and find frequent itemsets in each partition
    partitioned_data = data.repartition(num_partitions)
    frequent_items = partitioned_data.mapPartitions(lambda partition: apriori(partition, support_threshold))
    frequent_items = frequent_items.reduceByKey(lambda x, y: x + y).collect()

    # Step 2: Find frequent itemsets by combining frequent itemsets found in each partition
    candidates = [itemset for itemset, support in frequent_items if support >= support_threshold]
    frequent_itemsets = set(candidates)
    for i in range(2, len(candidates)):
        candidate_itemsets = partitioned_data.flatMap(lambda partition: generate_candidates(partition, frequent_itemsets, i))
        frequent_itemsets = set(candidate_itemsets.reduceByKey(lambda x, y: x + y).filter(lambda x: x[1] >= support_threshold).map(lambda x: x[0]).collect())

    return frequent_itemsets

def apriori(partition, support_threshold):
    # Generate candidate itemsets of size 1
    items = partition.flatMap(lambda transaction: transaction).map(lambda item: (item, 1))
    frequent_items = items.reduceByKey(lambda x, y: x + y).filter(lambda x: x[1] >= support_threshold)

    # Generate candidate itemsets of size > 1 using the Apriori algorithm
    candidates = set(frequent_items.map(lambda x: frozenset([x[0]])).collect())
    while len(candidates) > 0:
        frequent_items = items.filter(lambda x: x[0] in candidates).reduceByKey(lambda x, y: x + y).filter(lambda x: x[1] >= support_threshold)
        new_candidates = frequent_items.map(lambda x: frozenset(candidates.union([frozenset([x[0]])])))
        candidates = set(new_candidates.collect())
        frequent_items = frequent_items.collect()
        for itemset in frequent_items:
            yield (frozenset([itemset[0]]), itemset[1])

def generate_candidates(partition, frequent_itemsets, k):
    # Generate candidate itemsets of size k using the frequent itemsets of size k-1
    candidates = set()
    for itemset in frequent_itemsets:
        for item in partition:
            if item not in itemset:
                candidate = frozenset(itemset.union([item]))
                if candidate not in candidates:
                    candidates.add(candidate)
                    yield (candidate, 1)

conf = SparkConf().setAppName("SON Algorithm")
sc = SparkContext.getOrCreate(conf)

data = sc.textFile("/content/drive/MyDrive/Phaul/small1.csv")

cols = ['user_id', 'business_id']


#pythonLines = data.filter(lambda data: "color" in data )
#pythonLines.collect()
support_threshold = 100
num_partitions = 4

frequent_itemsets = SON_algorithm(data, support_threshold, num_partitions)

created 2 years ago

Scala Online Compiler

Write, Run & Share Scala code online using OneCompiler's Scala online compiler for free. It's one of the robust, feature-rich online compilers for Scala language, running on the latest version 2.13.8. Getting started with the OneCompiler's Scala compiler is simple and pretty fast. The editor shows sample boilerplate code when you choose language as Scala and start coding.

Read input from STDIN in Scala

OneCompiler's Scala online editor supports stdin and users can give inputs to programs using the STDIN textbox under the I/O tab. Following is a sample Scala program which takes name as input and prints hello message with your name.

object Hello {
	def main(args: Array[String]): Unit = {
	  val name = scala.io.StdIn.readLine()        // Read input from STDIN
    println("Hello " + name ) 
	}
}

About Scala

Scala is both object-oriented and functional programming language by Martin Odersky in the year 2003.

Syntax help

Variables

Variable is a name given to the storage area in order to identify them in our programs.

var or val Variable-name [: Data-Type] = [Initial Value];

Loops and conditional statements

1. If family:

If, If-else, Nested-Ifs are used when you want to perform a certain set of operations based on conditional expressions.

If

if(conditional-expression){    
//code    
}

If-else

if(conditional-expression) {  
//code if condition is true  
} else {  
//code if condition is false  
}

Nested-If-else

if(condition-expression1) {  
//code if above condition is true  
} else if (condition-expression2) {  
//code if above condition is true  
}  
else if(condition-expression3) {  
//code if above condition is true  
}  
...  
else {  
//code if all the above conditions are false  
}

2. For:

For loop is used to iterate a set of statements based on a criteria.

for(index <- range){  
  // code  
}

3. While:

While is also used to iterate a set of statements based on a condition. Usually while is preferred when number of iterations are not known in advance.

while(condition) {  
 // code 
}

4. Do-While:

Do-while is also used to iterate a set of statements based on a condition. It is mostly used when you need to execute the statements atleast once.

do {
  // code 
} while (condition)

Functions

Function is a sub-routine which contains set of statements. Usually functions are written when multiple calls are required to same set of statements which increases re-usuability and modularity.

def functionname(parameters : parameters-type) : returntype = {   //code
}

Note:

You can either use = or not in the function definition. If = is not present, function will not return any value.