You have chosen to work in Python, then there are some things you need to know.
The Platform already has some of the most typical libraries installed, but it would be required that you install the following libraries each time you start / restart a cluster.
Libraries
Installing libraries
Install libraries by using the script “pip install”:
Importing libraries
You can import the libraries by ejecuting the following script:
import json
import pyspark.sql
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from datetime import datetime, date, timedelta
from typing import Union
import pandas as pd
from pyspark.sql import SparkSession, Window
from pyspark.sql import *
from pyspark.sql.types import *
from delta.tables import DeltaTable
from functools import reduce
from pyspark.sql.window import Window
from pyspark.sql.functions import hour
from pyspark.sql.functions import date_format
import re
Or by the following other script:
%run /Shared/iese_data_exchange/01_utils/01_libraries/libraries
Access the data subsets
Once a data subset has been requested, this data subset will be stored in a “path” with a “filename” that will be provided by the Administrator to the research users. This “path” and “filename” can only be accessed by the users in the “Project Group” where the project is allocated and the “cluster” assigned to that group.
Method 1 to access the data: using spark libraries
Read CSV file with a pipe delimiter specified
Type the following script:
def read_csv_1(path:str) -> DataFrame:
df = (spark
.read
.format('com.databricks.spark.csv')
.option('header', True)
.option('delimiter', '|')
.option('charset', 'iso-8859-1')
)
return (df.load(path))
Where:
#str: is the full path of the csv file. including the csv filename.
Read CSV file with inferred schema
def read_csv(path:str, schema = StructType, options:dict ={}) ->DataFrame:
df = (spark
.read
.format('csv')
.schema(schema)
)
if options and len(options): # to check if there are any options provided
df.options(**options)
return(df.load(path))
Where:
#str: is the full path of the csv file. including the csv filename.
Read Delta file
def read_delta_table(path:str, options:dict ={}) ->DataFrame:
df = (spark
.read
.format('delta')
)
if options and len(options): # to check if there are any options provided
df.options(**options)
return(df.load(path))
Where:
#str: is the full path of the csv file. including the csv filename.
Method 2: using other libraries
Install libraries
On different separated command lines in a notebook execute the following scripts:
1.- Install library: pip install fsspec
2.- Install library to access S3 bucket: pip install s3fs
3.- Check the files in your folder. The execution of this script will provide the list of different files within a path.
import io, os, boto3
import pandas as pd
import io, os, boto3
import pandas as pd
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('path')
for object_summary in my_bucket.objects.filter(Prefix="prefixname/"):
print(object_summary.key)
where:
#path = the path of the bucket. i.e. s3a://907743700548-mlops-eu-iese-privacy-safe
# prefixname =the name of the folder where the file is stored at after the path. i.e. project_1
the output will be a list of the files stored under this path.
4.- Convert the file (i.e. a CSV file) into dataframe to start working with it:
item = pd.read_csv('fullpath', sep='|')
Where:
#fullpath = the complete path of the file. ie. s3a://907743700548-mlops-eu-iese-privacy-safe/project_1/filename.csv
5.- Display the dataset: will allow to view and start “playing” with the dataset using the platform own capabilities
#where Item = the name of the dataframe generated
This will open the posibility to view the dataframe or to plot it: