IESE Centre for Data Insights_Platform Userguide

Explore

IESE Centre for Data Insights_Platform Userguide

Working in Python

You have chosen to work in Python, then there are some things you need to know.

The Platform already has some of the most typical libraries installed, but it would be required that you install the following libraries each time you start / restart a cluster.

Libraries

Installing libraries

Install libraries by using the script “pip install”:

Importing libraries

You can import the libraries by ejecuting the following script:

import json

import pyspark.sql

import pyspark.sql.functions as F

from pyspark.sql import DataFrame

from datetime import datetime, date, timedelta

from typing import Union

import pandas as pd

from pyspark.sql import SparkSession, Window

from pyspark.sql import *

from pyspark.sql.types import *

from delta.tables import DeltaTable

from functools import reduce

from pyspark.sql.window import Window

from pyspark.sql.functions import hour

from pyspark.sql.functions import date_format

import re

Or by the following other script:

%run /Shared/iese_data_exchange/01_utils/01_libraries/libraries

Access the data subsets

Once a data subset has been requested, this data subset will be stored in a “path” with a “filename” that will be provided by the Administrator to the research users. This “path” and “filename” can only be accessed by the users in the “Project Group” where the project is allocated and the “cluster” assigned to that group.

Method 1 to access the data: using spark libraries

Read CSV file with a pipe delimiter specified

Type the following script:

def read_csv_1(path:str) -> DataFrame:

df = (spark

.read

.format('com.databricks.spark.csv')

.option('header', True)

.option('delimiter', '|')

.option('charset', 'iso-8859-1')

)

return (df.load(path))

Where:

#str: is the full path of the csv file. including the csv filename.

Read CSV file with inferred schema

def read_csv(path:str, schema = StructType, options:dict ={}) ->DataFrame:

df = (spark

.read

.format('csv')

.schema(schema)

)

if options and len(options): # to check if there are any options provided

df.options(**options)

return(df.load(path))

Where:

#str: is the full path of the csv file. including the csv filename.

Read Delta file

def read_delta_table(path:str, options:dict ={}) ->DataFrame:

df = (spark

.read

.format('delta')

)

if options and len(options): # to check if there are any options provided

df.options(**options)

return(df.load(path))

Where:

#str: is the full path of the csv file. including the csv filename.

Method 2: using other libraries

Install libraries

On different separated command lines in a notebook execute the following scripts:

1.- Install library: pip install fsspec

2.- Install library to access S3 bucket: pip install s3fs

3.- Check the files in your folder. The execution of this script will provide the list of different files within a path.

import io, os, boto3

import pandas as pd

import io, os, boto3

import pandas as pd

s3 = boto3.resource('s3')

my_bucket = s3.Bucket('path')

for object_summary in my_bucket.objects.filter(Prefix="prefixname/"):

print(object_summary.key)

where:

#path = the path of the bucket. i.e. s3a://907743700548-mlops-eu-iese-privacy-safe

# prefixname =the name of the folder where the file is stored at after the path. i.e. project_1

the output will be a list of the files stored under this path.

4.- Convert the file (i.e. a CSV file) into dataframe to start working with it:

item = pd.read_csv('fullpath', sep='|')

Where:

#fullpath = the complete path of the file. ie. s3a://907743700548-mlops-eu-iese-privacy-safe/project_1/filename.csv

5.- Display the dataset: will allow to view and start “playing” with the dataset using the platform own capabilities

display(item)

#where Item = the name of the dataframe generated

This will open the posibility to view the dataframe or to plot it:

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.