IESE Centre for Data Insights_Platform Userguide

Explore

IESE Centre for Data Insights_Platform Userguide

Working in R

You have chosen to work in R, then there are some things you need to know.

The Platform already has some of the most typical libraries installed, but it would be required that you install the following libraries each time you start / restart a cluster.

The process to convert the data files into R readable dataframes requires the following process:

1.- Read the file (csv or parquet) in Pandas → This to be done using PYTHON scripts.

2.- Convert the python Pandas dataframe to a Spark Dataframe → This to be done using PYTHON scripts.

3.- Transform the PYTHON Pandas Dataframe into a SQL dataframe → This to be done using Pythona scripts.

4.- Transform the SQL Dataframe into a R dataframe → This to be done using R scripts.

1.- Read the file (csv or parquet) in Pandas

Install some of the PYTHON packages:

Type pip install fsspec in one of the boxes and run the process.

%python

pip install fsspec

Type pip install pip install s3fs and run

%python

pip install s3fs

Check the files under your project path.

%python

import io, os, boto3

import pandas as pd

import io, os, boto3

import pandas as pd

bucket_path = '907743700548-mlops-eu-iese-privacy-safe' # in this case, the bucket path will be '907743700548-mlops-eu-iese-privacy-safe'. Check your bucket path in case it changes.

project_folder = 'my_project_folder/' # this will be the name of the project folder

s3 = boto3.resource('s3')

my_bucket = s3.Bucket(bucket_path)

for object_summary in my_bucket.objects.filter(Prefix=project_folder):

print(object_summary.key)

Run the script. The output will be a list of the data files in the folder.

Next, read the data file required to convert.

if the data file is a csv file, run the following script.

Where:

907743700548-mlops-eu-iese-privacy-safe is the name of the S3 bucket

my_project is the name of the folder where the data is stored

my_file.csv is the name of the file to read. In this case, it is a PIPE Separated file. sep = ‘|’, if comma separated sep = ‘,’ if ; separated, sep = ‘;’.

%python

df = pd.read_csv('s3a://907743700548-mlops-eu-iese-privacy-safe/my_project/my_file.csv', sep='|') # df is the name of the python pandas dataframe

Display the pandas dataframe to see the content, and verify that the reading has been properly executed.

%python

display(df) # were df is the name of the dataframe

2.- Convert the python Pandas dataframe to a Spark Dataframe

Run the following script.

%python

import numpy as np

import pandas as pd

# Enable Arrow-based columnar data transfers

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Create a Spark DataFrame from a pandas DataFrame using Arrow

dfs = spark.createDataFrame(df)

display(dfs) # where dfs is the new name of the spark dataframe

You must see the same dataframe as the previously displayed one.

3.- Transform the PYTHON Pandas Dataframe into a SQL dataframe

Run the following script.

# convert pyspark dataframe into sql

%python

dfs.createOrReplaceTempView('dfs_sql') # where dfs_sql is the new name of the sql dataframe

verify the content of the SQL dataframe. Run the followig script

%sql

select * from dfs_sql

4.- Transform the SQL Dataframe into a R dataframe

Run the following script in R.

library(SparkR)

df_r <- sql('select * from dfs_sql') # where df_r is the new name of the R dataframe

Now display the R dataframe to validate it has unchanged during the subsequent transformations

display(df_r)

An example of use has been created for researchers

User Guide for Researchers to use in python, pyspark, SQL and R.

1. Step 1: use python as a language to import the read utils, and folder utils notebooks.

These utils are written in python/pyspark language which is the default language chosen in databricks notebook. Hence to import the utils, we would need to run the notebook by using a %run command followed by the path of the utils notebook.

Example:

%run /Shared/iese_data_exchange/01_utils/06_folder_utils/iese_folders_researchers

%run /Shared/iese_data_exchange/01_utils/03_input_utils/01_read_utils

2. Step 2: client_name, file_name are the input variables which can be changed to retrieve particular details of a client name and file name.

Example:

client_name = 'imbric'

file_name = 'querySTIIESE.csv'

3. Step 3: iese_folders_researchers utils notebook is to read the file from the respective client folder. The parameter that is passed here is client name and folder name.

Example:

cat_gen_ds_file_path = getReGenCatClientFile(client_name, file_name)

cat_gen_ds_file_path is the variable name.

getReGenCatClientFile is the function name to retrieve the path of categorical descriptive statistics for a client_name and a file_name, from the utils iese_folders_researchers

4. Step 4: 01_read_utils notebook is to read the type of file from the saved location in pyspark language. We are reading parquet files so the function used here is read_parquet_file.

Example: cat_desc_stats = read_parquet_file(cat_gen_ds_file_path)

The parameter passed here is the location of the file which is saved under a variable name called ‘cat_gen_ds_file_path’ from step 3.

5. Step 5. Convert the dataframe to pandas library to read the dataframe in python.

Example: cat_desc_stats.toPandas()

This shows the below output:

⁠

image.jpeg failed to upload

⁠

6. Step 6: To display the dataframe in pyspark , use the command: display(cat_desc_stats) where cat_desc_stats is the pyspark dataframe.

7. Step 7: Since pyspark dataframe cannot be directly converted to R language. It is advised to convert the pyspark dataframe to a sql table, and convert the sql table to R data.frame.

7.a To convert the pyspark to a sql table, use the below command :

cat_desc_stats.createOrReplaceTempView('cat_desc_stats')

where cat_desc_stats is the pyspark dataframe and 'cat_desc_stats' is the sql table name given by us.

7b. To convert sql table to R data.frame, use the below command,

library(SparkR)

cat_desc_stats_rdf <- sql('select * from cat_desc_stats')

where we change the language to R, by using %r, and create a r data.frame by using sql statement of sql table from above step 7a.

Link to the notebooks :

⁠

userguide for researcher using python - Databricks⁠

⁠

userguide for researcher using R and SQL - Databricks⁠

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.