You have chosen to work in R, then there are some things you need to know.
The Platform already has some of the most typical libraries installed, but it would be required that you install the following libraries each time you start / restart a cluster.
The process to convert the data files into R readable dataframes requires the following process:
1.- Read the file (csv or parquet) in Pandas → This to be done using PYTHON scripts.
2.- Convert the python Pandas dataframe to a Spark Dataframe → This to be done using PYTHON scripts.
3.- Transform the PYTHON Pandas Dataframe into a SQL dataframe → This to be done using Pythona scripts.
4.- Transform the SQL Dataframe into a R dataframe → This to be done using R scripts.
1.- Read the file (csv or parquet) in Pandas
Install some of the PYTHON packages:
Type pip install fsspec in one of the boxes and run the process.
%python
pip install fsspec
Type pip install pip install s3fs and run
Check the files under your project path.
%python
import io, os, boto3
import pandas as pd
import io, os, boto3
import pandas as pd
bucket_path = '907743700548-mlops-eu-iese-privacy-safe' # in this case, the bucket path will be '907743700548-mlops-eu-iese-privacy-safe'. Check your bucket path in case it changes.
project_folder = 'my_project_folder/' # this will be the name of the project folder
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucket_path)
for object_summary in my_bucket.objects.filter(Prefix=project_folder):
print(object_summary.key)
Run the script. The output will be a list of the data files in the folder.
Next, read the data file required to convert.
if the data file is a csv file, run the following script.
Where:
907743700548-mlops-eu-iese-privacy-safe is the name of the S3 bucket
my_project is the name of the folder where the data is stored
my_file.csv is the name of the file to read. In this case, it is a PIPE Separated file. sep = ‘|’, if comma separated sep = ‘,’ if ; separated, sep = ‘;’.
%python
df = pd.read_csv('s3a://907743700548-mlops-eu-iese-privacy-safe/my_project/my_file.csv', sep='|') # df is the name of the python pandas dataframe
Display the pandas dataframe to see the content, and verify that the reading has been properly executed.
%python
display(df) # were df is the name of the dataframe
2.- Convert the python Pandas dataframe to a Spark Dataframe
Run the following script.
%python
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Create a Spark DataFrame from a pandas DataFrame using Arrow
dfs = spark.createDataFrame(df)
display(dfs) # where dfs is the new name of the spark dataframe
You must see the same dataframe as the previously displayed one.
3.- Transform the PYTHON Pandas Dataframe into a SQL dataframe
Run the following script.
# convert pyspark dataframe into sql
%python
dfs.createOrReplaceTempView('dfs_sql') # where dfs_sql is the new name of the sql dataframe
verify the content of the SQL dataframe. Run the followig script
%sql
select * from dfs_sql
4.- Transform the SQL Dataframe into a R dataframe
Run the following script in R.
library(SparkR)
df_r <- sql('select * from dfs_sql') # where df_r is the new name of the R dataframe
Now display the R dataframe to validate it has unchanged during the subsequent transformations
An example of use has been created for researchers
User Guide for Researchers to use in python, pyspark, SQL and R.
1. Step 1: use python as a language to import the read utils, and folder utils notebooks.
These utils are written in python/pyspark language which is the default language chosen in databricks notebook. Hence to import the utils, we would need to run the notebook by using a %run command followed by the path of the utils notebook.
Example:
%run /Shared/iese_data_exchange/01_utils/06_folder_utils/iese_folders_researchers
%run /Shared/iese_data_exchange/01_utils/03_input_utils/01_read_utils
2. Step 2: client_name, file_name are the input variables which can be changed to retrieve particular details of a client name and file name.
Example:
client_name = 'imbric'
file_name = 'querySTIIESE.csv'
3. Step 3: iese_folders_researchers utils notebook is to read the file from the respective client folder. The parameter that is passed here is client name and folder name.
Example:
cat_gen_ds_file_path = getReGenCatClientFile(client_name, file_name)
cat_gen_ds_file_path is the variable name.
getReGenCatClientFile is the function name to retrieve the path of categorical descriptive statistics for a client_name and a file_name, from the utils iese_folders_researchers
4. Step 4: 01_read_utils notebook is to read the type of file from the saved location in pyspark language. We are reading parquet files so the function used here is read_parquet_file.
Example: cat_desc_stats = read_parquet_file(cat_gen_ds_file_path)
The parameter passed here is the location of the file which is saved under a variable name called ‘cat_gen_ds_file_path’ from step 3.
5. Step 5. Convert the dataframe to pandas library to read the dataframe in python.
Example: cat_desc_stats.toPandas()
This shows the below output:
image.jpeg failed to upload
6. Step 6: To display the dataframe in pyspark , use the command: display(cat_desc_stats) where cat_desc_stats is the pyspark dataframe.
7. Step 7: Since pyspark dataframe cannot be directly converted to R language. It is advised to convert the pyspark dataframe to a sql table, and convert the sql table to R data.frame.
7.a To convert the pyspark to a sql table, use the below command :
cat_desc_stats.createOrReplaceTempView('cat_desc_stats')
where cat_desc_stats is the pyspark dataframe and 'cat_desc_stats' is the sql table name given by us.
7b. To convert sql table to R data.frame, use the below command,
%r
library(SparkR)
cat_desc_stats_rdf <- sql('select * from cat_desc_stats')
where we change the language to R, by using %r, and create a r data.frame by using sql statement of sql table from above step 7a.
Link to the notebooks :