Share
Explore

icon picker
Data Engineer Cert. - DataCamp

Welcome to my summary on Data Engineering Certification from DataCamp. This is an incomplete version. Some topic may not yet been covered. Please revisit this page from time to time to see the progress.

!Writing Efficient Code in Python

image.png

Build-in functions

Built-in types: list , tuple , set , dict , and others
Built-in functions: print(), len() , range(), round() , enumerate() , map() , zip() and others
Built-in modules: os, sys, itertools, collections, math, and others
Interesting points




Component of lambda function

Timing and profiling


Gaining Efficiencies


Pandas optimizations

!Writing Functions in Python

lorem ipsum

Intro to Shell

ls
file listing
pwd
print working directory
cd
change directory
cd ..
move up one level
cd ~
change to home directory
cp original.txt duplicate.txt
copy file
mv old_location.txt new_location.txt
move file (can also use for renaming file)
rm file.txt
remove file
rm -r directory
remove directory
cat file.txt
view file content
head file.txt
see first 10 lines of file
head -n 3 file.txt
see first 3 lines of file
ls -R
list everything below current directory
cut -f 2-5, 8 -d , values.csv
select column 2 to 5 and column 8 using , as separator
-f : field
-d : delimiter
grep search_term filename.csv
search for pattern in file
-c : print a count of matching lines instead of the lines
-h : do not print the filename when searching multiple files
-i : ignore case
-l : print the names of files that contain the matches
-n : print line number of matching lines
-v : invert match
head filename.csv > store.csv
store first 10 lines in filename.csv to store.csv
head -n 5 filename.csv | tail -n 2
store first 5 lines from filename.csv to be input for the next command
wc
print number of character -c , word -w , or line -l in a file
wildcards
* matches any characters at any length
? matches a single character
[...] matches any one of the characters inside
{...} matches any of comma separated patters inside
sort
sort output
-n numerical order
-r reversed order
-f fold case (case-insensitive)
-b ignore leading blank
uniq
remove adjacent of duplicate lines
filename=text.txt
assign text.txt to a variable called filename
echo $filename
print value contained in the variable filename
for filename in directory/*.csv; do echo $filename; done
print name of every file in the folder directory
nano filename
open filename in editor
ctrl + k cut a line
ctrl + u paste a line from clipboard
ctrl + o save the file
ctrl + x exit the editor
bash script.sh
run commands in script.sh

Data Processing in Shell

data download

curl [option flags] [URL]
client url download data from http or ftp
-O download with existing filename
-o newname download and rename to newname
-L redirect HTTP if 300 error code occurs
-C resume previous file transfer if it times out before completion
wget [option flags] [URL]
www get > native command to download files
better than curl for multiple file downloading
-b background download
-q turn off wget output
-c resume broken download
-i download from list given in a file
--wait=1 wait 1 second before download

csvkit

in2csv filename.xlsx > filename.csv
convert the first sheet in filename.xlsx to filename.csv
in2csv -n filename.xlsx
list all sheet names
in2csv filename.xlsx —sheet “sheet1” > filename.csv
convert sheet sheet1 to filename.csv
csvlook filename.csv
preview filename.csv to console
csvstat filename.csv
df.describe in console
csvcut -n filename.csv
list all column names in filename.csv
csvcut -c 1 filename.csv
return column index 1 (regarding result from csvcut -n ) from filename.csv
can be used as -c “column name” as well
csvgrep
filter by row using exact match or regex
must use one of the following options
-m exact row value
-r regex pattern
-f path to a file
csvgrep -c “column name” -m value filename.csv
filter filename.csv where column name == value
csvstack file1.csv file2.csv > allfile.csv
stack file1.csv and file2.csv together and save to allfile.csv
csvstack -g “f1”,”f2” -n "source" file1.csv file2.csv > allfile.csv
create a special column name source (instead of the default group)to identify which row comes from which file

SQL

connect to database sqlite:///database.db using query SELECT * FROM base and save to filename.csv
csvsql --query "SELECT * FROM base LIMIT 1" filename.csv
use the above query to select data from local filename.csv file
csvsql --query "SELECT * FROM base1 INNER JOIN base2 ON ID" base1.csv base2.csv
can use for multiple csv but the bases should appear in order according to SQL query
insert filename.csv to database
--no-inference disable type parsing (consider everything as string)
--no-constraints generate schema without length limit or null check

cron job

echo "* * * * * python hello_world.py" | crontab
Add as job that runs every minute on the minute to crontab
there are 5 * to indicate time for a cron job
minute(0-59)
hour(0-23)
day of month(1-31)
month(1-12)
day of week(0-6)
crontab -l
list all cron jobs

Intro to Bash

Bash is a script of shell commands
Usually starts with
#!/usr/bash or #!/bin/bash
file extension is .sh
run by bash scriptname.sh
arguments can be added by bash scriptname.sh arg1 arg2
These arg1 arg2 can be accessed inside the script by:
$1 get arg1
$2 get arg2
$@ get arg1 arg2
$# get 2 as the amount of arguments
variables
var1="sam" assign the value "sam" to var1 . Note that there must be no space between =
echo "There are $var1 items" will print There are 6 items
quotation marks
'sometext' interpreted literally
"sometext" interpreted literally except $ and `
`sometext` interpreted as bash command and return STDOUT into a variable
$(command) is equivalent to `command` but the $(command) is used more in modern applications
numerical calculation
expr 1 + 4 get 5
expr 11 + 3.5 get error as expr takes only int into account
echo "11 + 3.5" | bc get 14.5 as bc (basic calculator) can do more
echo "11/3" | bc get 3 since the decimal place has not been defined
echo "scale=3; 11/3" | bc get 3.666
$((11+3)) is expr 11 + 3
array
numerical-indexed
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.