Handy to know shizzle

Explore

Handy to know shizzle

Handy to know shizzle

Archiving Smarter Every Day episodes

Author:

/u/kyle0r⁠

Published: 2023-02-06.

Approach

I wanted to create an archive of

https://www.youtube.com/@smartereveryday/videos⁠

created by American engineer and science communicator

Destin Sandlin⁠

, who has become a famous YouTube content creator. Its undeniable that watching his content one becomes smarter every time 😉

Major Kudos to Destin! 🖖 He has made some amazing content available to folks whilst also being a positive role model. Long may that continue. I’ve also discovered a bunch of other interesting folks and channels through Destin. 🙏

My motivations included:

Making a Blu-ray backup copy of @smartereveryday/videos for safe keeping - its intellectually valuable content in my eyes. Who knows which YouTube channels get shutdown next, or videos made private.

Enabling offline viewing e.g. USB stick in a TV.

To form part of our family knowledge archive, for current and future generations.

I’ve captured this doc so I can reuse it in the future - it might be a useful resource for others too.

Prerequisites

Basic to intermediate bash and vim experience required. Of course you can replace vim with the EDITOR of your choice. It will help if you have a good grasp of

JSON⁠

for certain aspects but not strictly required. It will help if you have basic to intermediate understating of regular expressions (

interactive online intro course⁠

yt-dlp utility (

link⁠

) (

PhantomJS⁠

headless browser optional). suggested install: python3 -m pip install --upgrade yt-dlp

jq utility (

link⁠

) if you wish to parse JSON on the cli and follow along with some of the approaches herein. suggested install: apt install jq or the equivalent command from your distros package manager.

Commands overview

grab --match-filter "title ~= (?i)Smarter Every Day [0-9]{1,3}" 'https://www.youtube.com/@smartereveryday/videos'

grab is a bash function as follows:

# the base alias

alias ___grab='yt-dlp --concurrent-fragments 10 --downloader aria2c --restrict-filenames'

# use functions to avoid dealing with quoting and escaping inside aliases

function grab() { ___grab --format 'bestvideo[ext!=webm]+bestaudio[ext!=webm]/best[ext!=webm]' "$@"; }

function grab-max() { ___grab --format 'bestvideo+bestaudio/best' --remux-video mkv "$@"; }

function grab-audio() { ___grab --format 'bestaudio[ext!=webm]' "$@"; }

function grab-audio-max() { ___grab --format 'bestaudio' --remux-video mka "$@"; }

⚠️ If you don’t have aria2c available either install it or change to native downloader --downloader native.

Lets take a look at a 4K example video on YouTube:

grab --list-formats 'https://www.youtube.com/watch?v=LXb3EKWsInQ'

⁠

Notes on digital multimedia container formats and stream quality

As you can see from the --list-formats output, there are sometimes lots of formats to choose from!

The grab alias attempts to simplify this, and download the best and most compatible video and audio streams and mux them together, typically this results in an

mp4⁠

container. The alias excludes the webm container which isn’t commonly supported on media players and devices compared to mp4 and mkv. If you prefer mkv containers you can append --remux-video mkv. ⚠️ A caveat to this webm exclusion is that often video formats >1080p with higher bitrates, and similar for audio (AAC vs. Opus), are stored in webm containers (especially on YouTube). Do consider what suits your requirements and make your own adjustments. Personally, for YouTube-esque content, 1080p is typically a good balance of filesize and quality/bitrate. 2K+ streams are more demanding on storage space and the required decoders CPU/GPU strength.

💡 Video quality - If you want yt-dlp to download maximum quality/bitrate streams you can use grab-max to create a mkv container with the best quality streams available.

💡 Audio quality - as a general rule of thumb both AAC and Opus are nearly

transparent⁠

above ~128 kbit/s so its really a personal choice which you prefer. I have included grab-audio and grab-audio-max for your consideration.

What is yt-dlp?

⁠

This link can't be embedded.⁠

⁠

The yt-dlp utility is a fantastic open source python project for downloading audio and video content from online platforms like YouTube. The project is an active fork of the inactive youtube-dl project. Link:

https://github.com/yt-dlp/yt-dlp⁠

⁠

💡 As a general rule of thumb - if there is media playing on a website yt-dlp is likely able to parse it and download the content.

Sidebar: I often use yt-dlp to view or obtain content from social media sites that are network blocked for privacy and/or family protection, and/or to grab media from links I do not wish to open in a browser for security/privacy/tracking reasons and/or don’t have accounts for. E.g. it works well to anonymously grab content from LinkedIn, Facebook, Twitter and Instagram and similar platforms.

One can use any node/container than can run python (perhaps even serverless), optionally with a vpn or proxy, to grab content without sacrificing your privacy or internet location.

Procuring and/or updating yt-dlp

Detailed instructions are

here⁠

. I use the pip approach:

# The following cmd can be used for installation and/or updating

python3 -m pip install --upgrade yt-dlp

💡 I would recommend using the python/pip approach to install yt-dlp because this provides an easy way of keeping yt-dlp up-to-date which is required to maintain the support and compatibility for the ever changing and updating media platforms. I would recommend against distro packages because they will always be behind the release cycle of the project.

💡 yt-dlp has a regular release cycle, sometimes multiple releases per month to deal with the ever changing YouTube player and other sites updating their platform players. Its good to ensure you set up yt-dlp in a way that its easy to update. I think the utility also has an --update option to auto-update at invocation time - you’ll need to see how that works, it's possible it's only working as expected for binary installations.

It would be possible to create a simple bash alias to assist with updating:

alias yt-dlp-updater='python3 -m pip install --upgrade yt-dlp'

PhantomJS headless browser

PhantomJS is optional. If you have it available, it can enhance yt-dlp ability to successfully parse URI’s. PhantomJS is unfortunately no longer maintained but its still a valid library for certain use cases like this. I have it in my node_modules so I provide that bin path to yt-dlp. I use the

https://www.npmjs.com/package/phantomjs-prebuilt⁠

. ⚠️ From an InfoSec perspective you don’t want PhantomJS to have elevated rights or rights to anything important in case some page that it parses contains an exploit that hasn’t been patched since development stopped on the project.

Here is my alias with an appended PATH for PhantomJS:

alias ___grab='PATH=$PATH:/path/to/phantomjs/bin yt-dlp --concurrent-fragments 10 --downloader aria2c --restrict-filenames'

Back on topic

The grab command with --match-filter was a effective for the large majority of Smarter Every Day episodes in the videos playlist. There were a handful of episodes that didn't match the pattern, and a bunch of videos ~80 at the time of writing that would fall into one of the categories of miscellaneous|supporting|legacy content.

For the small minority of episodes that were missed by the filter - I grabbed those separately. You can find my reconciliation steps in the next section.

Preview and reconciliation

For a super fast look at what your command would download you can use the --flat-playlist and --print options:

grab --match-filter 'title ~= (?i)Smarter Every Day [0-9]{1,3}' 'https://www.youtube.com/@smartereveryday/videos' --skip-download --flat-playlist --print title

⁠

💡 A little trick you can use with the --match-filter option is to invert the filter to see a list of everything that didn’t match your filter. Change the ~= to !~=. Simple but can be very useful.

⚠️ Note that the last --flat-playlist command executes quickly because it skips downloading, extracting and parsing the majority of metadata associated with the given media resources. This means most metadata is not available in this mode. I would assume this also impacts the --match-filter ability to match most fields. There is a boolean negation of the previous option --no-flat-playlist, which will extract metadata - its slower but can be useful.

grab --match-filter "title ~= (?i)Smarter Every Day [0-9]{1,3}" 'https://www.youtube.com/@smartereveryday/videos' --skip-download --no-flat-playlist --print title --print upload_date --print duration

⁠

Dumping JSON

Perhaps the most useful options for working with meta data --dump-json and --dump-single-json. Lets take a look at them in action and use the fantastic JSON utility jq (see:

link⁠

) to make a JSON query on the output:

grab --match-filter 'title !~= (?i)Smarter Every Day [0-9]{1,3}' 'https://www.youtube.com/@smartereveryday/videos' --skip-download --dump-json | jq 'with_entries( select( [.key] | inside( ["id","title", "duration", "upload_date"] ) ) )'

⁠

The --dump-json option will write one JSON object per media resource, per line. That is to say the output is newline-delimited JSON. See

here⁠

and

here⁠

for .jsonl format info.

With the --dump-json lines are output during the processing of each media resource. This is where --dump-single-json is different.

--dump-single-json stores all data into a single JSON document - one side effect is that output is buffered until all items in the playlist are processed - this can take some minutes for longer play lists. The output is prepended with metadata about the playlist and source channel, at least for YouTube.

Saving JSON as a receipt and/or record keeping

You may wish to redirect the JSON output to a file as a receipt or for record keeping. Keep in mind the metadata can be rather verbose and is a good candidate for high levels of compression. Maybe your filesystem already does block level compression for you? If not you could use the modern zstd or legacy gzip etc.

grab --match-filter 'title !~= (?i)Smarter Every Day [0-9]{1,3}' 'https://www.youtube.com/@smartereveryday/videos' --skip-download --dump-single-json | zstdmt --stdout > smartereveryday.json.zst

💡 You may wish to use jq to post-process the JSON files to prune metadata that you don’t find relevant. As of writing the @smartereveryday/videos URI JSON was ~101 MiB uncompressed for 362 videos. That is avg ~0.28 MiB of metadata per video.

With a straightforward jq query its possible to reduce the file size by ~94%:

jq -c 'del (.entries[] | .thumbnails, .automatic_captions, .formats, .requested_downloads, .requested_formats)' smartereveryday.json > smartereveryday-pruned.json

This takes the 101 MiB uncompressed JSON down to 5.7 MiB, and with zstd compression that goes down to 0.37 MiB which is an overall size reduction of ~99.6% 💪👍

Ready to download

When you are finished with previewing the media resources and tuned your filters etc, its time to remove the options for skipping downloads and listing metadata, and actually download the media resources:

grab --match-filter "title ~= (?i)Smarter Every Day [0-9]{1,3}" 'https://www.youtube.com/@smartereveryday/videos'

Post download reconciliation

The --match-filter grabbed the majority of the episodes but there were a handful missing - so some reconciliation was required. The following command was my approach. I recorded a 5min screencast to explain the command and approach.

vimdiff <(seq 1 281) <(find . -maxdepth 1 -a -type f -a -name '*.mp4' | perl -ne '/^(.*?)Smarter_Every_Day_([0-9]{1,3})(?:.*?)-(?:\[[^\]]+\]\.mp4)$/ && { print "$2\t$_" } || { print "???\t$_" }' | sort -k1n,1)

⁠

YouTube link⁠

, and

asciinema cast link⁠

⁠

Info and Episode lists specific to Smarter Every Day

⁠

https://thetvdb.com/series/smarter-every-day/allseasons/official⁠

https://www.imdb.com/title/tt4424838/episodes⁠

💡 If those links stop working you can always try pasting them into

archive.org⁠

search.

Missing Episodes

E70 Why clip feathers? An interview of Kelly http://www.youtube.com/watch?v=x38p0EK7OA8 private E71 Changing feathers while still being able to fly? http://www.youtube.com/watch?v=3YSaCntyJ_Y private E72 You Completed Deep Dive #2 http://www.youtube.com/watch?v=4YhtJIVZUl4 private

Number of Episodes

As of writing 281 Episodes. 1 Ep has two parts (E68). 3 Eps are missing/private (E70 to E72) - related to birds - perhaps removed on ethics grounds? Total 281 - 3 + 1 = 279 mp4's

Post processing and renaming

Objective

Renaming the mp4's to include an episode number prefix, so they have a cross-platform human friendly sort order. I.e. when you look at the Blu-ray file list you see a nice logical sort order of the files.

💡 Do consider that yt-dlp supports customising the filenames that it will use for downloaded content, at the very least it allows you to use the supported fields and maybe it supports pattern matching and replacement. I chose to do the renaming workflow visually as a separate step, because my dst filename was a little complex and needed QA steps. I chose to let yt-dlp be great at its primary function and handled the batch rename myself - each to his own approach as they say 😊

The src filename pattern looked like this: <title> Smarter_Every_Day <episode nr> <misc> <id> <filename suffix>

I recall adjusting a few of the src filenames to create <episode chapter> which demarked a multi-part episode. e.g. E68. Actually it looks like E68B was only linked in the description of E68A and not in the main videos playlist.

My dst filename pattern looked like this: <episode nr suffix - zero pad> <episode chapter if present> <title> Smarter_Every_Day <episode nr> <misc> <id> <filename suffix>

I have recorded the bulk renaming approach:

YouTube link⁠

, and

asciinema cast link⁠

⁠

The following command takes a list of (src) filenames as input, and outputs a tab separated, sorted (by episode number) list of filenames (src) and renamed filenames (dst) into vim.

regex101.com⁠

was used to design and test the regex pattern.

find . -maxdepth 1 -a -type f -a -name '*.mp4' | perl -ne 'if (/^\.\/(?<title>.*?)(?:_-_|_|_-)Smarter_Every_Day_(?<episode>[0-9]{1,3})(?<chapter>[A-Z])?(?<misc>.*?)-(?<id>\[[^\]]+\])\.mp4$/) { chomp $_; print("$2\t$_"); printf("\t%03d",$+{episode}); print "$+{chapter}_$+{title}_-_Smarter_Every_Day_$+{episode}$+{chapter}$+{misc}-$+{id}.mp4\n"; } else { print "???\t$_" }'| sort -k1n,1 | vim +'set nowrap' -

Regex visualisation courtesy of

https://jex.im/regulex⁠

. For the purposes of visualisation I had to remove the named groups because the parser didn’t like them. The regex is otherwise functionally identical.

⁠

To breakdown the command:

find the desired mp4 files, output the filenames in a new line separated list

parse the input with perl for each line if the regex matches output the src and dst filenames, else output ???\t$_

sort the output numerically on column 1 (sort by episode number)

open vim with the the piped input in the buffer, and disable line wrapping

Once inside vim its easy to use visual block mode CTRL+v to remove the sort column, and prepend the mv -iv command to each line, format, QA and execute the commands.

The following command takes the current vim buffer, for each line, input field separator: tab, read the fields into var one (src) and two (dst), and prints a formatted string. The format %q escapes the strings so they are safe for use in shell scripting. So the paths will be safe to pass to mv and other commands. see:

https://manpages.debian.org/stable/coreutils/printf.1.en.html#q⁠

:%!bash -c "while IFS=$'\t' "'read -r one two; do printf "\%q \%q\n" $one $two; done'

When everything looks good, you can execute the current vim buffer in bash.

⚠️ It is important to QA your commands before asking bash to execute your vim buffer :) What is your rollback plan?

:%!bash

There are multiple benefits from performing this workflow inside vim.

When executing or processing buffer content in external commands, the buffer input is replaced by the stdout and stderr from the given command, and can be moved to a separate buffer for notes and other evaluations. For batch jobs it can be useful to save the output as a receipt/log of the work done.

The vim u and CTRL+r key stroke commands can be used to cycle buffer changes backwards and forwards. e.g. before and after command execution. If the command(s) need adjustment, undo, modify, try again. 💡 consider in the case of renaming files one can undo the buffer changes but not the rename operation itself.