Let me let you into my head approaching the end of the 2023 Winter semester at SDSU. My naive understanding of the ML + NLP architecture, was overshadowed by my captivation for the shiny new algorithms in the landscape. I could navigate to scroll down to Machine Learning and spend hours reading the newest papers. Scientific publications using funny titles, to absurdly overstate what their underlying architecture could possibly accomplish-I loved it. Meme your way to the top in a scholarly article; Pay homage to the old landscape, roast the previous architecture, introduce your hard work, and offer the reader directions to walk for further development. Now it was my turn, the final project of my CS561 course allowed us to extend the Base BERT model and submit a novice white paper. I was challenged, and I let it fuel me. I wanted to speed things up, clean things up. I wanted to optimize. So I started my research. Every morning, two hours exploring papers on arXiv and HuggingFace. I dived down this rabbit hole; stick with me here I know I make some jumps:
bidirectionality allows it to contextually understand language better than previous models inefficiency in scalability and computation intensity Dudes didn’t plan...had to finish a project in 24hrs...”How far can we get with a single GPU in just one day?” compression and sparsity techniques without sacrificing performance ...a Trillion huh? that’s a big # speed enhancements and scalability for large-scale LM’s replaces traditional attention with operations mapping inputs directly to their frequency characteristics, simplifying computation. Fast feedforward networks (FFFs) leverages basic CPU capabilities and optimized mathematical routines. Hard stop. CPU? Seems like a step back in a world obsessed with GPU’s, but here’s the kicker:
Conditional execution based on input; sidestepping the need for dense matrix multiplication. ...they used balanced binary tree structure for each input. Looking back, I realize the reason I was enamored was due to the difficulty grasping the logic of FFF layer.
All that said, I had the motivation and the topic to conduct my final project for CS561 Deep Learning with NLP. After this, and with the help of a classmate we setup the structure, tested and benchmarked BERT, crammedBERT, to compare against UltraFastBERT. Cool so, we will rigorously benchmark the model. Although in the paper, the authors list a disclaimer about where this idea could be applied and how it’s not flushed out well enough, but I didn’t care (understand).
Fumbling around, learning about process of training, finetuning, and benchmarking. We racked up hours of VM time (I left it running over night on accident), well over the time it took the authors to train their final models.
The deadline approached. I put together what I had and stayed up nights trying to wrap my brain around why the last piece isn’t fitting; I wasn’t able to finetune the model on downstream tasks in order to compare it to the others. Deadline passed. Asked for an extension. Didn’t reply to Professors emails. Apologized and asked for another extension. Bitterly submitted an unfinished paper filled with NotImplemented scattered throughout the text. I was mentally exhausted, and upset. Upset for the wrong reasons. I worked hard on this, I learned a lot. Ultimately I’m embarrassed I did not take the time to fully unpack and understand the FFF layer in the architecture. This was my biggest mistake. I was never going to finish that paper with the knowledge I had.