Natural Language Processing for Language of Life (mRNA vaccine design)
Date
Author
Institution
Degree Level
Degree
Department
Supervisor / Co-Supervisor and Their Department(s)
Citation for Previous Publication
Link to Related Item
Abstract
The COVID-19 pandemic accelerated the development of mRNA vaccines, yet iden- tifying the optimal mRNA sequence for human use, particularly for the SARS-CoV- 2 spike protein, remains challenging. This thesis focuses on optimizing the open reading frame (ORF), a crucial mRNA component composed of codons—triplets of nucleotides coding for amino acids. We introduce a novel ‘valid-codon’ masking strat- egy to streamline codon-to-amino acid mapping within the target protein sequence. This approach was competitive to the ‘codon-box’ method, which groups codons with identical nucleotide compositions. Our findings show that ‘valid-codon’ per- forms comparably to ‘codon-box’ in optimizing ORF sequences for gene expression. By integrating the masking strategy into a supervised fine-tuning (SFT) process us- ing the pre-trained ProtBert model, we further optimize the ORF for humans for the SARS-CoV-2 spike protein. Results indicate that our fine-tuned models surpass the ORF sequences used in Moderna and Pfizer vaccines in terms of gene expression and stability.
