Posts

The good news on climate

November 19, 2024

The good news on climate: Part 1 (Working Draft)

May contain some factual innaccuracies and poor phrasing

Something curious has started to happen in the past few years. Hyper focused technological nerds have gotten a lot more enthusiastic about climate change while policy focused people as well as newspapers have become more and more dire and more and more stressed as they talk about recent progress and COP goals and as we get further and further away from the projectsions that we are supposed to hit for net zero by 2050. Most people are pretty depressed about climate change so let me tell you a story about how I became a bit more of an optimist.

I checked out of climate change updates in about 2014 and doubly checked out in around 2016 after the election of Trump. It felt like things weren’t going quite fast enough despite best efforts in 2014, and it felt doubly dire in 2016 after the change in administrations. I was still hyper focused on the idea that US federal policy would be the major driver of climate change improvement and when I ddin’t see large changes here I simply decided that very little could be done and I should just accept that in the middle of the century, civilization might just begin to fall apart.

But I had a number of flaws in my thinking, and I have since gradually started to change my mind about the exact trajectory that we are on. The centrality of the United States in something as large and important as climate change is never assured. Governing bodies like the European Union and the Chinese Communist Party began to move full speed ahead towards solutions to climate change and some magical trends began to emerge. One trend was simply continued, and that was the dramatic lowering in price of both wind and solar panels. Here we will focus on solar panels as one of the defining technologies for green energy production that have emerged recently. Secondly is the quality of batteries as well as their energy density has begun to “take off” ever since the introduction of mass produced electric cars. Both of these trends together have led to

With these two developments, this current decade starting in 2020 has begun to see a shift. Creating energy from renewable resources is quickly becoming cheaper and easier than creating energy from sources such as coal and natural gas. Secondly providing storage for these sources and dealing with intermittency has begun to have something of a revolution. Starting with the price of batteries decreasing for electric cars simply some grid companies simply use lithium ion batteries for storage to stabilize their grid and lower the usage of natural gas peaker plants during peak grid usage times.

Clearly however, we aren’t quite there yet. Batteries are still too expensive to completely allow for a totally stable grid electrical system, worse, lithium as a battery source is fairly rare within the earth and fears of running out are an issue that we might run into if we decided to only use lithum for batteries. Fortunately this isn’t the case and newer battery storage systems that are beginning to come online with companies like form energy or antora with their rust oxide and heat batteries have the potential to offer even cheaper and longer lasting forms of electrical storage. These examples are nice as single sstories but they are emblematic of a wider truth. Solutions to a lot of problems can be found fairly quickly in modern research but are often not obvious to the lay person, or even the semi experienced expert in an area, and it is very difficult to know if an engineering difficulty could be solved easily with just easier R and D budgets and more research focus.

Indeed we appear to be on the cusp of a major tipping point, for years climate activists have pushed against the prevailing economic winds that favored drilling for oil and using natural gas and coal for fuel. Coal it seems has alredy become the first victim, coal mining and burning have entered something of a “death spiral” recently as coal simply can’t compete on price for the total amount of price per megawatt hour./ price per megajoule.

There are some criticisms as to how fast things can change and how quicly we can change our economic system to increase production of green energy as well as other sources. But Energy production increased rapidly over the 1920s to the 1970s. At about 3% per person per. Between 2022 and 2023 solar power increased 20% in total installed gigawatts. If we keep up this pace for the next two or three decades we would easily be able to create enough solar power by 2050 to cover all of the united states energy needs. However any reader who is worth their salt will note that extrapolation at this level is something of a public policy sin, especially with using exponential numbers. Proponents of exponential population growth showed that this would be a major issue despite expoential growth slowing and growth estimated to begin a long decline in around 2100 (another extrapolation).

Besides if you look at the expansion of natural gas and oil drilling from around 1920 to 1970, the massive increase of energy usage was done in around 50 years. Not a huge amount of time, but a time frame that is compareable to time frames for decarbonization globally (again only good news for someone who comes from an ultra pessimistic background). However these trends are self sustaining. Once energy becomes cheap enough from renewable resources it seems likely that these trends will become self sustaining and we will begin to see an accelleration of them before we see a saturation of them. Its quite possible that instead of a linear spinup we will see rapid decarbonization if we can only get things to a point where green energy can win on cost. As technology advances and solar becomes cheaper and more and more high performance and batteries of all forms begin to follow suit, its quite possible we will see massive shifts on the horizon in the next decade. For me this was a rallying cry, research has given us seemingly miracle solutions in the form of extremely cheap solar. If trends continue soon solar energy will become inescapable regardless of political or environmental leanings, and that gives me hope.

Getting Started With BioJulia Part 2: Copying samtools and samtools like things

April 22, 2024

Welcome back

Hopefully the last blog post made some sense. In this blog post I will be talking about how to use Julia to essentially filter bamfiles and perform some of the useful tasks that samtools might do for you with just julia. This will be essentially scripting filtering a bamfile. Despite initially being much more complicated my hope is that this will be of some use to those of you that end up piping bamfiles through multiple samtools commands and end up having Snakemake files with massive rules for filtering out a certain kind of read.

Last time we worked out how to go through and analyze reads in a pluto notebook a special kind of computational notebook that is used a lot with julia scripts. All of this code could also be run in a jupyter notebook if the author so desired. One caveat of this post is that it is likely to exist in a state of being half finished for some time. Hopefully at the minimum this code will provide a jumping off point for scripts to allow someone to work with reads in julia and perform some level of processing on them that might save them some time.

Starting off last time we were able to access a single read and see what we were looking at with the following code.


using XAM, BioSequences,
filename = "/path/to/dir/example.bam"
#create a reader object using the file we have supplied
reader = open(BAM.Reader, filename)
	
record = BAM.Record()
#make sure that the record is empty
#( this would be useful when  looping over multiple reads)
empty!(record)

#get the read from the reader and push it to the empty record object.
read!(reader, record)

#close the reader after we have read one sequence
close(reader)

#Use the BAM.sequence method to get the sequence from the BAM record object
bam_sequence = BAM.sequence(record)

We could look at the sequence as well as a few other characteristics of the read and print those out to the console. But lets say instead that we want to filter out the file and write this to a new bamfile instead. If this is the case then there are a few steps that we need to do in order to get this to work in the BioJulia ecosystem.

We need to create a new header file for the new bamfile and then we need to open up a new bgzf stream like below.

#this libary is needed to write to a bamfile in julia
using BGZFStreams
function filter_bam(file_in, file_out)
    

    reader = open(BAM.Reader, file_in)
    record = BAM.Record()

    #get the header from the bam file.
    bam_file_header = BAM.header(reader)
    writer = BAM.Writer(BGZFStream(open(file_out, "w"), "w"), bam_file_header)
    
   #keep going through until the reader hits the end of the file.
    while !eof(reader)
        #make sure that the record is empty beforew overwriting it
        empty!(record)

        #check the new read and get it from the reader
        read!(reader, record)

        #check whether the read is mapped
        condition  = XAM.ismapped(record)

        #if the read is mapped then we want to writ it to the stream.
        if condition
            write(writer, record) 
        end

    end

    #close both the reader and the writer so that no issues exist
    close(reader)
    close(writer)
end

Notice how we could change the condition to whatever we want. We can also add combinations conditions as well. For example we could check whether the read is properly paired as well as whether it is mapped and only look at reads that have this characteristic. But the advantage is that any function that takes in a read can be used as your conditional. We can remake samtools filtering with arbitrary many rules!

Notably this would be just as doable using either pysam or the HTSeq library in C or through its bindings in a bunch of different languages. The main advantages here are just all of the other advantages that you get with using Julia. I think that these are enough that it makes it worth it to use Julia because of its combination of speed and ease of writing.

Below is one toy example of filtering out one very bizzare read type.

#this libary is needed to write to a bamfile in julia
using BGZFStreams
function filter_bam(file_in, file_out)
    

    reader = open(BAM.Reader, file_in)
    record = BAM.Record()

    #get the header from the bam file.
    bam_file_header = BAM.header(reader)
    writer = BAM.Writer(BGZFStream(open(file_out, "w"), "w"), bam_file_header)
    
   #keep going through until the reader hits the end of the file.
    while !eof(reader)
        #make sure that the record is empty beforew overwriting it
        empty!(record)

        #check the new read and get it from the reader
        read!(reader, record)

        #check whether the read is mapped
        condition  = (BAM.sequence(record) == "AAAA")

        #if the read is mapped then we want to writ it to the stream.
        if condition
            write(writer, record) 
        end

    end

    #close both the reader and the writer so that no issues exist
    close(reader)
    close(writer)
end

In the above example we look for reads that match the exact sequence that we supply ` “AAAA” `. While a bizarre choice if you really did have reads like this it would be a nightmare to try and come up with all of the conditionals using general SAMtools. The other advantage is that computations on these functions will also be quite fast yet easy to read as opposed to if you were using plain python or C.

Happy hacking!

Getting Started With BioJulia Part 1: Analyzing Bam Files?

April 10, 2024

Who is this for?

The following may be a weird mish mash of different levels of complexity, hopefully a biologist who is just getting started in programming can use this to write some of their own genomics analysis scripts. Or someone from computer science getting started working with sequences can get a little bit of use of what parts of sequencing reads are of use and interesting to biologists. However, this post may require a teensy bit of googling or asking mr GPT if you are less familiar with next generation sequencing or have no experience with programming (Sorry!).

Why should I care about Julia?

Using Julia for bioinformatics has a few advantages which is the main reason I am choosing to use it for a project in which I am analyzing BAM files. Julia is also getting more popular among datascientists in general and has some cool features. Some of these features make it an ideal alternative to Python or R when we are trying to make command line tools to analyze BAM files. I like BAM files its a great format, and its pretty ubiquitous in bioinformatics. As I have gotten more into working with sequencing data I have found that I often want to work directly with BAM files and access all of the information that they provide before doing further analysis. One example of this is filtering out reads based off of really complex criteria. If I have a complicated decision process over whether to use or throw out a read this might change all the time but filtering this out with samtools might be too hard. In fact a number of bioinformatic tools are essentially scripts with wrappers over PySam or HTSeq for filtering out reads based off of overlaps with GTF locations and mapping quality.

Originally I thought it might be a good idea to learn C and use the awesome HTSeq written by Heng Li and coauthors. But learning C was pretty hard and I thought it might end up being a waste of time for the uses that I had. While Julia isn’t a C replacement it can offer some of the speed when working with very large sequencing data and has helped me to iterate over experiments quickly where PySam may have made this more difficult to do without having to learn cython, or where using R might have been impossible without writing RCpp code.

The main downside compared to C, GO and Rust is that Julia doesn’t quite have a static compiler available yet, meaning that in some ways it will still be pretty similar to python tools in terms of ease of installation. So if you do want to create a tool you will need to supply it as a script. There are a lot of tools that might be nice to have in the bioinformatics ecosystem for working with BAM files. DeepTools, Samtools, Bedtools, Picard, and GATK ( as well as many others that I am blanking on here, whoops, thanks for your work in the ecosystem!) are some very popular tools that perform read filtering and are essentially toolkits for working with bamfiles. Notably people often do not write scripts to process bamfiles themselves.

Often instead of creating custom scripts I seem to approximate this using piping and using many of the above tools. However I want to be able to start working with reads directly and also process them quickly. When I saw that BioJulia existed and had a number of packages I thought that this could potentially be a much easier avenue than learning C and or trying to work with PySam, pyRanges or GenomicRanges and dealing with speed issues when I have to deal with the parts of the program that won’t end up just being a wrapper around a C library. While GenomicRanges, PyRanges, and PySam are all extremely fast as they are wrappers of C once you get the data out you are always going to have some level of bottleneck working with Python and R. You can always try using packages like DataTable and Numpy but its easier if you can just program in a “dumb” style without having to use someone elses libraries sometimes.

Again this was an assumption that I made which may not be quite true, I also wanted an excuse to learn Julia and complete the Python, R, Julia trifecta. This blog post is my attempt at writing some of the missing “introductory walkthroughs” for BioJulia usage. I intend to update this and my other posts and add more information for working with Julia. For today the main questions we will be asking are; how would I get genecounts from a bamfile using BioJulia? A popular tool to do this is htseq-count written in python because it allows to filter by strict coverage. Another faster tool is called FeatureCounts which is written in C.

We will be working with just Julia in order to read in the bam file and count overlaps with genes. As you will see we also are able to filter out Bamfiles like we might do before hand with samtools Finally we can do some extra “tricks” fairly easily that might be something we would do with Picard. Depending on time I may add a final post showing how to make graphics similar to DeepTools and RseqQC using BioJulia as well.

Writing your first few lines

During this walkthrough I will be using Pluto.jl a julia package for making interactive notebooks to create and work with sequences. The main packages that I am using in this analysis is XAM.jl for processing BamFiles and Samfile records. I am also using BioSequences in order to potentially do some processing of the sequences. Currently the head of your Pluto notebook should have a cell that looks like this:

using XAM, BioSequences, GenomicFeatures

Next we are going to try and open a bamfile and look at the first read. For our purposes lets say we have a bam file called ~/path/to/dir/example.bam . We are going to open it and take a look at the first few reads, and what information they contain. Here is an example of what we are going to run next.

begin
filename = "/path/to/dir/example.bam"
#create a reader object using the file we have supplied
reader = open(BAM.Reader, filename)
	
record = BAM.Record()
#make sure that the record is empty
#( this would be useful when  looping over multiple reads)
empty!(record)

#get the read from the reader and push it to the empty record object.
read!(reader, record)

#close the reader after we have read one sequence
close(reader)

#Use the BAM.sequence method to get the sequence from the BAM record object
bam_sequence = BAM.sequence(record)
end

Note that I have wrapped the entire command in a begin your code here end block because pluto makes sure that code can be placed in any order and automatically updates if you want to paste a code block it needs to be wrapped with this. However, this code would work the exact same if you separated each line of code into a new line when pasting it into pluto.

We can now see that there has been a sequence printed out to the main plot. This is great we can access the direct sequences from our bam file! What if we want to see where it mapped to? Lets check the documentation for XAM.jl

link to documentation of xam

We can use XAM.BAM.refname to get the chromosome or contig (big genomic chunk you get which you might not be sure what it is or if its complete when assembling the genome) name. If we scroll down we can find even more ways to access the information that is stored for each read in the Bam file. XAM.BAM.position will give us the leftmost mapping position of the read. Great now we can maybe see if it overlaps with a gene, lets go get the rightmost position as well! XAM.BAM.rightposition

Now lets take this information and turn our read into a “genomic range” that we can check for overlaps with other ranges.

begin
end_position = BAM.rightposition(record)
start_position = BAM.position(record)
chromosome_name = BAM.refname(record)
end

Excellent we can now follow the documentation for GenomicIntervals and use the above variables to create a new range: link to genomic intervals documentation

begin
#putting the strand as "." means that we don't include any strand information here
#if you want you can look for this and add it from the bam reader function in XAM.jl
our_interval = Interval(chromosome_name, start_position, end_position, '.', bam_sequence)
example_gene_interval = Interval(chromosome_name, start_position+50, end_position+50, '.', "EXAMPLE_GENE_ABC")
end

Finally if we call an overlap query we can iterate over all the overlaps. Because the example gene has the same location of the other interval just offset by only 50 basepairs we should be able to find an overlap. Our read is probably around 100bp to 150bp long if it came from a classic next generation sequencing technology. Before we do this we are first going to make an interval collection. Often times you will want to overlap multiple read against multiple locations, for example when you want to calculate gene counts. In order to do this you first need to add both your reads as well as your gene locations to separate interval collections so that genomicFeatures can quickly Search them both for overlaps.

Below is the code that

begin
#we are going to keep the string metadata that we have added for each gene
gene_collection = IntervalCollection{String}()
push!(gene_collection, example_gene_interval)

#we are going to keep the sequence of the string just for fun
#this might be useful if we wanted to see if all of the reads that map to our gene have some odd feature that
#we need to worry about.
read_collection = IntervalCollection{String}()
push!(read_colelction, our_interval)

for (x, y) in eachoverlap(gene_collection, read_collection)
    println("intersection found between ", x, " and ", y)
end

end

Our output should look something like this, ignoring the exact details.

intersection found between GenomicFeatures.Interval{String}:
  sequence name: chr1
  leftmost position: 10052
  rightmost position: 10160
  strand: .
  metadata: EXAMPLE_GENE_ABC and GenomicFeatures.Interval{String}:
  sequence name: chr1
  leftmost position: 10002
  rightmost position: 10110
  strand: .
  metadata: AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAAACAAAA

And thats our introduction to working with bam files in Julia. Hopefully this should be helpful to anyone looking to filter reads before performing gene counts or work more closely with bam files in julia. There are numerous data science tools that can be used to group_by and perform other data oriented verbs. packages such as DataFrames and Pipe are great for copying a lot of what the tidyverse offers in terms of features. There is also a separate package called Tidier which aims to directly copy much of what the tidyverse has but in julia.

Between DataFrames, Pipe and Tidier the rest of the analysis should be pretty straightforward for analysing a large number of reads and or creating a CSV file that you can then analyze using R or Python.

Getting Started With Julia

February 8, 2024

Getting Started With Julia, notes for bioinformaticians

I recently started working with Julia, Below are some initial notes that I have been taking while installing and working with the programming language as well as the advantages that it can provide.

I tried to install originally on macintosh but found that it failed immediately. I was able to run the command to install juliaup however.

Finally the most important part is that you want to be able to use the following command


curl -fsSL https://install.julialang.org | sh

Then you can install a new julia version with the following command. Where 1.10 is the software version of the Julia language.


juliaup add +1.10

This will add a “channel” where the channel you will pick from the latest versions that are available for julia. Currently as of this writing the versions are 1.10 as the stable release version.

https://github.com/jupyter-lsp/jupyterlab-lsp install jupyterlab LSP

Which Development Environment to use

can use pluto VS code is the only one that has julia insider for viewing dataframes The vs code plugin can get setup exactly almost the same as Rstudio with a few complaints and not as indepth for analysis Julia packages

Useful functional packages

transducers folds

But most of the main things are similar still hard to beat R studio for all of its integration that it has.

Using Package compiler!

For compilation need to use Pkg.instantiate() and Pkg.resolve()

first need to activate a julia project start the package manager and do

activate .

Go up a directory then activate then rerun package compiler and get it to work Then you will have the dependencies correctly listed

https://bkamins.github.io/julialang/2020/05/10/julia-project-environments.html

Julia has made a breaking change in the types instead of using void the type is now “nothing”

make sure to include this when you are running the code.

Transducers and automa for quickly parsing files are also available

https://docs.juliahub.com/General/FiniteStateTransducers/stable/ if you care about weighted finite state transducers

How do I rbind in julia?

want to make a dataframe from a list of vectors

Using vcat fpr rbinding matrices, or hcat for cbinding matrices. Once you have the matrix worked out then you can easily turn this into a dataframe.

VIM Plugin

I use chad nvim and found there were a few things to be aware of because of the lazy package installation.

to get vim stuff working for inserting hte unicode symbols you need to do a ton of stuff that will take a lot of time https://github.com/folke/lazy.nvim/blob/c734d941b47312baafe3e0429a5fecd25da95f5f/README.md#L104

you need to only load it when ft is true

  {
    "JuliaEditorSupport/julia-vim",
    ft=".jl",
    config=true
  },

in your package spec change lazy to a different boolean it won’t work and install when you use lazy loading and it will automatically detect the filetype for you anyway lazy boolean?

https://docs.julialang.org/en/v1/manual/unicode-input/

Presenting WoofGPT

December 26, 2023

WoofGPT represents a novel breakthrough in animal language models

Recent breakthroughs in Artificial Intelligence and Large language models have promised to reshape how we work, live and think. Despite large breakthroughs in human like intelligence animal, and in particular dog-like intelligence has proven difficult to pin down and capture. Despite a literature search we found that almost no research had been conducted on dog language models. Seeing this scientific gap we set out to create the first dog language model that can sucessfuly simulate a conversation with a dog.

Using a generative transformer model we are able to find broad performance across a number of open source dog language bencmarks. Notabley we are able to achieve SOTA and beat single model performance using only a generative model without finetuning.

	random_baseline	woofGPT	single model SOTA
dogBENCH	50%	69%	71%
woofnogrande	50%	69%	65%
WOOF8k	0%	42%	10%

We think that this model will democratize dog dog intelligence. If you don’t own a dog How many times have you wanted to own a dog just to fret about the fees and upkeep in order to interact with humankinds greatest friend. We will be releasing the training data and weights of WoofGPT publicly. Notably we also are releasing a hosted version of the model for the public to play with below to try and experience this new intelligence for themselves.

Please go here if the model is unable to load!

(this is supposed to be a haha joke)