GitHub Archive Program: Why Archive GitHub Repositories?

Have you ever caught yourself thinking about what life is going to be like in 1,000 years from now? We have already noticed that life goes in a spiral, but with huge modifications. So, it is difficult to imagine what the life of our future generations will be, but technologies are going to be part of their world for sure. What modern technologies will they have? It’s hard to predict, but our “100-grand”-times children will definitely know everything about the technologies of nowadays thanks to GitHub and their Arctic Code Vault, which is preserving millions of open source projects with their source code and repositories for future generations.

There are very inspiring words of Steve Jobs: “We’re here to put a dent in the universe. Otherwise, why else even be here?” And that is true! Just look, the Mayans left us their calendars, the Polynesians left us their massive stone heads on Easter Island – thus, we know at least something about their life. However, GitHub decided to go further with the attempt to show not only life but also the way developers think. They made up their amazing Arctic challenge – GitHub Archive Program.

What is the GitHub Archive Program?

GitHub attracted many organizations, so to say, archive partners, among which are Software Heritage Foundation, Internet Archive, Long Now Foundation, the Arctic World Archive, Microsoft Research’s Project Silica, GHTorrent, and GHArchive to create a GitHub repository that will differ from what they usually do – GitHub Arctic Code Vault. GitHub users and contributors are also key to this open-source software repository preservation effort. This world’s open-source software storage, in a long-term perspective, becomes a time machine for current ongoing archiving initiatives.

The program employs a multi-tiered preservation strategy inspired by the ‘pace layers’ model, combining near-real-time copies with periodically updated and ultra-long-term cold storage mechanisms. The main incentive of the idea is to leave the repositories with all their metadata for future generations, for them to have a detailed understanding of what technologies were like in the past. So to say, to leave future technology historians food for their brains.

Where to find this Arctic Code Vault?

To save it from curious onlookers, GitHub hid all the information about modern software development in the Arctic Circle, to be precise, in a decommissioned coal mine deep beneath an Arctic mountain in Avalbard, a Norwegian archipelago between mainland Norway and the North Pole.

What does this Arctic GitHub data repository contain?

This Arctic vault includes roughly 21 trillion bytes of data, the snapshot of which was captured on the “mirror” data 02/02/2020. It is a single copy at a single definite time of every active public GitHub repository with its metadata (including every pull request, issue, commit, wiki, etc.), which was being developed at the platform at that time.

Which repositories were snapshotted?

Here is a rundown of requirements that helped single out and select repositories for the snapshot:

every active public repository was selected if it had any commits within 80 days before the “big” data.
every active public repo that had at least one star within 365 days before 02/02/2020.
every active public repository that had at least 250 stars without any time limitations.

If you are curious how many people made their community input and who exactly they are, it is easy to check. To honor every DevOps and their public repository, GitHub created a special badge, the Arctic Code Vault Badge, which is displayed in the highlights section of DevOps’ GitHub account.

How is the data kept?

Though the question is open: “How will future generations of DevOps read all those codes?” The answer is on the surface – GitHub archived all the software and recorded the information on the reels of films. There are 187 reels full of digital photosensitive archival film in the form of QR codes, and 1 reel written in human language – the most important one, as it is a “guide reel”, which is called the Tech Tree. Why? Because it is a kind of guidance that contains human-readable information for DevOps of the future to understand how to operate all those QRs.

No way! How to read it?

To read or analyze the archived data, programmers of the future will follow a series of steps. GitHub has created a complicated but smart system to keep the data. Thus, let’s see how everything works.

Due to the fact that the information is kept in the form of QR codes, future programmers will need to decode it. So to say, make the information readable for a machine. Once decoded, users will be able to access the archived data for further analysis. However, this is only the first step. As the information you get after decoding is compressed, future DevOps will need to decompress it to make it meaningful. After that, they will get an archive file that contains a software project’s repository. And it will be like a book for them to read, as one repository can contain many files. But they could read it only if they found a machine that will be able to read this binary code, consisting of ones and zeros, like 11010100.

So what is next..? We unarchive them, of course!

After all the manipulations are done – comes the most interesting – how to read and run the data? For sure, their modern computing will be much more advanced and can run everything, but what about configurations of the past, the ones we have today? That’s why let’s get back to the Tech Tree – human-readable guidance, as it includes the instructions on how to build a machine to read all those repos, so that future generations could get the best use of it.

What about the language?

Nowadays, the human language of coding is English, no matter what popular programming languages are used (Java, Python, JavaScript, Ruby), but who knows what will happen in the future and which language our descendants will select. That is why they provide information in other languages as well. So, future ‘kids’ will have the possibility to read the “Guide to the GitHub Code Vault” in five languages.

What is more, they even included an uncompressed UTF-8 file containing the Universal Declaration of Human Rights in more than 500 available languages at the beginning of every reel and also in the TechTree. Why? Because it is important for future generations to know what rights and freedoms we have now, as it is an essential part not only of human history, but technical history, as well.

In search of the past: what is the trick?

In the beginning, we said that there is just one single snapshot of the data. But what if a natural disaster happens or an intruder comes across right now? What will happen with all those efforts and resources? They will be lost for good. So, just because of this “in case” it is always better to have a backup. It will help to protect your code not only for future generations, but also for future search reference, which you will definitely need earlier in your life.

Here, GitProtect fits right in! The solution provides you with unlimited retention, and it is much more than 1,000 years so you can use it to archive your old, unused repos. Archive manipulations or deleting repositories can be done in GitHub’s “Danger Zone” settings, so one wrong move can result in severe irreversible consequences. GitProtect’s custom backup schedule can rescue you from the “Danger Zone” peril.

[FREE TRIAL] Ensure compliant GitHub backup and recovery with a 14-day trial 🚀
[CUSTOM DEMO] Let’s talk about how backup & DR software for GitHub can help you mitigate the risks

The article was originally published on October 28th, 2022

What is the GitHub Archive Program?

Where to find this Arctic Code Vault?

What does this Arctic GitHub data repository contain?

Which repositories were snapshotted?

How is the data kept?

No way! How to read it?

So what is next..? We unarchive them, of course!

What about the language?

In search of the past: what is the trick?

SecDevOps: A Practical Guide to the What and the Why

Disaster Recovery Testing For DevOps

What is the GitHub Archive Program?

Where to find this Arctic Code Vault?

What does this Arctic GitHub data repository contain?

Which repositories were snapshotted?

How is the data kept?

No way! How to read it?

So what is next..? We unarchive them, of course!

What about the language?

In search of the past: what is the trick?

SecDevOps: A Practical Guide to the What and the Why

Disaster Recovery Testing For DevOps

You may also like

Is GitHub a Safe Place for Your Source Code?

How To Create a New Branch In GitHub?