Archiving Bitbucket Content: Status Report

Pierre-Yves David
2020-05-26

We are continuing our effort to archive all of Bitbucket Mercurial content before Atlassian delete it all. And we are making great progress since we have already retrieved all existing public content. So far we identified a total of 244,609 public project using Mercurial.

Currently archived

In these 244,609 projects we identified:

  • 244,569 source code repositories,
  • 81,154 wiki repositories,
  • 213,345 issues in there bug trackers (with 603,334 comments, and 4445 images),
  • 86,372 pull requests and their comments and 1,320 images,
  • 45,977 project attachments (for about 750GB of data).

Looking at the 5.5TB of repositories more closely gives interresting data:

  • 27,662 (8.5%) repositories went missing either deleted or made private since we observed them in February,
  • 98 (0.03%) repositories are inaccessible (Bitbucket itself crashing trying to access them),
  • about a couple hundred repositories are still receiving pushes, less than 1 week away from the initial deadline.

What the plan for this content?

Ultimately, we plan to offer a set of tarballs for each project. People will be able to download:

  • the main mercurial repositories,
  • the wiki repositories,
  • the set of metadata associated to a project (as json),
  • the individual project attachment,

However… we will only do this once the set of data is frozen. So for the coming months, our server will tirelessly gather all new content that gets pushed to Bitbucket. Starting of July 1st, when the Mercurial content stops getting updated. We will start building and servicing tarballs through the Sofware Heritage infrastructure.

Anything else?

Yes! In addition to offering tarballs, we are also planning to import all this content inside the Software Heritage database. This will provide us with an excellent corpus to make Software Heritage's Mercurial importer more robust. This effort can start right now, so stay tuned for more news soon.