Note: Codecov has re-released our python package as version 2.1.13. Moving forward this uploader will follow the same deprecation schedule (to be released) as our other language specific uploaders. In the meantime, however, it will continue to function. We have edited our previous blog post on this topic to include knowledge of this re-release.
Summary of Incident
On Wednesday April 12th, Codecov removed our `codecov` package from PyPi. This removal was abrupt, with limited actionable messaging to end users, and a meaningful lack of follow up planning to gracefully off-ramp end users from this package post removal from PyPi.
This post-mortem will outline our justification for removal, the incident timeline, errors that were made throughout this removal process, and culminate in lessons learned and mitigating guards we will put in place to prevent this class of problem from recurring.
Historical Context
Reasoning for the Codecov PyPi Package Removal
Since June of 2021, Codecov has made available a binary uploader written in NodeJS. This uploader was intended to centralize all users on a single source of truth for uploading to Codecov. The rationale for moving to this uploader can be found in a prior blog post.
Historically, Codecov had many language specific implementations of our uploader. These implementations caused problems for us and our users for the following reasons:
- It was easy for uploaders to “diverge” in functionality, as our primary uploader (e.g., bash and later Node) would add new functionality it would take a long time for this functionality to propagate to these language specific uploaders, if it propagated at all.
- These uploaders were primarily community efforts, and as a result they were prone to being abandoned or neglected as members of the community moved away from these uploaders to our officially supported Bash and (later) Node uploaders.
- Users would find these language specific uploaders, attempt to use them, and they would ultimately not work for the above reasons. This created friction for our users and frustration for our support team, who would have to field myriad support tickets about language specific uploaders.
Fundamentally, Codecov had several language specific, legacy uploaders that were difficult to maintain, severely lagging in functionality, and were — in some cases — abandoned by the community. Therefore, we made the decision to deprecate these uploaders and, over time, centralize on our binary NodeJS uploader.
Process for Deprecation
Codecov publicly announced the deprecation of these language specific uploaders in Sept of 2021, and announced both through a blog post, and on the repositories themselves. Additionally, we developed a brownout schedule for these uploaders, in an attempt to degrade service as a means to incentivize end users to switch to the new uploader.
Incident Recap
Timeline of the Incident
All times PDT
April 11th:
- 1:26pm — discussion was started in Slack to potentially remove the pypi package. This discussion was brought up based on an acute support with a customer struggling to use the package.
- 5:43pm — decision was made to remove package
April 12th:
- 5:38am — package was removed by team member
- ~6:00am — version of codecov package upload by pypi user to preserve functionality
- ~7:00am — `codecov` name is locked on pypi
- 7:03am — discussion was had concerning whether or not team member replaced the deleted package with an alternative to prevent squatting on the `codecov` package name. This step was not taken.
- ~ 12:00pm — Codecov releases public statement regarding pypi package removal
Mistakes leading up to the Incident
Despite our defined process for deprecation, there were a couple of key mistakes leading up to the deletion of the Python package:
- Despite posting a brownout schedule, we were incredibly lax in our enforcement of brownouts. Despite communicating the brownout schedule in channels available to us (e.g., blog, social media, etc), these brownouts periods still created a support burden and routinely surprised and frustrated our users. It could be argued that this was the point of brownouts, but customer focus seemed to be on the unreliability of Codecov versus the need to replace the uploader.
- Our brownout schedule lapsed and we did not maintain our conviction on fully disabling the uploaders’ ability to function (primarily motivated by point 1). This led to individuals using our language specific uploaders for much longer than we intended. Making matters worse, our backing down was implicit and not communicated. The brownout deadline came and went, and we did not mutually decide as a team on what action to take next, leading to unclear policy and no updated messaging to our end users.
The first key learning from this experience is that we should have committed to our stance and scheduling on fully sunsetting our language specific uploaders. Had we done so, we likely never would have hit this problem at all. You can even compare our initial posting of the uploader deprecation to what we have live and published today. We clearly backed down from our initial resolve.
Mistakes during the incident
Justifying the Package Removal
The first major mistake occurred in the decision making process around removing the `codecov` package. Specifically, not applying enough rigor to the question of, “why should we remove this package right now?”
Fundamentally our decision to remove the package came from community and support frustration regarding use of the Python uploader. As stated above, the uploader lagged in feature set, had minimal support, and quite often would fail for users who were attempting to leverage features present in our Node uploader that were not present in our Python uploader.
An acute support issue around the Python uploader was the “straw that broke the camel’s back” and pushed us to make a pre-emptive decision on this particular uploader. This was a clear mistake. Even though customer frustration in the moment is painful and difficult, we cannot let it push us to make decisions in such a way that sidesteps processes or our fundamental decision making frameworks.
This removal justification was weak, and was not rooted firmly enough in data nor our prior communication efforts with customers. Since we did not develop a new plan after our brownout schedule lapsed, we let in the moment user and personal frustration drive our decision making instead. This was a huge miss on our part.
Performing the Removal
Once the decision to remove was made, we egregiously mishandled the process of removal. Meaningful places where we missed:
- We did not check our dashboards to determine active usage of the python uploader before removal (note: usage was about 2.5% of all repositories on Codecov).
- We did not send a final communication to users that we were performing the removal.
- We did not reach out to PyPi maintainers ahead of time to give notice of this change.
- We did not make any plans for holding the domain to prevent squatting or the distribution of malware, nor did we discuss this possibility internally.
- Note that we were able to confirm with the PyPi team that `codecov` packages uploaded post removal did not contain malware and no security issues arose as a result of this mistake.
Simply put, the removal was flippantly done and the ramifications were poorly considered. This was primarily due to the absence of an updated approach to uploader removal after we had surpassed the brownout period. Without a process or plan but a longstanding initiative to eventually remove these uploaders, it was easy to delude ourselves into thinking we could remove them whenever we wanted and users would not be negatively impacted. After all, we’d been discussing the removal for years, surely these uploaders weren’t still being widely used? A brief glance at our data, instead of relying on intuition and assumption, would have shown that while usage of the Python uploader was low, it was not insignificant.
The above mistake, relying on intuition and assumption instead of the actual data we had on hand, explains the first three misses: we tricked ourselves into thinking this removal wasn’t a big deal because we’d been planning it for over a year.
The last miss, however, is attributable to inexperience. We had never pulled a package from PyPi before, and did not deeply consider the ramifications of doing so. Furthermore, our own security team wasn’t consulted ahead of the removal, which likely would have resulted in a much more well-considered approach to removing the package.
The Way Forward
In order to avoid repeating the mistakes outlined above, Codecov will adopt the following processes specifically for the removal/deprecation of features, but also generally for those decisions that may greatly negatively impact some portion of our end users:
- If a process expires or is no longer relevant, the onus is on primary stakeholders to develop a new one immediately if possible. If this isn’t possible, at a minimum we will acknowledge that one does not exist or is not applicable and decide if/how we should proceed.
- Decisions that may have a profoundly negative impact on end users must — if at all possible — be verified with the data we have. If we cannot access data, then we must assume negative impacts are unavoidable and provide a reasonable off-ramp that incrementally pushes our users in the proper direction in a way that minimizes these negative impacts.
- All deletions of packages and repositories, whether in use by end users or not, must follow a formal ticketing process. These tickets must be created, approved, and executed to create the proper paper trail and ensure that multiple stakeholders — including security — can weigh in.
Specific Changes to our Approach to Package Removal
As mentioned, Codecov has several open source uploaders that we have active plans to deprecate and ultimately remove. Therefore, it is of paramount importance that we do not make this type of mistake again. Therefore, we will adopt the following approach for future removals of our uploaders:
- We will republish, and firmly adhere to, a brownout schedule for our uploaders. In the presence of user frustration during these brownout periods, we will communicate clearly, openly, and empathetically with our users; but we will adhere to our decision making unless new information leads us to change course. If we do change course, we will update our users accordingly.
- We will post *timely* notices of this deprecation/deletion plan in all relevant locations: the GitHub repository, the package manager, within the output of the uploader in the user’s CI, our blog, social media, our feedback repository, and so on. These notices will be resurfaced via the means available to us (e.g., emails to impacted users, our blog, social media) 24 hours prior to any deletion.
- We will reach out to package manager maintainers ahead of removal to ensure that we remove the package in such a way that deeply considers our package users and those maintainers.
- The specific plans for removal of any particular package will be approved in advance by our security team.