The Benefits of the Black Box

Finding the right combination of openness and opacity.

2023-05-03

A couple of weeks ago, Twitter, via Elon, decided to share some of their most important algorithms. Users have been complaining about the opacity of some of Twitter’s key functionalities for a long time. Open-sourcing these algorithms is a good way to change that. So was this a smart move?

For some code open-sourcing is harmful. Bot detection via heuristics would be one such example. An algorithm like that is effective because it looks for certain indicators, which are not publicly available. If they were, the algorithm would be easier to game, since it’s easier to write a program that complies with a set of rules when those rules are known. The benefits of open source are still there - others discover bugs and suggest improvements, which has been happening for the Twitter code - but the harm of the rules being known can easily outweigh those benefits.

Before the Twitter source code was released, it was benefiting from opacity in this way. Twitter uses heuristics to eliminate spammy content and promote content interesting to users. Those heuristics are now public. One example is the “heavy-ranker” algorithm, which plays a part in determining which tweets are shown in the for-you feed. The algorithm uses a machine learning model, which is opaque, combined with a deterministic function, which is not.

The machine learning model estimates the probabilities of certain events occurring as a result of a user seeing a tweet. For instance, the probability that a user will watch half a video, or the probability that a user replies to a tweet. These probabilities are fed to the deterministic function, in which the function assigns predefined weights to the probabilities generated by the machine learning model to calculate a result. That result is then used to determine which tweets are shown.

The machine learning model is opaque because one can’t infer the exact behavior of a large machine learning model by the source code alone. The opacity of machine learning models is often touted as a problem, but in this case it’s a benefit. That’s because that kind of algorithm can be released to the public while revealing little information about how to exploit it. By seeing it, the public gets enough information to know that Twitter isn’t engaging in activities that are obviously malicious, such as explicitly favoring one political party, but not enough information for bot developers to exploit the algorithm. Twitter knows more than the public about the model because they have access to the production model as well as the training data, none of which they are sharing. If they wanted to, they could also release some statistics on the outputs of the model to ensure the public that there aren’t any obvious biases in it.

The deterministic function causes issues because by open-sourcing it, spammers have new information to work with. For instance, from the source code, it appears to be important to make users view precisely half a video, or for the author to engage with tweet replies because the probabilities of these events have large weights in the deterministic calculation.

From the perspective of Twitter developers, their job was in a way easier before the algorithm was public: they had to build an algorithm that works as well as possible to achieve the desired behavior. If the machine learning model doesn’t work exactly as it should, just add a calculation afterward to fix it. Now it’s not so simple, because there’s the second-order effect: the fact that information can be used to exploit the algorithm, making it worse. This creates pressure toward opacity, more machine learning, and less deterministic calculations.