Multiverse Podcast Episode 2: We must go deeper
The transcript is also available at Temi.
This is Episode 2: We must go deeper.
So there's an interesting discussion online from this Stanford computer scientist and partner at Andreessen Horowitz, named Ali Yahya, who I've had the pleasure of speaking with. He worked on a few problems at Google Brain using something called a Robot Hive-Mind. And the idea is that by using distributed data and machine learning, they're able to solve problems to resolve this issue of the long tail of AI. So the long tail of AI, you have a lot of problems, like you're trying to make this robot. In his example, the robot is trying to open a door. That's easy to solve just by expert systems or by direct programming, because you can just make something that grabs a door handle at a specific height and uses computer vision in a very basic way. The problem is if you start looking at different varieties of doors and different types of doors in your house or apartment, you'll actually see there's a pretty wide variety.
If you start looking further out into the broader world, you'll see that there are countless different types of door knobs and door knob devices, and types of door structures, that there's almost too many for a machine learning system to just be programmed through expert systems. And you really need to start thinking about this as a long tail problem. So he goes on to explore additional issues, such as different types of lighting, different environmental conditions, you don't know if there's like a heavy wind blowing in certain directions, or you don't know, for example, if it's going to be extremely dark or so bright that the cameras are maxed out. And if you start thinking about even other unusual examples, but if you're looking at doors on ships or on an RV or in a truck, it starts to become a problem that has almost an infinite number of variations.
And I think that this is an interesting problem, this long tail of AI. If you think about it, a lot of problems in AI are going to be long tail, maybe even most of them or the vast majority of them. I'll give you another example of another problem in machine learning. My family used to live in the South and apparently Siri was pretty terrible for any accent that wasn't used around Cupertino. So there's an author from Southern living and she noted that she tried to tell Siri. She tried to transcribe a text message to her friend. In the message, she says, "just wanted to see how y'all are". Siri heard "Just wanted to see how are you out there". In other words, it pretty much had no idea what she was saying. I spoke to a friend who worked as an engineer at Apple on language recognition.
And, she said that for Mandarin, like the kind of Beijing style Mandarin is quite good, but the support for different dialects is pretty scary. It's actually very poor. And she thought that it was never going to get done. They were trying to hire people from one region to interpret training recordings from several other regions. The team was tiny as well. This touches on the mythical man-month. You can't just throw more engineers at a problem. You need to leverage the community for this.This will have to wait for another podcast. So if a company like Apple has this problem or Tesla, I mean, Apple has a $2 trillion market cap and Tesla's over $600 billion in market cap. These guys have virtually unlimited resources. Imagine if you're a startup founder or you're running a foundation or nonprofit, and you have almost no resources. What chance do you have to make an equitable and robust machine learning based product and that's going to be free of these biases?
A lot of this comes from the fact that these engineers did live around Silicon Valley. They encountered these problems personally, they're going to be much more likely to proactively fix them. This is something that is just less and less likely to further you get out from this kind of epicenter. And I think that it's perfectly understandable. It's not like Apple or Tesla are doing this intentionally, but it's just the problem with centralized and biased data collection. Everything on the Hadron multiverse is an AI app. Some of these apps are standalone apps designed to work with partners and customers. That's like Hedge, which some of you have seen already with one of our partners. Others are more like utilities that you can use as a building block or a service to help you as a developer. That helps you make a better AI app from potentially more diverse sources of data and from a broader range of sources.
And, one of them is an AI app that allows her decentralized data collection. It's something that we use, and it's something that we wanted to make available to other developers. Of course, with this kind of integrated economic system, there's competition for this as well. If you think that you can make better data, decentralized data collection app, you are definitely free to do so -- that's decentralized at its most basic. It's essentially providing an incentive for people to collect and verify good data for training machine learning systems. If you go back to that door problem that Ali talked about, you can imagine how difficult it would be for a company to collect images of doors in all situations: in all lighting conditions and in every country for every social class, like every neighborhood.
I mean, you can imagine how many different conditions, even just the condition of the door itself. You could take the same door and put it in different places. One's going to be dusty and dirty and going to be pristine or maybe painted different colors. And it is just essentially an impossible problem to program manually and is actually going to be almost impossible to collect it all within one region or one city. So you could argue that perhaps you collect this data on Facebook or Google Image Search by searching for the word door or door knob, but you can imagine a lot of data is probably not going to be tagged in the language of the developers that they don't speak. For example, you search for doors in English and you're going to get a lot of results of doors in English speaking countries, but good luck trying to find these in every single language that you're trying to cover.
You're trying to make a system that's not brittle. That's going to be very difficult. We actually went through this exercise in a previous company. And, Michael, you might want to talk a little bit about this. We were training a deep learning model that does automatic image tagging and it's for SEO. And, if we succeeded, it was extremely difficult to collect additional data from image search. It's a pretty terrible experience because the quality of the images is extremely poor, typically when you do an image search. This is not just Google, right? This happens in whatever platform you are going to use for collecting open source or even closed source images. But typically the highest ranked images in that list are all going to be by companies who are gaming the search algorithms to appear at the top of the search results. You know what I'm talking about. It's basically Shutterstock and other stock photography crap that looks basically all the same, with all the same looking models, with all the same lighting conditions and watermarked to hell. So it's really not very usable, especially for training a deep learning system. You're almost going to make a system that's going to recognize the watermark instead of the object you're trying to get.
Yeah, that's definitely what we experienced when we tried to refine our image recognition models with images derived from search. The problem was that the images people were uploading to our website, they didn't look anything like the top search results for a particular search term. Another problem we faced was that the initial model was trained on some pretty faulty data. We used a publicly available pre-labeled image set, but later discovered that the labeling was inconsistent. There were some images which were labeled for a particular class, if there was even a hint of that class in the image. The example that pops up in my head is the sunglasses one. For the sunglasses label, we had a variety of different pictures. Somewhere like product pictures, which took the whole image, a very representative of a sunglass or sunglasses. But then there were some pictures which were groups of people where one person had sunglasses protruding from their pocket.
It's really hard to properly train an algorithm when data isn't consistent. Here's another example of data labeling that doesn't help. We found that the image data had a number of images labeled with mufti, which I learned we're playing clothes worn by a person who normally wears a uniform. I have no idea how the labelers or the images determined if a photo of a person was wearing a mufti. And it was clear after all our training efforts that our algorithm didn't know, either. Being able to have a decentralized data collection app can really help with improving the quality of the training data.
Another AI app on the multiverse, it's a little more subtle in terms of why it helps address the long tail. That's a data annotation app. We also call it the Viewer sometimes. We think it's also something that needs distribution in decentralization as well. Because again, if you're trying to label data that comes in, you need people from all different cultures and regions. And, you also need people from different cultures and regions to really curate the results that come back. You need multiple levels of curation to really get the kind of extremely robust systems that you'd want. Because some people may consider a door; other people might consider a portal or something on a ship would not be considered a door. So you really need to have all this alignment in the data that you collect to make sure that addresses the problem you're trying to solve. The door problem is probably not the one you're trying to solve, but it is just an example of the kind of challenges that you might encounter in the real world.
Thanks, Cliff. I kind of chuckled when you pointed out that most AI is developed and controlled by a very small elite group of companies. I was an engineer at Google for 15 years, and I had an up close view of its journey with AI from the early days of recognizing cat pictures to that Seminole AlphaGo moment that you mentioned. And then to actual consumer products like translation and phrase suggestions, email, and personal photo search. There's always a lot of fanfare for algorithmic advancements in cool products, but what gets less news are the behind the scenes details and difficulties in developing and working with AI. Maybe the best example of this was an incident in which the Photos App misclassified some pictures of darker skinned people as a certain type of animal. You might remember this. It was a really bad mistake and made quite a bit of news.
I remember going to the post-mortem presentation by the Photos team, where they explained how it happened. When most people think of software bugs, they think of a developer making a mistake in the logic that the program follows. So some people might have jumped to a conclusion that there was some engineer who either accidentally or maliciously coded in a mapping from dark skinned people to this animal category. But as Ev pointed out, machine learning is different from logic based coding. It's not built on step-by-step instructions. Instead, it learns from deriving patterns from the data it's trained with. So the next question you might ask is, okay, did the training data contain dark skin people labeled as those animals? And the answer is no, it didn't. It did have pictures of the actual animals correctly labeled as well as correctly labeled pictures of darker skinned people, but there just weren't enough of them for the AI to learn the correct patterns and associations. In machine learning parlance, this is called data bias.
When the training data set is not representative of the data that the AI processes in the real world. And, this really gets to the heart of the main challenges in machine learning, which is data quality. When you think about it, machine learning is fundamentally dependent on good data.Because that's what does the job of teaching in the world of machine learning. We can't really get around that. And it was particularly striking to realize that even Google, with its piles of cash and top talent and engineering and research, struggles with acquiring good training data for machine learning. At the end of the day, it's always going to be a hard problem because it requires intelligence from a lot of humans to produce this data, and it is not one of those things that you can algorithm away.
Another big challenge in machine learning also has to do with data quality. But in this case, it's not about the quantity or diversity of the dataset, but in the correctness of the data labels used in training and validation. Just last month, MIT published a paper showing that some of the most commonly used datasets have a significant amount of errors. For example, they found that 6% of the image net validation set had errors and 10% of quick draw labels had errors. There's a proverb in machine learning: garbage in garbage out, which is just another way of saying that machine learning learns what you teach it. So in the same way that training data biases limit what the model can learn well. Inconsistencies in the training data labels will result in a similar rate of errors at inference time.
This issue of label inconsistency becomes more challenging as the classification task becomes more specialized, compared to simple image classification. For example, sentiment analysis tends towards even more inconsistency because the answer is more subjective. And, at the extreme end of the spectrum, you have tasks like medical image analysis, where it's well known that there's a high rate of disagreement even among experts. Hadron has been working with a large biotech company for a couple of years now. And we've developed a technology for addressing this problem by combining the statistics and algorithms from the machine learning space, with principles of consensus from the blockchain space. Essentially we've developed an AI to refine data sets, making it possible to train much more accurate AIs for these difficult tasks.
Thanks for joining us! In the next episode, the Hadron Multiverse -- Tokenomics. Find us on iTunes or your favorite podcast platform by searching for the Hadron Multiverse. Talk to you all then.