Articles, Blog

Extreme Classification – New Paradigm for Ranking and Recommendation

February 12, 2020


>>We have, again, a very stellar lineup
of speakers, as always. Beginning with Manik Varma
from Microsoft Research. Manik started out his career
as a Vision Researcher, but has moved now to interest in ML and he does
some very interesting. In fact, the Smart
Game videos that were playing earlier and
[inaudible] in this talk. He’s involved in
that project too, but this talk that he’s going
to do today is focused on some very interesting work
that he’s been doing now for many years on
Extreme Classification. So, without much further ado,
I’ll hand it over to Manik.>>Thank you, Satish. Right. So, I’m Manik Varma from MSR India and I’m also
an Adjunct Faculty at IIT Delhi. And again, before I start
my talk, good for PGM. So, I wanted to say I’ll be talking about
Extreme Classification, which is a new research area in Machine Learning that
we started in 2013, and which is also
opened up a new paradigm for thinking about key
applications in our industry, such as ranking
and recommendation. This is a joint project with my PhD students at IIT Delhi, Kunal Dahiya, Hemanshu Jain
and Yashoteja Prabhu, as well as my colleagues at Bing Ads Rahul Agrawal
and Shatrinda Harsola, and with Anil Kag who is a research fellow here
at MSR India with me. Now, some of you
might not have heard the term Extreme
Classification before, so let me start by
giving some context. In classification,
the complexity of the learning task has grown from binary
classification, where we learned to
pick a single item from two labels to
multiclass classification, where we learned to pick an item from more than two labels, to multi-label classification,
where we learned to pick the most relevant subset
of these labels. At the same time, the complexity
of the learning task has also grown in terms of the number of labels
being considered. So, we moved on from
working with two labels for binary
classification to working with tens to hundreds
and even thousands of labels for multi-class
and multi-label learning. And if you looked at
the state of the art in 2012, then the largest multi-label
dataset had 5,000 labels, and so the size of
the output space was two to the power 5,000, which is quite large, and so going beyond that was
considered very hard. Then in dub dub dub 2013, we published a paper which exploded the number of
labels being considered in a multi-label classifier
from 5,000 to 10 million. The motivating application
was to build a classifier that could automatically
predict the subset of millions of Bing queries, that might potentially lead
to a click on a new ad. So, we wanted to build a tool for advertisers such as Geico. So, that when they
created an ad like this, they could use
a tool to figure out which Bing queries might
lead to click on the ad, and then Geico could
bid on these queries, and then if the query
got asked on Bing and the bid
was high enough, then the ad might get shown. Now, as you can well imagine, predicting queries from
ads and webpages is a really important problem
from both the commercial and
the research perspective. And so, many sophisticated NLP machine-learning
and IR techniques have been developed
in the literature. However, we decided to avoid all these approaches and simply
formulated the problem by taking the top 10 million
monetizable queries on Bing and treating each one of them as a separate label in an extreme
multi-label classified. So, we learned a multi-label random forest classifier
called MLRF, which would take
this ad as input, extract a bag-of-words
feature vector from the raw HTML
underlying the ad, and then simply classified
the feature vector into the subset of top 10
million Bing queries that might lead to a click. Now, this was a completely
new and very different way of thinking about
the problem, and as a result, MLRF gave significantly
higher precision and coverage, as compared to
leading techniques that when production in Bing
at that point of time. So, as I mentioned MLRF was
published in dub dub dub 2013 and it started this area
of extreme classification, which deals with multi-class
and multi-label problems, involving millions of labels. And since 2013,
Extreme Classification has come to be a thriving area of research in both academia
and industry. So, students are
not doing PhDs in the area and
publications are coming out at not only top
tier Machine Learning and artificial
intelligence conferences, but also in top tier
applied conferences. Very popular workshops have been organized in the area
over the last five years, and if you turned up to the NIPS Extreme Classification workshop
held last month, you would have found
about 200 people from leading companies and
universities at the workshop. And if you would like to carry
out research in this area, then you might want
to go and check out the Extreme
Classification repository, which we maintain
and where we make freely available code, datasets, benchmark results et cetera, which make it very easy for new researchers to come in
and get started in this area. Also, what we’ve come
to rely since 2013, is that extreme
classification generalizes far beyond being
a simple tool for advertisers. In fact, one of
the most interesting questions to have come out of
our research, I think, is when or where in the world will you ever have 10 million labels
to choose from. What are the applications
of Extreme Classification? Well, if you think about it, 10 million is
a really large number. Even if it were to take just one second in order
to read out a label, it will take you
almost four months to go through a list
of 10 million labels without eating or sleeping. And to just put
things in perspective, when I was doing my PhD
Jitendra Malik and Pietro Perugino would get up and tell the computer
vision community, that according to bitumen, there are only 60 thousand object categories in the world. So, almost none of the computer vision problems
will make the cut. And even if you were to pick up a English language dictionary, it would only have between 100,000-500,000 words in it depending on which
dictionary you picked up. And so, many NLP problems
will also not make the cut. Nevertheless, people have
found high-impact applications of extreme classification
over the last five years, in various areas ranging
from information retrieval, to recommender
systems, to NLP and vision and even bioinformatics
for that matter. So for the rest of the talk, I’ll be focusing on some
of these applications. And in order to discuss them, I’ll be switching
to the language of ranking and recommendation, and the wheel model
these applications, is that they will be a space of users X and
the space of items Y. And what we’d like
to do is learn a multi-label
classification function f, which is going to take
a point in the space of users and map it to a set of points in
the space of items. So that when
a new user comes in, we can simply apply
our classifier, see which items get predicted, and then return
the items to the user, either as a recommended bag or as a ranked list
depending on the application. So, as you can see,
extreme classification opens up a new paradigm for
reformulating applications, such as ranking
and recommendation by taking each item to be ranked or recommended
and treating it as a separate label in
a multi-label classifier. So let’s focus on some of the concrete examples
of this reformulation, staying within advertising
and Bing to start with. So here is an ad for
Tesco’s Distilled Water. And as you can
see, the advertiser has bid on just a single query, which is distilled water
five liters. And unfortunately,
what this implies is, that this ad can only get
shown if this query gets asked. And so, as a result, this ad has not been shown on Bing
for the last six months, even though it has been sitting in our inventory all along. So, we would like to
rectify this problem, by predicting the subset
of Bing queries that might lead to
a click on the ad and then inserting
the queries and the ad into the inverted Bing ad index, so that if one of
these queries gets asked, then this ad can get
shown automatically, even though the advertiser
forgot to bid on the query. So, in this case, the input to our extreme
classification function f is going to be the ad, and our labels are going to be the top 10 million
monetizable queries on Bing. And as I just mentioned, this is not a new problem, we were not the first
ones to think of it and Bing already has an entire ensemble of
50 top notch algorithms, that are just predicting
queries from pages based on these sophisticated NLP machine learning
and IR techniques. But unfortunately, all of them failed in
this particular case. Bing made just one
recommendation, which is water five, which isn’t a very good one,
if you ask my opinion. Whereas, if you look
at our predictions, then they’re all on the money. So, we recommend
distilled water, whereby, distilled water,
distilled water delivery, Tesco’s distilled
water, and so on. And because of our predictions, this ad has now been shown on Bing multiple times and
has received many clicks. So let me take just a moment
to reflect on why the traditional approaches
don’t work well in this particular case and why
extreme classification does. And the high level intuition, is that a complex
Machine Learning Model, trained on a lot of data, might often outperform
a simple one. So, the reason
the traditional approaches don’t work is because, the traditional approach
tries to reduce the complexity of the problem by reducing it to a binary
classification task. So the traditional approach
learns a binary classifier h, which scans every phrase
present on the ad, in order to determine
whether it would make for a good
prediction or not. Now, unfortunately, ads are pithy and have
very little text on them. So the binary
classifier can never recommend or predict a query
such as buy distilled water, simply because this query never occurs as a freeze
on the ad itself. Furthermore, the binary
classifier is also low capacity. So, it will often make mistakes, which is what you see
in this particular case when it recommends that water five is a good query to
predict for this particular ad. On the other hand, in
extreme classification, we actually embrace
the complexity of the problem by testing each ad against
all 10 million queries. So, we learn a hierarchy
over the space of queries, where all 10 million queries are in the root node
to start with, but then all the fruit
related queries go left, and all the engine
related queries, battery related
queries go right. And then, once we’ve reached
right, let’s say the car, the engine and the, sorry, the battery and
the engine oil go left, whereas, all the distilled
water queries go right. So that when a new ad comes in, we can traverse it down the
hierarchy very efficiently in logarithmic time and can accurately predict
all the distilled water queries, even though they don’t occur
as phrases on the ad itself. So, our hierarchical
algorithm for extreme classification
is called Parabel, and it will be published in 2018 and Parabel has a number
of advantages over MLRF. First, it can be up to 10
thousand times faster to trim. So, what would take
us a few days to train on a thousand
core cluster using MLRF, can now be trained
in a few hours, on a single core of a standard
desktop using Parabel. And so, I really
need to commend Yash, because he’s done
a remarkable job on this paper, because many people all across the world for
the last three or four years, we’re trying to improve
upon our previous state of the art and they manage to
improve some of these metrics, but then the other metrics
would always get worse. So, for instance, people
would try and increase the prediction accuracy
or reduce the model size, but then the training time
and the prediction time of these new algorithms would
become unacceptably high. But for the first
time last year, Yash developed this tree model
called Parabel, which improves in
all the metrics. So, it increases,
sorry,as I said, it mixed training
foster significantly. It also reduces the model size from terabytes to gigabytes. It increases
the prediction accuracy, while maintaining
prediction time. So, we will be making
the code for Parabel freely available on the extreme classification
repository later on. And in addition, Rahul, Shatendra and Anil made a number of domain-specific
contributions, which helped our
extreme classifiers to work really well on
real-world applications. And as a result of that are
extreme classifiers that have now shipped in a number
of products in Bing ads, in almost all the
international markets. So here is a small
sampling of some of the products and
some of the markets, ranging from dynamic
Search Ads in the UK, to Text ads in
the French market, to Product ads in
Germany, and so on. And in each of these cases, extreme classifiers are ranked, either first or second, in that long list
of FTL algorithms, which are doing the same thing, and they’re generating
clicks that are in the double digits
in each case. So, this was an application
in advertising. Let me also consider an application in
recommended systems. So, we went and
downloaded information about 3 million
active Amazon items, from Amazon obviously, and now given the fact that the user might be interested in buying
one of these items, in this case a book on the
Rare Windflowers of Kentucky. What we’d like to
do is to predict the other items that the user might also be
interested in buying. So, this is
the classical item to item recommendation
task and it is traditionally solved using
collaborative filtering and matrix factorization
based methods. But we can again reformulate it as an extreme
classification task, where this time the input to the extreme classification
function f is going to be the
product description of this particular item, and then the labels
are going to be the 3 million
active Amazon items. So, if we look at the results, then we see that Amazon
does not do a very good job, it predicts only 3 or it
recommends only 3 other books, whereas it should have
recommended a lot more, and if you look at
our recommendations, then we recommend not only whatever Amazon
was recommending, but also more than with
a lot more diversity. So, we recommend
birds of Kentucky, wildlife of Kentucky,
biodiversity in Kentucky, woody plants of
Kentucky, and so on. So, if you look at
the traditional approach to recommending items, it’s based on
collaborative filtering, where we take
the ratings matrix, which is a matrix specifying which user likes which items, and then we try and
factor this matrix as the product of
2 low-rank matrices, one of them being tall
skinny matrix of user treat, and the other being
a short fat matrix of item attributes. And then, in order to
determine whether to recommend bananas to
the third user or not, what we do is we learn what these two matrices
from our training data, and then once
they’ve been learned, we take the row corresponding
to the third user, multiplied by the column
corresponding to bananas, and if the product is greater
than a certain threshold, then we recommend bananas
and otherwise we don’t. Now, unfortunately,
this simplifying low-rank assumption that
the ratings matrix can be factored as the product of these two low-rank matrices, completely breaks down at
our level of complexity. And therefore, collaborative
filtering might not make very good recommendations when dealing with millions of items. On the other hand, in
extreme classification we make no such assumptions and in fact we do not ever approximate the matrix
as being low Ryan Quora, assume that there
are linear distance preserving embeddings
etc, and because of this, we can make very accurate, very good predictions both
accurately and efficiently, even when there are millions
of items to be dealt with. So, our algorithm for warm start
recommendation is called Swift XML and it will be published in wisdom 2018 and
I should point out that dub dub dub and wisdom
other top tier conferences for this kind of research. And we’ll also be
making the quota for Swift XML available shortly on the extreme
classification repository and all of you can try
it out for yourselves. And I won’t have time to go
into the technical details, but the key technical
contribution in Swift XML is that we can now train not only on features from
all the training points, but we can also
leverage features from all the labels during both
streaming and prediction. So, let me finally cover
just one more application, which is somewhat related, but this time in
the area of web search. So, if you go to a search engine
and submit a query, then it will recommend related queries that might
have served you better, or it might suggest
related queries that you could have
asked in addition to get more information
on the topic. So, for example, if you
go to Bing and search for “cam procedure shoulder”, Bing will recommend
that you might try asking about “Cam Newton
shoulder surgery” instead. So, this problem is known
as related searches, and we can again formulate it as an extreme
classification task, where the input to the extreme classifier
this time is going to be a query and our labels are going to be the top 100
million queries on Bing. Now, as you can see
regarding the results, Bing again does not
do a very good job. It predicts only one query, it should have done much more, and in fact this query is not related to the original
query it turns out. So, Cam procedure stands for comprehensive
arthroscopic management, it’s a very specific type
of shoulder surgery, whereas Bing thinks that you’re talking about
the football player Cam Newton and it recommends you learn about
his shoulder surgery instead. Whereas, if you look at
what we’re recommending, then we also recommend a cost of an arthroscopic
shoulder surgery, what to wear after
a shoulder surgery, how long should
you take off work after the shoulder surgery, different types of
shoulder surgery and what should be music to the ears of the legally minded
people in the room. We also recommend
shoulder replacement lawsuits. So, the reason
that traditional approach doesn’t work well in
this particular case, is because a traditional
approach is based on sessions. So, a session is when a user
comes in types in a query, doesn’t like the results
that he sees, reformulates the query quickly and then goes and clicks
on something, right? So, you know that
the second query is a good recommendation
for the first query. The first query didn’t achieve the results whereas
the second did. But if your input query “cam
procedure shoulder” is new, it didn’t occur in
any of the sessions, then most of these sessions
based methods will fail, no matter how much fancy Machine Learning you
put on top of them, which is why Bing makes
so few recommendations. And then, even the
traditional methods based on query similarity
go for a toss, because these two queries seem
to be very similar, right? So, “cam procedure shoulder”, “Cam Newton shoulder surgery”
they get embedded together, which is why Bing
makes this mistake. Whereas if you look in
extreme classification, then it has been
designed specifically to handle unseen
test points, right? So, handling new queries like
“cam procedure shoulder”, which have not been seen before, comes very naturally
and we can make all of these good-quality
recommendations based on our extreme classifier, based on these deep
learning features. So, our algorithm for this based on our deep learning
features is called Slice, and Himancho has done a really
incredible job at Slice, because most life skills
to a 100 million labels, which is well beyond the pale of any other classifier out
there in the world today. So, we’re going to try
and submit Slice to KDD, and hopefully again
we’ll try and make it available on the extreme
classifications repository. So, I could cover
many different applications ranging from
tagging on Wikipedia to personal recognition
on Facebook, to learning universal
deep features etc, but I don’t want
to go on and on, so let me conclude by
just reiterating that extreme classification is a new area and
Machine Learning, which not only helps us tackle web-scale
classification problems, but which also helps us in reformulating
key applications, such as ranking
and recommendation. Over the years, we’ve
developed a number of algorithms for tackling
some of these applications, and all of them are available on the extreme
classification repository, parable and surf XML will
be available shortly, and so you can go and
check out the repository where you will find
not only our quote, but quote from lots of
different groups around the world, datasets, papers, benchmark results etc,
which will help you get started if you’re
interested. Thank you very much.>>Thank you, Monik. We’ll take a couple of
questions, if you’re okay. Yeah.>>Vote for PGN.>>Okay. That’s done, so
that’s ticked off. If folks can raise their hand, we’ve people again
with mic. Yeah, Shinni.>>Just out of curiosity,
is entropy still the guiding force
for branching out?>>No. So, as Shinni said, entropy is where we started
off with MLRF in 2013. It’s a standard function to trying split
a node in a tree. From entropy we
moved on to NDCG, which is a ranking
function in Fast XML 2013-14, sorry, 2014. From there we moved on to a slightly more advanced street splitting function in parabel. So, the criteria has changed, we’ve gotten better over time. It’s one the main reasons
we have better accuracy now.>>There’s a question
at the back [inaudible].>>My name is Amita Vaker from
[inaudible] My question is how these classifiers can
adjust themselves when number of the label change
all the time.>>Right. So, that’s
a very good question. One of the things I
wanted to talk about is that where I’m going with this right with
a 100 million labels, is I actually want to
build the next generation of search engines, using
extreme classification. So, if I can get
extreme classification to scale to billions of labels, then I can think
of each document on the web as being
a separate label, and now I can train
my extreme classifier, so that when a query comes in, it can predict the subset
of relevant documents, and now I can rank them on the basis of
the classify string, and I have a new kind
of search engine. So, obviously, in order
to get that to work, I have to be able to deal with labels changing
all the time because new documents will be
coming in all the time. So, that’s a new
research project that we have going
on at the moment. If you are interested in
doing a PhD with me on that, please come and talk
to me afterwards. This is my shameless being ad. And at the moment I have two concrete options for you, if you
can’t wait for a PhD. The first is that these classifiers are
very fast to train, so as your label distribution
changes over time, you should just retrain
your classifier and redeploy. The other thing is that you can, with things like Swift XML etc, because when you
have features on your labels, you can say, Okay, in this leaf node I
have these labels, they have these features
and this new label that came in it’s very similar to
them based on the features, so I just added onto
this leaf node. And then you can start predicting
that label as well. So, you can apply hacks
like this currently, for a more principled
solution you’ll have to wait for a little while.>>Okay, any more questions? If not, we’ll give
a hand to Monik again. Thank you, Monik.>>Thank you, guys.>>Thank you.

1 Comment

  • Reply Siddhant Katyan September 13, 2019 at 3:24 pm

    Based on Paper : Parabel – Partitioned Label Trees (WSDM 2018)

  • Leave a Reply