A Case for Correctly Rounded Math Libraries
     Santosh Nagarakatte
     
     >> 2... 2 minutes... 2 minutes. 2 minutes.
     One minute, one minute. We've got people coming in. So far, so good, everybody? Yeah? Cool. Very cool.
     How many people are going to Strange Loop? Cool. Excited about that, right? It's gonna be fun. All right. So, Santosh here who is another, you know, we talked about earlier, speakers that come through recommendations, interesting things. And sometimes -- every once in a while. And I was, you know -- I was looking at some -- a conference, and David Van Horn, a really, really dope professor at the University of Maryland. You know, something came across my immediate. This was pretty cool. Read it, it's so well done in selling this talk. Hopefully not too much.
     But gave a talk about this work they have been doing on a case for correctly rounded elementary functions. Van Horn said, I can't stop thinking about this talk. It's one of the best key notes I've ever seen, quote. Not to make it too much pressure. But one thing he says, and I want to read this, it's really good. First, Santosh is working on an old, well-trodden problem. One of our favorite things, correctly rounding floating point results. Teams of highly trained people have been working for decades around this kind of work. And despite all of this work and everything, every implementation is incorrect on some inputs.
     And so, it's, you know, in this work what I really like about it is, we're taking a really old idea, providing new things and new research and interesting results. And, you know, as Van Horn says on this, with the work they do on elementary on correct results, that they could remarkably pull this off. I'll send the Tweet thread for this which is really, really great. But one thing we want to showcase is -- like Deirdre's talk today, how new work changes everything or how we think about previous things as well. So, about further ado, I want to introduce Santosh Nagarakatte. Close. Yeah, that was tough, I was working on that. He was at Rutgers University. Without further ado, thank you so much for being here at PWLConf.
     [ Applause ]
     SANTOSH: Can people at the back hear me? Okay. So, thanks, Zeeshan, for the introduction. Now like there were really, really good words for me. I'll try to live up to that talk. But again, the goal of this talk is to get you excited. There's a lot of cool things to do in this space and people are not paying attention, okay? So, what am I going to talk about? I'm going to talk about the project. Alternative title is how we built the world's fastest math library. And who cares about math libraries?
     Everyone uses math libraries, right? Some people do care. Whether you like it or not, everyone uses a math library. Some people taking calculus, they want to forget about the elementary functions. What do these math libraries contain? These math libraries contain elementary functions which are functions of a single variable involving addition, multiplication, or trigonometric and polynomials. It's like plumbing. When you need a plumbing, you have an emergency.
     And natural language processing, speech processing. Just had a talk on processing beats, right? In the spaces. It's all there. So, unsurprisingly, you have implementations for these functions. Where do you find them? In math libraries, when you compile your code, they are automatically included, okay? But unfortunately, they're all wrong.
     Okay? So, I'll tell you what "Wrong" means. I come from a formal math school. Everything is failing, right if so, one of the issues with having wrong libraries is non-portability. You have your fancy GPU, running your training model. Produces some results. Port it to another GPU that another company brought up, you get fully different. Imagine you are the developer trying to debug it. What nightmares. Like Santosh, just another guy talking about random things, right? If.net trust me, there's a nice keynote by Ali Rahimi at NeurIPS. In the middle, there's a -- I changed the rounding mode in my training to round to zero. Why use round to zero? It's attractive for performance. And accuracy dropped to 25%. Okay? The model had a reasonable accuracy, 85% or so. Now it's 25%. Doing nothing. Just changing the rounding.
     So, this is where like if you want -- if you care about portability, if you want to make your code robust, there's a lot of interesting things to do in this space. So, effectively, we want to build correctly rounded elementary functions. And as Zeeshan alluded to, I come from the compilers and formal methods. You don't wake up one day and start working on numerical analysis.
     Let me tell you a story about how I got excite about this area. So, about 10 years ago, we had been working on proving the correctness of compiler optimizations. Compilers are large pieces of software, millions lines of code, and everyone relies on them and there are bugs. It's a fact of life. Bugs are always there. But you don't trust your -- you don't believe your compiler is wrong. Sometimes it something is wrong, oh, my code is wrong. Let me fix it first. But compilers do a lot of crazy things and they're wrong. We wanted to address it. We built formal semantics of the compiler in the assistant in collaboration with my advisers at Penn. And then we realized like it's very difficult to get traction from developers.
     So, what did we do? So, in collaboration with my student and a few other collaborators, I joined -- we built this domain-specific language called align. So, effectively, what was alive? Alive was a small DSL where you would specify your optimization in this DSL and you would automatically prove its correctness using generated conditions and give it to Z3. If it's correct, your compiler optimization is correct. And today a lot of LVM -- a lot of them use alive to check the correctness of new optimizations. Okay?
     So, this was primarily
     People locally writing roots. And my student was like, okay. We were just focusing on integer optimizations. What about floating point? And compilers like -- compiler optimizers fall into two categories, right? One, I don't want to do anything with floating point. I just want to, okay, it says this, I'll just do that. Even when they do that, they make mistakes. So, we had a snippet of alive, called alive FP, that we wanted to check big, precise optimizations. Is the value produced the same? And complying? And we found bugs. And if you don't know floating point, there are two zeros. Plus zero and minus zero. And one of the optimizations was changing the sign of zero. Okay?
     So, my student told this to his lab mate, and he said, who cares about sine of zero? But there's a nice article, that says if you change the sine of zero, it causes differences in branch codes. We filed this bug, it was immediately fixed, okay? And then there are the second class of compiler optimization designers. They have this demonic mode called fast math. Or I would call it incorrect math.
     Okay? So, what do they do? It considers any real identity as a floating point identity. And optimizes code. Okay? So, associativity is not proven floating point, okay? And when you're -- floating point metrics are associated. It's very important. You should not blame them. Because you can't do parallelization without associativity. Okay? So, there is this fast math mode. And this is the advantage of having curious students, right? My student then was taking a machine learning course. And he had a nice dataset. He was learning some training algorithm, machine training algorithm. He developed -- he implemented the algorithm in C++ and then trained it on a dataset and got reasonable accuracy. And then out of curiosity, or magic, I would say, he turned on fast math optimizations. And it tanked significantly. Doing nothing. Changing for the same number of iterations, same amount of time, the accuracy drops. And we wanted to debug it. What's the issue with portability small errors? Getting magnified, right? So, debugging is a nightmare. Okay. We looked around. Who can help us? A developer? There was none. So, we built our own.
     Okay? So, what debugger would be useful? So, we said, let's do a shadow execution. What do we do? We have your regular floating point program, we have another program where we execute it with real numbers. Simulated, it is very slow, but we do logstep execution with real numbers. And we do programs and they diverge, okay? And we find out the divergence happened in the math libraries.
     You would call a math library and it produces a very small error in the result. And it gets amplified with cancellation and various other things. The end result is crashes. Okay? And we had like lots of debugging it, reproducing the same bugs and other aspects of it. Okay. So, this is what got us excited about looking although this problem. Okay? Math libraries are producing wrong results. This is not a new problem. Why aren't people producing correct results? Okay?
     So, now what is a correct result? I have been talking about wrong results. Correct results. What is a correct result in the context of a floating point, right? You have a number line. Here I'm showing you the number line. You have infinitely many real numbers. You cannot represent everything with floating point. Only finite real numbers with that floating point. Which is a finite position. Here we have our values that are exactly representable with the floating point, okay? And now you call an elementary function. A math library function. Ln of x, right? It's called elementary because it produces an irrational value. Most likely, 99% of the case, this value will not be exactly representable in your floating point representation. So, now what do you do? You need to do round the result to the nearest floating point value. And the data -- there is a rounding mode, it specifies how you do the rounding. The round-to nearest-tie-goes-even. You take your favorite algorithm for approximating ln of x. Do it with real number asks then the final rounding to get the floating point result. And here we do, it's the correctly-rounded result, okay?
     If your library produces a correctly-rounded result for all inputs, it's a correctly-rounded library. Okay? So, why is this useful? It is useful because it enhances portability. Like, for example, long ago when IEEE standard wasn't around, every hardware had their own implementation of floating point. Porting code across multiple processers was just a nightmare, okay? So, by standardizing it, you can make your -- make the life of developers easier.
     So, we want -- ideally, we want a correctly-rounded library. Unfortunately, the conventional wisdom is generating correctly-rounded libraries is very expensive. That's why IEEE standard doesn't mandate correct rounding for elementary functions. It mandates for primitive, addition, subtraction, multiplication, division, and square. So, now why aren't existing approaches producing the correctly-rounded result? Even before you go there, we need to identify how do people approximate elementary functions? Okay?
     So, this is where you'll have a brief real analysis. You don't need to understand anything to understand the talk. Everyone approximated elementary functions with polynomials. There are classic results from real analysis. Why is approximation says? It says if you have a continuous real value function over a small domain, you can approximate it with a polynomial. Everyone approximate the recall value of ln of X. As I show in the slide, the approximation is feasible only over a small domain. What is the range of a floating point? 2 to the power of 148 -- and you cannot do polynomial approximate over a large domain. The first step in any implementation is to do range reduction. So, what are you -- what is range reduction? You're reducing an input which belongs to the domain, 2 to the power of 128 to 2 to the power of 149 to a smaller domain. For LAN, 1 to 2, or 1.2 to 1.33 or something. Okay? How do you do this? You use properties of logarithms and you do this.
     You approximate the real value of the effects and then reduce the input to a smaller domain. And now you do polynomial approximate over the installer domain for the reduced input and eventually you get the result for the reduced input. What do you need to do? You need to compensate the result you get for the additional input? This last step is called output compensation. If you forget all the jargon, the thing you need to remember is everyone approximates the real value and then they do reintroduction and then they evaluate the polynomial or pretty a polynomial approximation over the smaller domain. And then the output. All these four steps are then using floating point, okay? So, they're gonna have some.
     Why do they produce wrong results? I'm going to pictorially represent what happens with the existing approach. Let's say the real value of your elementary function is very close to the rounding boundary. Okay? In this case, in the slide, V1 and V2 are two representable value point representation. What is the rounding boundary with respect to the roundest even? It's the midpoint. If a real value is very close to the midpoint, any small error that you will have will make it produce a wrong result. So, the freedom that the polynomial generator has is the gray box that I show in the slide, okay? So, now recall. What did I say? You have to do a reintroduction, output compensation and polynomial. All of those have floating error. Any such error is going to push the value outside. And eventually resulting in wrong results, okay? So, the key take-away I want to remember is everyone is approximating the real value of F of X. Our other key insight, why are we approximating the real value? In the RLIBM Project, we make the case for directly using the real results. Why? You don't care about the real value, you only care about the correctly rounded result.
     Well, you're thinking, man, are you fooling us? You said you want to approximate the correctly-rounded result. But you know the correctly-rounded result only when you know the real value. How do you know the correctly-rounded result? So, our key insight is when you cannot solve a really complex problem, split the problem into smaller parts and do it well. So, the problem of generating efficient implementations split into two parts. The problem of generating the Oracle, which tells you what the correctly rounded result is for each input. Okay? This can be slow. And a lot of people, like it was alluded to, this is a problem that people have attacked for centuries. Okay? So, there are good Oracles out there. Built on the shoulders of giants, right? Once we know the Oracle, what do we want? We want our model to run faster, right? We want efficient implementation that produce correct result.
     So, design efficient implementations by directly getting the correct result. The focus of the RLIBM Project, begin an article, let's generate polynomial approximations that produce this result efficiently. Okay. So, first insight, directly approximated the correctly-rounded result. This is where thinking like a computer scientist actually helps. I have a lot of math background, but to do this, I had to forget some of my math background and think like a person who would implement stuff. Okay. Now, cool. We want to approximate the correctly-rounded result. But that doesn't solve. The key insight, the next insight that we had is if you have a correctly-rounded result. So, let's say we are designing a math library for a 32-bit floating point. In which representation do we implement the library? We would implement the library in double precision.
     So, there are a set of real values. There is a range of real values around the correctly-rounded result. In the double precision. You produce any value there, it will go to the correctly-rounded result in the flow. This is the freedom that you have. Okay? So, this gray area is the freedom. There's a range of values. In this case, three floating point values. What is the rounded boundary between V1 and V2? The midpoint. Between V2 and V3? The midpoint. The region between the two midpoints, is the freedom you have. That will round to the correctly-rounded result. This is the freedom that our polynomial generator has.
     If you recall the figure, this is much larger than whatever you had with the min and max polynomial. This is why our representations generate a really fast implementation. More freedom is you can use lower degree polynomials. Which implies freedom. Okay? So, make a directly approximate the correctly-rounded result. And then there is a range of values aren't the correctly-rounded result that produces -- such if you produce any value, it produces the correctly-rounded result. How to generate it? The synthesis is fine, what is our methodology?
     Our methodology is we have an Oracle, we have the representation that we have for each input. And a given elementary function and the rounding mode. We would use the article and compute the correct correctly-rounded result, okay? We would compute the rounding interval for the correctly-rounded result. And the gray boxes are the correctly-rounded result.
     Once you care about this, you don't care about that, the interval is whatever you care about. Now you want to generate a polynomial that satisfies all these intervals, okay?
     What is a polynomial? You know the inputs. Right? And when we don't know it the coefficients of the polynomials. So, the task is to generate these coefficients.
     Okay? So, what does each interval specify? And evaluate a particular point, the result of the polynomial should be greater than the lower bound and less than the upper bound. It's a constraint. What we have is a system of linear inequalities, okay? For a given degree. Now, another well-studied problem. Linear programming. So, they're very good solvers of that. Use solver out there. And encode this into the system of linear, it generates the coefficient. And there you go. You have a polynomial that satisfies all the points. This is the first insight we had 4 years ago. This is the topic of a paper where we summarize this. What do we have? Directly approximate the correctly-rounded result, the values around the correctly-rounded and frame the problem and you can generate correctly-rounded implementations.
     So, life would have been amazing if there was only one representation and one rounding mode. But today search designing their own representations. Spreading of the dynamic range or the precision. And if you're still not convinced, just two days ago, at the consortium of media researchers released an article, the floating point standard for accelerating machine learning workloads. It's a cool paper. Look at all the tradeoffs that is required to accelerate machine learning.
     So, our next question was: Can we design a single representation and single polynomial approximation that produces correctly-rounded results for multiple representations and multiple rounding modes? Why is this attractive? You're a library designer. You don't have to design a library for each representation and each rounding, okay?
     So, then you may be thinking: What is the largest representation that anyone uses? Doubled, right? Like right now. For most practical purposes. If you design it once for 64-bit, I'll be done. Let's say I designed a correctly-rounded for doubles, 64-bit double. What are people doing? Just precision. Just re-purpose this library for everything. What do you? Say you're a 32
     Bit float. You have the input in the float. You would promote the input to a double precision. Every 32-bit value is representable in 64-bit. Okay. You call the 64-bit library. You get the result -- you get the output in double. And then you truncate it back to -- round it back to float, okay?
     How many of you think like this will produce the correct result? Yeah. Anyone -- presuming I have something to say. This is wrong. This doesn't work. That's a negative solution right there. Why doesn't that work? This doesn't work because of double rounding error. What is a double rounding error? So, I'm going to demonstrate it here with a cartoonish figure here. So, you have 64-bit doubles and you have 32-bit floats. The most important thing is look at this bottom row. You have a 32-bit flow, there are three float values there and dollar lot more 64-bit values. Just for exposition, I'm just showing two values here. The far is the real value. Produced by your elementary function.
     And now you have a correctly-rounded double library. So, this is a correct result. It's apart from the perspective of a double library. Okay? And now if the value is very close to the midpoint, you rounded it. And now, if you take this double value, it's exactly at the midpoint. And if you round it to float, it produces -- it rounds to a direct box. But if you look at it, this value is less than the midpoint. It should have rounded it to the box on the right. So, there were two -- there are two roundings happening. The first is when the designer of the elementary functions designed it, right? Designed the library. There was one rounding. The second rounding happens when you take the double value and round it to float. And this is very common.
     So, if you want to generate a single polynomial approximation that works for multiple, you need to make double rounding innocuous, okay? So, how do you make double rounding innocuous? So, this is our next magical resource. This is the advantage of having great students, right? Who come up with really amazing ideas.
     So, the real value, like you have to maintain certain information about the real value such that the rounding decision is it not changed in the presence of double rounding. So, what information do you need to maintain about the real value? There are three basis of information that you want to maintain. Is the value less than the midpoint? Is the value exactly at the midpoint? Is the value greater than the midpoint? With respect to the target representation. If you maintain these three pieces of information exactly, you can make dull rounding. Okay? This is the key insight. But what is our magical solution to address it? So, if you want to create a correctly-rounded library for all representations up to N-bits, okay? Create a polynomial approximation, a correctly-rounded polynomial approximation for N plus 2 bits, two extra bits, a non-standard rounding round-to-odd, and I will describe that in a moment. If you magically take this, double any from 1 to 32 bits, it will produce correct results and all the modes in the standing. Now there are a lot of things to parse.
     The first thing to remember is if you want to generate correctly-rounded results, up to N-bits, create a library with 2 additional bits and with a round-to-odd mode. What is a round-to-odd mode? It is a non-standard mode. People have looked at it. If you read paper title, what every scientist needs to know about floating point by Goldberg. There's references to it. Talking about casting, converting a binary fraction to a decimal fraction. Why would people use this round-to-odd mode for elementary functions? Because people have been obsessed about real values, but never thinking about specific rounding modes for generating libraries. To approximate the correctly-rounded result, we can generate, directly approximate the round-to-odd result. And you don't need any additional numerical analysis.
     Now let's walk through the inclusion. What is round-to-odd? Round-to-odd is explains in two sentences. First, if the value is exactly representable, it rounds to this value, okay? If the value is exactly representable, it rounds to the value, that value. If the value is not exactly representable -- [No audio] -- I'll step back and give you a very pedantic example. Let's say we live in a world where we just have integers, okay? And you produce a value, 9 .6, okay? We just have integers. And you produce a value, 9.6, you need to round. Now, 9.6 is not exactly representable as an integer. You need to round. Where do you round to? In a typical world, round to 10, with round-to-odd, you round to 9 because 9 is odd. This is round-to-odd mode. Just to reinforce this. Let's say you have a value, 10.3. 10.3 is not exactly represent so you would round to 11. Because 11 is odd. So, the round-to-odd, if the value is exactly representable, it rounds to that value. If it's not exactly representable, it rounds to the nearest value that is odd. Okay?
     So, this is round-to-odd. Now round-to-odd makes double rounding innocuous, why? So, let me illustrate this. Let's say we are designing a correctly-rounded library that produces correct results for all representations up to 32-bits. We need to create a polynomial approximation for a 34-bit library. In the first row, I'm showing the 34-values. Between two adjacent 32-bit value the there are three 34-bit values. Exactly representable. So, now, whether this is even or odd, any value that is representable in 34 bits will be even in a 34-bit representation. Okay? So, this will be even. This will be odd. This will be even. This will be odd. This will be even.
     In a 34-bit representation. So, now the key point to remember is what are -- what are the key things the midpoint is even in 34-bit. And any value less than the midpoint, any value less than the midpoint, it rounds to the R value, round to. It rounds to the green value. Now if you double round it later on, it will round to the correct floating point result.
     Okay? So, the round-to-odd preserves this information. The three pieces of information that is correctly -- that is required to correctly round any values. What are those? Is the value less than the midpoint? Exactly at the midpoint? Better than the midpoint? So, the proof is a bit more involved. I don't want to burden you with the details of the proof. But the scale, the key theorem that we prove in the paper is take a real value. To round a real value to any representation, there are four pieces of information that you need. We call it surrounding components. And what we show is the round-to-odd mode preserves the exact same rounding components for rounding any real value to any representation.
     Okay? So, the end result is if you -- if you want to have a polynomial, if you want a approximation that produces correct results to index, create a library that works and produces correct results for N plus 2 bits, round-to-odd mode. And it magically works for all five rounding modes. Now the question is, how do we go about creating this library, right? We exactly used the RLIBM approach. You have an article. In this approach, there's no library that uses that, we built the 34-bit round-to-odd mode. For each, we would get the correct round-to-odd result. And then generate the rounding interval. The odd interval. Every interval rounds to the correctly round-to-odd result. And then the only issue you need to worry about here is if the value -- if the result of an elementary function is exactly representable in your representation. And it's even. It's odd interval, it's a singleton. Generating singletons is challenging. But one of the -- fortunately -- this problem of identifying singletons in the context of elementary functions is well-studied. So, what is a floating point input? It's a rational value. If it's exactly representable in a floating point representation, it means it's a representational output. You have a elementary function that takes rational inputs and has rational outputs. And we have fantastic results from real analysis which you can use. Let's give you an example. Say we are computing log X to base 2. So, when I  is a integer. And even. Then log X to base 2 is exactly representable. And it's even.
     So, it will be a singleton. So, we use bit factor optimizations to get the results and remove such singletons. We are left with odd intervals and generate a polynomial for them. And if produces correct results for all inputs, okay? So, what are the three takeaways I want? There's one more takeaway. But like the three takeaways until now, directly approximate the correctly-rounded result, there's values around the correctly-rounded result. Frame a problem. And if you want a single that works for multiple representations in rounding modes, create a polynomial approximation for two additional bits. And the two bits in the round-to-odd mode and it magically works. Okay?
     So, the last thing that I have been talking about is how do we generate this LP problem? What do we have? We have a single polynomial that works for all representations. And if you have generated a 32-bit float, includes the Bfloat 16 from Google, and the NVIDIA Tensor-Float32, and many of the representations that people are worried about.
     So, one of the questions I got when I gave this talk a year ago was, this is all great. Like it's amazingly good. But you're making like my Bfloat16 library pay the cost of 32-bit float. Can you make it better? Again, this is why you meet people and get feedback. They ask questions and I think.
     So, can we do better? And this is where we got -- we have the introduced this notion of progressive polynomials. So, you have a single polynomial that produces correct results with a round-to-odd mode for all inputs and multiple representations. If you care about a couple of representations, say Tensor-float32, you produce the correct results for Bfloat 16. And for the results, it produces the correct results for Tensor-float32. Okay? In our flow, how do you do this? Just discard the returns. It results in more constraints in the LP formation. The entire polynomial produces correct results with the round-to-odd mode. If you want correct results for Bfloat16, don't include those terms in your generation. So, the polynomial that has this property is the progressive polynomial, okay? The only challenge now is you have more constraints. With 4 billion inputs, you have 4 billion constraints, and add on top that have. Now you want to solve an LP problem at 4 billion constraints. How do we do this? I have been relying on LP, to solve them, LP are well-engineered, but they can't solve for a billion constraints. They time out. The linear program that we generate is small dimensions. Even though we have 4 billion constraints, how many terms in the polynomial? Very small. We are not looking for 1 billion degree polynomial, right? We are looking for a polynomial 5 or 6. The number of terms are pretty small. Our linear program is a linear program of small dimensions, okay? So, we have an efficient algorithm to solve them. So, what is the key idea? So, if your linear program is full rank, okay? That means like there are K constraints that satisfy all of the constraints.
     There exist like K critical constraints. If you just identify them, it will solve the entire problem, okay? So, now how do you identify these K constraints from 4 billion constraints, right? This is where we employ a randomized algorithm inspired by a Clarkson result from '95, and the method. Okay. I'll give you a sketch of the randomized algorithm. Okay. We have all of these 4 billion constraints. And we're going to do sampling. We are going to assign weights to each of these constraints.
     Okay? So, initially, all these constraints have one. So, the first step is just uniformly sampling some constraints. How many samples do we -- how many constraints do we sample? This -- if the polynomial has -- we sample 6K squared constraints. It's degree 6, 6 terms, 6 times 25, 150 constraints. And any LP can solve this. We are in the modern LP solvers. Generate the polynomial. So, the sample is satisfied by the generated polynomial.
     And if they're not satisfied many other constraints in your universe. So, we evaluate and test it out. And it turns out all the red constraints have not satisfied by your samples. What do we do? We double their weights. What is the goal? In the next sampling, we should increase the probability of picking them. So, we do weighted random sampling, okay? So, now we discard the sample and repeat the process. So, now we double the weight and we re-sample again. Some of the constraints that were weighted double, are now in the sample. Okay? We repeat the process. And check. It violates some other points. Now double their weights again. In the next sample, random weighted sampling, pick a new sample. And repeat this process. Eventually, you end up with K constraints in your sample and it satisfies the entire universe.
     So, in our paper, we provide the proof that this randomized algorithm terminates in 6K login expectations. This is the worst case. Okay? Randomized -- designing a randomized algorithm is easy, right? Providing the bones is the hard part. The proof is -- but still, it's easily doable. So, the end story, in practice, we see this much more effective. Like you don't need all these iterations. This is the worst case.
     So, what do we have? Now we know how to generate, like insights around generating correct libraries. We know how to solve the problems effectually. And this is all nice stories that I have been telling you for the last 30 minutes. At the end of the day, what do you care about? Does it work?
     Okay. So, we built a collection of ten functions on this slide. But this is like Papers We Love, right? I'm just presenting the four papers as part of it. But the project has like 19 functions including all the functions right now. So, single polynomial produces correct results for 805 different configurations, okay? And for all inputs, the floating implementations. And just to give you a flare for how this fares, these are libraries that have been optimized for decades. So, not just a couple of years, right? They produce wrong results. Okay? And even correctly-rounded, CRLibm is a correctly rounded library from India. It's a correctly-rounded double precision library. It produces wrong results in others.
     So, the model of the story is, RLIBM functions are correct. So, the next is what are we paying in terms of performance? Now, in this slide I'm going to show you what is the performance speedup with RLIBM functions. The baseline library. It's the Intel map library. Well-optimized amazingly by respected people. So, the library has two modes. One, you can approximate float functions using float, and you can approximate float using double precision. Approximating the float functions using double precis reduces the number of wrong results. Not correct for all inputs, but less wrong results. Compared to Intel's function, we are 30% faster on average. And look although the double functions, we are about 50% faster. These slides are one year old now. All the papers, we are talking about published papers, right? There's a lag of one year. If you go and download my GitHub library and test it out, my results would be more like 2x. I encourage you to check the results. We are faster. The motto of the story, we are not only correct, but also fast. That's the idea. It's four of us building this library. I don't have an army of engineers building this. In contrast, many production companies have teams of 30, 40 people building these libraries. Imagine how much faster we could go if when he a team that have size.
     Now the last thing I want to demonstrate, how to compare to CRLIBM. We are two times faster across all functions. All this is great. We have a small team, we have been pushing these functions in mainstream. Collaborating with LLVM developing and pushing into the new LIBM that's being developed. Five of the functions have been implemented and are correctly-rounded for all rounding modes. Paul Zimmermann has been helping us out in this effort. So, I think like if we are successful in the next standard, which is supposed to come out in 2029, we want to give enough evidence for IEEE standards committee to mandate correct rounding, at least for 32 bits. Okay.
     So, to conclude, what are the four takeaways that I want? One, directly approximate the correctly-rounded result. There is an interval of real values around the correctly-rounded result. Produce any value, you're good. And if you want a single polynomial that works for multiple representations and multiple rounding modes, create a correctly-rounded representation for the representation with two additional bits and two bits with the round-to-odd mode. It magically works for all representations. So, again, the goal of this talk is to encourage a lot of you to contribute to this area. And let's make our systems more robust.
     And I want to thank the National Science Foundation for funding really like crazy ideas. And enabling us to explore this space. And thank you for your attention.
     [ Applause ]
     ZEESHAN: Thank you, Santosh. It's amazing. Four people doing that work that whole companies haven't been able to do. It's amazing. You're around Strange Loop as well. This is awesome, awesome work. I want people to take a look at it. And we're gonna break until 3:20 for our next talk by Madeline. 18 minutes or so. So, we'll see you on the flipside.