OUR LATEST ARTICLES

Video: How to Build a Reproducible ML Pipeline

Everybody ready to get started?

Yeah, so I'm Chris Thompson representing GDG Cloud Portland. And as a machine learning engineer, my goal is always to design is to design automated, scalable machine learning pipelines. And this past year, Kubeflow 1.0 was released and to address some of the challenges associated with that. So our next speaker, Karl Weinmeister, will talk about how you can implement Kubeflow pipelines, which is a component of Kubeflow to improve scalability and reproducibility in the machine learning process. So with that, I'll leave it to Karl.

Thank you, Chris. Appreciate it. And there's great to start with Hannes' talk this afternoon. I think he really set the stage with TFX, Tensorflow extended set of components for machine learning. What we're going to talk about today is Kubeflow pipelines, the runner that has a whole set of infrastructure around running your pipelines, tracking your experiments, and things like that. So let's go ahead and dive in. I'm going to share my screen and we'll show a presentation and then a demo of Kubeflow pipelines. So here we go. All right. So we're going to talk about building a reproducible ML workflow. That's really key here. When you're doing data science, as we all know, it's all about running experiments, trying new things. Eventually, though, you get to a certain point where you need to put things into production and you need to make sure it works the same every time that you've codified that complex set of steps from ingesting data to building models, pushing it into production, et cetera. And so we're going to go through how Kubeflow pipelines can help you with that. And we'll see how the TFX components, which Hannes just spoke about, can fit into that. So we'll start talking about the challenges. I think we just covered some of the challenges in the previous talk and so I'll be brief about that. We'll talk about Kuveflow as a whole. Well, then really dive into the pipelines piece and then show a demo.

So as we all know, when we starting a project, it's slow relative can be easy getting to that right level of accuracy can often be very difficult. Many machine learning projects fail even in the early stages in terms of getting the right data to get the right result. But once you have that result, it gets even harder when you need to make sure that your model, which needs to respond to changing data, changing requirements, updates, all kinds of things that the that your users trust. There's a whole new set of issues that come into play that make it harder. Some of the issues are things like a lack of continuous monitoring, you create your model and we talked a little bit about data drift in in the past talk that the data coming in changes. And if your models trained on a particular corpus of data and then now you're seeing different data come in, it's very likely that your accuracy starts to kind of fall. And it's important that you have some way of knowing what that threshold is, where you want to get a notification, you want to start looking into it before it becomes an emergency. Another issue you might see is something called training-serving skew, where this is often the case where the code that you use for training is different from the code that you use in your production system, where the model is being used in terms of processing the inputs and maybe doing some transformation steps.

It's very possible that maybe you have some code that you push to one of the systems and not the other, or there's a variety of different types of scenarios like that that can pop up. And what happens is that's another way of having, you know, issues in your production system. Training, serving skews something to look for. A key other aspect of your requirements is knowing what's the freshness of the model that I need. There's really a wide spectrum if you're dealing with maybe here's one extreme example, news data, but or maybe even you're working with an e-commerce site where the products are changing very often you might need to rebuild models frequently. On the other hand, there are certain model types that are very stable that it's less often. So it's important coming into designing your pipeline, having a perspective on how often you'll need to create new models to meet those, you know, the changing data and the accuracy that you need. OK, so let's now talk about Kubeflow and how this can help. And the mission statement for Kubeflow here is in terms of making it easy for everyone from data scientists who aren't necessarily infrastructure experts to machine learning engineers that maybe want to work more on the platform side and building the infrastructure and aren't necessarily working on the models.

So a couple of different stakeholders. And it's about the full lifecycle development, deployment and management. So everything you need to do from end-to-end a machine learning, focus on portability since you're building on top of a Kubernetes-based infrastructure, the the pipelines that you build can work in these various different places, whether it's cloud, on Prem, et cetera. And finally, because you're using Kubernetese's, it has this fantastic way of distributing workloads across a cluster with pods that you can define. How much do you want to scale up or scale down monitoring tools? And often you see in a larger enterprise, you're the Enterprise is using Kubernetes for software development application development. So it's kind of nice to have one stack for dev ops that you're doing both development and data science on just a smaller learning curve, fewer moving parts. And so this is Kubeflow is about bringing ML and Kubernetes together with some nice benefits for running your machine, learning workflow. And we're going to, again, focus on pipelines today, but support and understand what that Kubeflow Foundation is so that all the things that go into Kubeflow you see on this slide, let's kind of start clockwise from the top Jupiter notebooks for your development of your train code in either code, meaning Python or you're training operators. This is where you actually do the training you instructed. What kind of maybe GPUs you want to use, what kind of framework you're using and it will run the training job on the cluster there.

Then you see workflow building and pipelines where we're going to focus today, data management. There's also some capability around versioning sharing the data. That's a key part of reproducibility as well, moving around a tools. There's also a component of Kubeflow called content that does your hyper parameter tuning Tensorboard to do some debugging on your tensor for model. Look at it in more detail. And then there are the Kubernette tools for logging, monitoring, things like that. What are the new capabilities that was added to Kubeflow's Metadata Management, see there are APIs so that as you build a model and as part of your pipeline, you can add information like standard information, when was it built? What framework that was used, et cetera, and, you know, custom metadata that you can add to know more about the asset that was being built. And finally, the serving piece, once you built your model, why not use the Kubernetes cluster to serve it as well in the same infrastructure? So we'll talk about that briefly. So Kubeflow is built inside of a Kubernetes cluster. And so traffic coming in can either go to the the central dashboard. There's a nice UI where you can see the pipelines, notebooks, tensor board, all kinds of different things there. There are the operators and I'm going to show how you can kick off training jobs with those.

And then the serving component, where I'll talk briefly about the APIs, then we'll get into pipelines. OK. So if you want to get start with Coupal, if you never have before, if you're familiar with Kubeflow, you'll find that the commands and concepts are similar. So in the community's world, there's a utility called Kube CTL that's used for management of your Bernardes cluster. There's CTL for Kubeflow control. And you see here a few examples from the GitHub repo of different Yamal files, which define the services that I spoke about on the previous couple of slides around the Jupiter server, serving, training, etc.. This is what lays down the cluster and the necessary servers for various different platforms. And this is what it takes to install. And once you have Kubeflow going a walk through a couple of the capabilities, you have the notebook server so you can create various servers and then launch Dupere hub from that for training. You can see where there are a variety of different Yamal files. If you go to the Kubeflow, get a brief, but you'll see examples. Here is a pie torch example on the left side. And what you see is where you would define the name of your pie torch job you would point it to. There's some standard container images with the training code, maybe some arguments that go into that container.

And then you see you can specify more details around things like using a GPU or CPU, things like that, within that YAML file to launch a training job here. You're actually using the Kubeflow êtes command here to go ahead and launch that operation and put that train code into the cluster start of the pods with the the Python code that's going to run the train. And then if you want to check the status, there's a Kumaris command that looks at the the jobs running on the cluster and maybe filters it by a certain label to see what's happening there. And you can observe the training happen. It's serving basically there's a framework called KFC serving where there are two different APIs, predict and explain for your common use cases. Hunch talked briefly about this concept of canaria and sort of regular end points. So it supports that capability where you can have a standard endpoint that maybe gets 90 plus percent of your traffic. Then maybe you're piloting a new model and you don't want to quite switch everybody over to it yet, but you want to test it. That's where you set up Canaria End Point. And so when you use KFC serving, there is infrastructure available for for that. And then whatever endpoint you pick, there's an ability to transform the data coming into your end point. And then you could potentially have some code to provide an explanation or if you have some interoperability code and then you have the ability to actually use the model to create a prediction and then return the result back to the user.

And so this is how you can create a rest end point in your cluster to accept input and serve your model in a scalable way and take advantage of Nettie's auto provisioning and things like that. OK, so now you've heard about the foundation of Coupeville. Let's dive a little deeper into pipelines. So a pipeline is a set of steps that perform that machine. Learning so to flow pipelines is specifically built for machine learning workflows. It can do other general-purpose workflows, but it's really aimed at solving the problems of machine learning and learning. Engineers help. So let's walk through an example of using the of pipelines SDK to build a pipeline. So what you would do is define your pipeline using these decorator's so you have your python code and then you can annotate the code with a few keywords that specify things like I'm about to tell you that this is a pipeline, maybe add some metadata around it. Here's the name and description of the pipeline and then you would have a method where this would define the parameters coming into your pipeline. Here we have some examples of maybe each time that the pipeline is run, you have some options like how many training steps or maybe the data coming into the pipeline.

And you can specify defaults so that each time it's run, you don't have to enter seven different things if you don't want to, but you have the ability to override what's going into the pipeline and then you define the steps in the pipeline by importing the components package. And then what you're doing is to to basically get access to a component. There's a load component from your URL method in the SDK and it will point to a YAML file and then you'll essentially have a definition of the component. You have an instantiated yet, but you're you kind of have access to to the class. So here the next step you're what you're doing is you instantiate it and then, you know, here, for instance, we have a copy component with the source and target directory. The next step you see here is how we would show dependencies between the components. So say we've got two components. How did we show how does it know that one comes after the other? So here's what we're doing. We're we're doing it explicitly. There's things like dot after dot before where you can explicitly specify the order. However, pipelines are smart enough to know that say you have an input parameter of one component that's coming from the output parameter of another, it will realize the order of operations and you don't need to do this.

So so generally you can just sort of define your components. And assuming the information kind of flows from one to the next, you don't have to do that. But in some cases, you might want to be more specific about paralyzing things. And that's where you use some of these different conditions and type of control commands like that. OK, so you've created your pipeline, the how do you deploy it? So there is a compiler package and you compile that Python code into a Targ file and then you can programmatically upload it. Or here we're seeing in the UI where you upload the file or maybe point to your URL and then you'll see it in the dashboard. And when it imports, what you see on the right side is not an actual run of the dashboard, but you're seeing the graph of the different components and what the user would see during a full and run pipeline. And so really, what I hope that comes across here is that you're able to define your pipeline in Python in the same language that the data scientists are using with a couple small steps to compile it. Then you have it in a format that can be executed in the pipelines you are. And there are lots of reusable components already, so you see some of these different popular operations, you extracting data from a database, transforming the data, so on and so forth.

Even the components that I just talked to, there are components. If you go to get hub inside of the pipeline, such components folder, there you'll see a nice set of arguments, what the component does and so that you can pull that into the pipeline that you're built. So now we're going to talk a little bit about how flow and affects can work nicely together. So if you're using the Kubeflow that I just talked about, you can pull in any of the components. There are Kubeflow SDK components that that interface with the components. So it's very easy to pull them in. There is also a T SDK that runs a couple of pipelines and that's that's new. And that's something that I want to talk about more in the rest of the presentation. So you have two different approaches, the traditional Kubeflow pipelines approach where you can use Coupeville components and CFC's components and now you have a text centric SDK and that's what you're seeing here on this slide that you're seeing, that there's two different approaches coming in using on the left side from the top, the stick or the SDK that we talked about. So depending on maybe if you're coming more from a tensor flow perspective where that's what your models and in your familiarity effects and you really want to embrace that ecosystem, it's a you can use that directly.

Maybe if you're doing more custom components or using some other frameworks, maybe you want to use the, you know, a couple of different ways to do things. So now let's talk about how to build your own custom component. So we talked about building a pipeline and then the components that you could pull in off the shelf. Now, what if you wanted to build your own? How hard is that? So let's look at that. So I'm going to show you three slides which kind of walk through an example of training. We're not we're not really getting really deep into the training, but more showing the structure. OK, so the first slide here, we're showing the YAML file. This is how we describe your component. This is where you name it. You say what the inputs are, the outputs are. And then a key link here is to the container image. Where is the code? Where rather where is the doctor or container with the code that will run at this component. So that's all this is. This is a descriptor. It's a YAML format. And this is, if you recall, back a few slides ago what we pulled from where we were trying to access a component. It would access this descriptor file. And then the next part is the doctor file, where you define the image. So here's where you might base it off of a base image here.

Let's say Tensor flew you. You maybe update it. You add a couple more packages to the base image and then you define what the entry point is into the image or what is going to be executed inside of that docker container, maybe some model train code in this case. And then finally, let's look at what the model train code is. And that's this is kind of where everything comes together. So you have a main method and you would pass those different arguments. These all match up to the inputs to that component. What where is your data? The number of training steps, all those kind of things. And then you have your code that does whatever you need it to do, that's in this file or set of files. So that's it as far as building a Kubeflow component that goes into a pipeline. All right, so to summarize, what we saw is that Kubeflow is a cloud native multi cloud solution for Emelle, it allows for composable portable skill email pipelines. And if you have a cluster, you can run Coupal. I will also say if you aren't super familiar with our communities, there are some utilities that allow you to run a local instance of Tupelo to get started, some kind of smaller mini caf and things like that. So very easy to get started with all the tools that have been built by the ecosystem.

OK, all right. So now let's spend a few minutes and kind of go over a demo here. All right. So what I'm going to show you is of pipelines implemented in Google Cloud, right? So there's lots of different ways to run it. What I'm doing is showing you a standalone implementation of two pipelines where it's just the pipelines piece, not the whole capful platform. It's very easy to get started here. That's why I'm demoing it this way. I simply click new instance and then I will be able to create the cluster with the Kubeflow cluster. I can specify some a couple of different options and click deploy and then it will set it all up for me. So it's very convenient. I'm going to I already have this going and tells me what the name of my cluster is, the version of KFT that's running, et cetera. So let's go ahead and click that link to open the dashboard and let's walk through a few things. So here are the pipelines. And if you recall where I was mentioning before uploading the pipeline, here's where you would do it. You can see that there are a few existing pipelines already here. For today's demo, I want to show you the text taxi prediction model. So this is a classification model. Its uses all the text components and it looks at a bunch of taxi data and try to predict is that is a taxi driver going to be TEPP 20 percent or more for binary classification problem? Here's what the graph looks like.

I'm going to briefly change gears and show you how you find out more about this demo if you want to try it yourself. On Tenzer Flug, under CFC's tutorials, you're going to find a demo on effects and cloud pipelines working together. And here you'll be able to run this demo. OK, so let's let's go ahead and get started and dive a little deeper here. So so here is the pipeline. Let's go and look at experiments. So experiments are where we have a list of all the runs of the pipeline. Right. So the pipeline is kind of like the class. The experiment is that this is a run of the pipeline. So let's dive down into it. And you can see I've already run it just for the purposes of today's demo. Always good to run things in advance. Everything ran smoothly. I think this will also show you another perspective on what we saw in Highness's talk, where he showed you all the different components. Here's another and an example. What this does is first it uses the example gen component. And what this is doing is it's pulling from CSV files and doing the train tests and all that. And so you can see with each of the different steps in the pipeline, Cooper will show you the artifacts produced.

What were the inputs and outputs to each step, even things like the log within the docker container that was executed. And so that's all here. And then let's look at the next step, OK? Once we've done our train to split, here's the statistics gen component. And this is pretty cool because this component shows you that. Let me go a little more full screen on this. Your standard statistics, all the all the different features coming in. How many were there? How what percent was missing? What's the mean variance, understanding the data that that's being used for your model? The next thing is schema again. So this is another neat component where this allows you to look at the training data and infer what the schema should be. So it tells you things like, OK, there's a company field, it's a type string, so on and so forth. That's all this information is here and. Categorical variables, et cetera, it's going to create a schema out of that. Then there's here we're using a component called the example validator, and this kind of works together with the schema component in your test data and basically checks your schema and says, OK, let's put some new data against it and let's see how well it worked. And so what it does is it generates a report here and you see some anomalies. So it looks like here that there were a couple new cab companies that weren't in the training data, which makes sense.

So this is where you might say, well, my schema was a little strict. I don't actually want to say these are the only cab companies. And here's a case where our, I guess, a type of credit card wasn't in the training data and that came through in the test data. So it might be a revision to the schema. So this would be something that back to the concept of training, serving SKU or drift that you can kind of monitor looking at the example validate or component. There's transformation, there's training, training where let's see if we go into the logs, you'd see things like with tensor flow, looking at the loss and all that. You can see within this as it's training. And again, as you're running in real time, you'll see this unfold. You'll see sort of the green or red if it's successful. And so it's kind of kind of fun to watch at times. And then the evaluator and this uses tensor flow model analysis. And what that does for you is it not only gives you your sort of standard accuracy statistics, but I think that this is an important feature is the ability to look at different slices of your data. So you want to look at your accuracy in terms of how it affects all of your users.

Right. Not just on the whole, but making sure that if we slice the data different ways, how are folks getting affected by that? So here for the taxicab example, what's happening is it's looking at the trip, the taxi trips that start at 1:00 in the morning versus 9:00 in the morning, et cetera. And what are the accuracy? So you could see here? Hmm. Well, maybe because I have fewer trips at 1:00 in the morning. I have fewer training samples. My accuracy is a little bit lower. Maybe I need to collect a few more examples there. It does actually show you that example. I was so my theory was correct here. We're kind of validating that. It just gives you a lot of great information about your your training data and the accuracy by these different features. So this all sort of comes out out of the box when you're intenser full extent. OK, I think I'm running a little short on time, but as honest mentioned, there's this concept of pushing your model to production once you reach a certain accuracy level that the evaluator will tell you about. All right. So just to summarize here, what we've talked about is using Kubeflow in a pipeline, how to use text with it. And I don't think I have time for questions, but maybe feel free to ping me. I'm on Twitter at Quine, my Twitter or LinkedIn, if you have any questions. Thank you.

Thanks a lot, Carl. That's awesome. Check out people pipelines. Also, they have great documentation if you have any other technical questions or anything, but definitely think Carl is one of the experts. Great opportunity to ask any questions you might have about that. But thanks, Carol. Really appreciate it. Appreciate it.

I know we're running out of time, but I think we can ask a few questions. I actually put it on the screen. I don't know if you.

Yeah, a couple of them got a couple of them got answered. I think. I think that one of them, though, was asking, but one of them was asking about the demo. So obviously we saw what we can access that. Are there any other demos that you would recommend, Carol, besides the taxicab demo like that? And can they find that in the same place?

Yeah, so if they go to the Kubeflow pipelines reffo, there's an example's folder that has a lot of great examples. I think the taxicab example is the best one if you want to learn how to fax is used. But there are if you I know that Tupelo has a lot to learn. So sometimes people look at there's a very basic components to like, hey, I just want to know the bare minimum of if I want to create like a new container out of a few lines of code, how do I do that? So. So what I do think is helpful, too, is those simple tutorials that just give you, like little tiny pieces of Kubeflow will explain them bit by bit.

So, yeah, I found when I was learning, it was a big, big undertaking. There's a lot to it. It's really, really useful.

As you get deeper, you'll find so many cool little things that are totally obvious right off the bat. So definitely play with it. I think is is the best course of action for learning. Cool. I think the like I said all the question, most of the questions got answered, but I encourage people to go play with people, the pipelines I've been playing with a few months, and I really, really like it to check it out. Cool. Cool. Thank you so much. Thank you. Thank you.

Thank you.

27.08.2020