lookibaby.blogg.se - Merlin project templates

We dive into the architecture, working with the platform, and a product use case. Our new machine learning platform is based on an open source stack and technologies. Using open source tooling end-to-end was important to us because we wanted to both draw from and contribute to the most up-to-date technologies and their communities as well as provide the agility in evolving the platform to our users’ needs. Merlin Architecture A high level diagram of Merlin’s architecture Flexibility: users can use any libraries or packages they need for their modelsįor the first iteration of Merlin, we focused on enabling training and batch inference on the platform.Fast Iterations: tools that reduce friction and increase productivity for our data scientists and machine learning engineers by minimizes the gap between prototyping and production.Scalability: robust infrastructure that can scale up our machine learning workflows.Merlin’s objective is to enable Shopify's teams to train, test, deploy, serve and monitor machine learning models efficiently and quickly. Merlin gives our users the tools to run their machine learning workflows. Typically, large scale data modeling and processing at Shopify happens in other parts of our data platform, using tools such as Spark. The data and features are then saved to our data lake or Pano, our feature store. Merlin uses these features and datasets as inputs to the machine learning tasks it runs, such as preprocessing, training, and batch inference. With Merlin, each use case runs in a dedicated environment that can be defined by its tasks, dependencies and required resources - we call these environments Merlin Workspaces. These dedicated environments also enable distributed computing and scalability for the machine learning tasks that run on them. Behind the scenes, Merlin Workspaces are actually Ray clusters that we deploy on our Kubernetes cluster, and are designed to be short lived for batch jobs, as processing only happens for a certain amount of time. We built the Merlin API as a consolidated service to allow the creation of Merlin Workspaces on demand. Our users can then use their Merlin Workspace from Jupyter Notebooks to prototype their work, or orchestrate it through Airflow or Oozie. Merlin’s architecture, and Merlin Workspaces in particular, are enabled by one of our core components- Ray. Ray is an open source framework that provides a simple, universal API for building distributed systems and tools to parallelize machine learning workflows. Ray is a large ecosystem of applications, libraries and tools dedicated to machine learning such as distributed scikit-learn, XGBoost, TensorFlow, PyTorch, etc.

In the following example, we train a model using Ray: When using Ray, you get a cluster that enables you to distribute your computation across multiple CPUs and machines.