Machine learning explores the study and construction of algorithms that learn and make predictions based on data. In the field of machine learning, data scientists, who specialize in analyzing data, are responsible for writing and modifying such algorithms.
Initially, a data scientist writes an algorithm based on a set of data features. This is generally an iterative process in which the data scientist explores different algorithms for predictive purpose. In this process, the amount of data and the number of features chosen for analysis may change. Data used for analysis could be of any type, such as sparse versus dense, or compressed versus non-compressed. Once the quantity and analysis of data no longer fit on a single machine, operations are typically scaled to a cluster of machines. In summary, analysis involves an iterative process over changes in a feature set and changes in the amount and type of data, which leads to the customization of algorithms. We can consider this to be “domain specific analytics.”
Analysis using a single machine drifts to a Big Data problem
Generally, a data scientist writes an algorithm in R, Python or another scripting language using a sufficiently small data size that can fit on a single machine. At a certain point, the algorithm works for a given situation.
How does an organization solve a Big Data Machine Learning problem?
Generally, in an organization, a data scientist writes an algorithm using a small data set that fits on a single machine and makes that “prototype” work. Then, a systems programmer who is an expert in clustered environments gets involved to run this algorithm in a clustered environment with larger scale data. This involves an iterative process of adjusting an algorithm to make it work in a clustered environment through continuous communication between the data scientist and the systems programmer until they are satisfied with the outcome. This works reasonably well and many organizations make that model work successfully. Though organizations make this model work, there are quite a few challenges with this approach. The data scientist writes the algorithm in R, Python or another scripting language. The script needs to be run efficiently and effectively on target platforms such as Apache Spark or Hadoop, which is not a trivial job.
Challenges in analyzing vertically across the business?
As I said earlier, this is “Domain specific analytics.” What will happen if the data scientist needs to analyze the data vertically across the business? An organization may have expertise in a particular business, say a “Car Selling” or “Car Manufacturing” business. Do we expect a person who is specialized in a particular business to be an analyst as well? Probably not, since it’s difficult to find a person with such a set of skills. We see there are at least four types of issues in such a situation:
- It’s difficult to hire a person with multi-dimensional skills.
- It’s a dual effort to write an algorithm, first for a single machine with small data and then on a cluster with big data.
- The process of doing domain specific analysis is iterative and will need to change per situation.
- Human factors, involving communication between the data scientist and systems programmer to run an algorithm successfully on a cluster that was originally written for small data on single machine, will slow down the effort.
Challenges for Data Scientists?
How do you expect a data scientist to deal with such dynamism? Do you expect data scientists to handle these variations in data or runtime environment when he or she writes an algorithm? Ideally, we would like a data scientist to be able to write an algorithm that is independent of data characteristics and runtime environment.
Motivation for “Declarative Machine Learning”
This motivated us to develop “Declarative Machine Learning” so that data scientists can write an algorithm in an expressive language. An algorithm written by a data scientist should be independent of data characteristics, scale of data, and runtime environment where the algorithm will be run. Data scientists should have the flexibility to write new algorithms, reuse existing algorithms, or customize algorithms as needed. We wanted data scientists to be able to treat this as a single machine problem. This leads to four high level requirements:
- High-level semantics: A data scientist should be able to write an algorithm in a high-level language without focusing on any low-level implementation details. He or she should be able to express goals through easy semantics. A data scientist should be able to understand the semantics and debug easily as needed.
- Flexibility: A data scientist should have flexibility to leverage existing algorithms with or without any customization. A data scientist should be able to write new algorithms easily.
- Data independence: A data scientist should not worry about data characteristics while writing the algorithms. Data could be sparse/dense, it could be analyzed per row or column, it may need to be cached, it could be in compressed or non-compressed form, but the algorithm should be able to be written without considering any of these data characteristics.
- Scale independence: The size of the data could be small or large. It could fit on a single machine for analysis or across a distributed environment.
Based on these requirements, we need something that can understand algorithms written in a high-level language and transform those into instructions to be executed on a target environment based on data characteristics. We realized we needed to leverage database optimization technologies and other database features to handle such a transformation. We developed a query optimizer to do the transformation from high-level statements written in a script to runtime instructions on a target environment based on characteristics of the data and runtime environment. The goal is to have a high-level language with the ability to scale in many different dimensions to many different data characteristics so that we can iterate faster.
The query optimizer is based on database query optimization techniques. The query optimizer will read statements written in a high-level language. These statements will be converted into smaller statement blocks. Based on data characteristics and the available runtime environment, individual smaller blocks get translated into generic data flow representations called high-level operations (HOPs). Subsequently, the optimizer applies dynamic rewrites and optimizes HOPs to generate low-level operations (LOPs) that will be a generic representation of the runtime execution plan. With current support, if the amount of memory required to run a particular instruction is available on a single node, then that instruction will be run in memory on the single node (Control Program); otherwise, that instruction will run in a distributed environment such as Spark or Hadoop as per the user’s preference.
The above diagram shows a pictorial representation of SystemML. At the top of the diagram, you can see that a user can write an algorithm in an R-like or Python-like language supported by SystemML. SystemML has the capability to expand language support for other languages as well.
An algorithm, which is expressed as a set of statements in a script, will be parsed by the parser for validation and then transformed into smaller blocks. These blocks go through static and dynamic rewrites to generate high-level operations (HOPs) that are a generic representation of data flow. Any static transformation is independent of data characteristics, whereas dynamic transformation is based on data characteristics. At the HOP level, the known data size is labeled and propagated across the HOP tree for a given block. Based on data size, the target runtime platform gets determined at the HOP level.
Every HOP gets transformed to one or more low-level operations (LOPs) that are a generic representation of the runtime execution plan. These transformations are based on dynamic rewrites and dynamic recompilation. Physical operators are substituted in the runtime execution plan based on data and runtime characteristics, and then the runtime execution plan is executed on the target environment.
SystemML Compilation Chain
*We have implemented several common algorithms. These algorithms illustrate how SystemML can be leveraged to write algorithms in a high-level language with R-like or Python-like syntax very easily.
Algorithms in the SystemML package
COX Proportional Hazard Regression Analysis
Generalized Linear Model
Kaplan-Meier Survival Analysis
L2 Support Vector Machine
Minimal Support Vector Machine
Multi Log Regression
Principal Component Analysis
Step GLM Step Linear Regression DS StratStats Transform Univar-Stats
We briefly looked at Declarative Machine Learning and discussed domain specific knowledge. We also discussed the language and compiler support needed to implement Declarative Machine Learning. A primary goal of this approach is to allow data scientists to write efficient and effective algorithms and improve productivity without thinking about data and runtime characteristics.