To better support data-intensive workflows which are typically built out of various independently developed executables, this paper proposes extensions to parallel database systems called User-Defined eXecutables (UDX) and collective queries. UDX facilitates the description of workflows by enabling seamless integrations of external executables into SQL statements without any efforts to write programs confirming to strict specifications of databases. A collective query is an SQL query whose results are distributed to multiple clients and then processed by them in parallel, using arbitrary UDX. It provides efficient parallelization of executables through the data transfer optimization algorithms that distribute query results to multiple clients, taking both communication cost and computational loads into account. We implement this concept in a system called Para Lite, a parallel database system based on a popular lightweight database SQLite. Our experiments show that Para Lite has several times higher performance over Hive for typical SQL tasks and has 10x speedup compared to a commercial DBMS for executables. In addition, this paper studies a real-world text processing workflow and builds it on top of Para Lite, Hadoop, Hive and general files. Our experiences indicate that Para Lite outperforms other systems in both productivity and performance for the workflow.
展开▼