There have been published some studies of genetic programming as a way to discover motifs in proteins and other biological data. These studies have been small, and often used domain knowledge to improve search. In this paper we present a genetic programming algorithm, that does not use domain knowledge, with results on 44 different protein families. We demonstrate that our list-based representation, given a fixed amount of processing resources/is able to discover meaningful motifs with good classification performance. Sometimes comparable to or even surpassing that of motifs found in a database of manually created motifs. We also investigate introduction of gaps in our algorithm, and it seems that this give a small increase in classification accuracy and recall, but with reduced precision.
展开▼