Nostril: A nonsense string evaluator written in Python
- Hucka, Michael1
A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens. Nostril ("Nonsense String Evaluator") is a Python 3 module that can infer whether a given word or text string is likely to be nonsense or meaningful text. A "meaningful" string of characters is one constructed from real or real-looking English words or fragments of real words (even if the words are runtogetherlikethis). The main use case for Nostril is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random or other non-identifier strings.