Nostril: A nonsense string evaluator written in Python

Creators: Hucka, Michael¹

1. California Institute of Technology

Style

Description

A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens. Nostril ("Nonsense String Evaluator") is a Python 3 module that can infer whether a given word or text string is likely to be nonsense or meaningful text. A "meaningful" string of characters is one constructed from real or real-looking English words or fragments of real words (even if the words are runtogetherlikethis). The main use case for Nostril is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random or other non-identifier strings.

Files

nostril-1.1.1.zip

Files (294.5 MB)

Name	Size	Actions
nostril-1.1.1.tar.gz md5:2e37e9967a073373817a15d6fa7f931e	146.7 MB	Download
nostril-1.1.1.zip md5:28db84beaaea394e2a9ab8ec3c9de343	147.8 MB	Preview Download

Nostril: A nonsense string evaluator written in Python

Citation

Description

Files

Additional details