Published May 11, 2018 | Version 1.1.1
Software Open

Nostril: A nonsense string evaluator written in Python

  • 1. ROR icon California Institute of Technology

Description

A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens. Nostril ("Nonsense String Evaluator") is a Python 3 module that can infer whether a given word or text string is likely to be nonsense or meaningful text. A "meaningful" string of characters is one constructed from real or real-looking English words or fragments of real words (even if the words are runtogetherlikethis). The main use case for Nostril is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random or other non-identifier strings.

Files

nostril-1.1.1.zip
Files (294.5 MB)
Name Size
md5:2e37e9967a073373817a15d6fa7f931e
146.7 MB Download
md5:28db84beaaea394e2a9ab8ec3c9de343
147.8 MB Preview Download

Additional details

Created:
September 9, 2022
Modified:
September 9, 2022