NLP tab

From visone manual
Revision as of 19:49, 14 July 2011 by Fratz (talk | contribs) (→‎word net analysis)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The NLP tab contains the visone Natural Language Processing functionality which can be used to create networks from text written in a natural language (usually English).

The NLP tab is not shown by default. In order for it to be enabled, visone must be started with the command line option

-e de.visone.nlp.NLPExtension

If texts with long sentences are to be processed, it is furthermore desirable to add -Xmx1g to prevent visone from running out of memory during network creation.

In addition, the files englishPCFG.ser.gz and left3words-wsj-0-18.tagger should be downloaded from the Stanford Natural Language Processing Group's website and placed in the visone working directory. This will allow centering resonance analysis and word net analysis to work without explicitly specifying a parser or tagger file (see below).

centering resonance analysis

This performs a CRA analysis of the input text.

  • quick layout does a quick layout after network creation, which may take some time for huge networks.
  • The stopword filter can either use the builtin list of stopwords (common function words that can be assumed to carry no meaning for the topic of the text), or a file containing one lowercase stopword per line, with empty lines and comments (starting with #) ignored. Also, a minimum word length can be set, so that all words shorter than this length are always ignored.
  • A comma-separated list of center tags can be specified. These are the node names as used by the parser, eg. NP (noun phrase) or VP (verb phrase). Phrases can be nested, such that an NP contains more NPs. If take centers from bottom level is selected, the inner ones are used, else the top one, thus producing a much denser network.
  • The parser file must be specified, however the default path points to englishPCFG.ser.gz in the visone working directory, so if the file is placed there as recommended, nothing should need to be selected.

word net analysis

Performs a WNA analysis of the input text.

  • For WNA, either a window size must be set, or if use co-occurence is selected, all words within a sentence are connected.
  • WNA can filter according to POS (part-of-speech) tags. The default tag set selects nouns, verbs, adjectives, adverbs and foreign words. Alternatively, an external file with one tag per line can be specified; in this case, empty lines and comments (starting with #) are ignored. The default tagger uses the Penn Treebank Tagset.
  • The tagger file must be specified, however the default path points to left3words-wsj-0-18.tagger in the visone working directory, so if the file is placed there as recommended, nothing should need to be selected.

See the centering resonance analysis section for the quick layout and stopword filter options.

crawl link structure

This function captures the link structure of a website into a network. One node is created for each page, and a (directed) edge for every link from one page to another. Each node will have the following attributes:

  • url: the URL of the page
  • text: the text content of the page (if enabled)

Each edge will have an attribute named linktext containing the text of the hyperlink represented by that edge. In addition, if a link was found but not followed, it will be marked as unconfirmed.

All that is required to use the crawler is the url field, which must be filled with the URL of the page at which the crawler should start, such as http://visone.info/wiki/index.php/Main_Page for the visone wiki. After crawling has finished, the network does not get layouted to allow handling of extremely large graphs, thus all nodes will be show lying on a single point. Click the quick layout button to see the network structure.

For finer control, the following options can be adjusted:

  • browser: The browser that visone will masquerade as, to avoid being blocked. Select one from the list, or paste any user agent string.
  • max depth: The maximum length of a chain of links that will be followed. For example, 0 will not follow any links, 1 will follow all links on the start page, and 2 will follow all links on the start page and on pages referenced by the start page.
  • page limit: The maximum number of pages that will be downloaded. This is not a limit on the number of nodes created, because nodes are created for every page that is found, even if it is never downloaded.
  • same host only: If selected, links to different hosts will not be followed. For example, if a page on visone.info links to another page on visone.info, this link will be followed, but a link to google.com will not.
  • store text: If selected, the text content of every page that is downloaded will be stored in the text attribute of its node.

crawler expert options

  • crawl rule selection: This option decides which parts of the webpage will be considered textual content.
    • standard uses a default rule that should work for most websites.
    • domain specific selects the rule based on the URL, so that for example only the actual text of Wikipedia articles is extracted. The name of the rule file is generated from the URL, for example, the rule file for http://www.informatik.uni-konstanz.de/algo is algo.www.informatik.uni-konstanz.de.rule. Some rules, such as the one for Wikipedia, are already built in.
    • specified uses the rule selected in the crawl rule dropdown.
  • cookies: If the website requires a login, paste the HTTP Cookie: header value here. (Obtaining this value might be complicated, though.)
  • ignore errors: If this option is not selected, any error encountered during crawling, such as the common page not found, will terminate the crawler. Unchecking is only recommended if there are known to be no dead links.
  • mean delay: Delay after a successful page download, in milliseconds.
  • cool down after error: Delay after a failed page download, in milliseconds.
  • respect nofollow: If selected, visone will observe the nofollow specification, and not follow any links such marked.

crawl text

The purpose of this function is to download the textual content of a singe webpage and store it in a file that can be used as an input file for the text analysis methods discussed above.

Use output file to select the name of the file to (over-)write. For the url, browser and expert options, see above.