Difference between revisions of "NLP tab"

From visone manual
Jump to navigation Jump to search
(crawl link structure documentation)
(crawler expert options and crawl text sections)
Line 27: Line 27:
 
* '''same host only''': If selected, links to different hosts will not be followed. For example, if a page on <code>visone.info</code> links to another page on <code>visone.info</code>, this link will be followed, but a link to <code>google.com</code> will not.
 
* '''same host only''': If selected, links to different hosts will not be followed. For example, if a page on <code>visone.info</code> links to another page on <code>visone.info</code>, this link will be followed, but a link to <code>google.com</code> will not.
 
* '''store text''': If selected, the text content of every page that is downloaded will be stored in the '''text''' attribute of its node.
 
* '''store text''': If selected, the text content of every page that is downloaded will be stored in the '''text''' attribute of its node.
 +
 +
=== crawler expert options ===
 +
* '''crawl rule selection''': This option decides which parts of the webpage will be considered textual content.
 +
** '''standard''' uses a default rule that should work for most websites.
 +
** '''domain specific''' selects the rule based on the URL, so that for example only the actual text of Wikipedia articles is extracted. The name of the rule file is generated from the URL, for example, the rule file for <code><nowiki>http://www.informatik.uni-konstanz.de/algo</nowiki></code> is <code>algo.www.informatik.uni-konstanz.de.rule</code>. Some rules, such as the one for Wikipedia, are already built in.
 +
** '''specified''' uses the rule selected in the '''crawl rule''' dropdown.
 +
* '''cookies''': If the website requires a login, paste the HTTP <code>Cookie:</code> header value here. (Obtaining this value might be complicated, though.)
 +
* '''ignore errors''': If this option is ''not'' selected, any error encountered during crawling, such as the common ''page not found'', will terminate the crawler. Unchecking is only recommended if there are known to be no dead links.
 +
* '''mean delay''': Delay after a successful page download, in milliseconds.
 +
* '''cool down after error''': Delay after a failed page download, in milliseconds.
 +
* '''respect nofollow''': If selected, visone will observe the [http://en.wikipedia.org/wiki/Nofollow nofollow] specification, and not follow any links such marked.
  
 
== crawl text ==
 
== crawl text ==
...
+
The purpose of this function is to download the textual content of a singe webpage and store it in a file that can be used as an input file for the text analysis methods discussed above.
 +
 
 +
Use '''output file''' to select the name of the file to (over-)write. For the '''url''', '''browser''' and expert options, see [[#crawl link structure|above]].

Revision as of 18:22, 14 July 2011

The NLP tab contains the visone Natural Language Processing functionality which can be used to create networks from text written in a natural language (usually English).

The NLP tab is not shown by default. In order for it to be enabled, visone must be started with the command line option

-e de.visone.nlp.NLPExtension

If texts with long sentences are to be processed, it is furthermore desirable to add -Xmx1g to prevent visone from running out of memory during network creation.

In addition, the files englishPCFG.ser.gz and left3words-wsj-0-18.tagger should be downloaded from the Stanford Natural Language Processing Group's website and placed in the visone working directory. This will allow centering resonance analysis and word net analysis to work without explicitly specifying a parser or tagger file (see below).

centering resonance analysis

...

word net analysis

...

crawl link structure

This function captures the link structure of a website into a network. One node is created for each page, and a (directed) edge for every link from one page to another. Each node will have the following attributes:

  • url: the URL of the page
  • text: the text content of the page (if enabled)

Each edge will have an attribute named linktext containing the text of the hyperlink represented by that edge. In addition, if a link was found but not followed, it will be marked as unconfirmed.

All that is required to use the crawler is the url field, which must be filled with the URL of the page at which the crawler should start, such as http://visone.info/wiki/index.php/Main_Page for the visone wiki. After crawling has finished, the network does not get layouted to allow handling of extremely large graphs, thus all nodes will be show lying on a single point. Click the quick layout button to see the network structure.

For finer control, the following options can be adjusted:

  • browser: The browser that visone will masquerade as, to avoid being blocked. Select one from the list, or paste any user agent string.
  • max depth: The maximum length of a chain of links that will be followed. For example, 0 will not follow any links, 1 will follow all links on the start page, and 2 will follow all links on the start page and on pages referenced by the start page.
  • page limit: The maximum number of pages that will be downloaded. This is not a limit on the number of nodes created, because nodes are created for every page that is found, even if it is never downloaded.
  • same host only: If selected, links to different hosts will not be followed. For example, if a page on visone.info links to another page on visone.info, this link will be followed, but a link to google.com will not.
  • store text: If selected, the text content of every page that is downloaded will be stored in the text attribute of its node.

crawler expert options

  • crawl rule selection: This option decides which parts of the webpage will be considered textual content.
    • standard uses a default rule that should work for most websites.
    • domain specific selects the rule based on the URL, so that for example only the actual text of Wikipedia articles is extracted. The name of the rule file is generated from the URL, for example, the rule file for http://www.informatik.uni-konstanz.de/algo is algo.www.informatik.uni-konstanz.de.rule. Some rules, such as the one for Wikipedia, are already built in.
    • specified uses the rule selected in the crawl rule dropdown.
  • cookies: If the website requires a login, paste the HTTP Cookie: header value here. (Obtaining this value might be complicated, though.)
  • ignore errors: If this option is not selected, any error encountered during crawling, such as the common page not found, will terminate the crawler. Unchecking is only recommended if there are known to be no dead links.
  • mean delay: Delay after a successful page download, in milliseconds.
  • cool down after error: Delay after a failed page download, in milliseconds.
  • respect nofollow: If selected, visone will observe the nofollow specification, and not follow any links such marked.

crawl text

The purpose of this function is to download the textual content of a singe webpage and store it in a file that can be used as an input file for the text analysis methods discussed above.

Use output file to select the name of the file to (over-)write. For the url, browser and expert options, see above.