NLP tab: Difference between revisions

From visone manual
Jump to navigation Jump to search
(skeleton)
 
(crawl link structure documentation)
Line 14: Line 14:


== crawl link structure ==
== crawl link structure ==
...
This function captures the link structure of a website into a network. One node is created for each page, and a (directed) edge for every link from one page to another. Each node will have the following attributes:
* '''url''': the URL of the page
* '''text''': the text content of the page (if enabled)
Each edge will have an attribute named '''linktext''' containing the text of the hyperlink represented by that edge. In addition, if a link was found but not followed, it will be marked as unconfirmed.
 
All that is required to use the crawler is the '''url''' field, which must be filled with the URL of the page at which the crawler should start, such as <code><nowiki>http://visone.info/wiki/index.php/Main_Page</nowiki></code> for the visone wiki. After crawling has finished, the network does ''not'' get layouted to allow handling of extremely large graphs, thus all nodes will be show lying on a single point. Click the quick layout button to see the network structure.
 
For finer control, the following options can be adjusted:
* '''browser''': The browser that visone will masquerade as, to avoid being blocked. Select one from the list, or paste any user agent string.
* '''max depth''': The maximum length of a chain of links that will be followed. For example, 0 will not follow any links, 1 will follow all links on the start page, and 2 will follow all links on the start page and on pages referenced by the start page.
* '''page limit''': The maximum number of pages that will be ''downloaded''. This is ''not'' a limit on the number of nodes created, because nodes are created for every page that is found, even if it is never downloaded.
* '''same host only''': If selected, links to different hosts will not be followed. For example, if a page on <code>visone.info</code> links to another page on <code>visone.info</code>, this link will be followed, but a link to <code>google.com</code> will not.
* '''store text''': If selected, the text content of every page that is downloaded will be stored in the '''text''' attribute of its node.


== crawl text ==
== crawl text ==
...
...

Revision as of 17:29, 14 July 2011

The NLP tab contains the visone Natural Language Processing functionality which can be used to create networks from text written in a natural language (usually English).

The NLP tab is not shown by default. In order for it to be enabled, visone must be started with the command line option

-e de.visone.nlp.NLPExtension

If texts with long sentences are to be processed, it is furthermore desirable to add -Xmx1g to prevent visone from running out of memory during network creation.

In addition, the files englishPCFG.ser.gz and left3words-wsj-0-18.tagger should be downloaded from the Stanford Natural Language Processing Group's website and placed in the visone working directory. This will allow centering resonance analysis and word net analysis to work without explicitly specifying a parser or tagger file (see below).

centering resonance analysis

...

word net analysis

...

crawl link structure

This function captures the link structure of a website into a network. One node is created for each page, and a (directed) edge for every link from one page to another. Each node will have the following attributes:

  • url: the URL of the page
  • text: the text content of the page (if enabled)

Each edge will have an attribute named linktext containing the text of the hyperlink represented by that edge. In addition, if a link was found but not followed, it will be marked as unconfirmed.

All that is required to use the crawler is the url field, which must be filled with the URL of the page at which the crawler should start, such as http://visone.info/wiki/index.php/Main_Page for the visone wiki. After crawling has finished, the network does not get layouted to allow handling of extremely large graphs, thus all nodes will be show lying on a single point. Click the quick layout button to see the network structure.

For finer control, the following options can be adjusted:

  • browser: The browser that visone will masquerade as, to avoid being blocked. Select one from the list, or paste any user agent string.
  • max depth: The maximum length of a chain of links that will be followed. For example, 0 will not follow any links, 1 will follow all links on the start page, and 2 will follow all links on the start page and on pages referenced by the start page.
  • page limit: The maximum number of pages that will be downloaded. This is not a limit on the number of nodes created, because nodes are created for every page that is found, even if it is never downloaded.
  • same host only: If selected, links to different hosts will not be followed. For example, if a page on visone.info links to another page on visone.info, this link will be followed, but a link to google.com will not.
  • store text: If selected, the text content of every page that is downloaded will be stored in the text attribute of its node.

crawl text

...