Import options dialog: Difference between revisions

From visone manual
Jump to navigation Jump to search
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
visone can import data from comma-separated value (CSV) files. Since these do not come with an unequivocal specification of how to interpret them some choices must be made. Therefore, whenever you open a CSV file, visone shows you an import options dialog. This page explains the various options; more on importing data from various sources is given in the [[Data_input_(tutorial)|data input tutorial]].
visone can import network data from comma-separated value (CSV) files. Since these do not come with an unequivocal specification of how to interpret them, some choices must be made. Therefore, whenever you open a CSV file via the [[file menu]], visone shows you the import options dialog.  


== The variants of comma-separated-value (CSV) files ==
== Data formats ==


A comma-separated-value file can be thought of as a plain-text file that encodes a table, i.e., a data array that has rows and columns - sometimes also referred to as a matrix. In a network context, there are different possibilities to encode information about nodes, links, or attribute information in such tables. The first three (adjacency matrices, link lists, and adjacency lists) provide information about nodes and links and can be opened via the [[file menu]] by selecting the appropriate file type. Attribute tables provide information about node or link attributes and can be opened via the [[attribute manager]]. What follows is a short characterization of these file types; more exhaustive explanation about how to read them is given in the following sections.
The import options dialog is able to handle four types of data formats:
* adjacency matrix
'''Adjacency matrix files.''' An adjacency matrix encodes for all pairs of nodes (indexing the rows and columns of the table) whether or not there is a link connecting these nodes. An example is shown in the following.
* link list
  ;A;B;C;D
* adjacency list
  A;0;'''1''';1;1
* node list
  B;'''0''';0;1;0
  C;1;1;0;0
  D;1;0;0;0
The first row and the first column are the labels of the nodes; the remaining part encodes whether there is a link from the node indexing the row to the node indexing the column. For instance, the character '''1''' in the row indexed by <tt>A</tt> and the column indexed by <tt>B</tt> indicates that there is a link going from <tt>A</tt> to <tt>B</tt>; the '''0''' in the row <tt>B</tt> and column <tt>A</tt> indicates that there is no link in the reverse direction.
 
'''Link list files.''' A link list contains as many rows as there are links in the network and (in its most basic form) a link list contains two columns where the entry in the first column is the identifier for the source node and the entry in the second column denotes the target node of the link. The following example
  A;C
  C;B
  B;A
  A;D
defines four links: from node <tt>A</tt> to node <tt>C</tt>, from <tt>C</tt> to <tt>B</tt>, etc. Note that link lists use less space than adjacency matrices - especially if the network is very sparse (i.e., when the number of links divided by the number of node pairs is a small value close to zero).
 
'''Adjacency list files.''' An adjacency list has as many rows as there are nodes in the network. Each row may have a different length and the row associated with a node lists all neighbors of this node. For instance, in the following example
  A;2;3;1
  B;0;2
  C;1;0
  D;0
the labels at the beginning of each row are the node identifiers (these ids are optional); <code>A</code> is the label of the node with index 0, <code>B</code> is the label of the node with index 1, etc; the list <code>2;3;1</code> following the label <code>A</code> defines that there are links from <code>A</code> to the node with index 2 (labeled <code>C</code>), to the node with index 3 (<code>D</code>), and to the node with index 1 (<code>B</code>).


'''Attribute tables.''' An attribute table has one row that lists the attribute names (in the example below this is <tt>id;age;smokes</tt> followed by as many rows as there are nodes in the network.  
A brief introduction to these four data formats is given in the [[Data input (tutorial)#The variants of comma-separated-value (CSV) files|data input tutorial]].
  id;age;smokes
To select the data format, use the topmost drop-down menu ('''data format'''). After selecting the data format, the import options dialog displays the appropriate options for interpreting the data. In the upper part, the import options dialog shows the different options for interpreting the data of the chosen data format. These options vary depending on data format. For each data format the available options will be described in the following sections. The middle part of the import options dialog ('''file format''') remains the same for all data formats and contains the different options of reading the file format (see last section for a detailed description). The bottom part of the dialog shows the table in the way that it will be interpreted with the current settings ('''preview'''). This part allows you to recognize whether the options are set correctly.
  A;23;false
  B;28;true
  C;19;true
  D;27;false
One of the columns (<tt>id</tt> in the example above) lists the unique node identifiers (it is not necessarily the first row and not necessarily labeled <tt>id</tt>). The other columns list the attribute values of the attribute whose name is given in the respective column header. Attribute files are not read via the [[file menu]] but can only be added to an existing network via the [[attribute manager]].


'''Note: visone does not allow to simultaneously input an adjacency matrix together with given attributes in one file.''' That is, a file like
Note that the import options dialog is designed to create nodes and links from a given data format. To add node and link attributes to a given network, use the [[attribute manager]]. A detailed description of adding attributes is also given in the [[Data input (tutorial)#Importing node and link attributes|data input tutorial]].
  ;A;B;C;D;age;smokes
  A;0;1;1;1;23;false
  B;1;0;1;0;28;true
  C;1;1;0;0;19;true
  D;1;0;0;0;27;false
(having the interpretation that the first four columns specify an adjacency matrix and the last two columns define node attribute values) cannot be opened. Rather you have to split this into two separate files, one containing the adjacency matrix which can be opened via the [[file menu]] and the other containing the attribute table which can be opened via the [[attribute manager]]. (For instance, in MS Excel you could split the file by selecting columns, copy them, and past them into a new table.)


The next four sections explain the various options that have to be set when reading CSV files.


== Adjacency matrix files ==
=== Adjacency matrix files ===


To open an adjacency matrix, use the [[file menu]], click on '''open...''', select ''files of type'' '''CSV files (.txt, .csv)''' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. Then the [[import options dialog]] opens (show below).
To open an adjacency matrix, use the [[file menu]], click on '''open...''', select ''files of type'' '''CSV files (.txt, .csv)''' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. If '''adjacency matrix''' is selected as ''data format'' in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data).  


[[File:Import_options_dialog_adjacency_matrix.png]]
[[File:Import_options_dialog_adjacency_matrix.png]]


The semantics of the various options is explained in the following.
The semantics of the various data format options is explained in the following. The file format options are explained in [[Import_options_dialog#File format|the last section]].


* '''data format''' is used to disinguish between adjacency matrix files and other types of CSV files (here choose '''adjacency matrix''').
* '''network type''' can be ''one mode'' or ''two mode''. In the adjacency matrix of a one mode network the rows and columns are indexed by the same set of nodes; for a two mode network (for instance, a network connecting authors to the articles they have written), the rows and columns are indexed by different sets of node (authors respectively articles in the example).
* '''network type''' can be ''one mode'' or ''two mode''. In the adjacency matrix of a one mode network the rows and columns are indexed by the same set of nodes; for a two mode network (for instance, a network connecting authors to the articles they have written), the rows and columns are indexed by different sets of node (authors respectively articles in the example).
* '''link attribute type''' can be ''decimal'' or ''text''. The entries of the adjacency matrix (which are either numbers or character strings) are saved in a link attribute of the newly opened network; this option defines the type of this attribute (''decimal'' for numerical attributes and ''text'' for categorical).
* '''link attribute type''' can be ''decimal'' or ''text''. The entries of the adjacency matrix (which are either numbers or character strings) are saved in a link attribute of the newly opened network; this option defines the type of this attribute (''decimal'' for numerical attributes and ''text'' for categorical).
* The check boxes '''row labels''' and '''header''' indicate whether the first column (respectively first row) lists the node identifiers (rather then entries of the adjacency matrix). If unchecked, then the node identifiers will be the numbers from <math>0</math> to <math>n-1</math> (when there are <math>n</math> nodes in the network).
* The check boxes '''row labels''' and '''header''' indicate whether the first column (respectively first row) lists the node identifiers (rather then entries of the adjacency matrix). If unchecked, then the node identifiers will be the numbers from <math>0</math> to <math>n-1</math> (when there are <math>n</math> nodes in the network).
* The check box '''directed edges''' is used to choose between directed and undirected networks.
* The check box '''directed edges''' is used to choose between directed and undirected networks.
* The '''file format''' can be ''MS Excel'', ''OpenOffice'' (default CSV output of these software programs, respectively), or ''user defined''. If it is set to ''user defined'' you have to specify the following options.
* '''cell delimiter''' defines the character that separates one matrix cell from the next. In the examples above, the cell delimiter is the semicolon (''';''') but it can as well be a comma, colon, TAB, or SPACE character.
* '''textframe''' can be double quotes, quotes, or NONE. Textframes are necessary if the matrix-cell entries themselves contain the cell delimiter. (For instance, if the cell delimiter is SPACE and the row/column labels are ''"firstname lastname"''; the quotes tell visone that the cell does not end after ''firstname''.)
* The '''merge empty cells''' checkbox tells visone whether repeated cell delimiters should be treated as one. This option is for instance necessary when reading the [[Newcomb_Fraternity_(data)|Newcomb Fraternity data]] (of which an excerpt is shown below)
  0  7 12 11 10  4 13 14 15 16  3  9  1  5  8  6  2
  8  0 16  1 11 12  2 14 10 13 15  6  7  9  5  3  4
  13 10  0  7  8 11  9 15  6  5  2  1 16 12  4 14  3
  ...
where the cell delimiter (the SPACE character) is sometimes repeated to enhance (human) readability.


The bottom part of the dialog shows the table in the way that it will be interpreted with the current setting. This part allows you to recognize whether the options are set correctly.


When you have set the options, click on the '''ok''' button to open the file.
=== Link list files ===


== Link list files ==
To open a link list, use the [[file menu]], click on '''open...''', select ''files of type'' '''CSV files (.txt, .csv)''' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. If '''link list''' is selected as ''data format'' in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data).
 
To open a link list, use the [[file menu]], click on '''open...''', select ''files of type'' '''CSV files (.txt, .csv)''' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. Then the [[import options dialog]] opens (show below).


[[File:Import_options_dialog_link_list.png]]
[[File:Import_options_dialog_link_list.png]]


Many options have the same meaning as when reading adjacency matrices (explained above). The most crucial difference is that in the first row of the preview area you select two specific columns, one containing the source of the link (indicated by the label '''source''' in the very first row) and one containing the target of the link (indicated by  '''target'''). The other columns contain link attributes that you might choose to import (if set to '''enabled''') or ignore (if set to '''disabled''').
The meaning of the data format options regarding '''network type''', '''header''' and '''directed edges''' are the same as when [[Import_options_dialog#Adjacency_matrix_files|reading adjacency matrices]] (explained above). The most crucial difference is that in the first row of the preview area you select two specific columns, one containing the source of the link (indicated by the label '''source''' in the very first row) and one containing the target of the link (indicated by  '''target'''). The other columns contain link attributes that you might choose to import (if set to '''enabled''') or ignore (if set to '''disabled''').


In the example above (see the tutorial on [[Wikipedia_edit_networks_(tutorial)|Wikipedia edit networks]] to learn more about this data), the column with the header '''ActiveUser''' contains the link source and the column labeled '''Target''' contains the link target. The other columns ('''WordCount''', '''InteractionType''', etc) hold the values of various link attributes that are newly created if not already in the network. Note that the type of these attributes can be set to be ''text'', ''integer'', ''decimal'', etc in the second row of the preview area.
In the example above (see the tutorial on [[Wikipedia_edit_networks_(tutorial)|Wikipedia edit networks]] to learn more about this data), the column with the header '''ActiveUser''' contains the link source and the column labeled '''Target''' contains the link target. The other columns ('''WordCount''', '''InteractionType''', etc) hold the values of various link attributes that are newly created if not already in the network. Note that the type of these attributes can be set to be ''text'', ''integer'', ''decimal'', etc in the second row of the preview area.


== Adjacency list files ==
If the links in a link list have associated time-information (encoding when the interaction happened) - or if the order in the file is meaningful und could be interpreted in the sense that interaction on the begining of the file happened earlier - you might consider opening them as ''event list files'' (this is illustrated in the [[Event_networks_(tutorial)|tutorial on event networks]]).


To open an adjacency list, use the [[file menu]], click on '''open...''', select '''files of type''' ''adjacency list files (.txt, .csv)'' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. Then the [[import options dialog]] opens (show below).
The file format options are explained in [[Import_options_dialog#File format|the last section]]. Note that nodes in the link list will be consecutively numbered, according to their first appearance in the link list.


[[File:Import_options_adj_list.png]]


* The '''header''' checkbox defines whether the first row is a header giving the numbers of nodes and links in the file (rather then the adjacency list of the first node).
=== Adjacency list files ===
* '''node labels''' indicates whether the first column list the node identifiers (if unchecked, then nodes are numbered consecutively and the <math>i</math>'th row list the neighbors of the <math>i</math>'th node).
* '''directed''' defines links are treated as directed or undirected.


The other options have the same meaning as when reading adjacency matrices (explained above).
To open an adjacency list, use the [[file menu]], click on '''open...''', select '''files of type''' ''adjacency list files (.txt, .csv)'' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. If '''adjacency list''' is selected as ''data format'' in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data).  


== Importing node and link attributes ==
[[File:Import_options_adj_list.png]]


To open an attribute table and add attribute values to the nodes or links of a network that is already opened in visone, use the [[attribute manager]] which can be started by clicking on the icon [[File:Attribute_manager.png|link=attribute_manager]] in visone's toolbar.
The check box '''directed edges''' is used to choose between directed and undirected networks.  


In the attribute manager, choose the '''node''' or '''link''' button in the top row, '''import & export''' on the left, operation '''import''' in the drop-down menu, and select the CSV file that contains the attributes you want to import
The other available options are file format options (explained in the [[Import_options_dialog#File_format|last section]]). Note that since the line length in an adjacency list often varies, not checking the ''merge empty cells'' checkbox ist a frequent source of error when importing this data format.


[[File:Attribute_manager_import.png]]
=== Node list files ===


Before clicking on '''apply''' it is very important to correctly set the value in the ''join by'' drop-down menu. This should point to the name of the attribute that identifies the nodes (or links) and tells visone which column in the imported CSV file holds these identifiers. The identifying attribute must be identical to the header of the column that contains the identifiers (see below). The other columns contain the names and values of the other attributes. For instance, the node with '''id''' ''A'' gets a value of ''23'' for the '''age''' attribute and the value ''false'' for the '''smokes''' attribute. (When reading attribute tables you need column headers.) 
A node list is a list of all nodes with their attribute values and has the same format as an attribute table (see example below).  
   id;age;smokes
   id;age;smokes
   A;23;false
   A;23;false
Line 108: Line 63:
   C;19;true
   C;19;true
   D;27;false
   D;27;false
Clicking on the '''apply''' button opens the import options which have the same meaning as when reading adjacency matrices (explained above).
To open a node list, use the [[file menu]], click on '''open...''', select ''files of type'' '''CSV files (.txt, .csv)''' in the file chooser, navigate to the file you want to open, and click on the '''ok''' button. If '''node list''' is selected as ''data format'' in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data). Importing a node list means to create a network of isolated nodes. In this case is it possible to add node attributes while creating the network, using the '''header''' checkbox (shown below).
 
[[File:Import_options_node_list.png]]
 
== File format ==
 
To read the comma-separated value (CSV) file correctly, visone needs some information about the used file format:
 
[[File:Import Options Dialog FileFormat.png]]
 
* The following '''presets''' are provided for selection: ''MS Excel'', ''OpenOffice'' (default CSV output of these software programs, respectively), and ''user defined''. If ''user defined'' is selected, you have to specify all the following options.
* '''cell delimiter''' defines the character that separates one matrix cell from the next. The most commonly used cell delimiter is the semicolon (''';''') but it can as well be a comma, colon, TAB, or SPACE character.
* Select the used character encoding by the '''encoding''' drop-down menu.
* '''textframe''' can be double quotes, quotes, or NONE. Textframes are necessary if the matrix-cell entries themselves contain the cell delimiter. (For instance, if the cell delimiter is SPACE and the row/column labels are ''"firstname lastname"''; the quotes tell visone that the cell does not end after ''firstname''.)
* The '''merge empty cells''' checkbox tells visone whether repeated cell delimiters should be treated as one. This option is for instance necessary when reading the [[Newcomb_Fraternity_(data)|Newcomb Fraternity data]] (of which an excerpt is shown below)
  0  7 12 11 10  4 13 14 15 16  3  9  1  5  8  6  2
  8  0 16  1 11 12  2 14 10 13 15  6  7  9  5  3  4
  13 10  0  7  8 11  9 15  6  5  2  1 16 12  4 14  3
  ...
where the cell delimiter (the SPACE character) is sometimes repeated to enhance (human) readability.
* the '''ignore lines starting with''' option enables you to determine lines, that should be ignored while reading the file.

Latest revision as of 08:34, 26 May 2015

visone can import network data from comma-separated value (CSV) files. Since these do not come with an unequivocal specification of how to interpret them, some choices must be made. Therefore, whenever you open a CSV file via the file menu, visone shows you the import options dialog.

Data formats

The import options dialog is able to handle four types of data formats:

  • adjacency matrix
  • link list
  • adjacency list
  • node list

A brief introduction to these four data formats is given in the data input tutorial. To select the data format, use the topmost drop-down menu (data format). After selecting the data format, the import options dialog displays the appropriate options for interpreting the data. In the upper part, the import options dialog shows the different options for interpreting the data of the chosen data format. These options vary depending on data format. For each data format the available options will be described in the following sections. The middle part of the import options dialog (file format) remains the same for all data formats and contains the different options of reading the file format (see last section for a detailed description). The bottom part of the dialog shows the table in the way that it will be interpreted with the current settings (preview). This part allows you to recognize whether the options are set correctly.

Note that the import options dialog is designed to create nodes and links from a given data format. To add node and link attributes to a given network, use the attribute manager. A detailed description of adding attributes is also given in the data input tutorial.


Adjacency matrix files

To open an adjacency matrix, use the file menu, click on open..., select files of type CSV files (.txt, .csv) in the file chooser, navigate to the file you want to open, and click on the ok button. If adjacency matrix is selected as data format in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data).

Import options dialog adjacency matrix.png

The semantics of the various data format options is explained in the following. The file format options are explained in the last section.

  • network type can be one mode or two mode. In the adjacency matrix of a one mode network the rows and columns are indexed by the same set of nodes; for a two mode network (for instance, a network connecting authors to the articles they have written), the rows and columns are indexed by different sets of node (authors respectively articles in the example).
  • link attribute type can be decimal or text. The entries of the adjacency matrix (which are either numbers or character strings) are saved in a link attribute of the newly opened network; this option defines the type of this attribute (decimal for numerical attributes and text for categorical).
  • The check boxes row labels and header indicate whether the first column (respectively first row) lists the node identifiers (rather then entries of the adjacency matrix). If unchecked, then the node identifiers will be the numbers from to (when there are nodes in the network).
  • The check box directed edges is used to choose between directed and undirected networks.


Link list files

To open a link list, use the file menu, click on open..., select files of type CSV files (.txt, .csv) in the file chooser, navigate to the file you want to open, and click on the ok button. If link list is selected as data format in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data).

Import options dialog link list.png

The meaning of the data format options regarding network type, header and directed edges are the same as when reading adjacency matrices (explained above). The most crucial difference is that in the first row of the preview area you select two specific columns, one containing the source of the link (indicated by the label source in the very first row) and one containing the target of the link (indicated by target). The other columns contain link attributes that you might choose to import (if set to enabled) or ignore (if set to disabled).

In the example above (see the tutorial on Wikipedia edit networks to learn more about this data), the column with the header ActiveUser contains the link source and the column labeled Target contains the link target. The other columns (WordCount, InteractionType, etc) hold the values of various link attributes that are newly created if not already in the network. Note that the type of these attributes can be set to be text, integer, decimal, etc in the second row of the preview area.

If the links in a link list have associated time-information (encoding when the interaction happened) - or if the order in the file is meaningful und could be interpreted in the sense that interaction on the begining of the file happened earlier - you might consider opening them as event list files (this is illustrated in the tutorial on event networks).

The file format options are explained in the last section. Note that nodes in the link list will be consecutively numbered, according to their first appearance in the link list.


Adjacency list files

To open an adjacency list, use the file menu, click on open..., select files of type adjacency list files (.txt, .csv) in the file chooser, navigate to the file you want to open, and click on the ok button. If adjacency list is selected as data format in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data).

Import options adj list.png

The check box directed edges is used to choose between directed and undirected networks.

The other available options are file format options (explained in the last section). Note that since the line length in an adjacency list often varies, not checking the merge empty cells checkbox ist a frequent source of error when importing this data format.


Node list files

A node list is a list of all nodes with their attribute values and has the same format as an attribute table (see example below).

 id;age;smokes
 A;23;false
 B;28;true
 C;19;true
 D;27;false

To open a node list, use the file menu, click on open..., select files of type CSV files (.txt, .csv) in the file chooser, navigate to the file you want to open, and click on the ok button. If node list is selected as data format in the topmost drop-dowm menu, the import options dialog is as shown below (with a preview displaying your data). Importing a node list means to create a network of isolated nodes. In this case is it possible to add node attributes while creating the network, using the header checkbox (shown below).

Import options node list.png

File format

To read the comma-separated value (CSV) file correctly, visone needs some information about the used file format:

Import Options Dialog FileFormat.png

  • The following presets are provided for selection: MS Excel, OpenOffice (default CSV output of these software programs, respectively), and user defined. If user defined is selected, you have to specify all the following options.
  • cell delimiter defines the character that separates one matrix cell from the next. The most commonly used cell delimiter is the semicolon (;) but it can as well be a comma, colon, TAB, or SPACE character.
  • Select the used character encoding by the encoding drop-down menu.
  • textframe can be double quotes, quotes, or NONE. Textframes are necessary if the matrix-cell entries themselves contain the cell delimiter. (For instance, if the cell delimiter is SPACE and the row/column labels are "firstname lastname"; the quotes tell visone that the cell does not end after firstname.)
  • The merge empty cells checkbox tells visone whether repeated cell delimiters should be treated as one. This option is for instance necessary when reading the Newcomb Fraternity data (of which an excerpt is shown below)
  0  7 12 11 10  4 13 14 15 16  3  9  1  5  8  6  2
  8  0 16  1 11 12  2 14 10 13 15  6  7  9  5  3  4
 13 10  0  7  8 11  9 15  6  5  2  1 16 12  4 14  3
 ...

where the cell delimiter (the SPACE character) is sometimes repeated to enhance (human) readability.

  • the ignore lines starting with option enables you to determine lines, that should be ignored while reading the file.